Summary Video Accessibility Talk
I’ve just got off a call to the UK Digital TV Group, for which I gave a talk on HTML5 video accessibility (slides best viewed in Google Chrome).
The slide provide a high-level summary of the accessibility features that we’ve developed in the W3C for HTML5, including:
- Subtitles & Captions with WebVTT and the track element
- Video Descriptions with WebVTT, the track element and speech synthesis
- Chapters with WebVTT for semantic navigation
- Audio Descriptions through synchronising an audio track with a video
- Sign Language video synchronized with a main video
I received some excellent questions.
The obvious one was about why WebVTT and not TTML. While for anyone who has tried to implement TTML support, the advantages of WebVTT should be clear, for some the decision of the browsers to go with WebVTT still seems to be bothersome. The advantages of CSS over XSL-FO in a browser-context are obvious, but not as much outside browsers. So, the simplicity of WebVTT and the clear integration with HTML have to speak for themselves. Conversion between TTML and WebVTT was a feature that was being asked for.
I received a question about how to support ducking (reduce the volume of the main audio track) when using video descriptions. My reply was to either use video descriptions with WebVTT and do ducking during the times that a cue is active, or when using audio descriptions (i.e. actual audio tracks) to add an additional WebVTT file of kind=metadata to mark the intervals in which to do ducking. In both cases some JavaScript will be necessary.
I received another question about how to do clean audio, which I had almost forgotten was a requirement from our earlier media accessibility document. “Clean audio” consists of isolating the audio channel containing the spoken dialog and important non-speech information that can then be amplified or otherwise modified, while other channels containing music or ambient sounds are attenuated. I suggested using the mediagroup attribute to provide a main video element (without an audio track) and then the other channels as parallel audio tracks that can be turned on and off and attenuated individually. There is some JavaScript coding involved on top of the APIs that we have defined in HTML, but it can be implemented in browsers that support the mediagroup attribute.
Another question was about the possibilities to extend the list of @kind attribute values. I explained that right now we have a proposal for a new text track kind=”forced” so as to provide forced subtitles for sections of video with foreign language. These would be on when no other subtitle or caption tracks are activated. I also explained that if there is a need for application-specific text tracks, the kind=”metadata” would be the correct choice.
I received some further questions, in particular about how to apply styling to captions (e.g. color changes to text) and about how closely the browser are able to keep synchronization across multiple media elements. The earlier was easily answered with the ::cue pseudo-element, but the latter is a quality of implementation feature, so I had to defer to individual browsers.
Overall it was a good exercise to summarize the current state of HTML5 video accessibility and I was excited to show off support in Chrome for all the features that we designed into the standard.
New Challenges
I finished up at Google last week and am now working at NICTA, an Australian ICT research institute.
My work with Google was exciting and I learned a lot. I like to think that Google also got a lot out of me – I coded and contributed to some YouTube caption features, I worked on Chrome captions and video controls, and above all I worked on video accessibility for HTML at the W3C.
I was one of the key authors of the W3C Media Accessibility Requirements document that we created in the Media Accessibility Task Force of the W3C HTML WG. I then went on to help make video accessibility a reality. We created WebVTT and the <track> element and applied it to captions, subtitles, chapters (navigation), video descriptions, and metadata. To satisfy the need for synchronisation of video with other media resources such as sign language video or audio descriptions, we got the MediaController object and the @mediagroup attribute.
I must say it was a most rewarding time. I learned a lot about being productive at Google, collaborate successfully over the distance, about how the WebKit community works, and about the new way of writing W3C standard (which is more like pseudo-code). As one consequence, I am now a co-editor of the W3C HTML spec and it seems I am also about to become the editor of the WebVTT spec.
At NICTA my new focus of work is WebRTC. There is both a bit of research and a whole bunch of application development involved. I may even get to do some WebKit development, if we identify any issues with the current implementation. I started a week ago and am already amazed by the amount of work going on in the WebRTC space and the amazing number of open source projects playing around with it. Video conferencing is a new challenge and I look forward to it.
Video Conferencing in HTML5: WebRTC via Socket.io
Six months ago I experimented with Web sockets for WebRTC and the early implementations of PeerConnection in Chrome. Last week I gave a presentation about WebRTC at Linux.conf.au, so it was time to update that codebase.
I decided to use socket.io for the signalling following the idea of Luc, which made the server code even smaller and reduced it to a mere reflector:
var app = require('http').createServer().listen(1337);
var io = require('socket.io').listen(app);
io.sockets.on('connection', function(socket) {
socket.on('message', function(message) {
socket.broadcast.emit('message', message);
});
});
Then I turned to the client code. I was surprised to see the massive changes that PeerConnection has gone through. Check out my slide deck to see the different components that are now necessary to create a PeerConnection.
I was particularly surprised to see the SDP object now fully exposed to JavaScript and thus the ability to manipulate it directly rather than through some API. This allows Web developers to manipulate the type of session that they are asking the browsers to set up. I can imaging e.g. if they have support for a video codec in JavaScript that the browser does not provide built-in, they can add that codec to the set of choices to be offered to the peer. While it is flexible, I am concerned if this might create more problems than it solves. I guess we’ll have to wait and see.
I was also surprised by the need to use ICE, even though in my experiment I got away with an empty list of ICE servers – the ICE messages just got exchanged through the socket.io server. I am not sure whether this is a bug, but I was very happy about it because it meant I could run the whole demo on a completely separate network from the Internet.
The most exciting news since my talk is that Mozilla and Google have managed to get a PeerConnection working between Firefox and Chrome – this is the first cross-browser video conference call without a plugin! The code differences are minor.
Since the specification of the WebRTC API and of the MediaStream API are now official Working Drafts at the W3C, I expect other browsers will follow. I am also looking forward to the possibilities of:
- multi-peer video conferencing like the efforts around webrtc.io,
- the media stream recording API,
- and the peer-to-peer data API.
The best places to learn about the latest possibilities of WebRTC are webrtc.org and the W3C WebRTC WG. code.google.com has open source code that continues to be updated to the latest released and interoperable features in browsers.
The video of my talk is in the process of being published. There is a MP4 version on the Linux Australia mirror server, but I expect it will be published properly soon. I will update the blog post when that happens.
What is “interoperable TTML”?
I’ve just tried to come to terms with the latest state of TTML, the Timed Text Markup Language.
TTML has been specified by the W3C Timed Text Working Group and released as a RECommendation v1.0 in November 2010. Since then, several organisations have tried to adopt it as their caption file format. This includes the SMPTE, the EBU (European Broadcasting Union), and Microsoft.
Both, Microsoft and the EBU actually looked at TTML in detail and decided that in order to make it usable for their use cases, a restriction of its functionalities is needed.
EBU-TT
The EBU released EBU-TT, which restricts the set of valid attributes and feature. “The EBU-TT format is intended to constrain the features provided by TTML, especially to make EBU-TT more suitable for the use with broadcast video and web video applications.” (see EBU-TT).
In addition, EBU-specific namespaces were introduce to extend TTML with EBU-specific data types, e.g. ebuttdt:frameRateMultiplierType or ebuttdt:smpteTimingType. Similarly, a bunch of metadata elements were introduced, e.g. ebuttm:documentMetadata, ebuttm:documentEbuttVersion, or ebuttm:documentIdentifier.
The use of namespaces as an extensibility mechanism will ascertain that EBU-TT files continue to be valid TTML files. However, any vanilla TTML parser will not know what to do with these custom extensions and will drop them on the floor.
Simple Delivery Profile
With the intention to make TTML ready for “internet delivery of Captions originated in the United States”, Microsoft proposed a “Simple Delivery Profile for Closed Captions (US)” (see Simple Profile). The Simple Profile is also a restriction of TTML.
Unfortunately, the Microsoft profile is not the same as the EBU-TT profile: for example, it contains the “set” element, which is not conformant in EBU-TT. Similarly, the supported style features are different, e.g. Simple Profile supports “display-region”, while EBU-TT does not. On the other hand, EBU-TT supports monospace, sans-serif and serif fonts, while the Simple profile does not.
Thus files created for the Simple Delivery Profile will not work on players that expect EBU-TT and the reverse.
Fortunately, the Simple Delivery Profile does not introduce any new namespaces and new features, so at least it is an explicit subpart of TTML and not both a restriction and extension like EBU-TT.
SMPTE-TT
SMPTE also created a version of the TTML standard called SMPTE-TT. SMPTE did not decide on a subset of TTML for their purposes – it was simply adopted as a complete set. “This Standard provides a framework for timed text to be supported for content delivered via broadband means,…” (see SMPTE-TT).
However, SMPTE extended TTML in SMPTE-TT with an ability to store a binary blob with captions in another format. This allows using SMPTE-TT as a transport format for any caption format and is deemed to help with “backwards compatibility”.
Now, instead of specifying a profile, SMPTE decided to define how to convert CEA-608 captions to SMPTE-TT. Even if it’s not called a “profile”, that’s actually what it is. It even has its own namespace: “m608:”.
Conclusion
With all these different versions of TTML, I ask myself what a video player that claims support for TTML will do to get something working. The only chance it has is to implement all the extensions defined in all the different profiles. I pity the player that has to deal with a SMPTE-TT file that has a binary blob in it and is expected to be able to decode this.
Now, what is a caption author supposed to do when creating TTML? They obviously cannot expect all players to be able to play back all TTML versions. Should they create different files depending on what platform they are targeting, i.e. a EBU-TT version, a SMPTE-TT version, a vanilla TTML version, and a Simple Delivery Profile version? Should they by throwing all the features of all the versions into one TTML file and hope that the players will pick out the right things that they require and drop the rest on the floor?
Maybe the best way to progress would be to make a list of the “safe” features: those features that every TTML profile supports. That may be the best way to get an “interoperable TTML” file. Here’s me hoping that this minimal set of features doesn’t just end up being the usual (starttime, endtime, text) triple.
UPDATE:
I just found out that UltraViolet have their own profile of SMPTE-TT called CFF-TT (see UltraViolet FAQ and spec). They are making some SMPTE-TT fields optional, but introduce a new @forcedDisplayMode attribute under their own namespace “cff:”.
Video Conferencing in HTML5: WebRTC via Web Sockets
A bit over a week ago I gave a presentation at Web Directions Code 2012 in Melbourne. Maxine and John asked me to speak about something related to HTML5 video, so I went for the new shiny: WebRTC – real-time communication in the browser.
I only had 20 min, so I had to make it tight. I wanted to show off video conferencing without special plugins in Google Chrome in just a few lines of code, as is the promise of WebRTC. To a large extent, I achieved this. But I made some interesting discoveries along the way. Demos are in the slide deck.
UPDATE: Opera 12 has been released with WebRTC support.
Housekeeping: if you want to replicate what I have done, you need to install a Google Chrome Web Browser 19+. Then make sure you go to chrome://flags and activate the MediaStream and PeerConnection experiment(s). Restart your browser and now you can experiment with this feature. Big warning up-front: it’s not production-ready, since there are still changes happening to the spec and there is no compatible implementation by another browser yet.
Here is a brief summary of the steps involved to set up video conferencing in your browser:
- Set up a video element each for the local and the remote video stream.
- Grab the local camera and stream it to the first video element.
- (*) Establish a connection to another person running the same Web page.
- Send the local camera stream on that peer connection.
- Accept the remote camera stream into the second video element.
Now, the most difficult part of all of this – believe it or not – is the signalling part that is required to build the peer connection (marked with (*)). Initially I wanted to run completely without a server and just enter the remote’s IP address to establish the connection. This is, however, not a functionality that the PeerConnection object provides [might this be something to add to the spec?].
So, you need a server known to both parties that can provide for the handshake to set up the connection. All the examples that I have seen, such as https://apprtc.appspot.com/, use a channel management server on Google’s appengine. I wanted it all working with HTML5 technology, so I decided to use a Web Socket server instead.
I implemented my Web Socket server using node.js (code of websocket server). The video conferencing demo is in the slide deck in an iframe – you can also use the stand-alone html page. Works like a treat.
While it is still using Google’s STUN server to get through NAT, the messaging for setting up the connection is running completely through the Web Socket server. The messages that get exchanged are plain SDP message packets with a session ID. There are OFFER, ANSWER, and OK packets exchanged for each streaming direction. You can see some of it in the below image:

I’m not running a public WebSocket server, so you won’t be able to see this part of the presentation working. But the local loopback video should work.
At the conference, it all went without a hitch (while the wireless played along). I believe you have to host the WebSocket server on the same machine as the Web page, otherwise it won’t work for security reasons.
A whole new world of opportunities lies out there when we get the ability to set up video conferencing on every Web page – scary and exciting at the same time!
My crazy linux.conf.au week
In January I attended the annual Australian Linux and Open Source conference (LCA). But since I was sick all of January and had a lot to catch up on, I never got around to sharing all the talks that I gave during that time.
Drupal Down Under
It started with a talk at Drupal Down Under, which happened the weekend before LCA. I gave a talk titled “HTML5 video specifications” (video, slides).
I spoke about the video and audio element in HTML5, how to provide fallback content, how to encode content, how to control them from JavaScript, and briefly about Drupal video modules, though the next presentation provided much more insight into those. I explained how to make the HTML5 media elements accessible, including accessible controls, captions, audio descriptions, and the new WebVTT file format. I ran out of time to introduce the last section of my slides which are on WebRTC.
Linux.conf.au
On the first day of LCA I gave a talk both in the Multimedia Miniconf and the Browser Miniconf.
Browser Miniconf
In the Browser Miniconf I talked about “Web Standardisation – how browser vendors collaborate, or not” (slides). Maybe the most interesting part about this was that I tried out a new slide “deck” tool called impress.js. I’m not yet sure if I like it but it worked well for this talk, in which I explained how the HTML5 spec is authored and who has input.
I also sat on a panel of browser developers in the Browser Miniconf (more as a standards than as a browser developer, but that’s close enough). We were asked about all kinds of latest developments in HTML5, CSS3, and media standards in the browser.
Multimedia Miniconf
In the Multimedia Miniconf I gave a “HTML5 media accessibility update” (slides). I talked about the accessibility problems of Flash, how native HTML5 video players will be better, about accessible video controls, captions, navigation chapters, audio descriptions, and WebVTT. I also provided a demo of how to synchronize multiple video elements using a polyfill for the multitrack API.
I also provided an update on HTTP adaptive streaming APIs as a lightning talk in the Multimedia Miniconf. I used an extract of the Drupal conference slides for it.
Main conference
Finally, and most importantly, Alice Boxhall and myself gave a talk in the main linux.conf.au titled “Developing Accessible Web Apps – how hard can it be?” (video, slides). I spoke about a process that you can follow to make your Web applications accessible. I’m writing a separate blog post to explain this in more detail. In her part, Alice dug below the surface of browsers to explain how the accessibility markup that Web developers provide is transformed into data structures that are handed to accessibility technologies.
Open Media Developers Track at OVC 2011
The Open Video Conference that took place on 10-12 September was so overwhelming, I’ve still not been able to catch my breath! It was a dense three days for me, even though I only focused on the technology sessions of the conference and utterly missed out on all the policy and content discussions.
Roughly 60 people participated in the Open Media Software (OMS) developers track. This was an amazing group of people capable and willing to shape the future of video technology on the Web:
- HTML5 video developers from Apple, Google, Opera, and Mozilla (though we missed the NZ folks),
- codec developers from WebM, Xiph, and MPEG,
- Web video developers from YouTube, JWPlayer, Kaltura, VideoJS, PopcornJS, etc.,
- content publishers from Wikipedia, Internet Archive, YouTube, Netflix, etc.,
- open source tool developers from FFmpeg, gstreamer, flumotion, VideoLAN, PiTiVi, etc,
- and many more.
To provide a summary of all the discussions would be impossible, so I just want to share the key take-aways that I had from the main sessions.
WebRTC: Realtime Communications and HTML5
Tim Terriberry (Mozilla), Serge Lachapelle (Google) and Ethan Hugg (CISCO) moderated this session together (slides). There are activities both at the W3C and at IETF – the ones at IETF are supposed to focus on protocols, while the W3C ones on HTML5 extensions.
The current proposal of a PeerConnection API has been implemented in WebKit/Chrome as open source. It is expected that Firefox will have an add-on by Q1 next year. It enables video conferencing, including media capture, media encoding, signal processing (echo cancellation etc), secure transmission, and a data stream exchange.
Current discussions are around the signalling protocol and whether SIP needs to be required by the standard. Further, the codec question is under discussion with a question whether to mandate VP8 and Opus, since transcoding gateways are not desirable. Another question is how to measure the quality of the connection and how to report errors so as to allow adaptation.
What always amazes me around RTC is the sheer number of specialised protocols that seem to be required to implement this. WebRTC does not disappoint: in fact, the question was asked whether there could be a lighter alternative than to re-use dozens of years of protocol development – is it over-engineered? Can desktop players connect to a WebRTC session?
We are already in a second or third revision of this part of the HTML5 specification and yet it seems the requirements are still being collected. I’m quietly confident that everything is done to make the lives of the Web developer easier, but it sure looks like a huge task.
The Missing Link: Flash to HTML5
Zohar Babin (Kaltura) and myself moderated this session and I must admit that this session was the biggest eye-opener for me amongst all the sessions. There was a large number of Flash developers present in the room and that was great, because sometimes we just don’t listen enough to lessons learnt in the past.
This session gave me one of those aha-moments: it the form of the Flash appendBytes() API function.
The appendBytes() function allows a Flash developer to take a byteArray out of a connected video resource and do something with it – such as feed it to a video for display. When I heard that Web developers want that functionality for JavaScript and the video element, too, I instinctively rejected the idea wondering why on earth would a Web developer want to touch encoded video bytes – why not leave that to the browser.
But as it turns out, this is actually a really powerful enabler of functionality. For example, you can use it to:
- display mid-roll video ads as part of the same video element,
- sequence playlists of videos into the same video element,
- implement DVR functionality (high-speed seeking),
- do mash-ups,
- do video editing,
- adaptive streaming.
This totally blew my mind and I am now completely supportive of having such a function in HTML5. Together with media fragment URIs you could even leave all the header download management for resources to the Web browser and just request time ranges from a video through an appendBytes() function. This would be easier on the Web developer than having to deal with byte ranges and making sure that appropriate decoding pipelines are set up.
Standards for Video Accessibility
Philip Jagenstedt (Opera) and myself moderated this session. We focused on the HTML5 track element and the WebVTT file format. Many issues were identified that will still require work.
One particular topic was to find a standard means of rendering the UI for caption, subtitle, und description selection. For example, what icons should be used to indicate that subtitles or captions are available. While this is not part of the HTML5 specification, it’s still important to get this right across browsers since otherwise users will get confused with diverging interfaces.
Chaptering was discussed and a particular need to allow URLs to directly point at chapters was expressed. I suggested the use of named Media Fragment URLs.
The use of WebVTT for descriptions for the blind was also discussed. A suggestion was made to use the voice tag <v> to allow for “styling” (i.e. selection) of the screen reader voice.
Finally, multitrack audio or video resources were also discussed and the @mediagroup attribute was explained. A question about how to identify the language used in different alternative dubs was asked. This is an issue because @srclang is not on audio or video, only on text, so it’s a missing feature for the multitrack API.
Beyond this session, there was also a breakout session on WebVTT and the track element. As a consequence, a number of bugs were registered in the W3C bug tracker.
WebM: Testing, Metrics and New features
This session was moderated by John Luther and John Koleszar, both of the WebM Project. They started off with a presentation on current work on WebM, which includes quality testing and improvements, and encoder speed improvement. Then they moved on to questions about how to involve the community more.
The community criticised that communication of what is happening around WebM is very scarce. More sharing of information was requested, including a move to using open Google+ hangouts instead of Google internal video conferences. More use of the public bug tracker can also help include the community better.
Another pain point of the community was that code is introduced and removed without much feedback. It was requested to introduce a peer review process. Also it was requested that example code snippets are published when new features are announced so others can replicate the claims.
This all indicates to me that the WebM project is increasingly more open, but that there is still a lot to learn.
Standards for HTTP Adaptive Streaming
This session was moderated by Frank Galligan and Aaron Colwell (Google), and Mark Watson (Netflix).
Mark started off by giving us an introduction to MPEG DASH, the MPEG file format for HTTP adaptive streaming. MPEG has just finalized the format and he was able to show us some examples. DASH is XML-based and thus rather verbose. It is covering all eventualities of what parameters could be switched during transmissions, which makes it very broad. These include trick modes e.g. for fast forwarding, 3D, multi-view and multitrack content.
MPEG have defined profiles – one for live streaming which requires chunking of the files on the server, and one for on-demand which requires keyframe alignment of the files. There are clear specifications for how to do these with MPEG. Such profiles would need to be created for WebM and Ogg Theora, too, to make DASH universally applicable.
Further, the Web case needs a more restrictive adaptation approach, since the video element’s API is already accounting for some of the features that DASH provides for desktop applications. So, a Web-specific profile of DASH would be required.
Then Aaron introduced us to the MediaSource API and in particular the webkitSourceAppend() extension that he has been experimenting with. It is essentially an implementation of the appendBytes() function of Flash, which the Web developers had been asking for just a few sessions earlier. This was likely the biggest announcement of OVC, alas a quiet and technically-focused one.
Aaron explained that he had been trying to find a way to implement HTTP adaptive streaming into WebKit in a way in which it could be standardised. While doing so, he also came across other requirements around such chunked video handling, in particular around dynamic ad insertion, live streaming, DVR functionality (fast forward), constraint video editing, and mashups. While trying to sort out all these requirements, it became clear that it would be very difficult to implement strategies for stream switching, buffering and delivery of video chunks into the browser when so many different and likely contradictory requirements exist. Also, once an approach is implemented and specified for the browser, it becomes very difficult to innovate on it.
Instead, the easiest way to solve it right now and learn about what would be necessary to implement into the browser would be to actually allow Web developers to queue up a chunk of encoded video into a video element for decoding and display. Thus, the webkitSourceAppend() function was born (specification).
The proposed extension to the HTMLMediaElement is as follows:
partial interface HTMLMediaElement {
// URL passed to src attribute to enable the media source logic.
readonly attribute [URL] DOMString webkitMediaSourceURL;
bool webkitSourceAppend(in Uint8Array data);
// end of stream status codes.
const unsigned short EOS_NO_ERROR = 0;
const unsigned short EOS_NETWORK_ERR = 1;
const unsigned short EOS_DECODE_ERR = 2;
void webkitSourceEndOfStream(in unsigned short status);
// states
const unsigned short SOURCE_CLOSED = 0;
const unsigned short SOURCE_OPEN = 1;
const unsigned short SOURCE_ENDED = 2;
readonly attribute unsigned short webkitSourceState;
};
The code is already checked into WebKit, but commented out behind a command-line compiler flag.
Frank then stepped forward to show how webkitSourceAppend() can be used to implement HTTP adaptive streaming. His example uses WebM – there are no examples with MPEG or Ogg yet.
The chunks that Frank’s demo used were 150 video frames long (6.25s) and 5s long audio. Stream switching only switched video, since audio data is much lower bandwidth and more important to retain at high quality. Switching was done on multiplexed files.
Every chunk requires an XHR range request – this could be optimised if the connections were kept open per adaptation. Seeking works, too, but since decoding requires download of a whole chunk, seeking latency is determined by the time it takes to download and decode that chunk.
Similar to DASH, when using this approach for live streaming, the server has to produce one file per chunk, since byte range requests are not possible on a continuously growing file.
Frank did not use DASH as the manifest format for his HTTP adaptive streaming demo, but instead used a hacked-up custom XML format. It would be possible to use JSON or any other format, too.
After this session, I was actually completely blown away by the possibilities that such a simple API extension allows. If I wasn’t sold on the idea of a appendBytes() function in the earlier session, this one completely changed my mind. While I still believe we need to standardise a HTTP adaptive streaming file format that all browsers will support for all codecs, and I still believe that a native implementation for support of such a file format is necessary, I also believe that this approach of webkitSourceAppend() is what HTML needs – and maybe it needs it faster than native HTTP adaptive streaming support.
Standards for Browser Video Playback Metrics
This session was moderated by Zachary Ozer and Pablo Schklowsky (JWPlayer). Their motivation for the topic was, in fact, also HTTP adaptive streaming. Once you leave the decisions about when to do stream switching to JavaScript (through a function such a wekitSourceAppend()), you have to expose stream metrics to the JS developer so they can make informed decisions. The other use cases is, of course, monitoring of the quality of video delivery for reporting to the provider, who may then decide to change their delivery environment.
The discussion found that we really care about metrics on three different levels:
- measuring the network performance (bandwidth)
- measuring the decoding pipeline performance
- measuring the display quality
In the end, it seemed that work previously done by Steve Lacey on a proposal for video metrics was generally acceptable, except for the playbackJitter metric, which may be too aggregate to mean much.
Device Inputs / A/V in the Browser
I didn’t actually attend this session held by Anant Narayanan (Mozilla), but from what I heard, the discussion focused on how to manage permission of access to video camera, microphone and screen, e.g. when multiple applications (tabs) want access or when the same site wants access in a different session. This may apply to real-time communication with screen sharing, but also to photo sharing, video upload, or canvas access to devices e.g. for time lapse photography.
Open Video Editors
This was another session that I wasn’t able to attend, but I believe the creation of good open source video editing software and similar video creation software is really crucial to giving video a broader user appeal.
Jeff Fortin (PiTiVi) moderated this session and I was fascinated to later see his analysis of the lifecycle of open source video editors. It is shocking to see how many people/projects have tried to create an open source video editor and how many have stopped their project. It is likely that the creation of a video editor is such a complex challenge that it requires a larger and more committed open source project – single people will just run out of steam too quickly. This may be comparable to the creation of a Web browser (see the size of the Mozilla project) or a text processing system (see the size of the OpenOffice project).
Jeff also mentioned the need to create open video editor standards around playlist file formats etc. Possibly the Open Video Alliance could help. In any case, something has to be done in this space – maybe this would be a good topic to focus next year’s OVC on?
Monday’s Breakout Groups
The conference ended officially on Sunday night, but we had a third day of discussions / hackday at the wonderful New York Lawschool venue. We had collected issues of interest during the two previous days and organised the breakout groups on the morning (Schedule).
In the Content Protection/DRM session, Mark Watson from Netflix explained how their API works and that they believe that all we need in browsers is a secure way to exchange keys and an indicator of protection scheme is used – the actual protection scheme would not be implemented by the browser, but be provided by the underlying system (media framework/operating system). I think that until somebody actually implements something in a browser fork and shows how this can be done, we won’t have much progress. In my understanding, we may also need to disable part of the video API for encrypted content, because otherwise you can always e.g. grab frames from the video element into canvas and save them from there.
In the Playlists and Gapless Playback session, there was massive brainstorming about what new cool things can be done with the video element in browsers if playback between snippets can be made seamless. Further discussions were about a standard playlist file formats (such as XSPF, MRSS or M3U), media fragment URIs in playlists for mashups, and the need to expose track metadata for HTML5 media elements.
What more can I say? It was an amazing three days and the complexity of problems that we’re dealing with is a tribute to how far HTML5 and open video has already come and exciting news for the kind of applications that will be possible (both professional and community) once we’ve solved the problems of today. It will be exciting to see what progress we will have made by next year’s conference.
Thanks go to Google for sponsoring my trip to OVC.
UPDATE: We actually have a mailing list for open media developers who are interested in these and similar topics – do join at http://lists.annodex.net/cgi-bin/mailman/listinfo/foms.
WebVTT at W3C
Today we started a community group (CG) at the W3C for “Web Media Text Tracks”: http://www.w3.org/community/texttracks/.
The group has been created to work on many aspects of video text tracks of which captioning and the WebVTT format are key parts.
The main reason behind creating this group is to create a forum at the W3C for working on WebVTT to allow all browsers to support this format and be involved in its development.
We’ve not gone the full way to creating a Working Group, although that was the initial intention. We had objections from W3C members for going down that path, so are using the CG path for now.
This is actually a good thing because CGs are open for anyone to join, while WGs are only open to W3C members. The key difference is that specs coming out of WGs can become RECs (“standards”), while CG’s specs cannot.
If we eventually see a need to move WebVTT to a REC, that move will be straight forward, since there is a clear path for work to transition from a CG to a WG.
3rd W3C Web and TV Workshop, Hollywood
Curious about any new requirements that the TV community may have for HTML5 video, I attended the W3C Web and TV Workshop in Hollywood last week. It’s already the third of its kind and was also the largest to date showing an increasing interest of the TV community to converge with the Web community.
The Workshop Aim
I went into the Workshop not quite knowing what to expect. My previous contact with members of this community was restricted to email exchanges on the W3C Web and TV Interest Group (IG) mailing list. I knew there was some interest in video accessibility (well: particularly captions) and little knowledge of existing HTML5 specifications around text tracks and why the browsers were going with WebVTT. So I had decided to attend the workshop to get a better understanding of the community, it’s background, needs, and issues, and to hopefully teach some of the ways of HTML5. For that reason I had also submitted a WebVTT presentation/demo.
As it turned out, the workshop had as its key target the facilitation of communication between the TV and the HTML5 community. The aim was to identify features that need to be added to the HTML5 video element to satisfy the needs of the TV community. I obviously came to the right workshop.
The process that is being used by the W3C in the Interest Group is to have TV community members express their needs, then have HTML5 experts express how these needs can be satisfied with existing HTML5 features, then make trial implementations and identify any shortcomings, then move forward to progress these through HTML5 or HTML.next. This workshop clearly focused on the first step: expressing needs.
Often times it was painful for me to watch presenters defending their requirements and trying to impress on the audience how important a certain feature is to them when that features actually already has a HTML5 specification, but just not yet a browser implementations. That there were so few HTML5 video experts present and that they were given very little space to directly reply to the expressed needs and actually explain what is already possible (or specified to be possible) was probably one of the biggest drawbacks of the workshop.
To be fair, detailed technical discussions were not possible in a room with 150 attendees with a panel sitting at the front discussing topics and taking questions. Solving a use case with existing HTML5 markup and identifying the gaps requires smaller break-out groups of a maximum of maybe 20 people and sufficient HTML5 knowledge in the room. Ultimately they require a single person to try to implement it using JavaScript alone, and, failing that, writing browser extensions. Only such code actually proves that a feature is missing.
Now, the video features of HTML5 are still continuing to change almost on a daily basis. Much development is, for example, happening around real-time communication features and around the track element as we speak. So, focusing on further requirements finding around HTML5 video for now is probably a good thing.
The TV Community Approach
Before I move on to some of the topics covered by the workshop, I have to express some concern about the behaviour that I observed with lots of the TV community folks. Many people tried pushing existing solutions from other spaces into the Web unchanged with a claim of not re-inventing the wheel and following paved cowpaths, which are some of the underlying design principles for HTML5. I can understand where such behaviour originates thinking that having solved the same problems elsewhere before, those solutions should apply here, too. But I would like to warn people of this approach.
If we blindly apply solutions that were not developed for HTML5 into HTML we will end up with suboptimal solutions that will hurt us further down the track. The principles of not re-inventing the wheel and following paved cowpaths were introduced for features that were already implemented by browsers or in de-facto standard use by JavaScript libraries. They were not created for new features in HTML. The video element is a completely new feature in HTML thus everything around it is new.
I would therefore like to see some more respect given to HTML5 and the complexities involved in finding the best possible technical solutions for the Web given that the video element does not stand alone in HTML5, but is part of a much larger picture of technical capabilities on the Web where many of the requested features for TV applications may already be solved by existing HTML markup that is not part of the video element.
Also, HTML5 is not just about the HTML markup, but also about CSS and JavaScript and HTTP. There are several layers of technology involved in creating a Web application: not only a separation of work between client and servers, but also between the Operating System, the media framework, the browser, browser plugins, and JavaScript has to be balanced. To get this balance right is a fine art that will take many discussion, many experiments and sometimes several design approaches. We need patience and calm to work through this, not a rushed adoption of existing solutions from other spaces.
New Requirements
Now let’s get to the take-aways I had from the workshop’s sessions:
Session 1 / Content Provider and Consumer Perspective:
The sessions participants postulate that we will see the creation of application stores for TV applications similar to how we have experienced this for mobile phones and tablets. People enjoy collecting apps like they collect badges. Right now, the app store domain is dominated by native apps and now Web apps. The reason is that we haven’t got a standard platform for setting up Web app stores with Web apps that work in all browsers on all operating systems. Thus, developers have to re-deploy their app for many environments.
While essentially an orthogonal need to HTML standardisation, this seems to be one of the key issues that keep Web apps back from making big market inroads and W3C may do well in setting up a new WG to define a standard Web app manifest format and JS APIs.
Session 2+3 / Multi-screen TV in the Home Network:
Several technologies of hybrid TV broadcast and set-top-box Web content delivery were being pointed out, including the European HbbTV and the Japanese Hybridcast, the latter of which gave an in-depth demo.
Web purists would probably say that it would be simpler to just deliver all content over the Web and not have to worry about any further technical challenges encountered by having to synchronize content received via two vastly different delivery mechanisms. I personally believe this development is one of business models: we don’t yet know exactly how to earn money from TV content delivered over the Internet, but we do know how to do so with TV content. So, hybrids allow the continuation of existing income streams while allowing the features to be augmented with those people enjoy from the Internet.
Should requirements that emerge from such a use case for HTML5 video be taken seriously? I think they absolutely should. What I see happening is that a new way of using the Web is starting to emerge. The new way is video-focused rather than text-focused. We receive our Web content by watching video programming online – video channels, not Web pages are the core content that we consume in the living room. Video channels are where we start our browsing experience from. Search may still be our first point of call, but it will be search for video content or a video-centric app rather than search for a Web site.
And it will be a matter of many interconnected devices in the house that contribute to the experience: the 5.1 stereos that are spread all over the house and should receive our video’s sound, the different screens in the different areas of our house between which we move around, and remote controls, laptops or tablets that function as remote controls and preview stations and are used to determine our viewing experience and provide a back-channel to the publishers.
We have barely begun to identify how such interconnected devices within a home fit within the server-client-based view of the Web world, and the new Web Sockets functionality. The Home Networking Task Force of the Web and TV IG is looking at the issues and analysing existing protocols and standards that solve this picture. But I have a gnawing feeling that the best solution will be something new that is more Web-specific and fits better with the technology layers of the Web.
Session 4 / Synchronized Metadata:
The TV environment offers many data services, some of which have been legally prescribed. This session analysed TV needs and how they can be satisfied with current HTML5.
Subtitles and closed captioning support are one of the key requirements that have been legally prescribed to allow for equal access of non-native speakers, and blind and vision-impaired users to TV content. After demonstration of some key features defined into the HTML5 track element and the WebVTT format, it was generally accepted that HTML5 is making big progress in this space, in particular that browsers are in the process of implementing support for the track element. A concern still exists for complete coverage of all the CEA-608/708 features in WebVTT.
Further concern was raised for support of audio descriptions and audio translations, in particular since no browser has as yet committed to implementing the HTML5′s media multitrack API with the @mediagroup attribute. In this context I am excited to see first JavaScript polyfills emerge (see captionator.js & mediagroup.js).
Another concern was that many captions are actually delivered as raster images (in particular DVD captions) and how that would work in the Web context. The proposal was to use WebVTT and encode the raster images as data-URIs included in timed cues, then render them by JavaScript as an overlay. This is something to explore further.
Demos were shown using WebVTT to synchronize ads with videos, to display related metadata from a user’s life log with videos, to display thumbnails along a video’s timeline, and to show the rendering of text descriptions through screen readers. General agreement by the panel was that WebVTT offers many opportunities and that this area will continue to need further development and that we will see new capabilities on the Web around metadata that were not previously possible on TV.
Session 5 / Content Format and Codecs: DASH and Codec standards
The introduction of HTTP adaptive streaming into HTML5 was one of the core issues that kept returning in the discussions. This panel focused on MPEG DASH, but also mentioned the need for programmatic implementation of adaptive streaming functionality.
The work around MPEG DASH would require specifications of how to use DASH with WebM and Ogg Theora, as well as a specification of a HTML5 profile for DASH, which would limit the functionality possible in DASH files to the ones needed in a HTML5 video element. One criticism of DASH was its verbosity. Another was its unclear patent position. Panel attendees with included Qualcomm, Apple and Microsoft made very clear that their position is pro a royalty-free use of DASH.
The work around a programmatic implementation for adaptive streaming would require at least a JavaScript API to measure the quality of service of a presented video element and a JavaScript API to feed the video element with chunks of (encrypted) video content on the fly. Interestingly enough, there are existing experiments both around Video metrics and MediaSource extensions, so we can expect some progress in this space, even if these are not yet a strong focus of the HTML WG.
I would personally support the creation of Community Group at the W3C around HTTP adaptive streaming and DASH. I think it would work towards alleviating the perceived patent issues around DASH and allow the right members of the community to participate in preparing a specification for HTML5 without requiring them to become W3C members.
Session 6 / Content Protection and DRM
A core concern of the TV community is around content protection. The requirements in this space seem, however, very confused.
The key assumption here is that Web browsers should support the decoding of DRM-protected content in the HTML5 video element because the video element provides a desirable JavaScript API, accessibility features (the track element), default controls, and the possibility to synchronize multiple media elements. However, at the same time, the video element is part of the core content of a Web page and thus allows direct access to the image content in a canvas etc, so some of its functionality is not desirable.
The picture is further confused by requests for authentication, authorization, encryption, obfuscation, same-origin, secure transmission, secure decryption key delivery, unique content identification and other “content protection” techniques without a clear understanding of what is already possible on the Web and what requirements to content publishers actually have for delivering their content on the Web. This is further complicated by the fact that there are many competing solutions for DRM systems in the market with no clear standard that all browsers could support.
A thorough analysis of the technologies and solutions available in this space as well as an analysis of the needs for HTML5 is required before it becomes clear what solution HTML5 browsers may need to support. There seemed to be agreement in the group, though, that browsers would not need to implement DRM solutions, but rather only hand through the functionality of the platform on which they are running (including the media frameworks and operating system functionalities). How this is supposed to work was, however, unclear.
Session 7 / Web & TV: Additional Device & User Requirements
This was a catch-all session for topics that had not been addressed in other sessions. Among the topics addressed in this group were:
- Parental Guidance: how to deal with ratings in an internationally inconsistent ratings landscape, how to deliver the ratings with the content, and how to enforce the viewing restrictions
- Emergency Notifications: how to replicate on the Web the emergency notification functionality of TV by providing text overlays to alert users
- TV channels: how to detect what channels of programming are available to users
Overall, the workshop was a worthwhile experience. It seems there is a lot of work still ahead for making HTML5 video the best it can be on the Web.
The new FOMS: Open Media Developers at OVC
Since 2007 I have organised the annual Foundations of Open Media Software (FOMS) developers workshop. Last year it was held for the first time in the northern hemisphere, in fact on the two days straight after the Open Video Conference (OVC).
This year I’m really excited to announce that the workshop will be an integral part of the Open Video Conference on 10-12 September 2011.
FOMS 2011 will take place as the Open Media Developers track at OVC and I would like to see as many if not more open media software developers attend as we had in last year’s FOMS.
Why should you go?
Well, firstly of course the people. As in previous years, we will have some of the key developers in open media software attend – not as celebrities, but to work with other key developers on hard problems and to make progress.
Then, secondly we believe we have some awesome sessions in preparation:
- WebRTC: Realtime Communications and HTML5
- Standards for Video Accessibility
- WebM: Testing, Metrics and New features
- HTML5 video players: Shortcomings of the HTML5 video API for cross-platform player libraries
- Standards for HTTP Adaptive Streaming
- Standards for Browser Video Statistics
- Device Inputs for A/V in the Browser
How we run it
I’m actually not quite satisfied with just these sessions. I’d like to be more flexible on how we make the three days a success for everyone. And this implies that there will continue to be room to add more sessions, even while at the conference, and create breakout groups to address really hard issues all the way through the conference.
I insist on this flexibility because I have seen in past years that the most productive outcomes are created by two or three people breaking away from the group, going into a corner and hacking up some demos or solutions to hard problems and taking that momentum away after the workshop.
To allow this to happen, we will have a plenary on the first day during which we will identify who is actually present at the workshop, what they are working on, what sessions they are planning on a attending, and what other topics they are keen to learn about during the conference that may not yet be addressed by existing sessions.
We’ll repeat this exercise on the Monday after all the rest of the conference is finished and we get a quieter day to just focus on being productive.
But is it worth the effort?
As in the past years, whether the workshop is a success for you depends on you and you alone. You have the power to direct what sessions and breakout groups are being created, and you have the possibility to find others at the workshop that share an interest and drag them away for some productive brainstorming or coding.
I’m going to make sure we have an adequate number of rooms available to actually achieve such an environment. I am very happy to have the support of OVC for this and I am assured we have the best location with plenty of space.
Trip sponsorships
As in previous FOMSes, we have again made sure that travel and conference sponsorship is available to community software developers that would otherwise not be able to attend FOMS. We have several such sponsorships and I encourage you to email the FOMS committee or OVC about it. Mention what you’re working on and what you’re interested to take away from OVC and we can give you free entry, hotel and flight sponsorship.
Oh, and don’t forget to Register for OVC!