WebSRT and HTML5 media accessibility
On 23rd July, Ian Hickson, the HTML5 editor, posted an update to the WHATWG mailing list introducing the first draft of a platform for accessibility for the HTML5 <video> element. The platform provides for captions, subtitles, audio descriptions, chapter markers and similar time-synchronized text both in-band (inside the video resource) and out-of-band (as external text files). Right now, the proposal only regards <video>, but I personally believe the same can be applied to the <audio> element, except we have to be a bit more flexible with the rendering approach. Anyway…
What I want to do here is to summarize what was introduced, together with the improvements that I and some others have proposed in follow-up emails, and list some of the media accessibility needs that we are not yet dealing with.
For those wanting to only selectively read some sections, here is a clickable table of contents of this rather long blog post:
- THE WebSRT TIMED TEXT FORMAT
- ASSOCIATING EXTERNAL TIMED TEXT RESOURCES WITH A VIDEO
- EXPOSING A LIST OF TimedTracks TO JAVASCRIPT
- RENDERING TimedTracks
- SUMMARY AND FURTHER NEEDS
THE WebSRT TIMED TEXT FORMAT
The first and to everyone probably most surprising part is the new file format that is being proposed to contain out-of-band time-synchronized text for video. A new format was necessary after the analysis of all relevant existing formats determined that they were either insufficient or hard to use in a Web environment.
The new format is called WebSRT and is an extension to the existing SRT SubRip format. It is actually also the part of the new specification that I am personally most uncomfortable with. Not that WebSRT is a bad format. It’s just not sufficient yet to provide all the functionality that a good time-synchronized text format for Web media should. Let’s look at some examples.
WebSRT is composed of a sequence of timed text cues (that’s what we’ve decided to call the pieces of text that are active during a certain time interval). Because of its ancestry of SRT, the text cues can optionally be numbered through. The content of the text cues is currently allowed to contain three different types of text: plain text, minimal markup, and anything at all (also called “metadata”).
In its most simple form, a WebSRT file is just an ordinary old SRT file with optional cue numbers and only plain text in cues:
1 00:00:15.00 --> 00:00:17.95 At the left we can see... 2 00:00:18.16 --> 00:00:20.08 At the right we can see the... 3 00:00:20.11 --> 00:00:21.96 ...the head-snarlers
A bit of a more complex example results if we introduce minimal markup:
00:00:15.00 --> 00:00:17.95 A:start Auf der <i>linken</i> Seite sehen wir... 00:00:18.16 --> 00:00:20.08 A:end Auf der <b>rechten</b> Seite sehen wir die.... 00:00:20.11 --> 00:00:21.96 A:end <1>...die Enthaupter. 00:00:21.99 --> 00:00:24.36 A:start <2>Alles ist sicher. Vollkommen <b>sicher</b>.
and add to this a CSS to provide for some colors and special formatting:
::cue { background: rgba(0,0,0,0.5); }
::cue-part(1) { color: red; }
::cue-part(2, b) { font-style: normal; text-decoration: underline; }
Minimal markup accepts <i>, <b>, <ruby> and a timestamp in <>, providing for italics, bold, and ruby markup as well as karaoke timestamps. Any further styling can be done using the CSS pseudo-elements ::cue and ::cue-part, which accept the features ‘color’, ‘text-shadow’, ‘text-outline’, ‘background’, ‘outline’, and ‘font’.
Note that positioning requires some special notes at the end of the start/end timestamps which can provide for vertical text, line position, text position, size and alignment cue setting. Here is an example with vertically rendered Chinese text, right-aligned at 98% of the video frame:
00:00:15.00 --> 00:00:17.95 A:start D:vertical L:98% 在左边我们可以看到... 00:00:18.16 --> 00:00:20.08 A:start D:vertical L:98% 在右边我们可以看到... 00:00:20.11 --> 00:00:21.96 A:start D:vertical L:98% ...捕蝇草械. 00:00:21.99 --> 00:00:24.36 A:start D:vertical L:98% 一切都安全. 非常地安全.
Finally, WebSRT files can be authored with abstract metadata inside cues, which practically means anything at all. Here’s an example with HTML content:
00:00:15.00 --> 00:00:17.95 A:start <img src="pic1.png"/>Auf der <i>linken</i> Seite sehen wir... 00:00:18.16 --> 00:00:20.08 A:end <img src="pic2.png"/>Auf der <b>rechten</b> Seite sehen wir die.... 00:00:20.11 --> 00:00:21.96 A:end <img src="pic3.png"/>...die <a href="http://members.chello.nl/j.kassenaar/ elephantsdream/subtitles.html">Enthaupter</a>. 00:00:21.99 --> 00:00:24.36 A:start <img src="pic4.png"/>Alles ist <mark>sicher</mark>.<br/>Vollkommen <b>sicher</b>.
Here is another example with JSON in the cues:
00:00:00.00 --> 00:00:44.00
{
slide: intro.png,
title: "Really Achieving Your Childhood Dreams" by Randy Pausch,
Carnegie Mellon University, Sept 18, 2007
}
00:00:44.00 --> 00:01:18.00
{
slide: elephant.png,
title: The elephant in the room...
}
00:01:18.00 --> 00:02:05.00
{
slide: denial.png,
title: I'm not in denial...
}
What I like about WebSRT:
- it allows for all sorts of different content in the text cues – plain text is useful for texted audio descriptions, minimal markup is useful for subtitles, captions, karaoke and chapters, and “metadata” is useful for, well, any data.
- it can be easily encapsulated into media resources and thus turned into in-band tracks by regarding each cue as a data packet with time stamps.
- it is not verbose
Where I think WebSRT still needs improvements:
- break with the SRT history: since WebSRT and SRT files are so different, WebSRT should get its own MIME type, e.g. text/websrt, and file extensions, e.g. .wsrt; this will free WebSRT for changes that wouldn’t be possible by trying to keep conformant with SRT
- introduce some header fields into WebSRT: the format needs
- file-wide name-value metadata, such as author, date, copyright, etc
- language specification for the file as a hint for font selection and speech synthesis
- a possibility for style sheet association in the file header
- a means to identify which parser is required for the cues
- a magic identifier and a version string of the format
- allow innerHTML as an additional format in the cues with the CSS pseudo-elements applying to all HTML elements
- allow full use of CSS instead of just the restricted features and also use it for positioning instead of the hard to understand positioning hints
- on the minimum markup, provide a neutral structuring element such as <span @id @class @lang> to associate specific styles or specific languages with a subpart of the cue
Note that I undertook some experiments with an alternative format that is XML-based and called WMML to gain most of these insights and determine the advantages/disadvantages of a xml-based format. The foremost advantage is that there is no automatism with newlines and displayed new lines, which can make the source text file more readable. The foremost disadvantages are verbosity and that there needs to be a simple encoding step to remove all encapsulating header-type content from around the timed text cues before encoding it into a binary media resource.
ASSOCIATING EXTERNAL TIMED TEXT RESOURCES WITH A VIDEO
Now that we have a timed text format, we need to be able to associate it with a media resource in HTML5. This is what the <track> element was introduced for. It associates the timestamps in the timed text cues with the timeline of the video resource. The browser is then expected to render these during the time interval in which the cues are expected to be active.
Here is an example for how to associate multiple subtitle tracks with a video:
<video src="california.webm" controls>
<track label="English" kind="subtitles" src="calif_eng.wsrt" srclang="en">
<track label="German" kind="subtitles" src="calif_de.wsrt" srclang="de">
<track label="Chinese" kind="subtitles" src="calif_zh.wsrt" srclang="zh">
</video>
In this case, the UA is expected to provide a text menu with a subtitle entry with these three tracks and their label as part of the video controls. Thus, the user can interactively activate one of the tracks.
Here is an example for multiple tracks of different kinds:
<video src="california.webm" controls>
<track label="English" kind="subtitles" src="calif_eng.wsrt" srclang="en">
<track label="German" kind="captions" src="calif_de.wsrt" srclang="de">
<track label="French" kind="chapter" src="calif_fr.wsrt" srclang="fr">
<track label="English" kind="metadata" src="calif_meta.wsrt" srclang="en">
<track label="Chinese" kind="descriptions" src="calif_zh.wsrt" srclang="zh">
</video>
In this case, the UA is expected to provide a text menu with a list of track kinds with one entry each for subtitles, captions and descriptions through the controls. The chapter tracks are expected to provide some sort of visual subdivision on the timeline and the metadata tracks are not exposed visually, but are only available through the JavaScript API.
Here are several ideas for improving the <track> specification:
- <track> is currently only defined for WebSRT resources – it should be made generic and then browsers can compete on the formats for which they provide support. WebSRT could be the baseline format. A @type attribute could be added to hint at the MIME type of the provided resource.
- <track> needs a means for authors to mark certain tracks as active, others as inactive. This can be overruled by browser settings e.g. on @srclang and by user interaction.
- karaoke and lyrics are supported by WebSRT, but aren’t in the HTML5 spec as track kinds – they should be added and made visible like subtitles or captions.
EXPOSING A LIST OF TimedTracks TO JAVASCRIPT
This is where we take an extra step and move to a uniform handling of both in-band and out-of-band timed text tracks. Futher, a third type of timed text track has been introduced in the form of a MutableTimedTrack – i.e. one that can be authored and added through JavaScript alone.
The JavaScript API that is exposed for any of these track type is identical. A media element now has this additional IDL interface:
interface HTMLMediaElement : HTMLElement {
...
readonly attribute TimedTrack[] tracks;
MutableTimedTrack addTrack(in DOMString label, in DOMString kind,
in DOMString language);
};
A media element thus manages a list of TimedTracks and provides for adding TimedTracks through addTrack().
The timed tracks are associated with a media resource in the following order:
- The <track> element children of the media element, in tree order.
- Tracks created through the addTrack() method, in the order they were added, oldest first.
- In-band timed text tracks, in the order defined by the media resource’s format specification.
The IDL interface of a TimedTrack is as follows:
interface TimedTrack {
readonly attribute DOMString kind;
readonly attribute DOMString label;
readonly attribute DOMString language;
readonly attribute unsigned short readyState;
attribute unsigned short mode;
readonly attribute TimedTrackCueList cues;
readonly attribute TimedTrackCueList activeCues;
readonly attribute Function onload;
readonly attribute Function onerror;
readonly attribute Function oncuechange;
};
The first three capture the value of the @kind, @label and @srclang attributes and are provided by the addTrack() function for MutableTimedTracks and exposed from metadata in the binary resource for in-band tracks.
The readyState captures whether the data is available and is one of “not loaded”, “loading”, “loaded”, “failed to load”. Data is only availalbe in “loaded” state.
The mode attribute captures whether the data is activate to be displayed and is one of “disabled”, “hidden” and “showing”. In the “disabled” mode, the UA doesn’t have to download the resource, allowing for some bandwidth management.
The cues and activeCues attributes provide the list of parsed cues for the given track and the subpart thereof that is currently active.
The onload, onerror, and oncuechange functions are event handlers for the load, error and cuechange events of the TimedTrack.
Individual cues expose the following IDL interface:
interface TimedTrackCue {
readonly attribute TimedTrack track;
readonly attribute DOMString id;
readonly attribute float startTime;
readonly attribute float endTime;
DOMString getCueAsSource();
DocumentFragment getCueAsHTML();
readonly attribute boolean pauseOnExit;
readonly attribute Function onenter;
readonly attribute Function onexit;
readonly attribute DOMString direction;
readonly attribute boolean snapToLines;
readonly attribute long linePosition;
readonly attribute long textPosition;
readonly attribute long size;
readonly attribute DOMString alignment;
readonly attribute DOMString voice;
};
The @track attribute links the cue to its TimedTrack.
The @id, @startTime, @endTime attributes expose a cue identifier and its associated time interval. The getCueAsSource() and getCueAsHTML() functions provide either an unparsed cue text content or a text content parsed into a HTML DOM subtree.
The @pauseOnExit attribute can be set to true/false and indicates whether at the end of the cue’s time interval the media playback should be paused and wait for user interaction to continue. This is particularly important as we are trying to support extended audio descriptions and extended captions.
The onenter and onexit functions are event handlers for the enter and exit events of the TimedTrackCue.
The @direction, @snapToLines, @linePosition, @textPosition, @size, @alignment and @voice attributes expose WebSRT positioning and semantic markup of the cue.
My only concerns with this part of the specification are:
- The WebSRT-related attributes in the TimedTrackCue are in conflict with CSS attributes and really should not be introduced into HTML5, since they are WebSRT-specific. They will not exist in other types of in-band or out-of-band timed text tracks. As there is a mapping to do already, why not rely on already available CSS features.
- There is no API to expose header-specific metadata from timed text tracks into JavaScript. This such as the copyright holder, the creation date and the usage rights of a timed text track would be useful to have available. I would propose to add a list of name-value metadata elements to the TimedTrack API.
- In addition, I would propose to allow media fragment hyperlinks in a <video> @src attribute to point to the @id of a TimedTextCue, thus defining that the playback position should be moved to the time offset of that TimedTextCue. This is a useful feature and builds on bringing named media fragment URIs and TimedTracks together.
RENDERING TimedTracks
The third part of the timed track framework deals with how to render the timed text cues in a Web page. The rendering rules are explained in the HTML5 rendering section.
I’ve extracted the following rough steps from the rendering algorithm:
- All timed tracks of a media resource that are in “showing” mode are rendered together to avoid overlapping text from multiple tracks.
- The timed tracks cues that are to be rendered are collected from the active timed tracks and ordered by the timed track order first and by their start time second. Where there are identical start times, the cues are ordered by their end time, earliest first, or by their creation order if all else is identical.
- Each cue gets its own CSS box.
- The text in the CSS boxes is positioned and formated by interpreting the positioning and formatting instructions of WebSRT that are provided on the cues.
- An anonymous inline CSS box is created into which all the cue CSS boxes are wrapped.
- The wrapping CSS box gets the dimensions of the video viewport. The cue CSS boxes are positioned so they don’t overlap. The text inside the cue CSS boxes inside the wrapping CSS box is wrapped at the edges if necessary.
To overcome security concerns with this kind of direct rendering of a CSS box into the Web page where text comes potentially from a different and malicious Web site, it is required to have the cues come from the same origin as the Web page.
To allow application of a restricted set of CSS properties to the timed text cues, a set of pseudo-selectors was introduced. This is necessary since all the CSS boxes are anonymous and cannot be addressed from the Web page. The introduced pseudo-selectors are ::cue to address a complete cue CSS box, and ::cue-part to address a subpart of a cue CSS box based on a set of identifiers provided by WebSRT.
I have several issues with this approach:
- I believe that it is not a good idea to only restrict rendering to same-origin files. This will disallow the use of external captioning services (or even just a separate caption server of the same company) to link to for providing the captions to a video. Henri Sivonen proposed a means to overcome this by parsing every cue basically as its own HTML document (well, the body of a document) and then rendering these in iFrame-manner into the Web page. This would overcome the same-origin restriction. It would also allow to do away with the new ::cue CSS selectors, thus simplifying the solution.
- In general I am concerned about how tightly the rendering is tied to WebSRT. Step 4 should not be in the HTML5 specification, but only apply to WebSRT. Every external format should provide its own mapping to CSS. As it is specified right now, other formats, such as e.g. 3GPP in MPEG-4 or Kate in Ogg, are required to map their format and positioning information to WebSRT instructions. These are then converted again using the WebSRT to CSS mapping rules. That seems overkill.
- I also find step 6 very limiting, since only the video viewport is regarded as a potential rendering area – this is also the reason why there is no rendering for audio elements. Instead, it would make a lot more sense if a CSS box was provided by the HTML page – the default being the video viewport, but it could be changed to any area on screen. This would allow to render music lyrics under or above an audio element, or render captions below a video element to avoid any overlap at all.
SUMMARY AND FURTHER NEEDS
We’ve made huge progress on accessibility features for HTML5 media elements with the specifications that Ian proposed. I think we can move it to a flexible and feature-rich framework as the improvements that Henri, myself and others have proposed are included.
This will meet most of the requirements that the W3C HTML Accessibility Task Force has collected for media elements where the requirements relate to accessibility functionality provided through alternative text resources.
However, we are not solving any of the accessibility needs that relate to alternative audio-visual tracks and resources. In particular there is no solution yet to deal with multi-track audio or video files that have e.g. sign language or audio description tracks in them – not to speak of the issues that can be introduced through dealing with separate media resources from several sites that need to be played back in sync. This latter may be a challenge for future versions of HTML5, since needs for such synchoronisation of multiple resources have to be explored further.
In a first instance, we will require an API to expose in-band tracks, a means to control their activation interactively in a UI, and a description of how they should be rendered. E.g. should a sign language track be rendered as pciture-in-picture? Clear audio and Sign translation are the two key accessibility needs that can be satisfied with such a multi-track solution.
Finally, another key requirement area for media accessibility is described in a section called “Content Navigation by Content Structure”. This describes the need for vision-impaired users to be able to navigate through a media resource based on semantic markup – think of it as similar to a navigation through a book by book chapters and paragraphs. The introduction of chapter markers goes some way towards satisfying this need, but chapter markers tend to address only big time intervals in a video and don’t let you navigate on a different level to subchapters and paragraphs. It is possible to provide that navigation through providing several chapter tracks at different resolution levels, but then they are not linked together and navigation cannot easily swap between resolution levels.
An alternative might be to include different resolution levels inside a single chapter track and somehow control the UI to manage them as different resolutions. This would only require an additional attribute on text cues and could be useful to other types of text tracks, too. For example, captions could be navigated based on scenes, shots, coversations, or individual captions. Some experimentation will be required here before we can introduce a sensible extension to the given media accessibility framework.
“HTML5 Audio And Video Accessibility, Internationalisation And Usability” talk at Mozilla Summit
For 2 months now, I have been quietly working along on a new Mozilla contract that I received to continue working on HTML5 media accessibility. Thanks Mozilla!
Lots has been happening – the W3C HTML5 accessibility task force published a requirements document, the Media Text Associations proposal made it into the HTML5 draft as a <track> element, and there are discussions about the advantages and disadvantages of the new WebSRT caption format that Ian Hickson created in the WHATWG HTML5 draft.
In attending the Mozilla Summit last week, I had a chance to present the current state of development of HTML5 media accessibility and some of the ongoing work. I focused on the following four current activities on the technical side of things, which are key to satisfying many of the collected media accessibility requirements:
- Multitrack Video Support
- External Text Tracks Markup in HTML5
- External Text Track File Format
- Direct Access to Media Fragments
The first three now already have first drafts in the HTML5 specification, though the details still need to be improved and an external text track file format agreed on. The last has had a major push ahead with the Media Fragments WG publishing a Last Call Working Draft. So, on the specification side of things, major progress has been made. On the implementation – even on the example implementation – side of things, we still fall down badly. This is where my focus will lie in the next few months.
Follow this link to read through my slides from the Mozilla 2010 summit.
Media Fragment URI Specification in Last Call WD
After two years of effort, the W3C Media Fragment WG has now created a Last Call Working Draft document. This means that the working group is fairly confident that they have addressed all the required issues for media fragment URIs and their implementation on HTTP and is asking for outside experts and groups for input. This is the time for you to get active and proof-read the specification thoroughly and feed back all the concerns that you have and all the things you do not understand!
The media fragment (MF) URI specification specifies two types of MF URIs: those created with a URI fragment (“#”), e.g. video.ogv#t=10,20 and those with a URI query (“?”), e.g. video.ogv?t=10,20. There is a fundamental difference between the two that needs to be appreciated: with a URI fragment you can specify a subpart of a resource, e.g. a subpart of a video, while with a URI query you will refer to a different resource, i.e. a “new” video. This is an important difference to understand for media fragments, because only some things that we want to achieve with media fragments can be achieved with “#”, while others can only be achieved by transforming the resource into a different new bitstream.
This all sounds very abstract, so let me give you an example. Say you want to retrieve a video without its audio track. Say you’d rather not download the audio track data, since you want to save on bandwidth. So, you are only interested to get the video data. The URI that you may want to use is video.ogv#track=video. This means that you don’t want to change the video resource, but you only want to see the video. The user agent (UA) has two options to resolve such a URI: it can either map that request to byte ranges and just retrieve those – or it can download the full resource and ignore the data it has not been requested to display.
Since we do not want the extra bytes of the audio track to be retrieved, we would hope the UA can do the byte range requests. However, most Web video formats will interleave the different tracks of a media resource in time such that a video track will results in a gazillion of smaller byte ranges. This makes it impractical to retrieve just the video through a “#” media fragment. Thus, if we really want this functionality, we have to make the server more intelligent and allow creation of a new resource from the existing one which doesn’t contain the audio. Then, the server, upon receiving a request such as video.ogv#track=video can redirect that to video.ogv?track=video and actually serve a new resource that satisfies the needs.
This is in fact exactly what was implemented in a recently published Firefox Plugin written by Jakub Sendor – also described in his presentation “Media Fragment Firefox plugin”.
Media Fragment URIs are defined for four dimensions:
- temporal fragments
- spatial fragments
- track fragments
- named fragments
The temporal dimension, while not accompanied with another dimension, can be easily mapped to byte ranges, since all Web media formats interleave their tracks in time and thus create the simple relationship between time and bytes.
The spatial dimension is a very complicated beast. If you address a rectangular image region out of a video, you might want just the bytes related to that image region. That’s almost impossible since pixels are encoded both aggregated across the frame and across time. Also, actually removing the context, i.e. the image data outside the region of interest may not be what you want – you may only want to focus in on the region of interest. Thus, the proposal for what to do in the spatial dimension is to simply retrieve all the data and have the UA deal with the display of the focused region, e.g. putting a dark overlay over the regions outside the region of interest.
The track dimension is similarly complicated and here it was decided that a redirect to a URI query would be in order in the demo Firefox plugin. Since this requires an intelligent server – which is available through the Ninsuna demo server that was implemented by Davy Van Deursen, another member of the MF WG – the Firefox plugin makes use of that. If the UA doesn’t have such an intelligent server available, it may again be most useful to only blend out the non-requested data on the UA similar to the spatial dimension.
The named dimension is still a largely undefined beast. It is clear that addressing a named dimension cannot be done together with the other dimensions, since a named dimension can represent any of the other dimensions above, and even a combination of them. Thus, resolving a named dimension requires an understanding of either the UA or the server what the name maps to. If, for example, a track has a name in a media resource and that name is stored in the media header and the UA already has a copy of all the media headers, it can resolve the name to the track that is being requested and take adequate action.
But enough explaining – I have made a screencast of the Firefox plugin in action for all these dimensions, which explains things a lot more concisely than word will ever be able to – enjoy:

And do not forget to proofread the specification and send feedback to public-media-fragment@w3.org.
My first released WordPress plugin

I’m pretty proud of this, which is why I’m dedicating a short blog post to it: today, John and I released my first WordPress plugin as open source to the WordPress plugins site.
It’s got the boring name “External Videos” and builds a bridge between your WordPress instance and videos of channels on a video hosting site – currently supported are YouTube, Vimeo, and DotSub.
It does this by using a brand-new feature to be introduced in WordPress 3: custom post types.
Check out the screenshots on the plugins page to see more – I’m unfortunately not yet running this Website with WordPress 3, so am not yet using this plugin’s features.
In the admin interface of WordPress, you enter the video channels that you want to pull videos from. Then it goes and pulls the videos with their metadata from these sites and creates video posts for them. That pulling is done once a day to update with new posts. The videos can be looked at in the admin interface under a separate video post section. They can be linked to WordPress posts and pages where the video may be discussed in context.
The video posts can be exposed on the WordPress site through a gallery, which is created by a short code, that can be added to any WordPress page. The gallery of thumbnails clicks through to an overlay with each video and its metadata as well as a link to the related WordPress post.
You can also add a widget to the side bar of the WordPress site with links to the most recent videos.
There are many more features that I want to develop for this plugin. I’d of course like to move it to HTML5 video instead of Adobe Flash. But for now I am happy with it.
I’d like to say thank you to John Ferlito, who helped with some of the coding, to Jeff Waugh for suggesting that it would best be developed using the new post types feature, and to Senator Kate Lundy and Pia Waugh at her office, who funded a part of the development. I am hoping they will find it useful to give their awesome collection of videos better exposure.
VP8/WebM: Adobe is the key to open video on the Web
Google have today announced the open sourcing of VP8 and the creation of a new media format WebM.
Technical Challenges
As I predicted earlier, Google had to match VP8 with an audio codec and a container format – their choice was a subpart of the Matroska format and the Vorbis codec. To complete the technical toolset, Google have:
- developed ffmpeg patches, so an open source encoding tool for WebM will be available
- developed GStreamer and DirectShow plugins, so players that build on these frameworks will be able to decode WebM,
- and developed an SDK such that commercial partners can implement support for WebM in their products.
This has already been successful and several commercial software products are already providing support for WebM.
Google haven’t forgotten the mobile space either – a bunch of Hardware providers are listed as supporters on the WebM site and it can be expected that developments have started.
The speed of development of software and hardware around WebM is amazing. Google have done an amazing job at making sure the technology matures quickly – both through their own developments and by getting a substantial number of partners included. That’s just the advantage of being Google rather than a Xiph, but still an amazing achievement.
Browsers
As was to be expected, Google managed to get all the browser vendors that are keen to support open video to also support WebM: Chrome, Firefox and Opera all have come out with special builds today that support WebM. Nice work!
What is more interesting, though, is that Microsoft actually announced that they will support WebM in future builds of IE9 – not out of the box, but on systems where the codec is already installed. Technically, that is be the same situation as it will be for Theora, but the difference in tone is amazing: in this blog post, any codec apart from H.264 was condemned and rejected, but the blog post about WebM is rather positive. It signals that Microsoft recognize the patent risk, but don’t want to be perceived of standing in the way of WebM’s uptake.
Apple have not yet made an announcement, but since it is not on the list of supporters and since all their devices exclusively support H.264 it stands to expect that they will not be keen to pick up WebM.
Publishers
What is also amazing is that Google have already achieved support for WebM by several content providers. The first of these is, naturally, YouTube, which is offering a subset of its collection also in the WebM format and they are continuing to transcode their whole collection. Google also has Brightcov, Ooyala, and Kaltura on their list of supporters, so content will emerge rapidly.
Uptake
So, where do we stand with respect to a open video format on the Web that could even become the baseline codec format for HTML5? It’s all about uptake – if a substantial enough ecosystem supports WebM, it has all chances of becoming a baseline codec format – and that would be a good thing for the Web.
And this is exactly where I have the most respect for Google. The main challenge in getting uptake is in getting the codec into the hands of all people on the Internet. This, in particular, includes people working on Windows with IE, which is still the largest browser from a market share point of view. Since Google could not realistically expect Microsoft to implement WebM support into IE9 natively, they have found a much better partner that will be able to make it happen – and not just on Windows, but on many platforms.
Yes, I believe Adobe is the key to creating uptake for WebM – and this is admittedly something I have completely overlooked previously. Adobe has its Flash plugin installed on more than 90% of all browsers. Most of their users will upgrade to a new version very soon after it is released. And since Adobe Flash is still the de-facto standard in the market, it can roll out a new Flash plugin version that will bring WebM codec support to many many machines – in particular to Windows machines, which will in turn enable all IE9 users to use WebM.
Why would Adobe do this and thus cement its Flash plugin’s replacement for video use by HTML5 video? It does indeed sound ironic that the current market leader in online video technology will be the key to creating an open alternative. But it makes a lot of sense to Adobe if you think about it.
Adobe has itself no substantial standing in codec technology and has traditionally always had to license codecs. Adobe will be keen to move to a free codec of sufficient quality to replace H.264. Also, Adobe doesn’t earn anything from the Flash plugins themselves – their source of income are their authoring tools. All they will need to do to succeed in a HTML5 WebM video world is implement support for WebM and HTML5 video publishing in their tools. They will continue to be the best tools for authoring rich internet applications, even if these applications are now published in a different format.
Finally, in the current hostile space between Apple and Adobe related to the refusal of Apple to allow Flash onto its devices, this may be the most genius means of Adobe at getting back at them. Right now, it looks as though the only company that will be left standing on the H.264-only front and outside the open WebM community will be Apple. Maybe implementing support for Theora wouldn’t have been such a bad alternative for Apple. But now we are getting a new open video format and it will be of better quality and supported on hardware. This is exciting.
IP situation
I cannot, however, finish this blog post on a positive note alone. After reading the review of VP8 by a x.264 developer, it seems possible that VP8 is infringing on patents that are outside the patent collection that Google has built up in codecs. Maybe Google have calculated with the possibility of a patent suit and put money away for it, but Google certainly haven’t provided indemnification to everyone else out there. It is a tribute to Google’s achievement that given a perceived patent threat – which has been the main inhibitor of uptake of Theora – they have achieved such an uptake and industry support around VP8. Hopefully their patent analysis is sound and VP8 is indeed a safe choice.
UPDATE (22nd May): After having thought about patents and the situation for VP8 a bit more, I believe the threat is really minimal. You should also read these thoughts of a Gnome developer, these of a Debian developer and the emails on the Theora mailing list.
Introducing media accessibility into HTML5
In recent months, people in the W3C HTML5 Accessibility Task Force developed two proposals for introducing caption, subtitle, and more generally time-aligned text support into HTML5 audio and video.
These time-aligned text files can either come as external files that are associated with the timeline of the media resource, or they come as part of the media resource in a binary track.
For both cases we now have proposals to extend the HTML5 specification.
Firstly, let’s look at time-aligned text in external files. The change proposal introduces markup to associate such external files as a kind of “virtual track” with a media resource. Here is an example:
<video src="video.ogv">
<track src="video_cc.ttml" type="application/ttaf+xml" language="en" role="caption"></track>
<track src="video_tad.srt" type="text/srt" language="en" role="textaudesc"></track>
<trackgroup role="subtitle">
<track src="video_sub_en.srt" type="text/srt; charset='Windows-1252'" language="en"></track>
<track src="video_sub_de.srt" type="text/srt; charset='ISO-8859-1'" language="de"></track>
<track src="video_sub_ja.srt" type="text/srt; charset='EUC-JP'" language="ja"></track>
</trackgroup>
</video>
The video resource is “video.ogv”. Associated with it are five timed text resources.
The first one is written in TTML (which is the new name for DFXP), is a caption track and in English. TTML is particularly useful when you want to provide more than just an unformatted piece of text to the viewers. Hearing-impaired users appreciate any visual help they can be provided with to absorb the caption text more quickly. This includes colour coding of speakers, positioning of text close to the speaking person on screen, or even animated musical notes to signify music. Thus, a format like TTML that allows for formatting and positioning information is an appropriate format to specify captions.
All other timed text resources are provided in SRT format, which is a simpler format that TTML with only plain text in the text cues.
The second text track is a textual audio description track. A textual audio description is in fact targeted at the vision-impaired and contains text that is expected to be read out by a screen reader or routed to a braille device. Thus, as the video plays, a vision-impaired user receives additional information about the visual content of the scene through their screen reader or braille device. The SRT format is particularly useful for providing textual audio descriptions since it only provides plain text, which can easily be handed on to assistive technology. When authoring such textual audio descriptions, it is very important to pick time intervals in the original media resource where no other significant audio cue is provided, such that the vision-impaired user is able to listen to the screen reader during that time.
The last three text tracks are subtitle tracks. They are grouped into a trackgroup element, which is not strictly necessary, but enables the author to say that these tracks are supposed to be alternatives. Thus, a Web Browser can create a menu with all the available tracks and put the tracks in the trackgroup into a menu of their own where only one option is selectable (similar to how radiobuttons work). Incidentally, the trackgroup element also allows to avoid having to repeat the role attribute in all the containing tracks. It is expected that these menus will be added to the default media controls and will thus be visible if the media element has a controls attribute.
With the role, type and language attributes, it is easy for a Web Browser to understand what the different tracks have to offer. A Web Browser can even decide to offer new functionality that is helpful to certain user groups. For example, an addition to a Web Browser’s default settings could be to allow users to instruct a Web Browser to always turn on captions or subtitles if they are available in the user’s main language. Or to always turn on textual audio descriptions. In this way, a user can customise their default experience of a media resource over and on top of what a Web page author decides to expose.
Incidentally, the choice of “track” as a name for relating external text resources to a media element has a deeper meaning. It is easily possible in future to extend “track” elements to not just point to dependent text resources, but also to dependent audio or video resources. For example, an actual audio description that is a recording of a human voice rather than a rendered text description could be association in the same way. Right now, such an implementation is not envisaged by the Browser vendors, but it will be something to work towards in the future.
Now, with such functionality available, there is naturally a desire to be able to control activation or de-activation of text tracks through JavaScript, not just through user interaction. A Web Developer may for example want to override the default controls provided by a Web Browser and run their own JavaScript-based controls, thus requiring to create their own selection menu for the tracks.
This is actually also an issue more generally and applies to all track types, including such tracks that come inside an existing media resource. In the current specification such tracks are not exposed and can therefore not be activated.
This is where the second specification that the W3C Accessibility Task Force has worked towards comes in: the media multitrack JavaScript API.
This specification introduces a read-only JavaScript interface to the audio and video elements to allow Web Developers to find out about the tracks (including the virtual tracks) that a media resource offers. The only action that the interface currently provides is to enable or disable tracks.
Here is an example use to turn on a french subtitle track:
if (video.tracks[2].role == "subtitle" && video.tracks[2].language == "fr") video.tracks[2].enabled = true;
There is still a need to introduce a means to actually expose the text cues as they relate to the currentTime of the media resource. This has not yet been specified in the given proposals.
The text cues could be exposed in several ways. They could be exposed through introducing an event, i.e. every time a new text cue becomes active, a callback is called which is given the active text cue (if such a callback had been registered previously). Another option is to simply write the text cues into a specified div-element in the DOM and thus expose them directly in the Browser. A third idea could be to expose the text cues in an iframe-like element to avoid any cross-site security issues. And a fourth idea that we have discussed is to expose the text cues in an attribute of the track.
All of this obviously also relates to how to actually render the text cues and whether to render them in a shadow DOM so as to make the JavaScript reading separate from the rendering and address security and copyright issues. I’d be curious in opinions here on how it should be done.
W3C Media Annotations API standard
Recently, I was asked to review the W3C Media Annotations specifications as they are about to go into Last Call (a state that comes before the request for implementations at the W3C).
The W3C Media Annotations group has defined a set of metadata that they believe is representative and common for media resources. The ontology consist of the following fields:
- ma:identifier: a URI or string to identify a resource
- ma:title: a string providing the title of the resource
- ma:language: a language code describing the language used in the resource
- ma:locator: the URI at which the resource can be accessed
- ma:contributor: a URI or string identifying the contributor and the nature of the contribution
- ma:creator: a URI or string identifying an author
- ma:createDate: a date of creation or publication of the resource
- ma:location: a string or geo code identifying where the resource has been shot/recorded
- ma:description: a string describing the content of the resource
- ma:keyword: a word or word combination providing a topic, keyword or tag representing the resource
- ma:genre: a string providing the genre of the resource
- ma:rating: rating value, including the rating scale
- ma:relation: a URI and string identifying a related resource and the relationship
- ma:collection: a URI or string providing the name of a collection to which the resource belongs
- ma:copyright: a URI or string with the copyright statement.
- ma:license: a string or URI with the usage license
- ma:publisher: a string or URI with the publisher of the resource
- ma:targetAudience: a URI and classification string providing the issuer of the classification and the classification value
- ma:fragments: a list of string and URI values that identify media fragments and their type
- ma:namedFragments: a list of string and URI values the provide names to media fragments
- ma:frameSize: a width – height pair in pixels
- ma:compression: a string providing the compression algorithm
- ma:duration: a float to provide the resource duration in seconds
- ma:format String: the mime type of the resource
- ma:samplingrate: a float with the audio sampling rate
- ma:framerate: a float with the video frame rate
- ma:bitrate: a float providing the average bit rate in kbps
- ma:numTracks: an int of the number of tracks
Note that some of these fields are not single values, but simple constructs of multiple values. Thus, they are actually more complex than name-value pairs that, e.g. are typically used in HTML meta headers or in Dublin Core. I regard this as an issue for implementations.
The fields were chosen as typical metadata being available about media resources. The media fragments fields are a bit dubious in this respect, but could be useful in future.
The metadata is determined either from within the resource itself or from a metadata collection about the resource. As such, the document maps several existing metadata and media resource formats to this interface, amongst them:
- XMP
- ID3
- iTunes
- QT
- SearchMonkey
- MediaRDF
- LOM
- METS
- EXIF
- CableLabs 1.1
- CableLabs 2.0
- DIG35
- MIX
- FRBR
- MediaRSS
- TXFeed
- YouTube
- VRA
- IPTC
- TVA
- EBUCore
- EBUP
- MPEG7
- SMTPD
As they didn’t have a mapping table for Ogg content, I offered the following:
| MAWG | Relation | Ogg properties | How to do the mapping | Datatype | |
|---|---|---|---|---|---|
| Descriptive Properties (Core Set) | |||||
| Identification | |||||
| ma:identifier | exact | Name | Name field in skeleton header (new) | String | |
| ma:title | exact | Title | TITLE field in vorbiscomment header | String | |
| exact | Title | Title field in skeleton header (new) | String | ||
| related | Album | ALBUM title in vorbiscomment header | String | ||
| ma:language | exact | Language | Language field in skeleton header (new) | language code | |
| ma:locator | exact | file URI from system | URI | ||
| Creation | |||||
| ma:contributor | exact | Artist, Performer | ARTIST and PERFORMER vorbiscomment headers | Strings | |
| ma:creator | related | Organization | ORGANIZATION field in vorbiscomment header | ||
| ma:createDate | exact | Date | DATE field in vorbiscomment header | ISO date format | |
| ma:location | exact | Location | LOCATION field in vorbiscomment header | String | |
| Content description | |||||
| ma:description | exact | Description | DESCRIPTION field in vorbiscomment header | String | |
| ma:keyword | N/A | ||||
| ma:genre | exact | Genre | GENRE field in vorbiscomment header | String | |
| ma:rating | N/A | ||||
| Relational | |||||
| ma:relation | related | Version, Tracknumber | VERSION (version of a title), TRACKNUMBER (CD track) fields in vorbiscomment header | Strings | |
| ma:collection | related | Album | ALBUM field of vorbiscomment header | String | |
| Rights | |||||
| ma:copyright | exact | Copyright | COPYRIGHT field of vorbiscomment header | String | |
| ma:license | exact | License | LICENSE field of vorbiscomment header | String | |
| Distribution | |||||
| ma:publisher | related | Organization | ORGNIZATION field of vorbiscomment header | String | |
| ma:targetAudience | more specific | Role | Role field of Skeleton header (new) | String | |
| Fragments | |||||
| ma:fragments | N/A | ||||
| ma:namedFragments | N/A | ||||
| Technical Properties | |||||
| ma:frameSize | exact | extract from binary header of video track | int, int (width x height) | ||
| ma:compression | exact | Content-type | Content-type field of Skeleton header | MIME type | |
| ma:duration | exact | calculate as duration = last_sample_time – first_sample_time of OggIndex header of skeleton | Float (or rather: rational – rational) | ||
| ma:format | exact | Content-type | Content-type field of Skeleton header | MIME type | |
| ma:samplingrate | exact | calculate as granulerate = granulerate_numerator / granulerate_denominator of Skeleton header | Rational (or rather int / int) | ||
| ma:framerate | exact | calculate as granulerate = granulerate_numerator / granulerate_denominator of Skeleton header | Rational (or rather int / int) | ||
| ma:bitrate | exact | calculate as bitrate = length_of_segment / duration from OggIndex headers of skeleton | Float | ||
| ma:numTracks | exact | Tracknumber | TRACKNUMBER field of vorbiscomment header (track number on album) | Int | |
You will notice that the table mentions 4 fields in skeleton with a “new” marker – they are actually proposed fields in skeleton – a bit of coding will be necessary to introduce them into software. The space for these fields already exists in message header fields, so it won’t require a change of the skeleton format.
In the second specification of the Media Annotations WG, the group offers a standard API to access (i.e. read) the defined fields. They also intend to create an API to write the fields, but I doubt that will be easy because of the vast amount of file types they intend to support.
There is basically a single function that allows the extraction of metadata:
MAObject[] getProperty(in DOMString propertyName, in optional DOMString sourceFormat, in optional DOMString subtype, in optional DOMString language, in optional DOMString fragment );
I proposed it may be possible to include this into HTML5 as follows:
interface HTMLMediaElement : HTMLElement {
...
getter MAObject getProperty(in DOMString propertyName, in optional unsigned long trackIndex);
...
}
This would either extract the property for a particular track in a media resource or for the complete resource if no track index is given. The only problem I see is that the returned object is different depending on the requested property – the MAObject is only a parent class for the returned object types. I am not sure it is therefore possible to specify this easily in HTML5.
Overall I thought the specification was a nice piece of work. I am not sure I agree with all the chosen fields, but that is always an issue with metadata. The most important fields are there and that’s what matters.
HTML5 Media and Accessibility presentation
Today, I was invited to give a talk at my old workplace CSIRO about the HTML5 media elements and accessibility.
A lot of the things that have gone into Ogg and that are now being worked on in the W3C in different working groups – including the Media Fragments and HTML5 WGs – were also of concern in the Annodex project that I worked on while at CSIRO. So I was rather excited to be able to report back about the current status in HTML5 and where we’re at with accessibility features.
Check out the presentation here. It contains a good collection of links to exciting demos of what is possible with the new HTML5 media elements when combined with other HTML features.
I tried something now with this presentation: I wrote it in a tool called S5, which makes use only of HTML features for the presentation. It was quite a bit slower than I expected, e.g. reloading a page always included having to navigate to that page. Also, it’s not easily possible to do drawings, unless you are willing to code them all up in HTML. But otherwise I have found it very useful for, in particular, including all the used URLs and video element demos directly in the slides. I was inspired with using this tool by Chris Double’s slides from LCA about implementing HTML 5 video in Firefox.
Google’s challenges of freeing VP8
Since On2 Technology’s stockholders have approved the merger with Google, there are now first requests to Google to open up VP8.
I am sure Google is thinking about it. But … what does “it” mean?
Freeing VP8
Simply open sourcing it and making it available under a free license doesn’t help. That just provides open source code for a codec where relevant patents are held by a commercial entity and any other entity using it would still need to be afraid of using that technology, even if it’s use is free.
So, Google has to make the patents that relate to VP8 available under an irrevocable, royalty-free license for the VP8 open source base, but also for any independent implementations of VP8. This at least guarantees to any commercial entity that Google will not pursue them over VP8 related patents.
Now, this doesn’t mean that there are no submarine or unknown patents that VP8 infringes on. So, Google needs to also undertake an intensive patent search on VP8 to be able to at least convince themselves that their technology is not infringing on anyone else’s. For others to gain that confidence, Google would then further have to indemnify anyone who is making use of VP8 for any potential patent infringement.
I believe – from what I have seen in the discussions at the W3C – it would only be that last step that will make companies such as Apple have the confidence to adopt a “free” codec.
An alternative to providing indemnification is the standardisation of VP8 through an accepted video standardisation body. That would probably need to be ISO/MPEG or SMPTE, because that’s where other video standards have emerged and there are a sufficient number of video codec patent holders involved that a royalty-free publication of the standard will hold a sufficient number of patent holders “under control”. However, such a standardisation process takes a long time. For HTML5, it may be too late.
Technology Challenges
Also, let’s not forget that VP8 is just a video codec. A video codec alone does not encode a video. There is a need for an audio codec and a encapsulation format. In the interest of staying all open, Google would need to pick Vorbis as the audio codec to go with VP8. Then there would be the need to put Vorbis and VP8 in a container together – this could be Ogg or MPEG or QuickTime’s MOOV. So, apart from all the legal challenges, there are also technology challenges that need to be mastered.
It’s not simple to introduce a “free codec” and it will take time!
Google and Theora
There is actually something that Google should do before they start on the path of making VP8 available “for free”: They should formulate a new license agreement with Xiph (and the world) over VP3 and Theora. Right now, the existing license that was provided by On2 Technologies to Theora (link is to an early version of On2′s open source license of VP3) was only for the codebase of VP3 and any modifications of it, but doesn’t in an obvious way apply to an independent re-implementations of VP3/Theora. The new agreement between Google and Xiph should be about the patents and not about the source code. (UPDATE: The actual agreement with Xiph apparently also covers re-implementations – see comments below.)
That would put Theora in a better position to be universally acceptable as a baseline codec for HTML5. It would allow, e.g. Apple to make their own implementation of Theora – which is probably what they would want for ipods and iphones. Since Firefox, Chrome, and Opera already support Ogg Theora in their browsers using the on2 licensed codebase, they must have decided that the risk of submarine patents is low. So, presumably, Apple can come to the same conclusion.
Free codecs roadmap
I see this as the easiest path towards getting a universally acceptable free codec. Over time then, as VP8 develops into a free codec, it could become the successor of Theora on a path to higher quality video. And later still, when the Internet will handle large resolution video, we can move on to the BBC’s Dirac/VC2 codec. It’s where the future is. The present is more likely here and now in Theora.
ADDITION:
Please note the comments from Monty from Xiph and from Dan, ex-On2, about the intent that VP3 was to be completely put into the hands of the community. Also, Monty notes that in order to implement VP3, you do not actually need any On2 patents. So, there is probably not a need for Google to refresh that commitment. Though it might be good to reconfirm that commitment.
ADDITION 10th April 2010:
Today, it was announced that Google put their weight behind the Theorarm implementation by helping to make it BSD and thus enabling it to be merged with Theora trunk. They also confirm on their blog post that Theora is “really, honestly, genuinely, 100% free”. Even though this is not a legal statement, it is good that Google has confirmed this.
Accessibility support in Ogg and liboggplay
At the recent FOMS/LCA in Wellington, New Zealand, we talked a lot about how Ogg could support accessibility. Technically, this means support for multiple text tracks (subtitles/captions), multiple audio tracks (audio descriptions parallel to main audio track), and multiple video tracks (sign language video parallel to main video track).
Creating multitrack Ogg files
The creation of multitrack Ogg files is already possible using one of the muxing applications, e.g. oggz-merge. For example, I have my own little collection of multitrack Ogg files at http://annodex.net/~silvia/itext/elephants_dream/multitrack/. But then you are stranded with files that no player will play back.
Multitrack Ogg in Players
As Ogg is now being used in multiple Web browsers in the new HTML5 media formats, there are in particular requirements for accessibility support for the hard-of-hearing and vision-impaired. Either multitrack Ogg needs to become more of a common case, or the association of external media files that provide synchronised accessibility data (captions, audio descriptions, sign language) to the main media file needs to become a standard in HTML5.
As it turn out, both these approaches are being considered and worked on in the W3C. Accessibility data that are audio or video tracks will in the near future have to come out of the media resource itself, but captions and other text tracks will also be available from external associated elements.
The availability of internal accessibility tracks in Ogg is a new use case – something Ogg has been ready to do, but has not gone into common usage. MPEG files on the other hand have for a long time been used with internal accessibility tracks and thus frameworks and players are in place to decode such tracks and do something sensible with them. This is not so much the case for Ogg.
For example, a current VLC build installed on Windows will display captions, because Ogg Kate support is activated. A current VLC build on any other platform, however, has Ogg Kate support deactivated in the build, so captions won’t display. This will hopefully change soon, but we have to look also beyond players and into media frameworks – in particular those that are being used by the browser vendors to provide Ogg support.
Multitrack Ogg in Browsers
Hopefully gstreamer (which is what Opera uses for Ogg support) and ffmpeg (which is what Chrome uses for Ogg support) will expose all available tracks to the browser so they can expose them to the user for turning on and off. Incidentally, a multitrack media JavaScript API is in development in the W3C HTML5 Accessibility Task Force for allowing such control.
The current version of Firefox uses liboggplay for Ogg support, but liboggplay’s multitrack support has been sketchy this far. So, Viktor Gal – the liboggplay maintainer – and I sat down at FOMS/LCA to discuss this and Viktor developed some patches to make the demo player in the liboggplay package, the glut-player, support the accessibility use cases.
I applied Viktor’s patch to my local copy of liboggplay and I am very excited to show you the screencast of glut-player playing back a video file with an audio description track and an English caption track all in sync:
Further developments
There are still important questions open: for example, how will a player know that an audio description track is to be played together with the main audio track, but a dub track (e.g. a German dub for an English video) is to be played as an alternative. Such metadata for the tracks is something that Ogg is still missing, but that Ogg can be extended with fairly easily through the use of the Skeleton track. It is something the Xiph community is now working on.
Summary
This is great progress towards accessibility support in Ogg and therefore in Web browsers. And there is more to come soon.