New proposal for captions and other timed text for HTML5
The first specification for how to include captions, subtitles, lyrics, and similar time-aligned text with HTML5 media elements has received a lot of feedback – probably because there are several demos available.
The feedback has encouraged me to develop a new specification that includes the concerns and makes it easier to associate out-of-band time-aligned text (i.e. subtitles stored in separate files to the video/audio file). A simple example of the new specification using srt files is this:
<video src="video.ogv" controls>
<itextlist category="CC">
<itext src="caption_en.srt" lang="en"/>
<itext src="caption_de.srt" lang="de"/>
<itext src="caption_fr.srt" lang="fr"/>
<itext src="caption_jp.srt" lang="jp"/>
</itextlist>
</video>
By default, the charset of the itext file is UTF-8, and the default format is text/srt (incidentally a mime type the still needs to be registered). Also by default the browser is expected to select for display the track that matches the set default language of the browser. This has been proven to work well in the previous experiments.
Check out the new itext specification, read on to get an introduction to what has changed, and leave me your feedback if you can!
The itextlist element
You will have noticed that in comparison to the previous specification, this specification contains a grouping element called “itextlist”. This is necessary because we have to distinguish between alternative time-aligned text tracks and ones that can be additional, i.e. displayed at the same time. In the first specification this was done by inspecting each itext element’s category and grouping them together, but that resulted in much repetition and unreadable specifications.
Also, it was not clear which itext elements were to be displayed in the same region and which in different ones. Now, their styling can be controlled uniformly.
The final advantage is that association of callbacks for entering and leaving text segments as extracted from the itext elements can now be controlled from the itextlist element in a uniform manner.
This change also makes it simple for a parser to determine the structure of the menu that is created and included in the controls element of the audio or video element.
Incidentally, a patch for Firefox already exists that makes this part of the browser. It does not yet support this new itext specification, but here is a screenshot that Felipe Corrêa da Silva Sanches created to demonstrate it:
If several itextlist elements are specified, that menu will receive sub-menus – one each for each itextlist. An example is the following:
<video src="video.ogv" aria-label="test video" controls>
<itextlist category="SUB" name="subtitles">
<itext src="sub_en.srt" lang="en"/>
<itext src="sub_de.srt" lang="de"/>
<itext src="sub_fr.srt" lang="fr"/>
<itext src="sub_jp.srt" lang="jp"/>
</itextlist>
<itextlist category="TAD" name="spoken transcript">
<itext id="tad_en" src="tad_en.srt" lang="en"/>
<itext id="tad_jp" src="tad_jp.srt" lang="jp"/>
</itextlist>
</video>
which will result in the following menu structure:
text
- subtitles
-- English
-- German
-- French
-- Japanese
-- none
- spoken transcript
-- English
-- Japanese
-- none
Similarly, a context menu would use the same structure.
Callbacks on timed text segments
This specification further introduces callbacks on time-aligned text segments: onenter and onleave. At this stage this is an idea I am experimenting with, but I believe has lots of potential to allow people to do fancy things when subtitles appear or disappear. Some ideas are: to have a specific picture displayed that relates to the text segment, to have text in another area of the display change e.g. because we have moved into a different part of the full text transcript, or to display Google ads that relate to the text in that particular text segment.
I am curious about feedback on this idea. It relates closely to the idea of cue ranges that was previously part of HTML5.
It is possible to achieve this effect simply through adding a timeupdate event listener, but proper callbacks like these are much more efficient.
Synchronisation adjustments
Another addition to the itext element is the introduction of two attributes that together allow fixing synchronisation issues in the timing between the video (or audio) and the itext track. The two attributes are “delay” and “stretch”.
“delay” allows specification of a negative or positive float value that represents the amount of seconds with which to delay the display of the itext text segments relative to the timing of the video (or audio) element.
“stretch” allows fixing a constant drift that in timing differences between the video (or audio) element and the text segments. It is given in percent, where 100% means no time stretch, 97% means getting the text segments 3% faster than their actual timing, and 108% means 8% slower.
These attributes are relevant since itext files are independent resources to the media resource and can therefore synchronise to a different clock than the media files. It happens frequently with srt files that are being used for differently encoded video files.
Further feedback
I am currently experimenting with creating the same kind of JavaScript API for in-line annotation tracks through extending some Firefox patches. It is exciting to see it all come together.
At the same time, I am sure there is still feedback that will further improve the specification and I encourage you to contribute. I have set up a wiki page where you can leave your feedback. Also feel free to drop me an email or leave a comment on this blog post. Thanks!
UPDATE 30th Oct 2009:
There is now also a working implementation that demonstrates the approach with itextlist. Check out http://www.annodex.net/~silvia/itext/elephant_no_skin_v2.html, which will not look much different to the previous version, but does indeed behave very differently.
on October 6th, 2009 at 8:37 pm
Silvia,
I get the impression that all the functionality you need is already available in various specs, and could be re-used easily without inventing new syntax.
The category and name functionality could be picked up from XHTML role, if I’m not mistaken. The itext/itextlist is SMIL par and text (or ref). Both of these specs are modularized, so you should be able to pick up just the pieces you need. In the namespaced XML world then you would be done, in the HTML world you would need a bit of extra work to import things into your spec.
on October 7th, 2009 at 3:01 am
[...] Propuesta HTML5 (iTextlist): Interesante propuesta para incluir subtítulos a los tags <video /> de HTML5. [...]
on October 7th, 2009 at 4:00 am
[...] http://blog.gingertech.net/2009/10/06/new-proposal-for-captions-and-other-timed-text-for-html5/ a few seconds ago from web [...]
on October 7th, 2009 at 5:03 am
Hello;
a good proposal after all, alas in which I don’t believe in additional data structures that are imposed by using the hypertext mark-up, already a structured data by itself.
For this type of business my recommendation is scripting an extensible subtitle language in XML, where we could set all the elements and attributes as they are needed:
[timeline]
..
[group name="fall of the chopper"]
…
[scene order="122" start="15:25" end="15:40" /]
…
[/group]
..
[/timeline]
…
[sub type="ambiance" scene="122"]sounds of the chopper breaking-down[/sub]
[sub scene="122" source="6" imperatives="true" index="1"]Hey Julian, WATCH OUT![/sub]
[sub scene="122" source="0" index="-1"]Mom, are we gonna be all right?[/sub]
[sub scene="123" source="6" index="1"]Are you OK?[/sub]
…
In my book, this example is “marking up”, not the one in yours. And for this kind of mark-up, HTML is definitely not the most suitable environment; we need to have our own set of rules, elements and attributes.
To support and validate the above example, an appropriate DTD could be defined accordingly.
Yours, on the other hand, is linking to different set of information, not an information the actual HTML document is supposed to represent, like a movie file being linked from the HTML, but not marked-up or embedded as a part of its native structure, like what we did and failed in base64 images.
In my opinion, this is a non-HTML matter, and is a necessarily XML one; therefore it could be linked to an HTML document, like an RDF, as a META data inside the HEAD.
Thanks a lot for this well-written article so we could think about it and write something
best regards
p.s. MODERATOR: this is final – I promise
on October 7th, 2009 at 8:09 am
@Jack thanks for the comments – and you are right: there are plenty of existing syntax elements in other specifications that could be tweaked, adapted and possibly re-used. However, none of them really fit.
“category” is very different to “role” – it is the category of time-aligned text we are talking about and there is a limited list part of the spec.
“name” could be replaced by “title” or something else – I am not particularly fussed about this though I needed it as an attribute rather than as content model, which would have been more obvious.
I am also consciously refraining from re-implementing SMIL. I do not want the full complexity of the “seq” and “par” elements. Also, the “text” or “ref” elements do not compare to “itext” which references a particular type of interactive text files similar to how “img” references particular types of image files.
Further, HTML doesn’t do namespaces, so every adoption from another standard would need to be replicated into HTML anyway. And since there is not an exact match between the needs that itext and itextlist express and those provided by other specs, I’d rather avoid that complexity.
The important thing here is though that we have looked at existing syntaxes and have learnt from them, so even through there is no direct re-use, there is indeed conceptual re-use and learnings.
on October 7th, 2009 at 8:24 am
@kunter There is no need to merge the markup of subtitles (or other time-aligned text for that matter) with HTML directly, just as there is no need to base64 encode images and include them directly in HTML. The itext proposal replicates for subtitles what we do for images and thus it follows completely along the HTML philosophy. DFXP is more than enough mark-up for subtitles, and so is srt or any of the other millions of formats that people have come up with over the years.
As for linking to subtitles in a HTML head element: that won’t work when you have multiple video elements on the web page. You really do need a solution that clearly associates with a particular video element.
on October 7th, 2009 at 9:18 am
Is the intent here that the name attribute of the itextlist element be localized as its sent down the wire? Just wondering if the idea is that is a well-defined string that the browser localizes or it’s something that should be localized before the UA sees it.
on October 7th, 2009 at 9:42 am
@blizzard The “name” attribute could indeed be localized before the UA sees it, since it will get displayed in the menu. The idea is however to allow the page author to influence the name of the menu. This may be a bad idea, I don’t know – I’m happy for suggestions there.
on October 16th, 2009 at 6:54 pm
>> Callbacks on timed text segments <> It is possible to achieve this effect simply through adding a timeupdate event listener, but proper callbacks like these are much more efficient. <> I am also consciously refraining from re-implementing SMIL. I do not want the full complexity of the “seq” and “par” elements. <<
I'm also interested in how timed-event 'fragments' might be handled — SMIL seems oriented to complete presentations — and the possibility of live timed events: for example, subtitling of live broadcasts, or content pushed or pulled in addition to live streaming, such as commentary and additional content broadcast during a concert.
We've also been looking at how to implement custom timed 'events' in a flexible way (though states might be a better word). For example, a carousel widget could listen for chapter events emitted by a video:
{
"start": 20.00,
"end": 30.00,
"sender": "video#myVideo",
"type": "chapter",
"title": "Single-celled organisms",
"description": "Single-celled microorganisms began to develop 3-4 billion years ago.",
"src": "single_cell.jpg",
"href": "en.wikipedia.org/wiki/Microorganism"
}
(Data here is shown as an JavaScript object literal, but could be in other formats. The main thing is that the event object properties can have any name and any value type.)
Alternatively, an element could emit timed CSS, HTML (or even JavaScript) events:
{
"start": 5.00,
"end": 10.00,
"sender": "video#myVideo"
"event": "timeupdate"
"receiver": "div#subtitle"
"type": "HTML",
"value": "It is half past nine and we've only just passed Sheffield”
}
Subtitles could be made bold for a few seconds like this:
{
“start”: 62.13,
“end”: 65.29,
“sender”: “video#myVideo”
“event”: “timeupdate”
“receiver”: “div#subtitle”
“type”: “CSS”,
“value”: {“font-weight”: “bold”}
}
on October 17th, 2009 at 8:13 pm
@Sam
Thanks for the extensive feedback – I’ve added it to https://wiki.mozilla.org/Accessibility/Experiment2_feedback .
Re SMIL: yes, it is oriented towards multi-media presentations where the timeline is in control – that is not how Web pages work, which are essentially static text content enriched with interactive and media elements. Thus the poor fit.
Re live timed events: I think it is possible to point the video src url to a live broadcast, which then gets updated continuously. It might make sense to turn off the controls for such an element. I also don’t see a problem in attaching a subtitle file that is continuously updated in an itext element to such a live video source. The text could continue to be pushed/pulled. With javascript, it would also be possible to continue pulling other content, such as images or other text.
Re timed events: I assume you are saying that you like the callback methods that were introduced on the timed text segments, since they allow you to do such timed updates? I can see how that would be possible – nice ideas!
on October 20th, 2009 at 8:15 pm
>> Re timed events: I assume you are saying that you like the callback methods that were introduced on the timed text segments, since they allow you to do such timed updates? <<
That's right. I was also trying to say (probably not very clearly!) that it would be good to be able to listen for enter and leave events as well as being able to to attach event handler callbacks to onenter and onleave — if only to encourage coders to move JavaScript out of HTML.
on October 20th, 2009 at 10:55 pm
@Sam
Yes, addition of enter and leave events make sense and make it complete.
on October 30th, 2009 at 3:07 am
It would be fantastic if there was a commen video standard to embed text in a video file that every browser could just access the same way as the video and audio streams inside.
Your idea is a hack around the stupid patent driven reality.
Your idea is well designed but I am afraid it also reenforces the reality. Why should people care to use a open standard for video with text embeding abilities if they can just use your hack?
People saving the video will end up with a useless file not containing the text they might need. And no indication of that before they save it. The text will mysteriously disappear.
on October 30th, 2009 at 8:05 am
@Doris
There are use cases for both, text inside a video/audio file, and text outside, but related. For most Web developers, outside is in fact a lot more sensible since it’s easier to update such files.
I am neither trying to avoid a patent reality nor trying to hack around issues. Even if there existed only a single format in which we encoded and encapsulated audio and video, I would still propose to use both: in-stream and external (out-of-band) time-aligned text.
For Web pages we already have the reality that it consists of many files that together create a consistent presentation. It’s not a problem – we have zip/tar files and many other means to solve this issues.
In fact, I think it will be even less of a problem for video, since a server can provide the additional service of embedding a text files (or, in fact, text from a database) inside a video file upon download, should that be desirable. Also, it is possible to download the video and the associated text files as package. If the text “mysteriously disappears”, it’s a feature/bug of the Website rather than a fundamental design issue.