Six weeks ago, on a fatal Saturday, both my washing machine and cute little Mario died in one day. The washing machine was quickly repaired, but there was no hope for Mario, as the burnt smell of electronics indicated. It wasn’t going to start up again.
Mario had been the first server to run the code developed at Vquence. It was our development and testing server for more than 8 months until we moved to a server at The Planet – later to Voxel and now ultimately to Amazon.
After it was relieved off Vquence duty, Mario became what it was originally bought to become: a media server. Running Linux and MythTV, it was the beloved center of our living room for the last 2 years. But it seems the heavy duty VCR work as well as running Linux exhausted him.
Well, it is now replaced by an ordinary HP machine – I will miss the cute little shuttle.
If anyone wants the remains, let me know.
UPDATE: The best demo I have seen so far of many of DFXP’s features is at http://www.w3.org/2009/02/ThisIsCoffee.html.
The W3C has published a third last call for the draft specification of DFXP, the Distribution Format Exchange Profile for the Timed Text Authoring Format – or short: for their new standard format for captions. Comments are due by the 30th June, so rush if you want to give any feedback. Here is what came to my mind as I was reading the 183 pages long document.
Please note: This review looks at DFXP from a Web view, i.e. how compatible is it with existing Web technologies, since my main use case will be on the Web, even if advocates will say that that’s not it’s main purpose, strangely enough, for a standard coming out of the W3C.
The state of affairs with caption formats
When it comes to caption and subtitles, there is no lack of formats. It seems, because it is an easy challenge to define a data format for something as simple as a piece of text and some timing information, every new project that wanted to deal with captions – or more generally timed text – created their own format. I am no exception to the rule.
Thus, the current state of affairs wrt timed text is that there are many different textual file formats to store such data, there are also many different video container formats each with their own data format (or even formats) for embedding timed text into them, and there is a lot of software that will deal with many input, output and encapsulation formats.
The problem with this situation is that the formats are all different in their complexity. The simple “piece of text and timing information” problem can be turned into as complex a problem as you desire. By adding layout information, styling information, animation functionality, metadata about the video and about the content, and possibly hyperlinks, we have ended up in a large mess of incompatible formats.
The aim of W3C Timed Text
The W3C Timed Text working group was chartered in January 2003 to attack this issue. It was supposed to become the super-format of all possible functionalities for timed text formats and therefore a perfect interchange format between applications (see requirements document). Its focus was for use on the Web and with SMIL and to make use of existing W3C technologies where possible
However, the history of captioning is TV and the scope of Timed Text is beyond mere use on the Web, so while W3C Timed Text took a lot of inspiration from other Web standards, it has become a stand-alone standard that does not rely on, e.g. the availability of a CSS engine, and it has no in-built hyperlinking functionality (see what requirements it fulfills).
So. let’s look into some of what DFXP provides.
Here is an example file taken straight from the draft – check the presentation here:
<tt xml:lang="" xmlns="http://www.w3.org/2006/10/ttaf1"> <head> <metadata xmlns:ttm="http://www.w3.org/2006/10/ttaf1#metadata"> <ttm:title>Timed Text DFXP Example</ttm:title> <ttm:copyright>The Authors (c) 2006</ttm:copyright> </metadata> <styling xmlns:tts="http://www.w3.org/2006/10/ttaf1#styling"> <!-- s1 specifies default color, font, and text alignment --> <style xml:id="s1" tts:color="white" tts:fontFamily="proportionalSansSerif" tts:fontSize="22px" tts:textAlign="center" /> <!-- alternative using yellow text but otherwise the same as style s1 --> <style xml:id="s2" style="s1" tts:color="yellow"/> <!-- a style based on s1 but justified to the right --> <style xml:id="s1Right" style="s1" tts:textAlign="end" /> <!-- a style based on s2 but justified to the left --> <style xml:id="s2Left" style="s2" tts:textAlign="start" /> </styling> <layout xmlns:tts="http://www.w3.org/2006/10/ttaf1#styling"> <region xml:id="subtitleArea" style="s1" tts:extent="560px 62px" tts:padding="5px 3px" tts:backgroundColor="black" tts:displayAlign="after" /> </layout> </head> <body region="subtitleArea"> <div> <p xml:id="subtitle1" begin="0.76s" end="3.45s"> It seems a paradox, does it not, </p> <p xml:id="subtitle2" begin="5.0s" end="10.0s"> that the image formed on<br/> the Retina should be inverted? </p> <p xml:id="subtitle3" begin="10.0s" end="16.0s" style="s2"> It is puzzling, why is it<br/> we do not see things upside-down? </p> <p xml:id="subtitle4" begin="17.2s" end="23.0s"> You have never heard the Theory,<br/> then, that the Brain also is inverted? </p> <p xml:id="subtitle5" begin="23.0s" end="27.0s" style="s2"> No indeed! What a beautiful fact! </p> <p xml:id="subtitle6a" begin="28.0s" end="34.6s" style="s2Left"> But how is it proved? </p> <p xml:id="subtitle6b" begin="28.0s" end="34.6s" style="s1Right"> Thus: what we call </p> <p xml:id="subtitle7" begin="34.6s" end="45.0s" style="s1Right"> the vertex of the Brain<br/> is really its base </p> <p xml:id="subtitle8" begin="45.0s" end="52.0s" style="s1Right"> and what we call its base<br/> is really its vertex, </p> <p xml:id="subtitle9a" begin="53.5s" end="58.7s"> it is simply a question of nomenclature. </p> <p xml:id="subtitle9b" begin="53.5s" end="58.7s" style="s2"> How truly delightful! </p> </div> </body> </tt>
I’m going to look at each of the different functionalities separately and discuss their strengths and weaknesses.
Let’s begin with the body of the DFXP document and what elements are defined for this area.
Firstly, <body> comes with optional begin, end, and dur attributes. As is the case for all time specifications in DFXP, there are both “end” and “dur” attributes. Why this over-specification? There is not even an explanation which of the two has higher priority when in conflict. This is plainly asking for trouble – why not simplify the spec?
The “region” and “style” attributes refer to a previously defined region and styles that are applied to the body. “id” and “lang” attributes allow to associate a name and a language with the body.
The “timeContainer” attribute enables the author to specify whether the elements in the body are all to be regarded as temporally parallel or in sequence, the default being parallel. This means that all text elements specified inside the body can render over the top of each other – a situation that is solved by giving them specific start and end times.
The containing elements of body are a sequence of <div> tags. The div element functions as a logical container and a temporal structuring element for a sequence of textual content units. div elements like body elements are allowed a “start”, “end” and “dur” attribute and generally everything that the body element also has, except that their children can be more div or p. Again, the children of the div element are all regarded as being temporally parallel.
The p element is basically the inner-most element that contains the actual text, including new-lines (br) and spans to associate further styling, metadata, or animations. The children of the p or span element are also all regarded as being temporally parallel, unless otherwise specified.
The structuring of text into div, p, and span elements seems to make sense and provide sufficient (if not even excessive) flexibility for any required timed text needs.
Once the text is specified and structured, the next question is where it should be positioned.
The extent attribute of the <tt> root element specifies the width and height of the root container, if not specified by the external authoring context.
Inside the root container, regions are defined through explicit <region> elements. The origin of placement for a region is the top left corner. Regions can define their “origin” offset, their “width” and “height”, the alignment of text within them through the “textAlign” and “displayAlign” styles, and whether text that “overflows” a region should be visible or hidden.
The way in which DFXP defines regions and placement of text within regions is very different to the way in which HTML and CSS work. By default, elements in HTML flow one after another in the same order as they appear in the source. CSS attributes applied to the elements can control their positioning through giving coordinates, or relative placements in relation to other elements. In DFXP elements are placed inside regions that are styled, making it incompatible with HTML.
The styling attributes available for DFXP are limited, but sufficient for timed text purposes. The way in which style associations to elements are resolved is quite diverse. Styles can be associated with regions, with individual elements, individually and as a group, through layouts and through parent elements. Compared to CSS, it feels complicated and potentially full of contradictions.
Further to styling, DFXP defines animations, which are discrete changes to some style parameter value that applies over some time interval. This is relevant for example to implement karaoke style colouring of text over time.
The <metadata> element serves as a generic container for grouping metadata information. It can be associated virtually with any element – which seems somewhat over-flexible, but provides for interesting meta data information such as meta data for styles or for a <br>.
In addition, metadata is actually limited to a set number of elements: title, desc, copyright, agent, name, and actor. These are strange fields – in particular if you compare them to the flexibility of HTML meta data, which consists of free-form name-value pairs, bringing us domain-specific schemes such as the Dublin Core. This is not easily possible here, but instead one has to define extensions to allow for such flexible meta data.
DFXP provides other features such as information that describes the related video file, e.g. frameRate, subFrameRate, frameRateMultiplier, pixelAspectRatio, smpteMode, timeBase, and tickRate. Such information will help at the point in time when DFXP is supposed to be multiplexed into a binary media file together with audio and video tracks. These attributes can provide information required for the multiplexing process. I am not sure that justifies their existence though.
Other, minor features are available too. Check out the full specification to get a complete picture.
Part of the publication of this draft is also a test suite. Several of the defined features are still not represented in the test suite, which to me raises the question if they are really required. It might do wonders to the draft size to remove them.
DFXP is a standard for timed text that is firmly grounded in past captioning specifications, but written in XML, and borrowing ideas from Web technologies. It is unfortunately not re-using existing Web infrastructure to implement its more complex features: no use of CSS for styling and layout, no use of hyperlinks. Also, the use of namespaces seems excessive and won’t make it easy to author this format, in particular since the defined namespaces do not map into the defined profiles.
DFXP is witten in such a way that it is possible to put together a new profile with extensions that are more appropriate for specific needs, e.g. that fit better into existing Web infrastructure. Currently, DFXP has three defined profiles: one focused on transformation, one focused on presentation, and one that contains everything.
I think it’s time for a html5 profile of DFXP that at minimum extends DFXP with hyperlinks, making it a real timed text Web format.
In the year 2000, while working at CSIRO as a research scientist, I had the idea that video (and audio) should be hyperlinked content on the Web just like any Web page. Conrad Parker and I developed the vision of a “Continuous Media Web” and called the technology that was necessary to develop “Annodex” for “annotated and indexed media”.
Not many people now know that this was really the beginning of Ogg on the Web. Until then, Ogg Vorbis and the emerging Ogg Theora were only targeted at desktop applications in competition to MP3 and MPEG-2.
Within a few years, we developed the specifications for a markup language for video called CMML that would provide the annotations, anchor points, and hyperlinks for video to make it possible to search and index video, hyperlink into video section, and hyperlink out of video sections.
We further developed the specification of temporal URIs to actually address to temporal offsets or segments in video.
And finally, we developed extensions to the Xiph Ogg framework to allow it to carry CMML, and more generally multi-track codecs. The resulting files were originally called “Annodex files”, but through increasing collaboration with Xiph, the specifications were simplified and included natively into Ogg and are now known as “Ogg Skeleton”.
Apart from specifications, we also developed lots of software to make the vision actually come true. Conrad, in particular, developed many libraries that helped develop software on top of the raw Xiph codecs, which include liboggz and libfishsound. Libraries were developed to deal with CMML and with embedding CMML into Ogg. Apache modules were developed to deal with segmenting sections from Ogg files and deliver them as a reply to a temporal URI request. And finally we actually developed a Firefox extension that would allow us to display the Ogg Theora/Vorbis videos inside a Web Browser.
Over time, a lot more sofware was developed, amongst them: php, perl and python bindings for Annodex, DirectShow filters to have Ogg Theora/Vorbis support on Windows, an ActiveX control for Windows, an authoring tool for CMML on Windows, Ogg format validation software, mobile phone support for Ogg Theora/Vorbis, and a video wiki for CMML and Ogg Theora called cmmlwiki. Several students and Annodex team members at CSIRO helped develop these, including Andre Pang (who now works for Pixar), Zen Kavanagh (who now works for Microsoft), and Colin Ward (who now works for Symbian). Most of the software was released as open source software by CSIRO and is available now either in the Annodex repository or the Xiph repositories.
Annodex technology became increasingly part of Xiph technology as team members also became increasingly part of the Xiph community, such as by now it’s rather difficult to separate out the Annodex people from the Xiph people.
Over time, other projects picked up on the Annodex technology. The first were in fact ethnographic researchers, who wanted their audio-visual ethnographic recordings usable in deeply. Also, other multimedia scientists experimented with Annodex. The first actual content site to publish a large collection of Ogg Theora video with annotations was OpenRoadTrip by Scott Shawcroft and Brandon Hines in 2006. Soon after, Michael Dale and Aphid from Metavid started really using the Annodex set of technologies and contributing to harden the technology. Michael was also a big advocate for helping Wikimedia and Archive.org move to using Ogg Theora.
By 2006, the team at CSIRO decided that it was necessary to develop a simple, cross-platform Ogg decoding and playback library that would allow easy development of applications that need deep control of Ogg audio and video content. Shane Stephens was the key developer of that. By the time that Chris Double from Firefox picked up liboggplay to include Ogg support into Firefox natively, CSIRO had stopped working on Annodex, Shane had left the project to work for Google on Wave, and we eventually found Viktor Gal as the new maintainer for liboggplay. We also found Cristian Adam as the new maintainer for the DirectShow filters (oggcodecs).
Now that the basic Ogg Theora/Vorbis support for the HTML5 <video> element is starting to be available in all major browsers (well, as soon as an ActiveX control is implemented for IE), we can finally move on to develop the bigger vision. This is why I am an invited expert on the W3C media fragments working group and why I am working with Mozilla on sorting out accessibility for <video>. Accessibility is an inherent part of making video searchable. So, if we can find a way to extend the annotations with hyperlinks, we will also be able to build Webs of videos and completely new experiences on the Web. Think about mashing up simply by creating a list of URLs. Think about tweeting video segments. Think about threaded video email discussions (Shane should totally include that into Google Wave!). And think about all the awesome applications that come to your mind that I haven’t even thought about yet!
I spent this week at the Open Video Conference in New York and was amazed about the 800 and more people that understand the value of open video and the need for open video technologies to allow free innovation and sharing. I can feel that the ball has got rolling – the vision developed almost 10 years ago is starting to take shape. Sometimes, in very very rare moments, you can feel that history has just been made. The Open Video Conference was exactly one such point in time. Things have changed. Forever. For the better. I am stunned.
On Jun 13th 2009 Chris DiBona of Google claimed on the WhatWG mailing list:
“If were to switch to theora and maintain even a semblance of the current youtube quality it would take up most available bandwidth across the Internet.”
Everyone who has ever encoded a Ogg Theora/Vorbis file and in parallel encoded one with another codec will have to immediately protest. It is sad that even the best people fall for FUD spread by the un-enlightened or the ones who have their own agenda.
Fortunately, Gregory Maxwell from Wikipedia came to the rescue and did an actual “YouTube / Ogg/Theora comparison”. It’s a good read and a comparison on one video. He has put his instructions there, so anyone can repeat it for themselves. You will have to start with a pretty good quality video though to see such differences.
I’ve always thought that the most compelling reason to go with HTML5 Ogg video over Flash are the cool things it enables you to do with video within the webpage.
I’ve previously collected the following videos and demos:
Then there were Michael Dale’s demos of Metavidwiki with its direct search, access and reuse of video segments, even a little web-based video editor:
Then there was Chris Double’s video SVG demo with cool moving, resizing and reshaping of video:
and Chris kept them coming:
Then Chris Blizzard also made a cool demo for showing synchronised video and graph updates as well as a motion detector:
And now we have Firefox Director Mike Belitzer show off the latest and coolest to TechCrunch, the dynamic content injection bit of which you can try out yourself here:
It just keeps getting better!
UPDATE: Here are some more I’ve come across:
Yesterday, somebody mentioned that the HTML5 video tag with Ogg Theora/Vorbis can be played back in Safari if you have XiphQT installed (btw: the 0.1.9 release of XiphQT is upcoming). So, today I thought I should give it a quick test. It indeed works straight through the QuickTime framework, so the player looks like a QuickTime player. So, by now, Firefox 3.5, Chrome, Safari with XiphQT, and experimental builds of Opera support Ogg Theora/Vorbis inside the HTML5 video tag. Now we just need somebody to write some ActiveX controls for the Xiph DirectShow Filters and it might even work in IE.
While doing my testing, I needed to go to some sites that actually use Ogg Theora/Vorbis in HTML5 video tags. Here is a list that I came up with in no particular order:
- Chris Double’s Tinyvid
- Dailymotion’s Open Video Demo (restricted to Firefox 3.5)
- Michael Dale and Aphid’s Metavid
- Archive.org’s videos
- Wikipedia’s videos
- the FOMS workshop videos
I’m sure there’s a lot more out there – feel free to post links in the comments.
Michael Dale just posted this to theora-dev. Go to one of the given URLs to install the Firefox plugin that lets you transcode video to Ogg using your Web browser.
On Fri, Jun 5, 2009 at 7:08 AM, Michael Dale
> I mentioned it in the #theora channel a few days ago but here it is with
> a more permanent url:
> These will be simple links you can send people so that they can encode
> source footage to a local ogg video file with the latest and greatest
> ogg encoders (presently thusnelda and vorbis). Updates to thusnelda and
> possible other free codecs will be pushed out via firefogg updates
> Pass along any feedback if things break or what not.
> I am also doing testing with “embed” these encoder interface. For those
> familiar with jQuery: an example to rewrite all your file inputs with
> firefogg enhanced inputs: $(“input:[type='file']“).firefogg() … Feel
> free to expeirment based on those examples. The form rewrite has mostly
> only been tested in the mediaWiki context:
> but with minor hacking should work elsewhere
> theora mailing list
This past week was amazing, not because of Google Wave, which everybody seems to be talking about now, and not because of Microsoft’s launch of the bing search engine, but amazing for the world of open video.
- YouTube are experimenting with the HTML5 video tag. The demo only works in HTML5 video capable browsers, such as Firefox 3.5, Safari, Opera, and the new Chrome, which leads me straight to the next news.
- The Google Chrome 3 browser now supports the HTML5 video tag. The linked release only supports MPEG encoded video, but that’s a big step forward.
- More importantly even, recently committed code adds Ogg Theora/Vorbis support to Google Chrome 3′s video tag! This is based on using ffmpeg at this stage, which needs some further work to e.g. gain Ogg Kate support. But this is great news for open media!
- And then the biggest news: Dailymotion, one of the largest social video networks, has re-encoded all their videos to Ogg Theora/Vorbis and have launched an openvideo platform. The blog post is slightly negative about video quality – probably because they used an older encoder. The Xiph community
has already recommended use ofrecommends experimenting with the new Thusnelda encoder and the latest ffmpeg2theora release that supports it, since they provide higher compression ratios and better quality.
- That latest ffmpeg2theora release is really awesome news by itself, but I’d also like to mention two other encoding tools that were released last week: the updated XiphQT QuickTime components, that now allow export to Ogg Theora/Vorbis directly from iMovie (I tested it and it’s awesome) and the new GStreamer command-line based python encoder gst2ogg which works mostly like ffmpeg2theora.
Overall a really exciting week for open media and HTML5 video! I think things are only going to heat up more in this space as more content publishers and more browsers will join the video tag implementations and the Ogg Theora/Vorbis support.