Monthly Archives: October 2009

Cortado 0.5.0 released

Cortado is a java applet that provides support for Ogg Theora/Vorbis to Web publishers. It’s particularly useful to publishers that want to use Ogg Theora/Vorbis in Browsers that do not yet support the HTML5 video element with Ogg.

Cortado was originally developed by Fluendo SA under a LGPL license and contains a re-implementation of Theora and Vorbis in Java (jheora and jcraft). After a few years of low maintenance, the Wikimedia Foundation took it in their hands to undust the code for their use in the Wikimedia Commons, where only unencumberd open video format are acceptable.

As Ralph states in his announcement of the new release: earlier this year, Xiph.org took over maintenance of the Cortado java applet to help concentrate interest and expertise on this important component of the free media codec infrastructure. Therefore, the official website for Cortado is as now part of the Xiph. [If somebody could update the Wikipedia article – that would be awesome!]

So, I am very happy to point to the first Cortado release in three years. Source and sample builds are available from the Xiph.org download site.

Ralph writes further:

The new version is tagged 0.5.0 to indicate both the change in hosting and the significant new support for files from the new libtheora encoder implementation and Kate embedded subtitles.

In particular, 0.5.0 has:

  • Support for files encoded with Theora 1.1
  • Faster YUV to RGB conversion with better results
  • Basic support for embedded Ogg Kate streams
  • Seeking fixed for files with an Ogg Skeleton track
  • Maintained compatibility with the Microsoft VM

This is an awesome example of the power of open source and what a group of people can achieve. Congratulations to everyone at Xiph, Wikipedia, and anyone else who contributed to the release!

Dealing with multi-track video (and audio)

We are slowly approaching the stage where we want to make multi-track video of the following type available and accessible:

  • original video track
  • original audio track
  • dubbed audio tracks in n different languages
  • audio description track in n different langauges
  • sign language video tracks in n different sign langauges
  • caption tracks in n different langauges
  • multiple other time-aligned text tracks in different langauges
  • audio and video track from different camera angles
  • music and speech tracks can be separate
  • different quality tracks are available
  • accompanying images, e.g. slides for a presentation

One of the issues with such a sizeable number of tracks is how to display them. Some of them are alternatives, some of them additions. Sign language is typically presented in a PiP (picture-in-picture) approach. If we have a music and a speech (or singing) track, we may want to have control over removing certain tracks – e.g. to be able to do karaoke. Caption and subtitle tracks in the same language are probably alternatives, while in different languages they could be additions. It is not a trivial challenge to handle such complex files in an application.

At this point, I am only trying to solve a sub-challenge. As we talk about a particular track in a multi-track media file, we will want to identify it by name. Should there be a standard for naming the track, so that we can e.g. address them by a URL, e.g. with the intention of only delivering a subset of tracks from the larger file? We could introduce that for Ogg – but maybe there is an opportunity to do this across file formats?

To find some answers to these and related questions, I want to discuss two approaches.

The first approach is a simple numbering approach. In it, the audio, video, and annotation tracks are all ordered and then numbered through. This will result in the following sets of track names: video[0] … [n], audio[0] … [n], timed text[0] … [n], and possibly even timed images[0] … [n]. This approach is simple, easy to understand, and only requires ordering the tracks within their types. It allows addressing of a particular track – e.g. as required by the media fragment URI scheme for track addressing. However, it does not allow identification of alternatives, additions, or presentation styles.

Should alternatives, additions, and presentation styles be encoded in the name of track? Or should this information go into a meta description area of the multi-track video? Something like skeleton in Ogg? Or should it go a step further and be buried in an external information file such as an m3u file (or ROE for Ogg)?

I want to experiment here with the naming scheme and what we would need to specify to be able to decide which tracks to ignore and which to combine for a presentation. And I want to ask for your comments and advice.

This requires listing exactly what types of content tracks we may have to deal with.

In the video space, we have at minimum the following track types:

  • main video content – with alternative camera angles
  • subsidiary video content – with alternative camera angles
  • sign language videos – in alternative languages

Alternatives are defined by camera angle and language. Also, each track can be made available in a different quality. I’d also regard additional image content, such as slides in a presentation, into subsidiary video content. So, here we could use a scheme such as video_[main,side,sign]_language_angle.

In the audio space, we have at minimum the following track types:

  • main audio content – in alternative languages
  • background audio content – e.g.music, SFX, noise
  • foreground speech or singing content – in alternative languages
  • audio descriptions – in alternative languages

Alternatives are defined by language and content type. Again, each track can be made available in a different quality. Here we could use a scheme such as audio_type_language.

In the text space, we have at minimum the following track types:

  • subtitles – in different languages
  • captions – in different languages
  • textual audio descriptions – in different languages
  • other time-aligned text – in different languages

Alternatives are defined by language and content type – e.g. lyrics, captions and subtitles really compete for the same screen space. Here we could use a scheme such as text_type_language.

A generic track naming scheme
It seems, the generic naming scheme of

<content_type>_<track_type>_<language> [_<angle>]

can cover all cases.

Are there further track types, further alternatives I have missed? What do you think?

MySQL, Snow Leopard and ruby

I got a shiny new MacBook Pro on the weekend, yay! After months of complaining about the slowness and the heat evaporating from my old Macbook, I’m finally off to better grounds.

But then there was the annoying task of setting up the machine with all the software that I’m using. MySQL and ruby turned out to be particular problems. I installed MySQL for 10.5, since MySQL haven’t published one for OS 10.6 yet. I ran “gem install mysql”. And then the pain started.

I got all the errors that were reported elsewhere:
uninitialized constant MysqlCompat::MysqlRes” and “undefined method `real_connect’ for Mysql:Class (NoMethodError)“. I tried all the suggestions – including:
"sudo env ARCHFLAGS="-arch x86" gem install mysql -- --with-mysql-config=/usr/local/mysql-5.1.39-osx10.5-x86/bin/mysql_config -V --debug, but just couldn’t get there.

My laptop reports in the System Software Overview: “64-bit Kernel and Extensions: No”, so I assumed I had to use the 32 bit versions. However, that was a wrong assumption. Even though my kernel seems to be 32 bit, applications seem to be 64 bit.

So, eventually I re-installed MySQL for Mac OS X 10.5 (x86_64) and ran the correct gem install command:
sudo env ARCHFLAGS="-arch x86_64" gem install mysql -- --with-mysql-config=/usr/local/mysql-5.1.39-osx10.5-x86 and things were fine.

Additionally, there was some fighting with the PrefPane and re-starting mysql. I had to kill it manually and I had to install the updated PrefPane of Swoon dot net to make it work.

Hope this helps somebody avoid the same pain!

Web Directions South 2009 talk on HTML5 video

Yesterday, I gave a talk on the HTML5 video element at Web Directions South.

The title was “Taking HTML5 <video> a step further” and the abstract was provided goes as follows:

This talk focuses on the efforts engaged by W3C to improve the new HTML 5 media elements with mechanisms to allow people to access multimedia content, including audio and video. Such developments are also useful beyond accessibility needs and will lead to a general improvement of the usability of media, making media discoverable and generally a prime citizen on the Web.

Silvia will discuss what is currently technically possible with the HTML5 media elements, and what is still missing. She will describe a general framework of accessibility for HTML5 media elements and present her work for the Mozilla Corporation that includes captions, subtitles, textual audio annotations, timed metadata, and other time-aligned text with the HTML5 media elements. Silvia will also discuss work of the W3C Media Fragments group to further enhance video usability and accessibility by making it possible to directly address temporal offsets in video, as well as spatial areas and tracks.

Here are my slides:

Download the pdf from here.

There was also a video recording and I will add that here as soon as it is published.

UPDATE:
The video is available on Tinyvid:

I’m not going to try and upload this 50min long video to YouTube – with it’s 10 min limit, I won’t get very far.

WebJam 2009 talk on video accessibility

On Wednesday evening I gave a 3 min presentation on video accessibility in HTML5 at the WebJam in Sydney. I used a video as my presentation means and explained things while playing it back. Here is the video, without my oral descriptions, but probably still useful to some. Note in particular how you can experience the issues of deaf (HoH), blind (VI) and foreign language users:

The Ogg version is here.

New proposal for captions and other timed text for HTML5

The first specification for how to include captions, subtitles, lyrics, and similar time-aligned text with HTML5 media elements has received a lot of feedback – probably because there are several demos available.

The feedback has encouraged me to develop a new specification that includes the concerns and makes it easier to associate out-of-band time-aligned text (i.e. subtitles stored in separate files to the video/audio file). A simple example of the new specification using srt files is this:

<video src="video.ogv" controls>
   <itextlist category="CC">
     <itext src="caption_en.srt" lang="en"/>
     <itext src="caption_de.srt" lang="de"/>
     <itext src="caption_fr.srt" lang="fr"/>
     <itext src="caption_jp.srt" lang="jp"/>
   </itextlist>
 </video>

By default, the charset of the itext file is UTF-8, and the default format is text/srt (incidentally a mime type the still needs to be registered). Also by default the browser is expected to select for display the track that matches the set default language of the browser. This has been proven to work well in the previous experiments.

Check out the new itext specification, read on to get an introduction to what has changed, and leave me your feedback if you can!

The itextlist element
You will have noticed that in comparison to the previous specification, this specification contains a grouping element called “itextlist”. This is necessary because we have to distinguish between alternative time-aligned text tracks and ones that can be additional, i.e. displayed at the same time. In the first specification this was done by inspecting each itext element’s category and grouping them together, but that resulted in much repetition and unreadable specifications.

Also, it was not clear which itext elements were to be displayed in the same region and which in different ones. Now, their styling can be controlled uniformly.

The final advantage is that association of callbacks for entering and leaving text segments as extracted from the itext elements can now be controlled from the itextlist element in a uniform manner.

This change also makes it simple for a parser to determine the structure of the menu that is created and included in the controls element of the audio or video element.

Incidentally, a patch for Firefox already exists that makes this part of the browser. It does not yet support this new itext specification, but here is a screenshot that Felipe Corr