ginger's thoughts

Silvia's blog

Manifests for exposing the structure of a Composite Media Resource

Posted in Digital Media, standards, video accessibility by silvia on the November 25th, 2009

In the previous post I explained that there is a need to expose the tracks of a time-linear media resource to the user agent (UA). Here, I want to look in more detail at different possibilities of how to do so, their advantages and disadvantages.

Note: A lot of this has come out of discussions I had at the recent W3C TPAC and is still in flux, so I am writing this to start discussions and brainstorm.

Declarative Syntax vs JavaScript API

We can expose a media resource’s tracks either through a JavaScript function that can loop through the tracks and provide access to the tracks and their features, or we can do this through declarative syntax.

Using declarative syntax has the advantage of being available even if JavaScript is disabled in a UA. The markup can be parsed easily and default displays can be prepared without having to actually decode the media file(s).

OTOH, it has the disadvantage that it may not necessarily represent what is actually in the binary resource, but instead what the Web developer assumed was in the resource (or what he forgot to update). This may lead to a situation where a “404″ may need to be given on a media track.

A further disadvantage is that when somebody copies the media element onto another Web page, together with all the track descriptions, and then the original media resource is changed (e.g. a subtitle track is added), this has not the desired effect, since the change does not propagate to the other Web page.

For these reasons, I thought that a JavaScript interface was preferable over declarative syntax.

However, recent discussions, in particular with some accessibility experts, have convinced me that declarative syntax is preferable, because it allows the creation of a menu for turning tracks on/off without having to even load the media file. Further, declarative syntax allows to treat multiple files and “native tracks” of a virtual media resource in an identical manner.

Extending Existing Declarative Syntax

The HTML5 media elements already have declarative syntax to specify multiple source media files for media elements. The <source> element is typically used to list video in mpeg4 and ogg format for support in different browsers, but has also been envisaged for different screensize and bandwidth encodings.

The <source> elements are generally meant to list different resources that contribute towards the media element. In that respect, let’s try using it for declaring a manifest of tracks of the virtual media resource on an example:

  <video>
    <source id='av1' src='video.3gp' type='video/mp4' media='mobile' lang='en'
                     role='media' >
    <source id='av2' src='video.mp4' type='video/mp4' media='desktop' lang='en'
                     role='media' >
    <source id='av3' src='video.ogv' type='video/ogg' media='desktop' lang='en'
                     role='media' >
    <source id='dub1' src='video.ogv?track=audio[de]' type='audio/ogg' lang='de'
                     role='dub' >
    <source id='dub2' src='audio_ja.oga' type='audio/ogg' lang='ja'
                     role='dub' >
    <source id='ad1' src='video.ogv?track=auddesc[en]' type='audio/ogg' lang='en'
                     role='auddesc' >
    <source id='ad2' src='audiodesc_de.oga' type='audio/ogg' lang='de'
                     role='auddesc' >
    <source id='cc1' src='video.mp4?track=caption[en]' type='application/ttaf+xml'
                     lang='en' role='caption' >
    <source id='cc2' src='video.ogv?track=caption[de]' type='text/srt; charset="ISO-8859-1"'
                     lang='de' role='caption' >
    <source id='cc3' src='caption_ja.ttaf' type='application/ttaf+xml' lang='ja'
                     role='caption' >
    <source id='sign1' src='signvid_ase.ogv' type='video/ogg; codecs="theora"'
                     media='desktop' lang='ase' role='sign' >
    <source id='sign2' src='signvid_gsg.ogv' type='video/ogg; codecs="theora"'
                     media='desktop' lang='gsg' role='sign' >
    <source id='sign3' src='signvid_sfs.ogv' type='video/ogg; codecs="theora"'
                     media='desktop' lang='sfs' role='sign' >
    <source id='tad1' src='tad_en.srt' type='text/srt; charset="ISO-8859-1"'
                     lang='en' role='tad' >
    <source id='tad2' src='video.ogv?track=tad[de]' type='text/srt; charset="ISO-8859-1"'
                     lang='de' role='tad' >
    <source id='tad3' src='tad_ja.srt' type='text/srt; charset="EUC-JP"' lang='ja'
                     role='tad' >
  </video>

Note that this somewhat ignores my previously proposed special itext tag for handling text tracks. I am doing this here to experiment with a more integrative approach with the virtual media resource idea from the previous post. This may well be a better solution than a specific new text-related element. Most of the attributes of the itext element are, incidentally, covered.

You will also notice that some of the tracks are references to tracks inside binary media files using the Media Fragment URI specification while others link to full files. An example is video.ogv?track=auddesc[en]. So, this is a uniform means of exposing all the tracks that are part of a (virtual) media resource to the UA, no matter whether in-band or in external files. It actually relies on the UA or server being able to resolve these URLs.

“type” attribute

“media” and “type” are existing attributes of the <source> element in HTML5 and meant to help the UA determine what to do with the referenced resource. The current spec states:

The “type” attribute gives the type of the media resource, to help the user agent determine if it can play this media resource before fetching it.

The word “play” might need to be replaced with “decode” to cover several different MIME types.

The “type” attribute was also extended with the possibility to add the “charset” MIME parameter of a linked text resource – this is particularly important for SRT files, which don’t handle charsets very well. It avoids having to add an additional attribute and is analogous to the “codecs” MIME parameter used by audio and video resources.

“media” attribute

Further, the spec states:

The “media” attribute gives the intended media type of the media resource, to help the user agent determine if this media resource is useful to the user before fetching it. Its value must be a valid media query.

The “mobile” and “desktop” values are hints that I’ve used for simplicity reasons. They could be improved by giving appropriate bandwidth limits and width/height values, etc. Other values could be different camera angles such as topview, frontview, backview. The media query aspect has to be looked into in more depth.

“lang” attribute

The above example further uses “lang” and “role” attributes:

The “lang” attribute is an existing global attribute of HTML5, which typically indicates the language of the data inside the element. Here, it is used to indicate the language of the referenced resource. This is possibly not quite the best name choice and should maybe be called “hreflang”, which is already used in multiple other elements to signify the language of the referenced resource.

“role” attribute

The “role” attribute is also an existing attribute in HTML5, included from ARIA. It currently doesn’t cover media resources, but could be extended. The suggestion here is to specify the roles of the different media tracks – the ones I have used here are:

  • “media”: a main media resource – typically contains audio and video and possibly more
  • “dub”: a audio track that provides an alternative dubbed language track
  • “auddesc”: a audio track that provides an additional audio description track
  • “caption”: a text track that provides captions
  • “sign”: a video-only track that provides an additional sign language video track
  • “tad”: a text track that provides textual audio descriptions to be read by a screen reader or a braille device

Further roles could be “music”, “speech”, “sfx” for audio tracks, “subtitle”, “lyrics”, “annotation”, “chapters”, “overlay” for text tracks, and “alternate” for a alternate main media resource, e.g. a different camera angle.

Track activation

The given attributes help the UA decide what to display.

It will firstly find out from the “type” attribute if it is capable of decoding the track.

Then, the UA will find out from the “media” query, “role”, and “lang” attributes whether a track is relevant to its user. This will require checking the capabilities of the device, network, and the user preferences.

Further, it could be possible for Web authors to influence whether a track is displayed or not through CSS parameters on the <source> element: “display: none” or “visibility: hidden/visible”.

Examples for track activation that a UA would undertake using the example above:

Given a desktop computer with Firefox, German language preferences, captions and sign language activated, the UA will fetch the original video at video.ogv (for Firefox), the German caption track at video.ogv?track=caption[de], and the German sign language track at signvid_gsg.ogv (maybe also the German dubbed audio track at video.ogv?track=audio[de], which would then replace the original one).

Given a desktop computer with Safari, English language preferences and audio descriptions activated, the UA will fetch the original video at video.mp4 (for Safari) and the textual audio description at tad_en.srt to be displayed through the screen reader, since it cannot decode the Ogg audio description track at video.ogv?track=auddesc[en].

Also, all decodeable tracks could be exposed in a right-click menu and added on-demand.

Display styling

Default styling of these tracks could be:

  • video or alternate video in the video display area,
  • sign language probably as picture-in-picture (making it useless on a mobile and only of limited use on the desktop),
  • captions/subtitles/lyrics as overlays on the bottom of the video display area (or whatever the caption format prescribes),
  • textual audio descriptions as ARIA live regions hidden behind the video or off-screen.

Multiple audio tracks can always be played at the same time.

The Web author could also define the display area for a track through CSS styling and the UA would then render the data into that area at the rate that is required by the track.

How good is this approach?

The advantage of this new proposal is that it builds basically on existing HTML5 components with minimal additions to satisfy requirements for content selection and accessibility of media elements. It is a declarative approach to the multi-track media resource challenge.

However, it leaves most of the decision on what tracks are alternatives of/additions to each other and which tracks should be displayed to the UA. The UA makes an informed decision because it gets a lot of information through the attributes, but it still has to make decisions that may become rather complex. Maybe there needs to be a grouping level for alternative tracks and additional tracks – similar to what I did with the second itext proposal, or similar to the <switch> and <par> elements of SMIL.

A further issue is one that is currently being discussed within the Media Fragments WG: how can you discover the track composition and the track naming/uses of a particular media resource? How, e.g., can a Web author on another Web site know how to address the tracks inside your binary media resource? A HTML specification like the above can help. But what if that doesn’t exist? And what if the file is being used offline?

Alternative Manifest descriptions

The need to manifest the track composition of a media resource is not a new one. Many other formats and applications had to deal with these challenges before – some have defined and published their format.

I am going to list a few of these formats here with examples. They could inspire a next version of the above proposal with grouping elements.

Microsoft ISM files (SMIL subpart)

With the release of IIS7, Microsoft introduced “Smooth Streaming”, which uses chunking on files on the server to deliver adaptive streaming to Silverlight clients over HTTP. To inform a smooth streaming client of the tracks available for a media resource, Microsoft defined ism files: IIS Smooth Streaming Server Manifest files.

This is a short example – a longer one can be found here:

<?xml version=”1.0? encoding=”utf-8??>
  <smil xmlns=”http://www.w3.org/2001/SMIL20/Language”>
  <head>
    <meta name=”clientManifestRelativePath” content=”manifest” />
  </head>
  <body>
    <switch>
      <video src=”video.ismv” systemBitrate=”490000?>
        <param name=”trackID” value=”1? valueType=”data” />
      </video>
      <audio src=”video.ismv” systemBitrate=”76000?>
        <param name=”trackID” value=”2? valueType=”data” />
      </audio>
      <textstream src="video.ismv" systemBitrate="700000" systemLanguage="en">
        <param name="trackID" value="3" valuetype="data" />
      </textstream>
    </switch>
  </body>
</smil>

This short example is a simple video file with an audio, a video, and text track. The ismv file is actually a mpeg4 file, but pre-chunked. Bitrate and trackID of the three tracks are specified and the parallel nature of these three tracks is described through being parallel inside the <switch> element.

According to Microsoft, the server manifest serves three key roles:

  • Specify the group of media files that comprise the presentation.
  • Specify heuristic parameters, such as bit rate and fragment quality index, for each track.
  • Abstract the layout of the tracks into files on disk for consumption by the client.

This is very similar to our needs here and thus the specification also looks very similar to what we ended up with above, though the <source> element’s specification is much denser than the SMIL subpart used here.

Xiph ROE files

The Xiph community also realised the for varying use cases there is a need for a manifest file format for multi-track media files. Authoring of a multi-track file, content negotiation, and content representation are three example uses of ROE, the Rich Open multitrack media Exposition format.

This is an example ROE file:

<?xml version="1.0"?>
 <ROE>
   <head>
     <link id="html_linkback" rel="alternate" type="text/html"
                 href="http://example.com/full_video.html"/>
   </head>
   <body>
     <track id="v" provides="video">
       <switch distinction="angle" default="v1">
         <mediaSource id="v1" content-type="video/theora"
                                  src="http://example.com/angle1.ogv?track=v1" />
         <mediaSource id="v2" content-type="video/theora"
                                   src="http://example.com/angle2.ogv" />
       </switch>
     </track>
     <track id="a" provides="audio">
       <switch distinction="Content-Language" default="a3">
           <mediaSource id="a1" lang="en" content-type="audio/vorbis"
                                     src="http://example.com/lang1.oga" />
           <mediaSource id="a2" lang="de" content-type="audio/vorbis"
                                     src="http://example.com/lang2.oga" />
           <mediaSource id="a3" lang="fr" content-type="audio/vorbis"
                                     src="http://example.com/lang3.oga" />
       </switch>
     </track>
   </body>
 </ROE>

ROE is using many SMIL features, just like ISM, but has also introduced further attributes and elements. Since ROE is usable for authoring, it includes the <seq> element to sequence audio, video or text files. This is not necessary for a simple manifests of multi-track media resources and in fact destroys the single timeline paradigm.

Matterhorn MediaPackage Manifest

The Opencast Matterhorn project, which is defining an enterprise-level, easy-to-install open source podcast and rich media capture, processing and delivery system has defined a media package manifest, which lists the packages (tracks and metadata) contents along with their core technical properties. The track description part of the manifest is again very similar to all the above described formats, even while it contains a lot more technical details:

<mediapackage duration="2704016" id="1">
    <media>
        <track id="track-1" type="track/presentation">
            <mimetype>video/quicktime</mimetype>
            <checksum type="md5">0adc841a6dfd47bd7c8cf8db6cbb71c9</checksum>
            <url>http://repository.opencastproject.org/123/tracks/slides-vga.mov</url>
            <size>8754667</size>
            <duration>2704016</duration>
            <audio id="stream-1">
                <encoder type="AAC"/>
                <channels>2</channels>
                <bitdepth>16</bitdepth>
                <bitrate>256000.0</bitrate>
            </audio>
            <video id="stream-2">
                <encoder type="AVC"/>
                <resolution>1024x768</resolution>
                <scantype type="progressive"/>
                <bitrate>454904.0</bitrate>
                <framerate>2</framerate>
            </video>
        </track>
    </media>
</mediapackage>

Most of this technical information should only be relevant to a decoder, but some of it is helpful to making a choice between tracks.

Further Formats

Further formats that are capable of describing a media resource manifest, but go with their functionality far beyond that goal are: MPEG-7, MPEG-21 DIDL, and general SMIL.

Since their functionalities go much beyond a mere description of the manifest of a multi-track media resource, they are not regarded here as options – it would be too hard to reduce them to the bare necessities for such a simple exercise. Apart from that, subparts of SMIL have already been used further up.

Summary

It is possible that the manifest stated above, which is already almost entirely supported by HTML5, is sufficient for much of the use cases and requirements that underpin this post. Maybe the introduction of a <text> or rather <itext> element is not necessary when the UA knows from the MIME type what kind of a data stream it is dealing with. However, a grouping element to specify alternate and additional tracks and which tracks should be displayed together with another track choice may be a good idea.

For content discovery issues and negotiation over the network, the existence of a manifest on the server that can describe the virtual media resource can also be a valuable addition. It could also be used to communicate the currently available tracks to an embedded location. This is, in fact, how ROE is being used on metavid – as an additional attribute on the video or audio element.

Please leave your feedback: Do you agree with the idea of re-using the <source> element for describing all the available tracks for a (virtual composite) media resource instead of defining new elements for specific track types (text, sign language, audio description etc.)? How should we solve the need to describe dependencies and relationships between tracks? Do you agree with the need to have an explicit manifest file on the server that accompanies the media resource?

11 Responses to 'Manifests for exposing the structure of a Composite Media Resource'

Subscribe to comments with RSS or TrackBack to 'Manifests for exposing the structure of a Composite Media Resource'.


  1. on November 25th, 2009 at 8:08 pm

    Thanks for taking the time to go through all of this Silvia. I’ll send my feedback to public-html-a11y.

  2. Davy Van Deursen said,

    on November 25th, 2009 at 10:56 pm

    Hi Silvia, interesting post, thanks. Just a small comment regarding the server manifest of Microsoft’s Smooth Streaming technology. Microsoft only uses this manifest on server-side, to interpret incoming requests. The client (i.e., the Silverlight player) uses a so-called client manifest, which is based on a proprietary XML format, and is thus not aware of the SMIL-based server manifest. But as you said, the information contained in the server manifest is very similar to what we need here and could also be useful on client-side.

  3. silvia said,

    on November 25th, 2009 at 11:36 pm

    @Davy

    Yes, indeed, Microsoft’s client-side manifesto is well explained here: http://msdn.microsoft.com/en-us/library/ee673438.aspx . It actually also contains all the data from the server manifest, but additionally all the chunking (or “fragment”) information. Thus, it actually exposes binary blocks that can be addressed. Very interesting!

  4. Reimar said,

    on December 27th, 2009 at 10:07 pm

    To “which tracks should be displayed together with another track choice may be a good idea”, MPEG and Matroska etc. use so-called “programs” for that, maybe that is worth imitating?
    As for sign language, I think there really should be a way to specify a video track with the sign language display already integrated and one with only the sign language contents.
    Maybe it is even possible that some time in the future useful sign language can be given via a kind of subtitle track that only describes the gestures (not that likely I admit).

  5. silvia said,

    on December 28th, 2009 at 11:13 am

    @Reimar

    Thanks for your comment.

    Do you have a link to the MPEG or Matroska programs? I wasn’t able to find that. I know there are program streams in MPEG, but that’s just an encapsulation format for MPEG streams. Matroska describes tracks very well, but I couldn’t find a description for how to choose which tracks to display together.

  6. Reimar said,

    on December 28th, 2009 at 6:53 pm

    Hmm, seems I might have had my facts wrong, I haven’t found anything for Matroska so far.
    However for MPEG there is this: http://en.wikipedia.org/wiki/MPEG_transport_stream#PAT
    It is mostly used for broadcast because there are multiple TV programs on one frequency, and this allows separating them out.
    It allows grouping stream IDs along with a bit of metadata.

  7. silvia said,

    on December 28th, 2009 at 7:53 pm

    @Reimar

    It seems the program specific information in MPEG allows you to list the different programs available (in parallel or sequential) WITH all their respective tracks. That last part is indeed interesting. However, I haven’t found a way to describe conditions between tracks – which tracks basically belong together from a language or accessibility POV.

    I guess, maybe my model is too broad and unrealistic. In reality, a video is only made for one language. Then the tracks inside it must contribute on that basis, i.e. contribute subtitles in other languages, contribute captions in that language, contribute sign language tracks that are typical for the area, contribute an audio description in only the main language.

    Then, when a dubbed version is created, a new resource is created that has again all the respective tracks as part of it.

    This will simplify the problem. I’ll have to think some more about it. Thanks.


  8. on December 28th, 2009 at 8:16 pm

    Within the ISO Base Media File Format (which forms the basis for MP4, 3GPP, …), there exists a concept called ‘alternate group’. More specifically, for each track, you can specify the group it belongs to: (copy-paste from the spec)

    “alternate_group”: is an integer that specifies a group or collection of tracks. If this field is 0 there is no information on possible relations to other tracks. If this field is not 0, it should be the same for tracks that contain alternate data for one another and different for tracks belonging to different such groups. Only one track within an alternate group should be played or streamed at any one time, and must be distinguishable from other tracks in the group via attributes such as bitrate, codec, language, packet size etc. A group may have only one member.

  9. silvia said,

    on December 28th, 2009 at 8:48 pm

    @Davy

    Now, that’s very interesting indeed. I only found reference in some papers on the “alternate group”, but it seems to be possible to use it to allow the user to switch between alternative tracks. I wonder if it is in practical use! But it certainly is a good idea and should be capable of encoding most of the ROE style specification. Very interesting idea!

  10. giander said,

    on January 12th, 2010 at 4:14 am

    @Silvia

    “alternate group” is used in order to switch between tracks, such as when you have a podcast with multiple subtitles or audio tracks. You can modify its settings if you go to QuickTime Pro > Show Movie Properties. Dumpster for Mac is a very good program to analyze the structure of mp4 files.

  11. silvia said,

    on January 12th, 2010 at 5:59 am

    @giander That’s helpful, thanks!

Leave a Reply