On 14th October I gave a talk at Web Directions South on “HTML5 audio and video – using these exciting new elements in practice”.
I wanted to give people an introduction into how to use these elements while at the same time stirring their imagination as to the design possibilities now that these elements are available natively in browsers. I re-used some of the demos that I have put together for the book that I am currently writing, added some of the cool stuff that others have done and finished off with an outlook towards what new features will probably arrive next.
“Slides” are now available, which are really just a Web page with some demos that work in modern browsers.
Table of contents:
HTML5 Audio and Video
- Cross browser <video> element
- Cross browser <audio> element
- Fallback considerations
- CSS and <video> – samples
- <video> and SVG
- <video> and Canvas
- <video> and Web Workers
- <video> and Accessibility
- audio plans
At this week’s FOMS in New York we had one over-arching topic that seemed to be of interest to every single participant: how to do adaptive bitrate streaming over HTTP for open codecs. On the first day, there was a general discussion about the advantages and disadvantages of adaptive HTTP streaming, while on the second day, we moved towards designing a solution for Ogg and WebM. While I didn’t attend all the discussions, I want to summarize the insights that I took out of the days in this blog post and the alternative implementation strategies that were came up with.
Use Cases for Adaptive HTTP Streaming
Streaming using RTP/RTSP has in the past been the main protocol to provide live video streams, either for broadcast or for real-time communication. It has been purpose-built for chunked video delivery and has features that many customers want, such as the ability to encrypt the stream, to tell players not to store the data, and to monitor the performance of the stream such that its bandwidth can be adapted. It has, however, also many disadvantages, not least that it goes over ports that normal firewalls block and thus is rather difficult to deploy, but also that it requires special server software, a client that speaks the protocol, and has a signalling overhead on the transport layer for adapting the stream.
RTP/RTSP has been invented to allow for high quality of service video consumption. In the last 10 years, however, it has become the norm to consume “canned” video (i.e. non-live video) over HTTP, making use of the byte-range request functionality of HTTP for seeking. While methods have been created to estimate the size of a pre-buffer before starting to play back in order to achieve continuous playback based on the bandwidth of your pipe at the beginning of downloading, not much can be done when one runs out of pre-buffer in the middle of playback or when the CPU on the machine doesn’t manage to catch up with decoding of the sheer amount of video data: your playback stops to go into re-buffering in the first case and starts to become choppy in the latter case.
An obvious approach to improving this situation is the scale the bandwidth of the video stream down, potentially even switch to a lower resolution video, right in the middle of playback. Apple’s HTTP live streaming, Microsoft’s Smooth Streaming, and Adobe’s Dynamic Streaming are all solutions in this space. Also, ISO/MPEG is working on DASH (Dynamic Adaptive Streaming over HTTP) is an effort to standardize the approach for MPEG media. No solution yets exist for the open formats within Ogg or WebM containers.
Some features of HTTP adaptive streaming are:
- Enables adaptation of downloading to avoid continuing buffering when network or machine cannot cope.
- Gapless switching between streams of different bitrate.
- No special server software is required – any existing Web Server can be used to provide the streams.
- The adaptation comes from the media player that actually knows what quality the user experiences rather than the network layer that knows nothing about the performance of the computer, and can only tell about the performance of the network.
- Adaptation means that several versions of different bandwidth are made available on the server and the client switches between them based on knowledge it has about the video quality that the user experiences.
- Bandwidth is not wasted by downloading video data that is not being consumed by the user, but rather content is pulled moments just before it is required, which works both for the live and canned content case and is particularly useful for long-form content.
In discussions at FOMS it was determined that mid-stream switching between different bitrate encoded audio files is possible. Just looking at the PCM domain, it requires stitching the waveform together at the switch-over point, but that is not a complex function. To be able to do that stitching with Vorbis-encoded files, there is no need for a overlap of data, because the encoded samples of the previous window in a different bitrate page can be used as input into the decoding of the current bitrate page, as long as the resulting PCM samples are stitched.
For video, mid-stream switching to a different bitrate encoded stream is also acceptable, as long as the switch-over point adheres to a keyframe, which can be independently decoded.
Thus, the preparation of the alternative bitstream videos requires temporal synchronisation of keyframes on video – the audio can deal with the switch-over at any point. A bit of intelligent encoding is thus necessary – requiring the encoding pipeline to provide regular keyframes at a certain rate would be sufficient. Then, the switch-over points are the keyframes.
With the solutions from Adobe, Microsoft and Apple, the technology has been created such there are special tools on the server that prepare the content for adaptive HTTP streaming and provide a manifest of the prepared content. Typically, the content is encoded in versions of different bitrates and the bandwidth versions are broken into chunks that can be decoded independently. These chunks are synchronised between the different bitrate versions such that there are defined switch-over points. The switch-over points as well as the file names of the different chunks are documented inside a manifest file. It is this manifest file that the player downloads instead of the resource at the beginning of streaming. This manifest file informs the player of the available resources and enables it to orchestrate the correct URL requests to the server as it progresses through the resource.
At FOMS, we took a step back from this approach and analysed what the general possibilities are for solving adaptive HTTP streaming. For example, it would be possible to not chunk the original media data, but instead perform range requests on the different bitrate versions of the resource. The following options were identified.
With Chunking, the original bitrate versions are chunked into smaller full resources with defined switch-over points. This implies creation of a header on each one of the chunks and thus introduces overhead. Assuming we use 10sec chunks and 6kBytes per chunk, that results in 5kBit/sec extra overhead. After chunking the files this way, we provide a manifest file (similar to Apple’s m3u8 file, or the SMIL-based manifest file of Microsoft, or Adobe’s Flash Media Manifest file). The manifest file informs the client about the chunks and the switch-over points and the client requests those different resources at the switch-over points.
- Header overhead on the pipe.
- Switch-over delay for decoding the header.
- Possible problem with TCP slowstart on new files.
- A piece of software is necessary on server to prepare the chunked files.
- A large amount of files to manage on the server.
- The client has to hide the switching between full resources.
- Works for live streams, where increasing amounts of chunks are written.
- Works well with CDNs, because mid-stream switching to another server is easy.
- Chunks can be encoded such that there is no overlap in the data necessary on switch-over.
- May work well with Web sockets.
- Follows the way in which proprietary solutions are doing it, so may be easy to adopt.
- If the chunks are concatenated on the client, you get chained Ogg files (similar concept in WebM?), which are planned to be supported by Web browsers and are thus legal files.
Alternatively to creating the large number of files, one could also just create the chained files. Then, the switch-over is not between different files, but between different byte ranges. The headers still have to be read and parsed. And a manifest file still has to exist, but it now points to byte ranges rather than different resources.
Advantages over Chunking:
- No TCP-slowstart problem.
- No large number of files on the server.
Disadvantages over Chunking:
- Mid-stream switching to other servers is not easily possible – CDNs won’t like it.
- Doesn’t work with Web sockets as easily.
- New approach that vendors will have to grapple with.
Since in Chained Chunks we are already doing byte-range requests, it is a short step towards simply dropping the repeating headers and just downloading them once at the beginning for all possible bitrate files. Then, as we seek to different positions in “the” file, the byte range of the bitrate version that makes sense to retrieve at that stage would be requested. This could even be done with media fragment URIs, through addressing with time ranges is less accurate than explicit byte ranges.
In contrast to the previous two options, this basically requires keeping n different encoding pipelines alive – one for every bitrate version. Then, the byte ranges of the chunks will be interpreted by the appropriate pipeline. The manifest now points to keyframes as switch-over points.
Advantage over Chained Chunking:
- No header overhead.
- No continuous re-initialisation of decoding pipelines.
Disadvantages over Chained Chunking:
- Multiple decoding pipelines need to be maintained and byte ranges managed for each.
Unchunked Byte Ranges
We can even consider going all the way and not preparing the alternative bitrate resources for switching, i.e. not making sure that the keyframes align. This will then require the player to do the switching itself, determine when the next keyframe comes up in its current stream then seek to that position in the next stream, always making sure to go back to the last keyframe before that position and discard all data until it arrives at the same offset.
- There will be an overlap in the timeline for download, which has to be managed from the buffering and alignment POV.
- Overlap poses a challenge of downloading more data than necessary at exactly the time where one doesn’t have bandwidth to spare.
- Requires seeking.
- No special authoring of resources on the server is needed.
- Requires a very simple manifest file only with a list of alternative bitrate files.
At FOMS we weren’t able to make a final decision on how to achieve adaptive HTTP streaming for open codecs. Most agreed that moving forward with the first case would be the right thing to do, but the sheer number of files that can create is daunting and it would be nice to avoid that for users.
Today I gave a talk at the Open Video Conference about the state of the specifications in HTML5 for media accessibility.
Over the last two days we had the Open Subtitles Summit here in New York. It was very exciting to feel the energy in the room to make a change to media accessibility – I am sure we will see much development over the next 12 months. We spoke much about HTML5 video and standards and had many discussions about subtitles, captions, and other accessibility information.
On Wednesday we had a discussion about metadata and I quickly realized that “your metadata is not my metadata”: everyone used the word for something different. So, I suggested to have a metadata discussion on Thursday where we would put a structure onto all of this, identify what kinds of metadata we have and whether and how it should be supported in HTML5 standards.
Our basic findings are very simple and widely accepted. There are three fundamentally different types of metadata:
- Technical metadata about video: information about the format of the resource – things that can be determined automatically and are non-controversial, such as the width, height, framerate, audio sample rate etc. This information can be used to, e.g. decide if a video is appropriate for a certain device.
- Semantic metadata about video: semantic information about the video resource – e.g. license, author, publication date, version, attribution, title, description. This information is good for search and identification.
- Timed semantic metadata: semantic information that is associated with time intervals of the video, not with the full video – e.g. active speaker, location, date-time, objects.
As we talked about this further, however, we identified subclasses of these generic types that are very important to identify because they will be handled differently.
We found that semantic metadata can be separated into universal metadata and domain-specific metadata. Universal metadata is semantic metadata that can basically be applied to any content. There is very little of that and the W3C Media Annotations WG has done a pretty good job in identifying it. Domain-specific metadata is such metadata that only applies to some content, e.g. all the videos about sports have metadata such as game scores, players, or type of sport.
As for adding such metadata into media resources, we discussed that it makes sense to have the universal metadata explicitly spelled out and to have a generic means to associate name-value pairs with resource. Of course it will all be stored in databases, but there was also a requirement to have it encoded into the media resource – and in our discussion case: into external captions or subtitle files.
As for timed metadata – it is possible to separate this into metadata that is only relevant as part of a subtitle or caption file, because the metadata relates to a certain word or a word sequence, and into independent timed metadata that can be stored in, e.g. JSON or some similar format.
Since we are particularly interested in subtitles and captions, the timed metadata that is associated with words or word sequences is particularly important. The most natural metadata that is useful as part of subtitles is of course speaker segmentation. We also identified that hyperlinks to related content are just as important, since it can enable applications such as popcorn.js.
Potentially there is a use for metadata association with any sequence of words in a caption or subtitle, which could be satisfied with the use of a generic markup element for a sequence of words, such that microdata or RDFa may get associated. A request for such a generic means of associating metadata was made. However, the need for it still has to be confirmed with good use cases – the breakout group was out of time as we came to this point. So, leave your ideas for use cases in the requirements – they will help shape standards.