Hi Ian, all, I'd like to pick up again on the discussion about what file format should be supported as baseline in HTML5 for providing time-synchronized text for media resources. This is particularly important now that we have the WebSRT proposal http://www.whatwg.org/specs/web-apps/current-work/websrt.html and decisions need to be made about implementations.
My intention with this discussion is to clearly understand our reasons for going with the specifications that we will implement. I'd rather have the discussions now than later when implementations exist and changes are difficult. So, I want to make sure we have good reasons not to go with an alternative existing format. And I want to make sure we also have good reasons for the decision not to go with an xml-based format, in particular given that so much of HTML could be reused if we went with a markup language. Finally, I want to point out some limitations of WebSRT that I've come across and that we can improve, should we really decide to stick with WebSRT. CONSIDERING EXISTING FORMATS The question about re-using existing formats has thus far led us to the following conclusions: * SRT is nice and simple, but insufficient for most advanced use cases, including captions and subtitles * DFXP/TTML is problematic since it uses XML namespaces, XSL-FO for styling (which conflicts with CSS), and doesn't support <ruby> and several other expressed needs. There are many other formats, that have been considered (see http://wiki.whatwg.org/wiki/Timed_track_formats) and dismissed for one reason or another. The formats that have been (and others that could still be) inspected broadly fall into the following classes: * non-xml text formats developed by applications to store subtitles (or captions), e.g. SRT, MicroDVD, DVD Studio Pro, SSA/ASS, Subsonic, Subviewer * xml-based text formats developed by applications to store subtitles or captions, e.g. RealText, SAMI, USF, Structured Subtitle Format * binary formats that encode text to go into media files, e.g. 3GPP in MPEG, QTText in Quicktime, Kate in Ogg * legacy binary formats that were used to exchange subtitles and teletext between software/hardware in the pre-digital age, e.g. EBU-STL, PAC The problem with looking at these existing formats is that they are all built for a particular purpose that almost exclusively does not apply to our problem at hand. Also, if we look at their capabilities, we find that the overlapping functionality that all share is captured in simple SRT, while the combined functionality of them all is so encompassing that not even TTML - which has been developed as an exchange format for such files formats - provides for them all. Note that the subtitling community has traditionally been using the Subrip (srt) or SubViewer (sub) formats as a simple format and SubStation alpha (ssa/ass) as the comprehensive format. Aegisub, the successor of SubStation Alpha, is still the most popular subtitling software and ASS is the currently dominant format. However, even this community is right now developing a new format called AS6. This shows that the subtitling community also hasn't really converged on a "best" format yet. So, given this background and the particular needs that we have with implementing support for a time-synchronized text format in the Web context, it would probably be best to start a new format from a clean slate rather than building it on an existing format. Learn from what exists, but make something completely new that works best for the Web environment. If others pick it up - good - but it's not a requirement. A BRIEF OVERVIEW OF WMML Given that WebSRT is basically an effort at creating a new format that is not using markup (or at least: not much), I took the step to experiment with a new markup-based format which I call WMML (Web Media Markup Language) to be able to refer to it. It is specified at https://wiki.mozilla.org/Accessibility/Video_Text_Format . XML-based formats for subtitles have been successful in the past, e.g. SAMI, RealText, 3GPP. Also, many other successful formats on the Web are XML-based, e.g. RSS, XSPF and there are plenty of parsers around to tokenize such formats. Also, authors already know how to write HTML, so if our time-synchronized text format does not require a steep learning curve, we will end up with more captions and subtitles faster, both hand-made and from authoring applications. So, even though such a format will be more verbose than a plain text format, there are a lot of good reasons to build a new format with markup rather than without. I developed WMML as a xml-based caption format that will not have the problems that have been pointed out for DFXP/TTML, namely: there are no namespaces, it doesn't use XSL-FO but instead fully reuses CSS, and it supports innerHTML markup in the cues instead of inventing its own markup. Check out the examples at https://wiki.mozilla.org/Accessibility/Video_Text_Format . WMML's root element contains some attributes that are important for specifying how to use the resource: * a @lang attribute which specifies what language the resource is written for * a @kind attribute which specifies the intended use of the resource, e.g. caption, subtitle, chapter, description * a @profile attribute which specifies the format used in the cues and thus the parser that should be chosen, including "plainText", "minimalMarkup", "innerHTML", "JSON", "any" (other formats can be developed) WMML completely reuses the HTML <head> element. This has the following advantages: * there is a means to associate metadata in the form of name-value pairs with the time-synchronized text resource. There is a particular need to be able to manage at least the following metadata for time-synchronized text resources: ** the association with the media resource and its metadata, such as title, actors, url, duration ** the author, authoring date, copyright, ownership, license and usage rights for the time-synchronized text resource * there is a means to include in-line styles and a means to include a link to an external style sheet * there is a possibility to provide script that just affects the time-synchronized text resource WMML doesn't have a <body> element, but instead has a <cuelist> element. It was important not to reuse <body> in order to allow only <cue> elements inside the main part of the WMML resource. This makes it a document that can also easily be encapsulated in binary media resources such as WebM, Ogg or MPEG-4 because each cue is essentially a "codec data page" associated with a given timeline, while anything in the root and head element are "codec headers". In this way, the hierarchical document structure is easily flattened. The <cue> elements have a start and end time attribute and contain innerHTML, thus there is already parsing code available in Web browsers to deal with this content. Any Web content can be introduced into a <cue> and the Web browsers will already be able to render it. A single addition has been made to WMML cue elements in the form of a <t> element, which enables Karaoke, but this is not strictly necessary since we already have the CSS3 transition-delay property which can provide for this need. COMPARING WebSRT and WMML Examples that I experimented with are at https://wiki.mozilla.org/Accessibility/Video_Text_Format_Comparison . There are a few things I like about WebSRT. 1. First and foremost I believe that the ability to put different types of content into a cue is really powerful. It turns WebSRT into a platform for delivering time-synchronized text rather than just markup for a particular application of time-synchronized text. It makes it future-proof to allow absolutely anything in cues. 2. There is a natural mapping of WebSRT into in-band text tracks. Each cue naturally maps into a encoding page (just like a WMML cue does, too). But in WebSRT, because the setup information is not brought in a hierarchical element surrounding all cues, it is easier to just chuck anything that comes before the first cue into an encoding header page. For WMML, this problem can be solved, but it is less natural. 3. I am not too sure, but the "voice" markup may be useful. At this point I do wonder whether it has any further use than a @class attribute has in normal markup, but the idea of providing some semantic information about the content in cues is interesting. Right now it's only used to influence styling but it could have a semantic use, too - uses that microformats or RDFa are also targeting. 4. It's a light-weight format in that it is not very verbose. It is nice for hand-authoring if you don't have to write so much. This is particularly true for the simple case. E.g. if new-lines that you author are automatically kept as newlines when interpreted. The drawbacks here are that as soon as you include more complicated markup into the cues (e.g. HTML markup or a SVG image), you're not allowed to put empty lines into it because they have a special meaning. So, while it is true that the number of characters for WebSRT will always be less than for any markup-based format, this may be really annoying in any of the cases that need more than plain text. Now, I've tried to include point 1 into WMML, but because WMML is xml-based, the ability to include any kind of markup into cues is not so elegant. It is, however, controlled by the @profile attribute on the <wmml> element, so applications should be able to deal with it. Point 2 is possible in WMML through "encoding" all outer markup in a header and the cues in the data packets. Point 3 is also possible in WMML through the use of the @class attribute on cues. Point 4 really is something where WMML cannot compete with WebSRT - it will always be more verbose. However, authors are able to author nicely formatted WMML files that contain innerHTML in cues without having to worry about the newlines that they use, so this is a double-edged sword. Now to the things that WMML provides where WebSRT needs to improve. 1. Extensibility with header data. In contrast to being flexible about what goes into the cues, WebSRT is completely restrictive and non-extensible in all the content that is outside the cues. In fact, no content other than comments are allowed outside the cues. This creates the following problems: * there is no possibility to add file-wide metadata to WebSRT; things about authoring and usage rights as well as information about the media resource that the file relates to should be kept within the file. Almost all subtitle and caption format have the possibility for such metadata and we know from image, music and video resources how important it is to have the ability to keep such metadata inside the resource. * there is no language specification for a WebSRT resource; while this will not be a problem when used in conjunction with a <track> element, it still is a problem when the resource is used just by itself, in particular as a hint for font selection and speech synthesis. * there is no style sheet association for a WebSRT resource; this can be resolved by having the style sheet linked into the Web page where the resource is used with the video, but that's not possible when the resource is used by itself. It needs something like a <link> to a CSS resource inside the WebSRT file. * there is no magic identifier for a WebSRT resource, i.e. what the <wmml> element is for WMML. This makes it almost impossible to create a program to tell what file type this is, in particular since we have made the line numbers optional. We could use "-->" as an indicator, but it's not a good signature. * there is no means to identify which parser is required in the cues (is it "plain text", "minimal markup", or "anything"?) and therefore it is not possible for an application to know how it should parse the cues. * there is no version number on the format, thus it will be difficult to introduce future changes. I believe the fundamental issue with the lack of such markup on the header-level is that we are trying hard to stay backwards compatible with SRT. If we were to just break away from that, it can give us the opportunity to create solutions for all these issues. But let me address this issue properly. 2. Break the SRT link. I can understand that the definition of WebSRT took inspiration from SRT for creating a simple format. But realistically most SRT files will not be conformant WebSRT files because they are not written in UTF-8. Further, realistically, all WebSRT files that use more than just the plain text markup are not conformant SRT files. So, let's stop pretending there is compatibility and just call WebSRT a new format. In fact, the subtitling community itself has already expressed their objections to building an extension of SRT, see http://forum.doom9.org/showthread.php?p=1396576 , so we shouldn't try to enforce something that those for whom it was done don't want. A clean slate will be better for all. * the mime type of WebSRT resources should be a different mime type to SRT files, since they are so fundamentally different; e.g. text/websrt * the file extension of WebSRT resources should be different from SRT files, e.g. wsrt 3. Introduce a innerHTML type for cues Right now, there is "plain text", "minimum markup" and "anything" allowed in the cues. Seeing as WebSRT is built with the particular purpose of bringing time-synchronized text for HTML5 media elements, it makes no sense to exclude all the capabilities of HTML. Also, with all the typical parsers and renderers available in UAs, support of innerHTML in cues should be simple to implement. The argument that offline applications don't support it is not relevant since we have no influence on whether standalone media applications will actually follow the HTML5 format choice. That WebSRT with "plain text" and "minimal markup" can be supported easily in standalone media applications is a positive side effect, but not an aim in itself for HTML5 and it should have no influence on our choices. 4. Make full use of CSS In the current form, WebSRT only makes limited use of existing CSS. I see particularly the following limitations: * no use of the positioning functionality is made and instead a new means of positioning is introduced; it would be nicer to just have this reuse CSS functionality. It would also avoid having to repeat the positioning information on every single cue. * little use of formatting functionality is made by restricting it to only use 'color', 'text-shadow', 'text-outline', 'background', 'outline' and 'font' * cue-related metadata ("voice") could be made more generic; why not reuse "class"? * there is no definition of the "canvas" dimensions that the cues are prepared for (width/height) and expected to work with other than saying it is the video dimensions - but these can change and the proportions should be changed with that * it is not possible to associate CSS styles with segments of text, but only with a whole cue using ::cue-part; it's thus not possible to just highlight a single word in a cue * when HTML markup is used in cues, as the specification stands, that markup is not parsed and therefore cannot be associated with CSS; again, this can be fixed by making innerHTML in cues valid 5. Other issues * I noticed that it is not possible to make a language association with segments of text and thus it is not possible to have text with mixed languages. * Is it possible to reuse the HTML font systems? IN SUMMARY Having proposed a xml-based format, it would be good to understand reasons for why it is not a good idea and why a plain text format that has no structure other than that provided through newlines and start/end time should be better and more extensible. Also, if we really are to go with WebSRT, I am looking for a discussion on those suggested improvements. Cheers, Silvia.