Hi Ian, all,

I'd like to pick up again on the discussion about what file format should be
supported as baseline in HTML5 for providing time-synchronized text for
media resources. This is particularly important now that we have the WebSRT
proposal http://www.whatwg.org/specs/web-apps/current-work/websrt.html and
decisions need to be made about implementations.

My intention with this discussion is to clearly understand our reasons for
going with the specifications that we will implement. I'd rather have the
discussions now than later when implementations exist and changes are
difficult.

So, I want to make sure we have good reasons not to go with an alternative
existing format.
And I want to make sure we also have good reasons for the decision not to go
with an xml-based format, in particular given that so much of HTML could be
reused if we went with a markup language.
Finally, I want to point out some limitations of WebSRT that I've come
across and that we can improve, should we really decide to stick with
WebSRT.


CONSIDERING EXISTING FORMATS

The question about re-using existing formats has thus far led us to the
following conclusions:
* SRT is nice and simple, but insufficient for most advanced use cases,
including captions and subtitles
* DFXP/TTML is problematic since it uses XML namespaces, XSL-FO for styling
(which conflicts with CSS), and doesn't support <ruby> and several other
expressed needs.

There are many other formats, that have been considered (see
http://wiki.whatwg.org/wiki/Timed_track_formats) and dismissed for one
reason or another.

The formats that have been (and others that could still be) inspected
broadly fall into the following classes:
* non-xml text formats developed by applications to store subtitles (or
captions), e.g. SRT, MicroDVD, DVD Studio Pro, SSA/ASS, Subsonic, Subviewer
* xml-based text formats developed by applications to store subtitles or
captions, e.g. RealText, SAMI, USF, Structured Subtitle Format
* binary formats that encode text to go into media files, e.g. 3GPP in MPEG,
QTText in Quicktime, Kate in Ogg
* legacy binary formats that were used to exchange subtitles and teletext
between software/hardware in the pre-digital age, e.g. EBU-STL, PAC

The problem with looking at these existing formats is that they are all
built for a particular purpose that almost exclusively does not apply to our
problem at hand. Also, if we look at their capabilities, we find that the
overlapping functionality that all share is captured in simple SRT, while
the combined functionality of them all is so encompassing that not even TTML
- which has been developed as an exchange format for such files formats -
provides for them all.

Note that the subtitling community has traditionally been using the Subrip
(srt) or SubViewer (sub) formats as a simple format and SubStation alpha
(ssa/ass) as the comprehensive format. Aegisub, the successor of SubStation
Alpha, is still the most popular subtitling software and ASS is the
currently dominant format. However, even this community is right now
developing a new format called AS6. This shows that the subtitling community
also hasn't really converged on a "best" format yet.

So, given this background and the particular needs that we have with
implementing support for a time-synchronized text format in the Web context,
it would probably be best to start a new format from a clean slate rather
than building it on an existing format. Learn from what exists, but make
something completely new that works best for the Web environment. If others
pick it up - good - but it's not a requirement.


A BRIEF OVERVIEW OF WMML

Given that WebSRT is basically an effort at creating a new format that is
not using markup (or at least: not much), I took the step to experiment with
a new markup-based format which I call WMML (Web Media Markup Language) to
be able to refer to it. It is specified at
https://wiki.mozilla.org/Accessibility/Video_Text_Format .

XML-based formats for subtitles have been successful in the past, e.g. SAMI,
RealText, 3GPP. Also, many other successful formats on the Web are
XML-based, e.g. RSS, XSPF and there are plenty of parsers around to tokenize
such formats. Also, authors already know how to write HTML, so if our
time-synchronized text format does not require a steep learning curve, we
will end up with more captions and subtitles faster, both hand-made and from
authoring applications. So, even though such a format will be more verbose
than a plain text format, there are a lot of good reasons to build a new
format with markup rather than without.

I developed WMML as a xml-based caption format that will not have the
problems that have been pointed out for DFXP/TTML, namely: there are no
namespaces, it doesn't use XSL-FO but instead fully reuses CSS, and it
supports innerHTML markup in the cues instead of inventing its own markup.
Check out the examples at
https://wiki.mozilla.org/Accessibility/Video_Text_Format .

WMML's root element contains some attributes that are important for
specifying how to use the resource:
* a @lang attribute which specifies what language the resource is written
for
* a @kind attribute which specifies the intended use of the resource, e.g.
caption, subtitle, chapter, description
* a @profile attribute which specifies the format used in the cues and thus
the parser that should be chosen, including "plainText", "minimalMarkup",
"innerHTML", "JSON", "any" (other formats can be developed)

WMML completely reuses the HTML <head> element. This has the following
advantages:
* there is a means to associate metadata in the form of name-value pairs
with the time-synchronized text resource. There is a particular need to be
able to manage at least the following metadata for time-synchronized text
resources:
  ** the association with the media resource and its metadata, such as
title, actors, url, duration
  ** the author, authoring date, copyright, ownership, license and usage
rights for the time-synchronized text resource
* there is a means to include in-line styles and a means to include a link
to an external style sheet
* there is a possibility to provide script that just affects the
time-synchronized text resource

WMML doesn't have a <body> element, but instead has a <cuelist> element. It
was important not to reuse <body> in order to allow only <cue> elements
inside the main part of the WMML resource. This makes it a document that can
also easily be encapsulated in binary media resources such as WebM, Ogg or
MPEG-4 because each cue is essentially a "codec data page" associated with a
given timeline, while anything in the root and head element are "codec
headers". In this way, the hierarchical document structure is easily
flattened.

The <cue> elements have a start and end time attribute and contain
innerHTML, thus there is already parsing code available in Web browsers to
deal with this content. Any Web content can be introduced into a <cue> and
the Web browsers will already be able to render it.

A single addition has been made to WMML cue elements in the form of a <t>
element, which enables Karaoke, but this is not strictly necessary since we
already have the CSS3 transition-delay property which can provide for this
need.



COMPARING WebSRT and WMML

Examples that I experimented with are at
https://wiki.mozilla.org/Accessibility/Video_Text_Format_Comparison .

There are a few things I like about WebSRT.

1. First and foremost I believe that the ability to put different types of
content into a cue is really powerful.
It turns WebSRT into a platform for delivering time-synchronized text rather
than just markup for a particular application of time-synchronized text. It
makes it future-proof to allow absolutely anything in cues.

2. There is a natural mapping of WebSRT into in-band text tracks.
Each cue naturally maps into a encoding page (just like a WMML cue does,
too). But in WebSRT, because the setup information is not brought in a
hierarchical element surrounding all cues, it is easier to just chuck
anything that comes before the first cue into an encoding header page. For
WMML, this problem can be solved, but it is less natural.

3. I am not too sure, but the "voice" markup may be useful.
At this point I do wonder whether it has any further use than a @class
attribute has in normal markup, but the idea of providing some semantic
information about the content in cues is interesting. Right now it's only
used to influence styling but it could have a semantic use, too - uses that
microformats or RDFa are also targeting.

4. It's a light-weight format in that it is not very verbose.
It is nice for hand-authoring if you don't have to write so much. This is
particularly true for the simple case. E.g. if new-lines that you author are
automatically kept as newlines when interpreted. The drawbacks here are that
as soon as you include more complicated markup into the cues (e.g. HTML
markup or a SVG image), you're not allowed to put empty lines into it
because they have a special meaning. So, while it is true that the number of
characters for WebSRT will always be less than for any markup-based format,
this may be really annoying in any of the cases that need more than plain
text.

Now, I've tried to include point 1 into WMML, but because WMML is xml-based,
the ability to include any kind of markup into cues is not so elegant. It
is, however, controlled by the @profile attribute on the <wmml> element, so
applications should be able to deal with it.

Point 2 is possible in WMML through "encoding" all outer markup in a header
and the cues in the data packets.

Point 3 is also possible in WMML through the use of the @class attribute on
cues.

Point 4 really is something where WMML cannot compete with WebSRT - it will
always be more verbose. However, authors are able to author nicely formatted
WMML files that contain innerHTML in cues without having to worry about the
newlines that they use, so this is a double-edged sword.


Now to the things that WMML provides where WebSRT needs to improve.


1. Extensibility with header data.

In contrast to being flexible about what goes into the cues, WebSRT is
completely restrictive and non-extensible in all the content that is outside
the cues. In fact, no content other than comments are allowed outside the
cues. This creates the following problems:

* there is no possibility to add file-wide metadata to WebSRT; things about
authoring and usage rights as well as information about the media resource
that the file relates to should be kept within the file. Almost all subtitle
and caption format have the possibility for such metadata and we know from
image, music and video resources how important it is to have the ability to
keep such metadata inside the resource.

* there is no language specification for a WebSRT resource; while this will
not be a problem when used in conjunction with a <track> element, it still
is a problem when the resource is used just by itself, in particular as a
hint for font selection and speech synthesis.

* there is no style sheet association for a WebSRT resource; this can be
resolved by having the style sheet linked into the Web page where the
resource is used with the video, but that's not possible when the resource
is used by itself. It needs something like a <link> to a CSS resource inside
the WebSRT file.

* there is no magic identifier for a WebSRT resource, i.e. what the <wmml>
element is for WMML. This makes it almost impossible to create a program to
tell what file type this is, in particular since we have made the line
numbers optional. We could use "-->" as an indicator, but it's not a good
signature.

* there is no means to identify which parser is required in the cues (is it
"plain text", "minimal markup", or "anything"?) and therefore it is not
possible for an application to know how it should parse the cues.

* there is no version number on the format, thus it will be difficult to
introduce future changes.

I believe the fundamental issue with the lack of such markup on the
header-level is that we are trying hard to stay backwards compatible with
SRT. If we were to just break away from that, it can give us the opportunity
to create solutions for all these issues. But let me address this issue
properly.


2. Break the SRT link.

I can understand that the definition of WebSRT took inspiration from SRT for
creating a simple format. But realistically most SRT files will not be
conformant WebSRT files because they are not written in UTF-8. Further,
realistically, all WebSRT files that use more than just the plain text
markup are not conformant SRT files. So, let's stop pretending there is
compatibility and just call WebSRT a new format. In fact, the subtitling
community itself has already expressed their objections to building an
extension of SRT, see http://forum.doom9.org/showthread.php?p=1396576 , so
we shouldn't try to enforce something that those for whom it was done don't
want. A clean slate will be better for all.

* the mime type of WebSRT resources should be a different mime type to SRT
files, since they are so fundamentally different; e.g. text/websrt

* the file extension of WebSRT resources should be different from SRT files,
e.g. wsrt


3. Introduce a innerHTML type for cues

Right now, there is "plain text", "minimum markup" and "anything" allowed in
the cues. Seeing as WebSRT is built with the particular purpose of bringing
time-synchronized text for HTML5 media elements, it makes no sense to
exclude all the capabilities of HTML. Also, with all the typical parsers and
renderers available in UAs, support of innerHTML in cues should be simple to
implement. The argument that offline applications don't support it is not
relevant since we have no influence on whether standalone media applications
will actually follow the HTML5 format choice. That WebSRT with "plain text"
and "minimal markup" can be supported easily in standalone media
applications is a positive side effect, but not an aim in itself for HTML5
and it should have no influence on our choices.


4. Make full use of CSS

In the current form, WebSRT only makes limited use of existing CSS. I see
particularly the following limitations:

* no use of the positioning functionality is made and instead a new means of
positioning is introduced; it would be nicer to just have this reuse CSS
functionality. It would also avoid having to repeat the positioning
information on every single cue.
* little use of formatting functionality is made by restricting it to only
use 'color', 'text-shadow', 'text-outline', 'background', 'outline' and
'font'
* cue-related metadata ("voice") could be made more generic; why not reuse
"class"?
* there is no definition of the "canvas" dimensions that the cues are
prepared for (width/height) and expected to work with other than saying it
is the video dimensions - but these can change and the proportions should be
changed with that
* it is not possible to associate CSS styles with segments of text, but only
with a whole cue using ::cue-part; it's thus not possible to just highlight
a single word in a cue
* when HTML markup is used in cues, as the specification stands, that markup
is not parsed and therefore cannot be associated with CSS; again, this can
be fixed by making innerHTML in cues valid


5. Other issues

* I noticed that it is not possible to make a language association with
segments of text and thus it is not possible to have text with mixed
languages.
* Is it possible to reuse the HTML font systems?


IN SUMMARY

Having proposed a xml-based format, it would be good to understand reasons
for why it is not a good idea and why a plain text format that has no
structure other than that provided through newlines and start/end time
should be better and more extensible.

Also, if we really are to go with WebSRT, I am looking for a discussion on
those suggested improvements.


Cheers,
Silvia.

Reply via email to