Over the past week I've attended 3 video-related events in New York and have discussed <track> and WebSRT at all of them. Here's a lengthy report of feedback, mine and others.

At the Open Subtitles Design Summit [1], there was some discussion about captioning for the HoH. I've already put this input into a related bug [2], but to summarize: The default rendering for the voices syntax should probably be to prefix the text cue with the name of the speaker, not to do anything funny with colors or positioning. What's less clear is if it's annoying to always prefix with the speaker, or if it should be done only to disambiguate.

For my Open Video Conference [3] presentation [4] I did a JavaScript implementation of the most interesting parts of <track> and WebSRT to be able to demo what the future might hold [5][6][7]. I have some issues with the parser that are at the end of this mail.

At FOMS [8] we had a session on WebSRT [9] which was extremely helpful. It turns out that SRT has more syntax variations than we had thought, kindly pointed out by VLC developer j-b. Even though there is no SRT spec, there is a test suite of sorts [10] that I had never seen before. I'll call SRT which follows the syntax implied by these tests ale5000-SRT. Apart from the HTML-like markup we knew about, ale5000-SRT also has various markup on the form {...} which was borrowed from SSA, as well as \h and \N for "hard space" and line break respectively. Also in the crazy department is that tags which aren't matched with an opening and closing tag should be rendered as plain text. Stray < should also just be displayed as text. VLC actually implements most of this, as does VSFilter, which we should have tested but didn't [11]. It would probably be possible to write a spec for ale5000-SRT, but extensibility would be limited to matched opening and closing tags, which doesn't work for the suggested voices syntax. With this mess, I'd rather not extend ale5000-SRT. I can only agree with Silvia that we should make WebSRT identifiable, so that different parsers can be used. So:

* Add magic bytes to identify WebSRT, maybe "WebSRT". (This will break some existing SRT parsers.) * Make WebSRT always be UTF-8, since you can't reuse existing SRT files anyway. * Note that certain ale5000-SRT syntax is not part of WebSRT, so that one doesn't have to debug the parsing algorithm to learn that.

Styling hooks were requested. If we only have the predefined tags (i, b, ...) and voices, these will most certainly be abused, e.g. resulting in <i> being used where italics isn't wanted or <v Foo> being used just for styling, breaking the accessibility value it has.

As an aside, the idea of using an HTML parser for the cue text wasn't very popular.

There was also some discussion about metadata. Language is sometimes necessary for the font engine to pick the right glyph. With legacy SRT the encoding could be used as a hint, but if we use UTF-8 that's not possible. License is also an often requested piece of metadata. I have no strong opinion about how to solve this, but key-value pairs like HTTP headers comes to mind.

Finally, some things I think are broken in the current WebSRT parser:

* Parsing of timestamps is more liberal than it needs to be. In particular, treating the part after the decimal separator as an integer and dividing by 1000 leads to 00:00:00.1 being interpreted as 0.001 seconds, which is weird. This is what e.g. VLC does, but if we need to add a header we could just as well change this to make more sane. Alternatively, if we want to really align with C implementations using scanf, we should also handle negative numbers (00:01:-5,000 means 55 seconds), octal and hexadecimal.

* The current syntax looks like XML or HTML but has very different parsing. Voices like <narrator> don't create nodes at all and for tags like <i> the paser has a whitelist and also special rules for inserting <rt>. Unless there are strong reasons for this, then for simplicity and forward compatibility, I'd much rather have the parser create an actual DOM (not a tree of "WebSRT Node Object") that reflects the input. If we also support attributes then people who actually want to use their (silly) <font color=red> tags can do so with CSS. This could also work as styling hooks. Obviously, a WebSRT parser should create elements in another namespace, we don't want e.g. <img> to work inside cues.

* The "bad cue" handling is stricter than it should be. After collecting an id, the next line must be a timestamp line. Otherwise, we skip everything until a blank line, so in the following the parser would jump to "bad cue" on line "2" and skip the whole cue.

1
2
00:00:00.000 --> 00:00:01.000
Bla

This doesn't match what most existing SRT parsers do, as they simply look for timing lines and ignore everything else. If we really need to collect the id instead of ignoring it like everyone else, this should be more robust, so that a valid timing line always begins a new cue. Personally, I'd prefer if it is simply ignored and that we use some form of in-cue markup for styling hooks.

* At the beginning of "cue text loop" (step 28) a newline should be collected.

[1] http://universalsubtitles.org/opensubtitles2010
[2] http://www.w3.org/Bugs/Public/show_bug.cgi?id=10320
[3] http://www.openvideoconference.org/
[4] http://people.opera.com/philipj/2010/10/02/ovc/
[5] http://people.opera.com/philipj/2010/10/02/ovc/demos/captions.html
[6] http://people.opera.com/philipj/2010/10/02/ovc/demos/transcript.html
[7] http://people.opera.com/philipj/2010/10/02/ovc/demos/metadata.html
[8] http://www.foms-workshop.org/foms2010OVC/
[9] http://www.foms-workshop.org/foms2010OVC/pmwiki.php/Main/WebSRT
[10] http://ale5000.altervista.org/subtitles.htm
[11] http://wiki.whatwg.org/wiki/SRT_research

--
Philip Jägenstedt
Core Developer
Opera Software

Reply via email to