On Mon, 24 Oct 2011 22:50:43 +0200, Silvia Pfeiffer
<[email protected]> wrote:
So, in your opinion, should there be a change to the WebVTT spec that
separates cues differently?
Is there a recommendation you have from your analysis?
My recommendation is http://www.w3.org/Bugs/Public/show_bug.cgi?id=14550
Cheers,
Silvia.
On Mon, Oct 24, 2011 at 6:26 PM, Simon Pieters <[email protected]> wrote:
I wanted to research how common it is to fail to separate cues in SRT,
and
for what reason.
SRT parsers usually interpret a timings line as a new cue, while WebVTT
wants two blank lines for a new cue.
I took the 65k SRT files we've got, replaced comma with dot and
prepended
"WEBVTT\n\n", then ran them in Opera's <track> impl, looking for '-->'
in
cue data.
There were 840 files with --> in cue data. This is 1.3% of the files.
Looking at the cue data, there were 11,118 lines that contained -->.
There
were 8830 lines of only whitespace.
In the cue data, if I look at valid-looking timing lines
(/^\d\d:\d\d:\d\d\.\d\d\d\s*-->\s*\d\d:\d\d:\d\d\.\d\d\d(\s|$)/) and
check
the line before that, or the line before *that* if it looks like an SRT
id
(/^\d+\s*$/), then I see 7030 lines of only whitespace and 3761 lines of
something else.
Failing to separate cues results in an unpleasant experience for the
user,
since basically the screen is filled with several "cues" including
their IDs
and timing lines.
Some files had most or all of their cues parsed as a single cue with the
WebVTT parser, e.g. because all lines ended with one or more spaces.
Looking
at such a file in a text editor, it's not immediately obvious that
there's
an error, because the spaces are not visible. Moreover, the file is not
non-conforming, so a validator wouldn't help either.
So what about the cases that aren't whitespace? It seems to be mostly
just
missing the newline completely. Some omitted the ID also. One file had
a "|"
between all cues.
--
Simon Pieters
Opera Software
--
Simon Pieters
Opera Software