On Thu, 06 Oct 2011 01:45:13 +0200, Ralph Giles <[email protected]> wrote:

On 05/10/11 10:22 AM, Simon Pieters wrote:

I did some research on authoring errors in SRT timestamps to inform
whether WebVTT parsing of timestamps should be changed.

This is completely awesome, thanks for doing it.

hours too many '(^|\s|>)\d{3,}[:\.,]\d+[:\.,]\d+'
834

As Silvia mentioned, the WebVTT spec currently leaves the number of
digits in the hour field as implementation defined, so long as it's at
least two.

I asked previously[1] if we could agree on and specify a limit. Would
you mind checking what the histogram of digit numbers is in the hours
field? Especially if you can separate cases like

34500:24:01,000 --> 00:24:03,000

either because the index is missing, or because the the interval is
negative (for which the WebVTT spec would reject the entire cue).

I don't know how many have negative interval, I'd need to run a new script over the 52,000,000 lines to figure out. (If you want me to check this, please contact me with details about what you want to count as "negative interval".)

The cases where there were 3 or more digits in the hours field are distributed as follows:

leading id e.g.
10300:11:53,891 --> 00:11:56,155

33

hours set to 255 (these seem to all come from the same file and the minutes are evenly distributed between 0 and 46; maybe the hours were actually intended to be 00) e.g.
255:46:18,058 --> 255:46:25,191

671

hours in the first timestamp much greater than the second timestamp e.g.
244:00:13,320 --> 00:00:13,320

10

hours in the second timestamp much greater than the first timestamp e.g.
00:00:33,010 --> 415:54:55,400

3

leading zero (in first and/or second timestamp) e.g.
000:09:40,300 --> 00:09:45,519

150

other (garbage) e.g.
8247,711,7nsuacer :56:20,0071:15 -->ddar vid18

9

Cheers,
 -r

[1]
http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2011-September/033271.html


--
Simon Pieters
Opera Software

Reply via email to