On Wed, 05 Oct 2011 23:07:17 +0200, Silvia Pfeiffer <[email protected]> wrote:

On Thu, Oct 6, 2011 at 4:22 AM, Simon Pieters <[email protected]> wrote:
I did some research on authoring errors in SRT timestamps to inform whether
WebVTT parsing of timestamps should be changed.

Our starting point was 70,000 files provided to Opera (for research
purposes) by opensubtitles.org (thanks!) supposedly being SRT files. We are
not allowed to share the files.

Filtering out files that don't contain "-->" leaved 65,000 files.

Grepping for lines that contain "-->" resulted in 52,000,000 lines (which
should represent roughly the total number of cues). Of those, there were
31,900 lines that are invalid, i.e. don't match the python regexp
'\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d\s*-->\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d($|\s)'.

Forgot to mention here that this regexp used re.match rather than re.search, which basically means that a leading '^' is implied.

Those are categorized as follows. Note that a line can belong to several
categories (except for "none of the above"):


hours too few '(^|\s|>)\d[:\.,]\d+[:\.,]\d+'
57
hours too many '(^|\s|>)\d{3,}[:\.,]\d+[:\.,]\d+'
834

IIUC this means there are more than 2 characters used for the hours. I
think that's a bug of your regex then. There was always going to be
more than 99 hours possible and WebVTT Timestamps are no different:
http://www.whatwg.org/specs/web-apps/current-work/webvtt.html#webvtt-timestamp
. It says "two or more characters...".

Right. However, since movies are seldom longer than 99 hours, I figured that it was worth inspecting to see what kinds of mistakes were hidden there.


minutes too few '(^|\s|>)\d+[:\.,]\d[:\.,]\d+'
16
minutes too many '(^|\s|>)\d+[:\.,]\d{3,}[:\.,]\d+'
11
seconds too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d([:.,-]|\s|$)'
889
seconds too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d{3,}'
154
decimals too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{1,2}(\s|$|-)'
2085
decimals too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{4,}'
62
decimals missing '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+(\s|$|-)'
132
minutes gt 59 '(^|\s|>)\d+[:\.,]0{0,}[6-9]\d+[:\.,]\d+'
6

That's small.

seconds gt 59 '(^|\s|>)\d+[:\.,]\d+[:\.,]0{0,}[6-9]\d+'
184

That's fairly small, in particular considering that spaces in
timestamps or an elongated arrow create a lot more problems.

What problems?


leading garbage '^[^\s\d]+\d+[:\.,]\d+[:\.,]\d+'
599
trailing garbage '-->\s*(\d+[:\.,]){2,3}\d+(\s+[^\s]|[^\s\d:\.,])'
532
colon instead of comma '\d+[:\.,]\d+[:\.,]\d+[:\.,]\d+:\d+'
26
dot instead of comma '\d+[:\.,]\d+[:\.,]\d+\.\d+'
25372
comma instead of colon '\d+,\d+[:\.,]\d+'
82
dot instead of colon '\d+\.\d+[:\.,]\d+'
41
id before timestamp '^\s*\d+\s+\d+[:\.,]\d+'
115
spaces in timestamp '(\d[\d\s]*[:\.,]\s*){2,3}\d[\d\s]*' and not
'(\d+[:\.,]){2,3}\d+'
922
too long arrow '\d\s*-{3,}>\s*\d'
326
none of the above
969


The most common error is to use a dot instead of a comma.

They're WebVTT files already. ;-)

Unlikely. :-)


Some appear to be a different format, and some appear to be just garbage.

Too few or too many hours might not technically be an error, however it
appeared that some of too many hours were cases where the line between the
id and the timestamp was missing (and no whitespace between), e.g.:

34500:24:01,000 --> 00:24:03,000

The trailing garbage is mostly the line between the timestamp and the cue
text being missing, e.g.:

00:00:01,000 --> 00:00:03,000Hello.

So we have a lot more errors coming from missing new lines than from
mis-authoring the hour, minute or seconds number? That's encouraging.
The only common number mistake seems to be to make the decimals
shorter than 3 numbers. Maybe we can resolve this by just having a
rule for what that should be interpreted as?

That's still is very rare in this sample: 2,085/52,000,000 ≈ 0.004% of all cues.

--
Simon Pieters
Opera Software

Reply via email to