Here's the script used: http://pastebin.com/KhdsydzJ

Input was determined to be valid UTF-8 if text.decode('utf-8') didn't raise an exception, same for ASCII. I haven't tried to analyze what other encodings were used.

Philip

On Tue, 24 Aug 2010 21:47:14 +0200, Kevin Marks <kevinma...@gmail.com> wrote:

When you say 'invalid utf8' what were you seeing? win1252 encoding of
accents? or illegal unicode characters like 0x80 ?

On Tue, Aug 24, 2010 at 4:20 AM, Philip Jägenstedt <phil...@opera.com>wrote:

As mentioned deep in another thread, I've gotten hold of a big batch of SRT files and have collected some statistics, which may help inform decisions on
the WebSRT format. Many thanks to OpenSubtitles for providing the data.

http://blog.foolip.org/2010/08/20/srt-research/

--
Philip Jägenstedt

Reply via email to