I searched the archives and did not find this precise issue.

I have a vob file extracted from a DVD.  Call it 0055743.vob if you like.

vlc plays this vob fine and displays the subtitles as they should be.


I use this transcode based command to extract the substream:

tccat -i 0055743.vob  | tcextract -x ps1 -t vob -a 0x20 > 0055743.en_0.subtrack;

and subtitle2pgm  to break it out into images for OCR

subtitle2pgm -o 0055743.en_0 -c 255,0,0,255 < 0055743.en_0.subtrack

Then I use various OCR engines etc to get an srt file.


The problem is that when I follow this some of the timings and subs come
out wrong. Very often a sub will be repeated where there should be two
different subs. This often happens where the endpoint of one is the start
of another.  Here is an example my process gives of this type:


11
00:01:24,180 --> 00:01:26,819
30 barrels of rice for land taxes.

12
00:01:26,819 --> 00:01:29,510
30 barrels of rice for land taxes.


When it should give this:

11
00:01:24,180 --> 00:01:26,819
Yoza, it seems you have collected

12
00:01:26,819 --> 00:01:29,510
30 barrels of rice for land taxes.


Obviously  the pgms  extracted by  subtitle2pgm are  wrong.  Sometimes
there are larger errors consisting of a sequence of pgms all displaced
by one.



My question,  is this a  problem with tcextract or  with subtitle2pgm?
Where should I look first for a fix?

Has anybody else seen this, or related problems. I can host the 4G vob for
anybody to download to test their setup on.


Also what other simple ways are there to do this process another way. I
extract a lot of subs so it has to be command line based and managable.


        Thanks in advance,

                Simon.












Reply via email to