Re: OCR Dvd Subtitles

Nick White Mon, 09 Jul 2012 05:01:13 -0700

Hi,

I'm glad the standard trained data is working well. Particularly
given that (presumably) it's against a non-white background - people
have found that to be a big obstacle in the past.


I would suggest against completely retraining it. As the training
source files are not available it would take quite a bit of work
just to get the training up to the quality of the built-in one.

The sort of errors you describe could be reasonably addressed by
editing the lang.unicharambigs file. Unpack the training you're
using with:
  combine_tessdata -u /path/to/lang.traineddata lang.
add the ambigs rules you need, then recombine it with:
  combine_tessdata lang.
and copy the lang.traineddata to the tessdata directory.

More information on the unicharambigs file is given in the training
guide on the wiki. You could also consider looking at the
configuration variables to do things like give higher penalties for
unexpected punctuation (may help things like / vs l), but I think
that would take a while and not be as effective for you. Grep for
'_VAR' in the source tree if you want to try it anyway.

Best of luck, and let us know how you get on.

Nick 

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: OCR Dvd Subtitles

Reply via email to