Approach for training Tesseract with a new language and/font faces

MilanKnizek Mon, 20 Apr 2009 12:46:44 -0700

I have come recently to Tesseract, since it is used by OGMRIP for OCR
of DVD subtitles. First run for the subtitles in the Czech language
was a garbage, but after some study of Tesseract and training it, the
outcome was pretty good and just a little of editing was necessary.


I would like to use Tesseract for other OCR uses, too (linux, Ubuntu
9.04 RC). Even after studying the wiki and test runs, the following
areas are not fully clear to me.

The Wiki http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
reads that "The training data currently needs to fit on a single
page." and just few sentences later "Upto 32 training pages can be
used. It is best to create pages in a mix of fonts and styles,
including italic and bold.".

This looks contradictory and I believe the latter is true, since
tesseract fontfile.tif junk nobatch box.train
can be run on more pages to create *.tr files.

Question: for the purposes of training quality, does it make any
difference to have a certain quantity of text in a single (possibly
huge) page or split over more pages? (I created the TIFFs manually
from ripped vobsub images, so the layout is arbitrary for me.)

The other question relates to different font faces and their variants
(italics, bold, small caps). From the view-point of OCR quality, is it
better to create a special set of data files (cze1.*, cze2.*)
separately for serif, sans serif, etc. faces or put it all together to
a single cze.* set?

Thanks for answers or links to sources of information elsewhere.

Milan
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Approach for training Tesseract with a new language and/font faces

Reply via email to