I have come recently to Tesseract, since it is used by OGMRIP for OCR of DVD subtitles. First run for the subtitles in the Czech language was a garbage, but after some study of Tesseract and training it, the outcome was pretty good and just a little of editing was necessary.
I would like to use Tesseract for other OCR uses, too (linux, Ubuntu 9.04 RC). Even after studying the wiki and test runs, the following areas are not fully clear to me. The Wiki http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract reads that "The training data currently needs to fit on a single page." and just few sentences later "Upto 32 training pages can be used. It is best to create pages in a mix of fonts and styles, including italic and bold.". This looks contradictory and I believe the latter is true, since tesseract fontfile.tif junk nobatch box.train can be run on more pages to create *.tr files. Question: for the purposes of training quality, does it make any difference to have a certain quantity of text in a single (possibly huge) page or split over more pages? (I created the TIFFs manually from ripped vobsub images, so the layout is arbitrary for me.) The other question relates to different font faces and their variants (italics, bold, small caps). From the view-point of OCR quality, is it better to create a special set of data files (cze1.*, cze2.*) separately for serif, sans serif, etc. faces or put it all together to a single cze.* set? Thanks for answers or links to sources of information elsewhere. Milan --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

