Hi Jani, Good questions, I'll answer them as best I can below:
> * Is any of the input formats preferable over others? I used PDF to TIFF via > Ghostscript and I wonder if png/jpeg or other formats could have any > advantage. > If the original text is not color, does the TIFF device chosen matter? TIFF should be fine. PNG can be easier to work with, as TIFF has so many variants that it can cause unexpected problems sometimes. But if it's opening it and processing it, you can stick with TIFF. > * Is there a way to ensure optimal quality of the TIFF for purposes of OCR > file > via Ghostscript's command line options? I tried -r600, -r1000, -r1200 just to > see if there's any difference and while there were improvements in > recognition > in 1000 vs 600 there were also regressions in Tesseract's output. 600DPI is generally recommended. You could try higher, but if you say there were some improvements and some regressions, I'd just stay at 600DPI. > * The text is Romanian, so latin characters with a few twists but no complex > shapes. Is there any extra training to be done or should the available > language > data be enough? Does "a few twists" mean any extra characters or diacritics? If not, then just keep the training the same. However you will get improvements by creating a 'dictionary' file for Romanian and telling Tesseract to use that. You can do that by getting a text file with one word per line, then running 'wordlist2dawg' on it, and then use 'combine_tessdata' to uncompile the current eng.traineddata, add your dictionary 'dawg' file, then recompile it with the 'combine_tessdata' command again (there may be an easier way to do this, but I'm not sure; Zdenko, correct me if there is.) > * Is it a common practice that is outside the scope of Tesseract to do > post-processing/spelling correction if words are incorrectly recognized or is > that a sign of more training/tweaking needed? I don't know if it's common practise. Spell checking should really be handled by training Tesseract appropriately. I have been working on creating a training file for a while, and have been keen to try to keep everything possible in the training rather than rely on post-processing. Is there anything particular you had in mind for post-processing? We could let you know if it would be possible or sensible to try fiddling with the training instead. Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

