Hello, I am using Ubuntu 12.04 with stock Tesseract 3.02 packages that come with it.
I'd like to extract the text from multi-page black and white documents scanned into PDF and I have a few questions after trying the most widely documented and probably most basic approach. So far the results are good, but I hope the output can be improved if one puts in more effort. * Is any of the input formats preferable over others? I used PDF to TIFF via Ghostscript and I wonder if png/jpeg or other formats could have any advantage. If the original text is not color, does the TIFF device chosen matter? * Is there a way to ensure optimal quality of the TIFF for purposes of OCR file via Ghostscript's command line options? I tried -r600, -r1000, -r1200 just to see if there's any difference and while there were improvements in recognition in 1000 vs 600 there were also regressions in Tesseract's output. * The text is Romanian, so latin characters with a few twists but no complex shapes. Is there any extra training to be done or should the available language data be enough? * Is it a common practice that is outside the scope of Tesseract to do post-processing/spelling correction if words are incorrectly recognized or is that a sign of more training/tweaking needed? thanks Jani -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

