Hello,

I am using Ubuntu 12.04 with stock Tesseract 3.02 packages that come with 
it.

I'd like to extract the text from multi-page black and white documents 
scanned into PDF and I have a few questions after trying the most widely 
documented and probably most basic approach. So far the results are good, 
but I hope the output can be improved if one puts in more effort.

* Is any of the input formats preferable over others? I used PDF to TIFF 
via Ghostscript and I wonder if png/jpeg or other formats could have any 
advantage. If the original text is not color, does the TIFF device chosen 
matter?

* Is there a way to ensure optimal quality of the TIFF for purposes of OCR 
file via Ghostscript's command line options? I tried -r600, -r1000, -r1200 
just to see if there's any difference and  while there were improvements in 
recognition in 1000 vs 600 there were also regressions in Tesseract's 
output. 

* The text is Romanian, so latin characters with a few twists but no 
complex shapes. Is there any extra training to be done or should the 
available language data be enough?

* Is it a common practice that is outside the scope of Tesseract to do 
post-processing/spelling correction if words are incorrectly recognized or 
is that a sign of more training/tweaking needed?

thanks
Jani

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to