Re: How to improve recognition on TIFF black-and white Romanian text?

Nick White Wed, 22 Aug 2012 02:49:32 -0700

Hi Jani,

Good questions, I'll answer them as best I can below:


> * Is any of the input formats preferable over others? I used PDF to TIFF via
> Ghostscript and I wonder if png/jpeg or other formats could have any 
> advantage.
> If the original text is not color, does the TIFF device chosen matter?

TIFF should be fine. PNG can be easier to work with, as TIFF has so
many variants that it can cause unexpected problems sometimes. But
if it's opening it and processing it, you can stick with TIFF.

> * Is there a way to ensure optimal quality of the TIFF for purposes of OCR 
> file
> via Ghostscript's command line options? I tried -r600, -r1000, -r1200 just to
> see if there's any difference and  while there were improvements in 
> recognition
> in 1000 vs 600 there were also regressions in Tesseract's output. 

600DPI is generally recommended. You could try higher, but if you
say there were some improvements and some regressions, I'd just stay
at 600DPI.
 
> * The text is Romanian, so latin characters with a few twists but no complex
> shapes. Is there any extra training to be done or should the available 
> language
> data be enough?

Does "a few twists" mean any extra characters or diacritics? If not,
then just keep the training the same. However you will get
improvements by creating a 'dictionary' file for Romanian and
telling Tesseract to use that. You can do that by getting a text
file with one word per line, then running 'wordlist2dawg' on it, and
then use 'combine_tessdata' to uncompile the current
eng.traineddata, add your dictionary 'dawg' file, then recompile it
with the 'combine_tessdata' command again (there may be an easier
way to do this, but I'm not sure; Zdenko, correct me if there is.)

> * Is it a common practice that is outside the scope of Tesseract to do
> post-processing/spelling correction if words are incorrectly recognized or is
> that a sign of more training/tweaking needed?

I don't know if it's common practise. Spell checking should really
be handled by training Tesseract appropriately. I have been working
on creating a training file for a while, and have been keen to try
to keep everything possible in the training rather than rely on
post-processing. Is there anything particular you had in mind for
post-processing? We could let you know if it would be possible or
sensible to try fiddling with the training instead.

Nick

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: How to improve recognition on TIFF black-and white Romanian text?

Reply via email to