Hi Nick,

thanks for the prompt answer!

> * Is there a way to ensure optimal quality of the TIFF for purposes of 
> OCR file 
> > via Ghostscript's command line options? I tried -r600, -r1000, -r1200 
> just to 
> > see if there's any difference and  while there were improvements in 
> recognition 
> > in 1000 vs 600 there were also regressions in Tesseract's output. 
>
> 600DPI is generally recommended. You could try higher, but if you 
> say there were some improvements and some regressions, I'd just stay 
> at 600DPI. 
>
 
Alright, although there seemed to be more improvements than regressions at 
1000dpi.

>   
> > * The text is Romanian, so latin characters with a few twists but no 
> complex 
> > shapes. Is there any extra training to be done or should the available 
> language 
> > data be enough? 
>
> Does "a few twists" mean any extra characters or diacritics? If not, 
>

By the available language data I meant the already avaiable 
/usr/share/tesseract-ocr/tessdata/ron.traineddata for Romanian
that comes in Ubuntu/Debian's packaging of Tesseract. There are diacritics 
in the text and I pass -l ron to tesseract.
I checked using the default English but that does not yield the correct 
words for those with diacritics.
I was wondering if the Romanian dataset needs further training - I am not 
sure what well-trained means in this context.

> * Is it a common practice that is outside the scope of Tesseract to do 
> > post-processing/spelling correction if words are incorrectly recognized 
> or is 
> > that a sign of more training/tweaking needed? 
>
> I don't know if it's common practise. Spell checking should really 
> be handled by training Tesseract appropriately. I have been working 
>
 
I only meant spelling corrections in the post processing phase as I see 
quite a few non-words being recognized instead of
what the original document has, usually one or two edit-distances away. 
Matching with dictionary words could fix these but
then I wonder if it would not go against the intention of the OCR process, 
which is to recognize what is in the input, and not
what the correct spelling of the input is. In my case the originals are all 
correctly spelled so I would need a post-processing step
anyway but maybe it should not be a core part of Tesseract's pipeline.

Here's an example file (text of Romanian law publication from the early 
ninteties when most were not digitally prepared but scanned in later to PDF)

http://startx.ro/005.pdf

Jani

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to