I am using Tesseract to extract text from images attached. For some reason, 
even though the images are nearly identical, tesseract makes a mistake in 
one of them: for 'bad.png' the output is ELHADIJ, whereas for 'good.png' it 
is ELHADJ

Here is what I have and done:

   - tesseract version: 4.0.0-beta.1
   - leptonica version: 1.75.3
   - I use English .traineddata file from here: 
   https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata
   - I tried these page segmentation modes: 3, 7, 8, 13 - the mistake is 
   always there.

So the commands I ran were

tesseract good.png output1 -l eng --psm 8
tesseract bad.png output2 -l eng --psm 8

and similarly for other PSMs


My question is: how do I make tesseract more robust? Why does it make a 
mistake in one case but not in the other?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/81a83479-b266-4686-a2d8-fae2d5916831o%40googlegroups.com.

Reply via email to