Hi Yash, in my experience you are going top see a lot of these errors on similar characters.
Given the pre processed text only I might do the same mistake myself. What I do is to fix these letters according to a pattern, in this case WDDDDDDD and I replace: S <-> 8 O <-> 0 I <-> 1 i <-> 1 l <-> 1 z <-> 2 Z <-> 2 etc. Another options, but I'm not 100% sure if it is possible with the latest version, is to ask tesseract for the whole list of predictions for each token with confidence. For the first token you'd get something like: S: 0.6839 8: 0.2123 B: 0.1445 ... and, again according to a pattern, you select the best matching one (you need to use the choiceIterator on the result object iterating at level SYMBOL). This second approach is more elegant but I do not think is giving you much more over the simpler approach. Of course a little bit of model fine tuning helps but will not fix these problems 100% and it takes a lot of time to do it properly. I recommend using tessocr that is a real API/library wrapper (not a command line wrapper...), it gives you access to the whole API and, if used properly, it is a lot faster. Bye Lorenzo Il giorno mar 7 giu 2022 alle ore 09:50 'Yash Mistry' via tesseract-ocr < [email protected]> ha scritto: > I am facing challenge to extract correct a letter from a word which are > look-alike, i.e 5 & S, I & 1, 8 & S. > > I applied image pre-processing techniques like Blurring, erode, dilate, > normalised the noise, remove unnecessary component and text detection from > the input image but after these much of pre-processing tesseract OCR isn't > giving correct result. > > Please check attached images, > > *Original Image* > > > *[image: image.png]* > > *Pre-processed Image* > > [image: image (1).png] > > *Detected Text* > > > *[image: image (2).png]* > > > *[image: image (3).png]* > > *Tesseract Configuration* > > -l eng --oem 1 --psm 7 -c > tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n" > load_system_dawg=false load_freq_dawg=false > > *Result of OCR*: TITLENUMBER 81003716 > > As we can see OCR extract S as 8 even after pre-processing and text > detection. > > Is there anyway we can overcome this problem? > > *Tesseract Version*: tesseract 5.1.0-32-gf36c0 > > Note: Asked same question in pytesseract github repo and got suggestion to > drop this question here. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/72dac625-d07f-4240-9032-3fa856868b8dn%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/72dac625-d07f-4240-9032-3fa856868b8dn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxhLY1FXQZAR%2Be5Cc%2Bm0p6j%3DZBaUOMz9-Bef0%3DLirW05Q%40mail.gmail.com.

