Re: [tesseract-ocr] Tesseract confused between a character and a digit which look-alike

Lorenzo Bolzani Tue, 07 Jun 2022 01:15:47 -0700

Hi Yash,
in my experience you are going top see a lot of these errors on similar
characters.



Given the pre processed text only I might do the same mistake myself.


What I do is to fix these letters according to a pattern, in this case
WDDDDDDD

and I replace:

S <-> 8
O <-> 0
I  <->  1
i  <->  1
l  <->  1
z  <->  2
Z  <->  2
etc.

Another options, but I'm not 100% sure if it is possible with the latest
version, is to ask tesseract for the whole list of predictions for each
token with confidence. For the first token you'd get something like:

S: 0.6839
8: 0.2123
B: 0.1445
...

and, again according to a pattern, you select the best matching one (you
need to use the choiceIterator on the result object iterating at level
SYMBOL). This second approach is more elegant but I do not think is giving
you much more over the simpler approach.

Of course a little bit of model fine tuning helps but will not fix these
problems 100% and it takes a lot of time to do it properly.


I recommend using tessocr that is a real API/library wrapper (not a command
line wrapper...), it gives you access to the whole API and, if used
properly, it is a lot faster.



Bye

Lorenzo

Il giorno mar 7 giu 2022 alle ore 09:50 'Yash Mistry' via tesseract-ocr <
[email protected]> ha scritto:

> I am facing challenge to extract correct a letter from a word which are
> look-alike, i.e 5 & S, I & 1, 8 & S.
>
> I applied image pre-processing techniques like Blurring, erode, dilate,
> normalised the noise, remove unnecessary component and text detection from
> the input image but after these much of pre-processing tesseract OCR isn't
> giving correct result.
>
> Please check attached images,
>
> *Original Image*
>
>
> *[image: image.png]*
>
> *Pre-processed Image*
>
> [image: image (1).png]
>
> *Detected Text*
>
>
> *[image: image (2).png]*
>
>
> *[image: image (3).png]*
>
> *Tesseract Configuration*
>
> -l eng --oem 1 --psm 7 -c
> tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n"
> load_system_dawg=false load_freq_dawg=false
>
> *Result of OCR*: TITLENUMBER 81003716
>
> As we can see OCR extract S as 8 even after pre-processing and text
> detection.
>
> Is there anyway we can overcome this problem?
>
> *Tesseract Version*: tesseract 5.1.0-32-gf36c0
>
> Note: Asked same question in pytesseract github repo and got suggestion to
> drop this question here.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/72dac625-d07f-4240-9032-3fa856868b8dn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/72dac625-d07f-4240-9032-3fa856868b8dn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxhLY1FXQZAR%2Be5Cc%2Bm0p6j%3DZBaUOMz9-Bef0%3DLirW05Q%40mail.gmail.com.

Re: [tesseract-ocr] Tesseract confused between a character and a digit which look-alike

Reply via email to