Re: [tesseract-ocr] Tesseract confused between a character and a digit which look-alike

Lorenzo Bolzani Fri, 24 Jun 2022 02:15:18 -0700

Hi Yash,
please see the example at the bottom of this page:

https://github.com/sirfz/tesserocr


and this issue about the versions (I think you need version 5.x):

https://github.com/sirfz/tesserocr/issues/166


If you have problems with tesserocr make sure it matches the tesseract
version it was compiled for:


https://github.com/sirfz/tesserocr/releases/tag/v2.5.2


The alternative choices should also be available in the XML output, if I
remember correctly.


Your input image is very tiny (text is 9 pixels tall) and there are a lot
of compression artifacts. If possibile, acquire an higher resolution image
with less compression.

Also try to MANUALLY clean the text more (with Gimp for example) to remove
the black fragments of the border or the dot on the left to see IF this
gives you better results. Also try to MANUALLY remove almost all of the
white borders.

IF any of these gives you better results you can think about how to improve
your automated pre-processing step with a clear target, like the attached
images (I did not test them).

Your image uses two background colors, you can cut the top and bottom parts
and process each fragment on its own (so adaptive thresholding does not get
confused).




Bye,

Lorenzo

Il giorno ven 24 giu 2022 alle ore 09:22 'Yash Mistry' via tesseract-ocr <
[email protected]> ha scritto:

> Hi Lorenzo,
>
> Thank you for the suggestions.
>
> The first approach you suggest is not feasible for me because there is no
> certainty that at particular position specific type of data will present.
>
> I am interested in second approach, I am trying to find any functionality
> of tesseract which give me all possible prediction for the specific letter
> bur I haven't found any solution yet.
>
> Can you please help me from where did you find this kind of functionality
> in tesseract and of which version of tesseract?
>
> Thank you
>
> On Tuesday, June 7, 2022 at 1:45:48 PM UTC+5:30 Lorenzo Blz wrote:
>
>> Hi Yash,
>> in my experience you are going top see a lot of these errors on similar
>> characters.
>>
>>
>> Given the pre processed text only I might do the same mistake myself.
>>
>>
>> What I do is to fix these letters according to a pattern, in this case
>> WDDDDDDD
>>
>> and I replace:
>>
>> S <-> 8
>> O <-> 0
>> I  <->  1
>> i  <->  1
>> l  <->  1
>> z  <->  2
>> Z  <->  2
>> etc.
>>
>> Another options, but I'm not 100% sure if it is possible with the latest
>> version, is to ask tesseract for the whole list of predictions for each
>> token with confidence. For the first token you'd get something like:
>>
>> S: 0.6839
>> 8: 0.2123
>> B: 0.1445
>> ...
>>
>> and, again according to a pattern, you select the best matching one (you
>> need to use the choiceIterator on the result object iterating at level
>> SYMBOL). This second approach is more elegant but I do not think is giving
>> you much more over the simpler approach.
>>
>> Of course a little bit of model fine tuning helps but will not fix these
>> problems 100% and it takes a lot of time to do it properly.
>>
>>
>> I recommend using tessocr that is a real API/library wrapper (not a
>> command line wrapper...), it gives you access to the whole API and, if used
>> properly, it is a lot faster.
>>
>>
>>
>> Bye
>>
>> Lorenzo
>>
>> Il giorno mar 7 giu 2022 alle ore 09:50 'Yash Mistry' via tesseract-ocr <
>> [email protected]> ha scritto:
>>
>>> I am facing challenge to extract correct a letter from a word which are
>>> look-alike, i.e 5 & S, I & 1, 8 & S.
>>>
>>> I applied image pre-processing techniques like Blurring, erode, dilate,
>>> normalised the noise, remove unnecessary component and text detection from
>>> the input image but after these much of pre-processing tesseract OCR isn't
>>> giving correct result.
>>>
>>> Please check attached images,
>>>
>>> *Original Image*
>>>
>>>
>>> *[image: image.png]*
>>>
>>> *Pre-processed Image*
>>>
>>> [image: image (1).png]
>>>
>>> *Detected Text*
>>>
>>>
>>> *[image: image (2).png]*
>>>
>>>
>>> *[image: image (3).png]*
>>>
>>> *Tesseract Configuration*
>>>
>>> -l eng --oem 1 --psm 7 -c
>>> tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n"
>>> load_system_dawg=false load_freq_dawg=false
>>>
>>> *Result of OCR*: TITLENUMBER 81003716
>>>
>>> As we can see OCR extract S as 8 even after pre-processing and text
>>> detection.
>>>
>>> Is there anyway we can overcome this problem?
>>>
>>> *Tesseract Version*: tesseract 5.1.0-32-gf36c0
>>>
>>> Note: Asked same question in pytesseract github repo and got suggestion
>>> to drop this question here.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/72dac625-d07f-4240-9032-3fa856868b8dn%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/72dac625-d07f-4240-9032-3fa856868b8dn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c46185ed-b502-4320-bf98-966a6b2e90een%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/c46185ed-b502-4320-bf98-966a6b2e90een%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxf6mWLRZEKButd1O0WG%3DeLC8pGdLc7n69_B8pEzTxBMg%40mail.gmail.com.

Re: [tesseract-ocr] Tesseract confused between a character and a digit which look-alike

Reply via email to