any ideas?

Zdenko Podobny Sat, 05 Oct 2019 11:24:11 -0700

First image has several problems:

   1. not straight baseline
   2. different font size
   3. table like structure
   4. amount/digits fields



1-3  could be solved with custom layout analyze e.g. splitting image to
individual parts and sending them to tesseract via API or uzn file.

There was analyze (you can found it in forum) that suggest not to use
letters higher than 30 pixels,so I also resized input image.

LSTM engine is not (always) good at OCR of amount field, so I suggest to
use legacy engine for this image (you will need end.trainneddata from
tessdata repository).

Here is result:
tesseract 1_input_r.png - --psm 4 --oem 2
UZN file 1_input_r.uzn loaded.
15.

16.

17.

18.

19.

Sophie
Mitglied

DerNick03
Mitglied

Joko
Mitglied

Jens
Mitglied

Christian
Mitglied

76

51

0

0


Zdenko


so 5. 10. 2019 o 18:27 test0r man <[email protected]> napísal(a):

> thanks for your test. i set the border with imagemagick for a better
> result on the first image. tesseract detects with psm 6 all numbers right,
> but only on the second image. have you tried the first image too?
>
>
> Am Samstag, 5. Oktober 2019 14:52:15 UTC+2 schrieb zdenop:
>>
>>
>> tesseract 2_input_cropped.png - --psm 6 --oem 0
>> 6.
>> 7.
>> 8.
>> 9.
>> 10.
>>
>>
>>
>> Zdenko
>>
>>
>> so 5. 10. 2019 o 10:04 test0r man <[email protected]> napísal(a):
>>
>>> --Push--
>>>
>>> does anyone have an idea?
>>>
>>> thanks for help!
>>>
>>>
>>> Am Sonntag, 8. September 2019 12:23:28 UTC+2 schrieb test0r man:
>>>>
>>>> hi,
>>>> i use this command:
>>>>
>>>> tesseract input/image.jpg output/output --dpi 72 --oem 1 -l deu+eng
>>>>
>>>> to scan image like "1_input.jpg" and "2_input.jpg". the ocr result is
>>>> good, but it seems that tesseract ignores short/single characters.
>>>> in the first image it ignores the three "0".
>>>> in the second image it only detects the "10.".
>>>>
>>>> the tessinput files are attached too.
>>>> if i use the "--psm 6" command, all other words won't be detected right.
>>>> if i scale the images to 300 dpi, it's the same result.
>>>>
>>>> has anyone an idea? thanks for help!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/6bb8a731-afa3-4dbf-a805-90b9120b791b%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/6bb8a731-afa3-4dbf-a805-90b9120b791b%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c84074cd-d44b-4c52-95d5-a725e2a2b6af%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/c84074cd-d44b-4c52-95d5-a725e2a2b6af%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y%2B5EWd4zaA8fwTMLaAt9%2BQ%3D9ActDrdizMM9iQES%2Bw1%2Bw%40mail.gmail.com.

1_input_r.uzn
Description: Binary data

Re: [tesseract-ocr] Re: tesseract ignores single/short characters -> any ideas?

Reply via email to