any ideas?

Zdenko Podobny Sat, 05 Oct 2019 12:49:00 -0700

end is typo ;-) should be read as eng :-)

Dňa so 5. 10. 2019, 21:31 test0r man <[email protected]> napísal(a):


> Hi Zdenko,
>
> very good job! i've tried so many image manipulation, but this was the
> wrong way for the problems 1-3. the idea with the uzn file is great and i
> think the perfect solution. Thanks :-)
>
> i can confirm that scaling these image doesn't helped (more than 30 pixel
> per letter is the right explanation).
>
> what do you mean with the "end" traineddata? i have the "eng" traineddata
> and can't find "end.traineddata" - neither on google.
>
> i've tested it your files and the result is perfect. thank you, thank you,
> thank you!
>
>
> Am Samstag, 5. Oktober 2019 20:24:08 UTC+2 schrieb zdenop:
>>
>> First image has several problems:
>>
>>    1. not straight baseline
>>    2. different font size
>>    3. table like structure
>>    4. amount/digits fields
>>
>>
>> 1-3  could be solved with custom layout analyze e.g. splitting image to
>> individual parts and sending them to tesseract via API or uzn file.
>>
>> There was analyze (you can found it in forum) that suggest not to use
>> letters higher than 30 pixels,so I also resized input image.
>>
>> LSTM engine is not (always) good at OCR of amount field, so I suggest to
>> use legacy engine for this image (you will need end.trainneddata from
>> tessdata repository).
>>
>> Here is result:
>> tesseract 1_input_r.png - --psm 4 --oem 2
>> UZN file 1_input_r.uzn loaded.
>> 15.
>>
>> 16.
>>
>> 17.
>>
>> 18.
>>
>> 19.
>>
>> Sophie
>> Mitglied
>>
>> DerNick03
>> Mitglied
>>
>> Joko
>> Mitglied
>>
>> Jens
>> Mitglied
>>
>> Christian
>> Mitglied
>>
>> 76
>>
>> 51
>>
>> 0
>>
>> 0
>>
>>
>> Zdenko
>>
>>
>> so 5. 10. 2019 o 18:27 test0r man <[email protected]> napísal(a):
>>
>>> thanks for your test. i set the border with imagemagick for a better
>>> result on the first image. tesseract detects with psm 6 all numbers right,
>>> but only on the second image. have you tried the first image too?
>>>
>>>
>>> Am Samstag, 5. Oktober 2019 14:52:15 UTC+2 schrieb zdenop:
>>>>
>>>>
>>>> tesseract 2_input_cropped.png - --psm 6 --oem 0
>>>> 6.
>>>> 7.
>>>> 8.
>>>> 9.
>>>> 10.
>>>>
>>>>
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> so 5. 10. 2019 o 10:04 test0r man <[email protected]> napísal(a):
>>>>
>>>>> --Push--
>>>>>
>>>>> does anyone have an idea?
>>>>>
>>>>> thanks for help!
>>>>>
>>>>>
>>>>> Am Sonntag, 8. September 2019 12:23:28 UTC+2 schrieb test0r man:
>>>>>>
>>>>>> hi,
>>>>>> i use this command:
>>>>>>
>>>>>> tesseract input/image.jpg output/output --dpi 72 --oem 1 -l deu+eng
>>>>>>
>>>>>> to scan image like "1_input.jpg" and "2_input.jpg". the ocr result is
>>>>>> good, but it seems that tesseract ignores short/single characters.
>>>>>> in the first image it ignores the three "0".
>>>>>> in the second image it only detects the "10.".
>>>>>>
>>>>>> the tessinput files are attached too.
>>>>>> if i use the "--psm 6" command, all other words won't be detected
>>>>>> right.
>>>>>> if i scale the images to 300 dpi, it's the same result.
>>>>>>
>>>>>> has anyone an idea? thanks for help!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6bb8a731-afa3-4dbf-a805-90b9120b791b%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6bb8a731-afa3-4dbf-a805-90b9120b791b%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/c84074cd-d44b-4c52-95d5-a725e2a2b6af%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/c84074cd-d44b-4c52-95d5-a725e2a2b6af%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/7cd3752d-7fcc-44fe-bd0b-da291ea12d93%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/7cd3752d-7fcc-44fe-bd0b-da291ea12d93%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yykvwuDqfMb-r3OaS_-HJvFWe0882aHzKXnJvLbcK%3DgA%40mail.gmail.com.

Re: [tesseract-ocr] Re: tesseract ignores single/short characters -> any ideas?

Reply via email to