First image has several problems: 1. not straight baseline 2. different font size 3. table like structure 4. amount/digits fields
1-3 could be solved with custom layout analyze e.g. splitting image to individual parts and sending them to tesseract via API or uzn file. There was analyze (you can found it in forum) that suggest not to use letters higher than 30 pixels,so I also resized input image. LSTM engine is not (always) good at OCR of amount field, so I suggest to use legacy engine for this image (you will need end.trainneddata from tessdata repository). Here is result: tesseract 1_input_r.png - --psm 4 --oem 2 UZN file 1_input_r.uzn loaded. 15. 16. 17. 18. 19. Sophie Mitglied DerNick03 Mitglied Joko Mitglied Jens Mitglied Christian Mitglied 76 51 0 0 Zdenko so 5. 10. 2019 o 18:27 test0r man <[email protected]> napísal(a): > thanks for your test. i set the border with imagemagick for a better > result on the first image. tesseract detects with psm 6 all numbers right, > but only on the second image. have you tried the first image too? > > > Am Samstag, 5. Oktober 2019 14:52:15 UTC+2 schrieb zdenop: >> >> >> tesseract 2_input_cropped.png - --psm 6 --oem 0 >> 6. >> 7. >> 8. >> 9. >> 10. >> >> >> >> Zdenko >> >> >> so 5. 10. 2019 o 10:04 test0r man <[email protected]> napísal(a): >> >>> --Push-- >>> >>> does anyone have an idea? >>> >>> thanks for help! >>> >>> >>> Am Sonntag, 8. September 2019 12:23:28 UTC+2 schrieb test0r man: >>>> >>>> hi, >>>> i use this command: >>>> >>>> tesseract input/image.jpg output/output --dpi 72 --oem 1 -l deu+eng >>>> >>>> to scan image like "1_input.jpg" and "2_input.jpg". the ocr result is >>>> good, but it seems that tesseract ignores short/single characters. >>>> in the first image it ignores the three "0". >>>> in the second image it only detects the "10.". >>>> >>>> the tessinput files are attached too. >>>> if i use the "--psm 6" command, all other words won't be detected right. >>>> if i scale the images to 300 dpi, it's the same result. >>>> >>>> has anyone an idea? thanks for help! >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/6bb8a731-afa3-4dbf-a805-90b9120b791b%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/6bb8a731-afa3-4dbf-a805-90b9120b791b%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/c84074cd-d44b-4c52-95d5-a725e2a2b6af%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/c84074cd-d44b-4c52-95d5-a725e2a2b6af%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y%2B5EWd4zaA8fwTMLaAt9%2BQ%3D9ActDrdizMM9iQES%2Bw1%2Bw%40mail.gmail.com.
1_input_r.uzn
Description: Binary data

