Hello Zdenko,
1) Can I assume you used the latest version of tesseract to produce the
output you produced?
To install the latest version, do I need to first *uninstall *the older
version that I have on my PC?
2) How do I create a custom image segmentation?
Thanks,
Hylton
On Sat, Oct 3, 2020 at 12:21 PM Zdenko Podobny <[email protected]> wrote:
> 1. try the latest version
> 2. try play with psm: e.g. tesseract 20201002.png - --psm 11 --dpi 300
> produces:
>
> 8 27 26 10 04 03 01
>
> N29 19 16 14 09 03
>
> 131 27 25 18 12 03
>
> N21 18 16 13 07 04
>
> N32 232112 10 07
>
> N 36 34 30 27 21 01
>
> X35 3417 13 10 08
>
> N36 33 29 28 14 09
>
> R 33 32 31 21 06 01
>
> - oe ————
>
> —— — ——— —— a = —
>
> R 37 27 19 09 05 03
>
> -———
>
> Fra anny
>
> 156136
>
> -——
>
> 3198(19): ‘on iam mn
>
> 10:52:25 28.11.19 1 09
>
>
> .. . custom image segmentation would help too (and then to OCR each "cell"
> individually)
>
> Zdenko
>
>
> so 3. 10. 2020 o 7:06 H Brenner <[email protected]> napísal(a):
>
>> Hi,
>>
>> I have tesseract 3.02 on a Windows 10 PC.
>>
>> I am trying to recognise text on a form scanned with a camera that has
>> numbers mostly in tabular form with a small amount of Hebrew characters
>> plus one English "graphical" word. I processed the photo to remove a pink
>> background pattern, and to enhance the text in the image (the original -
>> minus the pink pattern - produced the same results)
>>
>> [image: 3198Rfat.png]
>>
>> The Hebrew text on the bottom 2 lines is cut off on the right, but this
>> does not matter to me.
>>
>> Only the numbers are of interest to me in the output.
>>
>> I am running tesseract in Python using the pytesseract wrapper, and I am
>> running the following command:
>>
>> - Imaj=Image.open(ImgPath) # ImgPath is the full path to the .png
>> file.
>> - print('\n\n','v'*20,'\n',
>> pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n') # use eng default
>>
>> I believe this corresponds to the command-line:
>>
>> - tesseract ImgPath out (I used the actual path)
>>
>> The output that I get is the following:
>>
>> - 7547512723 2
>> -
>> - 1334718913
>> - 0000000000
>> - 3927010465.
>> - 4483273819..
>> - 0.|..1|.|.1ln/_1|.7_n/.01
>> - 0556107919..
>> - 1|11n/Tln/_nJ110._O...|__
>> - 6978344327..
>> - n/..|9._..l9._Q.:1Jn.o3n/___
>> - _/0._1|.|9._n0EunD3./:
>> - n/L232333333““
>> -
>> - A —:1 qnnwn N
>> -
>> - 156138
>> -
>> - ::§1§§?13:?76fi-fi333ii‘ifi1
>> - 10:52:25 29.11.19 :1 ma‘
>>
>> Most of it is meaningless gibberish to me. Only the highlighted text is
>> recognised correctly/
>>
>> When I ran it with the Hebrew language selected, it produced similar
>> results, but with *some *of the Hebrew characters and only the "156138"
>> recognised correctly.
>>
>> Running tesseract manually (English) in a 'CMD' window produced the
>> attached file 'out.txt'.
>>
>> I suspect that the font used in the form is the problem - the form was
>> not printed on a normal Windows, Mac or linux computer.
>>
>> Which fonts were used to create heb.traineddata? Is there a way for me to
>> display them?
>>
>> Do I have to train tesseract with the font in the form?
>>
>> Any help will be appreciated!
>>
>> Thanks!
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/xhCARSW3RaU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xwpL-8%3DS4OwmtxNtkR47E-q5%2BtpncF%2BkPa05QkwGWWvA%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xwpL-8%3DS4OwmtxNtkR47E-q5%2BtpncF%2BkPa05QkwGWWvA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAJpqH1h-RxdqqONwcz%3D%3D2aDR1Nxhwvk0hKW4eY%3DgyvfWg4ND2Q%40mail.gmail.com.