Hi,
I have tesseract 3.02 on a Windows 10 PC.
I am trying to recognise text on a form scanned with a camera that has
numbers mostly in tabular form with a small amount of Hebrew characters
plus one English "graphical" word. I processed the photo to remove a pink
background pattern, and to enhance the text in the image (the original -
minus the pink pattern - produced the same results)
[image: 3198Rfat.png]
The Hebrew text on the bottom 2 lines is cut off on the right, but this
does not matter to me.
Only the numbers are of interest to me in the output.
I am running tesseract in Python using the pytesseract wrapper, and I am
running the following command:
- Imaj=Image.open(ImgPath) # ImgPath is the full path to the .png file.
- print('\n\n','v'*20,'\n',
pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n') # use eng default
I believe this corresponds to the command-line:
- tesseract ImgPath out (I used the actual path)
The output that I get is the following:
- 7547512723 2
-
- 1334718913
- 0000000000
- 3927010465.
- 4483273819..
- 0.|..1|.|.1ln/_1|.7_n/.01
- 0556107919..
- 1|11n/Tln/_nJ110._O...|__
- 6978344327..
- n/..|9._..l9._Q.:1Jn.o3n/___
- _/0._1|.|9._n0EunD3./:
- n/L232333333““
-
- A —:1 qnnwn N
-
- 156138
-
- ::§1§§?13:?76fi-fi333ii‘ifi1
- 10:52:25 29.11.19 :1 ma‘
Most of it is meaningless gibberish to me. Only the highlighted text is
recognised correctly/
When I ran it with the Hebrew language selected, it produced similar
results, but with *some *of the Hebrew characters and only the "156138"
recognised correctly.
Running tesseract manually (English) in a 'CMD' window produced the
attached file 'out.txt'.
I suspect that the font used in the form is the problem - the form was not
printed on a normal Windows, Mac or linux computer.
Which fonts were used to create heb.traineddata? Is there a way for me to
display them?
Do I have to train tesseract with the font in the form?
Any help will be appreciated!
Thanks!
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com.
7547512723 2
1334718913
0000000000
3927010465.
4483273819..
0.|..1|.|.1ln/_1|.7_n/.01
0556107919..
1|11n/Tln/_nJ110._O...|__
6978344327..
n/..|9._..l9._Q.:1Jn.o3n/___
_/0._1|.|9._n0EunD3./:
n/L232333333““
A —:1 qnnwn N
156138
::§1§§?13:?76fi-fi333ii‘ifi1
10:52:25 29.11.19 :1 ma‘