Re: [tesseract-ocr] how to see which fonts are used in .traineddata files

2020-10-23 Thread Zdenko Podobny
e.g.
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.444.226=rep1=pdf

https://arthurflor23.medium.com/text-segmentation-b32503ef2613

Zdenko


pi 23. 10. 2020 o 5:05 H Brenner  napísal(a):

> Hi Zdenko,
>
> Per you suggestion I have installed the latest version of tesseract (Ver
> 5), and I played with the psm.
>
> I get the best result using --psm 11, like you did. Other values of psm
> give poor results. npsm 11 is the best, but it is still not good.
>
> How do I create custom image segmentation?
>
> Thank you in advance for your help.
>
> Hylton
>
> On Saturday, October 3, 2020 at 12:21:10 PM UTC+3 zdenop wrote:
>
>> 1. try the latest version
>> 2. try play with psm: e.g. tesseract 20201002.png - --psm 11 --dpi 300
>> produces:
>>
>> 8 27 26 10 04 03 01
>>
>> N29 19 16 14 09 03
>>
>> 131 27 25 18 12 03
>>
>> N21 18 16 13 07 04
>>
>> N32 232112 10 07
>>
>> N 36 34 30 27 21 01
>>
>> X35 3417 13 10 08
>>
>> N36 33 29 28 14 09
>>
>> R 33 32 31 21 06 01
>>
>> - oe 
>>
>> —— — ——— —— a = —
>>
>> R 37 27 19 09 05 03
>>
>> -———
>>
>> Fra anny
>>
>> 156136
>>
>> -——
>>
>> 3198(19): ‘on iam mn
>>
>> 10:52:25 28.11.19 1 09
>>
>>
>> .. . custom image segmentation would help too (and then to OCR each
>> "cell" individually)
>>
>> Zdenko
>>
>>
>> so 3. 10. 2020 o 7:06 H Brenner  napísal(a):
>>
>>> Hi,
>>>
>>> I have tesseract 3.02 on a Windows 10 PC.
>>>
>>> I am trying to recognise text on a form scanned with a camera that has
>>> numbers mostly in tabular form with a small amount of Hebrew characters
>>> plus one English "graphical" word. I processed the photo to remove a pink
>>> background pattern, and to enhance the text in the image (the original -
>>> minus the pink pattern - produced the same results)
>>>
>>> [image: 3198Rfat.png]
>>>
>>> The Hebrew text on the bottom 2 lines is cut off on the right, but this
>>> does not matter to me.
>>>
>>> Only the numbers are of interest to me in the output.
>>>
>>> I am running tesseract in Python using the pytesseract wrapper, and I am
>>> running the following command:
>>>
>>>- Imaj=Image.open(ImgPath)  # ImgPath is the full path to the .png
>>>file.
>>>- print('\n\n','v'*20,'\n',
>>>pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n')  # use eng default
>>>
>>> I believe this corresponds to the command-line:
>>>
>>>- tesseract  ImgPath  out(I used the actual path)
>>>
>>> The output that I get is the following:
>>>
>>>-  7547512723 <(754)%20751-2723> 2
>>>-
>>>- 1334718913
>>>- 00
>>>- 3927010465.
>>>- 4483273819..
>>>- 0.|..1|.|.1ln/_1|.7_n/.01
>>>- 0556107919..
>>>- 1|11n/Tln/_nJ110._O...|__
>>>- 6978344327..
>>>- n/..|9._..l9._Q.:1Jn.o3n/___
>>>- _/0._1|.|9._n0EunD3./:
>>>- n/L23233““
>>>-
>>>-  A —:1 qnnwn N
>>>-
>>>- 156138
>>>-
>>>- ::§1§§?13:?76fi-fi333ii‘ifi1
>>>- 10:52:25 29.11.19 :1 ma‘
>>>
>>> Most of it is meaningless gibberish to me. Only the highlighted text is
>>> recognised correctly/
>>>
>>> When I ran it with the Hebrew language selected, it produced similar
>>> results, but with *some *of the Hebrew characters and only the "156138"
>>> recognised correctly.
>>>
>>> Running tesseract manually (English) in a 'CMD' window produced the
>>> attached file 'out.txt'.
>>>
>>> I suspect that the font used in the form is the problem - the form was
>>> not printed on a normal Windows, Mac or linux computer.
>>>
>>> Which fonts were used to create heb.traineddata? Is there a way for me
>>> to display them?
>>>
>>> Do I have to train tesseract with the font in the form?
>>>
>>> Any help will be appreciated!
>>>
>>> Thanks!
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/66846144-4cbb-444a-8385-98edfbf1b1c3n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion 

Re: [tesseract-ocr] how to see which fonts are used in .traineddata files

2020-10-22 Thread H Brenner
Hi Zdenko,

Per you suggestion I have installed the latest version of tesseract (Ver 
5), and I played with the psm.

I get the best result using --psm 11, like you did. Other values of psm 
give poor results. npsm 11 is the best, but it is still not good.

How do I create custom image segmentation?

Thank you in advance for your help.

Hylton

On Saturday, October 3, 2020 at 12:21:10 PM UTC+3 zdenop wrote:

> 1. try the latest version
> 2. try play with psm: e.g. tesseract 20201002.png - --psm 11 --dpi 300 
> produces:
>
> 8 27 26 10 04 03 01
>
> N29 19 16 14 09 03
>
> 131 27 25 18 12 03
>
> N21 18 16 13 07 04
>
> N32 232112 10 07
>
> N 36 34 30 27 21 01
>
> X35 3417 13 10 08
>
> N36 33 29 28 14 09
>
> R 33 32 31 21 06 01
>
> - oe 
>
> —— — ——— —— a = —
>
> R 37 27 19 09 05 03
>
> -———
>
> Fra anny
>
> 156136
>
> -——
>
> 3198(19): ‘on iam mn
>
> 10:52:25 28.11.19 1 09
>
>
> .. . custom image segmentation would help too (and then to OCR each "cell" 
> individually)
>
> Zdenko
>
>
> so 3. 10. 2020 o 7:06 H Brenner  napísal(a):
>
>> Hi,
>>
>> I have tesseract 3.02 on a Windows 10 PC.
>>
>> I am trying to recognise text on a form scanned with a camera that has 
>> numbers mostly in tabular form with a small amount of Hebrew characters 
>> plus one English "graphical" word. I processed the photo to remove a pink 
>> background pattern, and to enhance the text in the image (the original - 
>> minus the pink pattern - produced the same results)
>>
>> [image: 3198Rfat.png]
>>
>> The Hebrew text on the bottom 2 lines is cut off on the right, but this 
>> does not matter to me.
>>
>> Only the numbers are of interest to me in the output.
>>
>> I am running tesseract in Python using the pytesseract wrapper, and I am 
>> running the following command:
>>
>>- Imaj=Image.open(ImgPath)  # ImgPath is the full path to the .png 
>>file.
>>- print('\n\n','v'*20,'\n', 
>>pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n')  # use eng default
>>
>> I believe this corresponds to the command-line:
>>
>>- tesseract  ImgPath  out(I used the actual path)
>>
>> The output that I get is the following:
>>
>>-  7547512723 <(754)%20751-2723> 2
>>- 
>>- 1334718913
>>- 00
>>- 3927010465.
>>- 4483273819..
>>- 0.|..1|.|.1ln/_1|.7_n/.01
>>- 0556107919..
>>- 1|11n/Tln/_nJ110._O...|__
>>- 6978344327..
>>- n/..|9._..l9._Q.:1Jn.o3n/___
>>- _/0._1|.|9._n0EunD3./:
>>- n/L23233““
>>- 
>>-  A —:1 qnnwn N
>>- 
>>- 156138
>>- 
>>- ::§1§§?13:?76fi-fi333ii‘ifi1
>>- 10:52:25 29.11.19 :1 ma‘
>>
>> Most of it is meaningless gibberish to me. Only the highlighted text is 
>> recognised correctly/
>>
>> When I ran it with the Hebrew language selected, it produced similar 
>> results, but with *some *of the Hebrew characters and only the "156138" 
>> recognised correctly.
>>
>> Running tesseract manually (English) in a 'CMD' window produced the 
>> attached file 'out.txt'.
>>
>> I suspect that the font used in the form is the problem - the form was 
>> not printed on a normal Windows, Mac or linux computer.
>>
>> Which fonts were used to create heb.traineddata? Is there a way for me to 
>> display them?
>>
>> Do I have to train tesseract with the font in the form?
>>
>> Any help will be appreciated!
>>
>> Thanks!
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/66846144-4cbb-444a-8385-98edfbf1b1c3n%40googlegroups.com.


Re: [tesseract-ocr] how to see which fonts are used in .traineddata files

2020-10-05 Thread H Brenner
Hello Zdenko,

1) Can I assume you used the latest version of tesseract to produce the
output you produced?
To install the latest version, do I need to first *uninstall *the older
version that I have on my PC?
2) How do I create a custom image segmentation?

Thanks,
Hylton

On Sat, Oct 3, 2020 at 12:21 PM Zdenko Podobny  wrote:

> 1. try the latest version
> 2. try play with psm: e.g. tesseract 20201002.png - --psm 11 --dpi 300
> produces:
>
> 8 27 26 10 04 03 01
>
> N29 19 16 14 09 03
>
> 131 27 25 18 12 03
>
> N21 18 16 13 07 04
>
> N32 232112 10 07
>
> N 36 34 30 27 21 01
>
> X35 3417 13 10 08
>
> N36 33 29 28 14 09
>
> R 33 32 31 21 06 01
>
> - oe 
>
> —— — ——— —— a = —
>
> R 37 27 19 09 05 03
>
> -———
>
> Fra anny
>
> 156136
>
> -——
>
> 3198(19): ‘on iam mn
>
> 10:52:25 28.11.19 1 09
>
>
> .. . custom image segmentation would help too (and then to OCR each "cell"
> individually)
>
> Zdenko
>
>
> so 3. 10. 2020 o 7:06 H Brenner  napísal(a):
>
>> Hi,
>>
>> I have tesseract 3.02 on a Windows 10 PC.
>>
>> I am trying to recognise text on a form scanned with a camera that has
>> numbers mostly in tabular form with a small amount of Hebrew characters
>> plus one English "graphical" word. I processed the photo to remove a pink
>> background pattern, and to enhance the text in the image (the original -
>> minus the pink pattern - produced the same results)
>>
>> [image: 3198Rfat.png]
>>
>> The Hebrew text on the bottom 2 lines is cut off on the right, but this
>> does not matter to me.
>>
>> Only the numbers are of interest to me in the output.
>>
>> I am running tesseract in Python using the pytesseract wrapper, and I am
>> running the following command:
>>
>>- Imaj=Image.open(ImgPath)  # ImgPath is the full path to the .png
>>file.
>>- print('\n\n','v'*20,'\n',
>>pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n')  # use eng default
>>
>> I believe this corresponds to the command-line:
>>
>>- tesseract  ImgPath  out(I used the actual path)
>>
>> The output that I get is the following:
>>
>>-  7547512723 2
>>-
>>- 1334718913
>>- 00
>>- 3927010465.
>>- 4483273819..
>>- 0.|..1|.|.1ln/_1|.7_n/.01
>>- 0556107919..
>>- 1|11n/Tln/_nJ110._O...|__
>>- 6978344327..
>>- n/..|9._..l9._Q.:1Jn.o3n/___
>>- _/0._1|.|9._n0EunD3./:
>>- n/L23233““
>>-
>>-  A —:1 qnnwn N
>>-
>>- 156138
>>-
>>- ::§1§§?13:?76fi-fi333ii‘ifi1
>>- 10:52:25 29.11.19 :1 ma‘
>>
>> Most of it is meaningless gibberish to me. Only the highlighted text is
>> recognised correctly/
>>
>> When I ran it with the Hebrew language selected, it produced similar
>> results, but with *some *of the Hebrew characters and only the "156138"
>> recognised correctly.
>>
>> Running tesseract manually (English) in a 'CMD' window produced the
>> attached file 'out.txt'.
>>
>> I suspect that the font used in the form is the problem - the form was
>> not printed on a normal Windows, Mac or linux computer.
>>
>> Which fonts were used to create heb.traineddata? Is there a way for me to
>> display them?
>>
>> Do I have to train tesseract with the font in the form?
>>
>> Any help will be appreciated!
>>
>> Thanks!
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com
>> 
>> .
>>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/xhCARSW3RaU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xwpL-8%3DS4OwmtxNtkR47E-q5%2BtpncF%2BkPa05QkwGWWvA%40mail.gmail.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJpqH1h-RxdqqONwcz%3D%3D2aDR1Nxhwvk0hKW4eY%3DgyvfWg4ND2Q%40mail.gmail.com.


Re: [tesseract-ocr] how to see which fonts are used in .traineddata files

2020-10-03 Thread Zdenko Podobny
1. try the latest version
2. try play with psm: e.g. tesseract 20201002.png - --psm 11 --dpi 300
produces:

8 27 26 10 04 03 01

N29 19 16 14 09 03

131 27 25 18 12 03

N21 18 16 13 07 04

N32 232112 10 07

N 36 34 30 27 21 01

X35 3417 13 10 08

N36 33 29 28 14 09

R 33 32 31 21 06 01

- oe 

—— — ——— —— a = —

R 37 27 19 09 05 03

-———

Fra anny

156136

-——

3198(19): ‘on iam mn

10:52:25 28.11.19 1 09


.. . custom image segmentation would help too (and then to OCR each "cell"
individually)

Zdenko


so 3. 10. 2020 o 7:06 H Brenner  napísal(a):

> Hi,
>
> I have tesseract 3.02 on a Windows 10 PC.
>
> I am trying to recognise text on a form scanned with a camera that has
> numbers mostly in tabular form with a small amount of Hebrew characters
> plus one English "graphical" word. I processed the photo to remove a pink
> background pattern, and to enhance the text in the image (the original -
> minus the pink pattern - produced the same results)
>
> [image: 3198Rfat.png]
>
> The Hebrew text on the bottom 2 lines is cut off on the right, but this
> does not matter to me.
>
> Only the numbers are of interest to me in the output.
>
> I am running tesseract in Python using the pytesseract wrapper, and I am
> running the following command:
>
>- Imaj=Image.open(ImgPath)  # ImgPath is the full path to the .png
>file.
>- print('\n\n','v'*20,'\n',
>pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n')  # use eng default
>
> I believe this corresponds to the command-line:
>
>- tesseract  ImgPath  out(I used the actual path)
>
> The output that I get is the following:
>
>-  7547512723 2
>-
>- 1334718913
>- 00
>- 3927010465.
>- 4483273819..
>- 0.|..1|.|.1ln/_1|.7_n/.01
>- 0556107919..
>- 1|11n/Tln/_nJ110._O...|__
>- 6978344327..
>- n/..|9._..l9._Q.:1Jn.o3n/___
>- _/0._1|.|9._n0EunD3./:
>- n/L23233““
>-
>-  A —:1 qnnwn N
>-
>- 156138
>-
>- ::§1§§?13:?76fi-fi333ii‘ifi1
>- 10:52:25 29.11.19 :1 ma‘
>
> Most of it is meaningless gibberish to me. Only the highlighted text is
> recognised correctly/
>
> When I ran it with the Hebrew language selected, it produced similar
> results, but with *some *of the Hebrew characters and only the "156138"
> recognised correctly.
>
> Running tesseract manually (English) in a 'CMD' window produced the
> attached file 'out.txt'.
>
> I suspect that the font used in the form is the problem - the form was not
> printed on a normal Windows, Mac or linux computer.
>
> Which fonts were used to create heb.traineddata? Is there a way for me to
> display them?
>
> Do I have to train tesseract with the font in the form?
>
> Any help will be appreciated!
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xwpL-8%3DS4OwmtxNtkR47E-q5%2BtpncF%2BkPa05QkwGWWvA%40mail.gmail.com.