e.g. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.444.226&rep=rep1&type=pdf
https://arthurflor23.medium.com/text-segmentation-b32503ef2613 Zdenko pi 23. 10. 2020 o 5:05 H Brenner <hyltonbren...@gmail.com> napísal(a): > Hi Zdenko, > > Per you suggestion I have installed the latest version of tesseract (Ver > 5), and I played with the psm. > > I get the best result using --psm 11, like you did. Other values of psm > give poor results. npsm 11 is the best, but it is still not good. > > How do I create custom image segmentation? > > Thank you in advance for your help. > > Hylton > > On Saturday, October 3, 2020 at 12:21:10 PM UTC+3 zdenop wrote: > >> 1. try the latest version >> 2. try play with psm: e.g. tesseract 20201002.png - --psm 11 --dpi 300 >> produces: >> >> 8 27 26 10 04 03 01 >> >> N29 19 16 14 09 03 >> >> 131 27 25 18 12 03 >> >> N21 18 16 13 07 04 >> >> N32 232112 10 07 >> >> N 36 34 30 27 21 01 >> >> X35 3417 13 10 08 >> >> N36 33 29 28 14 09 >> >> R 33 32 31 21 06 01 >> >> - oe ———— >> >> —— — ——— —— a = — >> >> R 37 27 19 09 05 03 >> >> -——— >> >> Fra anny >> >> 156136 >> >> -—— >> >> 3198(19): ‘on iam mn >> >> 10:52:25 28.11.19 1 09 >> >> >> .. . custom image segmentation would help too (and then to OCR each >> "cell" individually) >> >> Zdenko >> >> >> so 3. 10. 2020 o 7:06 H Brenner <hylton...@gmail.com> napísal(a): >> >>> Hi, >>> >>> I have tesseract 3.02 on a Windows 10 PC. >>> >>> I am trying to recognise text on a form scanned with a camera that has >>> numbers mostly in tabular form with a small amount of Hebrew characters >>> plus one English "graphical" word. I processed the photo to remove a pink >>> background pattern, and to enhance the text in the image (the original - >>> minus the pink pattern - produced the same results) >>> >>> [image: 3198Rfat.png] >>> >>> The Hebrew text on the bottom 2 lines is cut off on the right, but this >>> does not matter to me. >>> >>> Only the numbers are of interest to me in the output. >>> >>> I am running tesseract in Python using the pytesseract wrapper, and I am >>> running the following command: >>> >>> - Imaj=Image.open(ImgPath) # ImgPath is the full path to the .png >>> file. >>> - print('\n\n','v'*20,'\n', >>> pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n') # use eng default >>> >>> I believe this corresponds to the command-line: >>> >>> - tesseract ImgPath out (I used the actual path) >>> >>> The output that I get is the following: >>> >>> - 7547512723 <(754)%20751-2723> 2 >>> - >>> - 1334718913 >>> - 0000000000 >>> - 3927010465. >>> - 4483273819.. >>> - 0.|..1|.|.1ln/_1|.7_n/.01 >>> - 0556107919.. >>> - 1|11n/Tln/_nJ110._O...|__ >>> - 6978344327.. >>> - n/..|9._..l9._Q.:1Jn.o3n/___ >>> - _/0._1|.|9._n0EunD3./: >>> - n/L232333333““ >>> - >>> - A —:1 qnnwn N >>> - >>> - 156138 >>> - >>> - ::§1§§?13:?76fi-fi333ii‘ifi1 >>> - 10:52:25 29.11.19 :1 ma‘ >>> >>> Most of it is meaningless gibberish to me. Only the highlighted text is >>> recognised correctly/ >>> >>> When I ran it with the Hebrew language selected, it produced similar >>> results, but with *some *of the Hebrew characters and only the "156138" >>> recognised correctly. >>> >>> Running tesseract manually (English) in a 'CMD' window produced the >>> attached file 'out.txt'. >>> >>> I suspect that the font used in the form is the problem - the form was >>> not printed on a normal Windows, Mac or linux computer. >>> >>> Which fonts were used to create heb.traineddata? Is there a way for me >>> to display them? >>> >>> Do I have to train tesseract with the font in the form? >>> >>> Any help will be appreciated! >>> >>> Thanks! >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/66846144-4cbb-444a-8385-98edfbf1b1c3n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/66846144-4cbb-444a-8385-98edfbf1b1c3n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xpeMnGXedw-OyESydvryvxB9ySccQU6cuq3CcdnCLdpA%40mail.gmail.com.