With this configuration: tesseract 3.05.01 leptonica-1.75.3 libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.6.20 : zlib 1.2.8
Running: tesseract --psm 7 -l eng 24-block-0-L-42.png out gives me: 3765 Sexualhormonbind. Globulin 1, 15 30 , 16 Upscaling the image to height 50px gives me: 3765 Sexualhormonbind. Globulin 1,15 30,16 As attachment you find the hocr output I get with your command. This for the second image (as is): 3620 Risen 1,15 2,68 For images like this you may also cut it into three parts: 3765 Sexualhormonbind. Globulin 1,15 30,16 and use a different "tessedit_char_whitelist" for each, like this: tesseract --psm 7 -l eng -c tessedit_char_whitelist="1234567890" crop.png out Bye Lorenzo 2018-06-22 16:47 GMT+02:00 <[email protected]>: > I have tried to add margins to the lines, but it did not make the results > better. > > Also tried to use other psm values (11, 12 ..) it was not also enhancing > the output. > > It looks like the (hocr) parameter, is enforcing the psm to be as a page. > > any Ideas how to imporve or enhance the results. > > On Friday, June 15, 2018 at 2:42:00 PM UTC+2, [email protected] wrote: >> >> Dear All, >> >> In the project that I am currently working in, I have a pure text line >> cropped from an document image. >> >> As a next step, I need to recognize the text using and at the same time, >> I need to get the words coordinates. >> >> To get that coordinates I am passing the hocr parameters to the command >> line and assign the page segmentation mode to 7 (line). >> >> tesseract file.png out.txt --psm 7 hocr. >> >> However, the output is really bad because by passing these parameters, >> the line will be conisders as a page and some words will not be detected at >> the output. >> >> Is there another way to get the word coordinate of that line? >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/4f861275-6e2d-47ed-bc98-ceb31f6c9fe0% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/4f861275-6e2d-47ed-bc98-ceb31f6c9fe0%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLzwEX0ynSyEsBbp%3D5NxkusLBdE44JYteiVXZ1VtxtVERw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
out.txt.hocr
Description: Binary data

