Re: [tesseract-ocr] Re: Word coordinate for single lines.

Lorenzo Bolzani Fri, 22 Jun 2018 12:18:01 -0700

With this configuration:

tesseract 3.05.01
 leptonica-1.75.3
  libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.6.20 : zlib 1.2.8



Running:

tesseract --psm 7 -l eng 24-block-0-L-42.png out

gives me:

3765 Sexualhormonbind. Globulin 1, 15 30 , 16


Upscaling the image to height 50px gives me:

3765 Sexualhormonbind. Globulin 1,15 30,16


As attachment you find the hocr output  I get with your command.

This for the second image (as is):

3620 Risen 1,15 2,68


For images like this you may also cut it into three parts:

3765
Sexualhormonbind. Globulin
1,15 30,16

and use a different "tessedit_char_whitelist" for each, like this:


tesseract --psm 7 -l eng -c tessedit_char_whitelist="1234567890" crop.png
out



Bye

Lorenzo

2018-06-22 16:47 GMT+02:00 <[email protected]>:

> I have tried to add margins to the lines, but it did not make the results
> better.
>
> Also tried to use other psm values (11, 12 ..) it was not also enhancing
> the output.
>
> It looks like the (hocr) parameter, is enforcing the psm to be as a page.
>
> any Ideas how to imporve or enhance the results.
>
> On Friday, June 15, 2018 at 2:42:00 PM UTC+2, [email protected] wrote:
>>
>> Dear All,
>>
>> In the project that I am currently working in, I have a pure text line
>> cropped from an document image.
>>
>> As a next step, I need to recognize the text using and at the same time,
>> I need to get the words coordinates.
>>
>> To get that coordinates I am passing the hocr parameters to the command
>> line and assign the page segmentation mode to 7 (line).
>>
>> tesseract file.png out.txt --psm 7 hocr.
>>
>> However, the output is really bad because by passing these parameters,
>> the line will be conisders as a page and some words will not be detected at
>> the output.
>>
>> Is there another way to get the word coordinate of that line?
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/4f861275-6e2d-47ed-bc98-ceb31f6c9fe0%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/4f861275-6e2d-47ed-bc98-ceb31f6c9fe0%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLzwEX0ynSyEsBbp%3D5NxkusLBdE44JYteiVXZ1VtxtVERw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

out.txt.hocr
Description: Binary data

Re: [tesseract-ocr] Re: Word coordinate for single lines.

Reply via email to