Re: [tesseract-ocr] Not Getting Proper Output using Tesseract

Dmitri Silaev Thu, 28 May 2015 07:44:51 -0700

I see you have a publication on document image processing, therefore I
suppose you're in the know of many techniques.


These images require a bit different approaches. In general, in both cases
Tess requires some help with layout analysis and table border or frame
removal.

4.png
-------
- Binarize. I think Otsu would suffice.
- Remove table borders. Use either CC analysis (filter by CC size, nesting
level, etc.), or Hough transform to detect long straight lines (if table
borders touch characters).
- Isolate rotated text at the right. Tess can't recognize such text.
Unrotate and OCR separately. Probably also would need upscaling, say by 3x.
- Isolate regions with dense text and OCR separately one by one. Tess is
bad at recognition of sparse text, let alone so different in size.

82.png
---------
- Binarize. Otsu.
- Remove the frame. I suppose the easiest is filter CCs by pixel count.
- Upper word. Isolate and OCR separately. Needs prior blurring (to make
characters more "fleshy") and upscaling (to provide more stroke details to
Tess). Instead of blurring you may use dilation.
- Lower word. Isolate and OCR separately. May require erosion (as Tess's
stock traineddata might not work well for such bold font).

Locating dense text regions, vertical text and so on can be done by NN
chain analysis.

It seems you have used all the above mentioned methods as I read in your
article's abstract. Tesseract is no miracle, you have to do many things
manually. All above is easier to do by programming but might be done by
means of ImageMagick/shell scripts also.

Best regards,
Dmitri Silaev
www.CustomOCR.com





On Thu, May 28, 2015 at 2:47 PM, supriya Das <[email protected]> wrote:

> Hello Dmitri Siaev,
> Thanks for your response. Please tell me the complex processing logic.
> Thanks in advance.
>
> On Thursday, 28 May 2015 15:59:22 UTC+5:30, Dmitri Silaev wrote:
>>
>> You won't get any improvement just by changing a few params. A more
>> complex processing is required. Let me know if you're interested in more
>> details.
>>
>> Best regards,
>> Dmitri Silaev
>> www.CustomOCR.com
>>
>>
>>
>>
>>
>> On Thu, May 28, 2015 at 8:50 AM, supriya Das <[email protected]> wrote:
>>
>>> Hello Everybody,
>>>
>>>    I am not getting proper output for couple of image. What kind of
>>> parameter should be set for getting proper output?
>>>    and is it possible to set SetPageSegMode with multiple enum at a
>>> time? Some problem images are as follow. Thanks in Advance.
>>>
>>>
>>> In the bellow images i am not getting any kind of output. i also  tried
>>> to change ppi to 300 but not getting result.
>>>
>>>
>>> <https://lh3.googleusercontent.com/-XlFRIZfDN-k/VWasN7JC1FI/AAAAAAAAAPU/y77aOoveOhk/s1600/4.png>
>>>
>>>
>>> <https://lh3.googleusercontent.com/-jW3aDb_4lZE/VWargKvFZsI/AAAAAAAAAPM/Y26kenYq93U/s1600/82.png>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/7431b25c-47ae-46d1-af90-e2ec80a7b7ca%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/7431b25c-47ae-46d1-af90-e2ec80a7b7ca%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/8e57aa4c-3a7c-4eb4-a377-8a0700093f32%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/8e57aa4c-3a7c-4eb4-a377-8a0700093f32%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAKzLxFPdiTAHD4Q-MKcBzqxmsVyNjAjhrws_gy2K_HxVGbvrzw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Not Getting Proper Output using Tesseract

Reply via email to