Re: [tesseract-ocr] Not Getting Proper Output using Tesseract

supriya Das Thu, 28 May 2015 22:51:44 -0700

Hello Dmitri Silaev,

   Thanks for your Response.


On Thursday, 28 May 2015 20:14:30 UTC+5:30, Dmitri Silaev wrote:
>
> I see you have a publication on document image processing, therefore I 
> suppose you're in the know of many techniques.
>
> These images require a bit different approaches. In general, in both cases 
> Tess requires some help with layout analysis and table border or frame 
> removal.
>
> 4.png
> -------
> - Binarize. I think Otsu would suffice.
> - Remove table borders. Use either CC analysis (filter by CC size, nesting 
> level, etc.), or Hough transform to detect long straight lines (if table 
> borders touch characters).
> - Isolate rotated text at the right. Tess can't recognize such text. 
> Unrotate and OCR separately. Probably also would need upscaling, say by 3x.
> - Isolate regions with dense text and OCR separately one by one. Tess is 
> bad at recognition of sparse text, let alone so different in size.
>
> 82.png
> ---------
> - Binarize. Otsu.
> - Remove the frame. I suppose the easiest is filter CCs by pixel count.
> - Upper word. Isolate and OCR separately. Needs prior blurring (to make 
> characters more "fleshy") and upscaling (to provide more stroke details to 
> Tess). Instead of blurring you may use dilation.
> - Lower word. Isolate and OCR separately. May require erosion (as Tess's 
> stock traineddata might not work well for such bold font).
>
> Locating dense text regions, vertical text and so on can be done by NN 
> chain analysis.
>
> It seems you have used all the above mentioned methods as I read in your 
> article's abstract. Tesseract is no miracle, you have to do many things 
> manually. All above is easier to do by programming but might be done by 
> means of ImageMagick/shell scripts also.
>
> Best regards,
> Dmitri Silaev
> www.CustomOCR.com
>
>
>
>
>
> On Thu, May 28, 2015 at 2:47 PM, supriya Das <[email protected] 
> <javascript:>> wrote:
>
>> Hello Dmitri Siaev,
>> Thanks for your response. Please tell me the complex processing logic. 
>> Thanks in advance.
>>
>> On Thursday, 28 May 2015 15:59:22 UTC+5:30, Dmitri Silaev wrote:
>>>
>>> You won't get any improvement just by changing a few params. A more 
>>> complex processing is required. Let me know if you're interested in more 
>>> details.
>>>
>>> Best regards,
>>> Dmitri Silaev
>>> www.CustomOCR.com
>>>
>>>
>>>
>>>
>>>
>>> On Thu, May 28, 2015 at 8:50 AM, supriya Das <[email protected]> 
>>> wrote:
>>>
>>>> Hello Everybody,
>>>>
>>>>    I am not getting proper output for couple of image. What kind of 
>>>> parameter should be set for getting proper output?
>>>>    and is it possible to set SetPageSegMode with multiple enum at a 
>>>> time? Some problem images are as follow. Thanks in Advance.
>>>>
>>>>
>>>> In the bellow images i am not getting any kind of output. i also  tried 
>>>> to change ppi to 300 but not getting result.
>>>>
>>>>
>>>> <https://lh3.googleusercontent.com/-XlFRIZfDN-k/VWasN7JC1FI/AAAAAAAAAPU/y77aOoveOhk/s1600/4.png>
>>>>
>>>>
>>>> <https://lh3.googleusercontent.com/-jW3aDb_4lZE/VWargKvFZsI/AAAAAAAAAPM/Y26kenYq93U/s1600/82.png>
>>>>   
>>>>  
>>>>
>>>>
>>>>                                                                         
>>>>                        
>>>>    
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/7431b25c-47ae-46d1-af90-e2ec80a7b7ca%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7431b25c-47ae-46d1-af90-e2ec80a7b7ca%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/8e57aa4c-3a7c-4eb4-a377-8a0700093f32%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/8e57aa4c-3a7c-4eb4-a377-8a0700093f32%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/81ba6a60-0542-44ab-9c48-a0fef69ff363%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Not Getting Proper Output using Tesseract

Reply via email to