Re: [tesseract-ocr] General strategies for dealing with problem images

gl00637 Tue, 19 Mar 2019 13:23:26 -0700

Thank you for your response, my experience with OCR is limited to the 
conversion of screenshots I may take online, yours far more extensive I 
think.


And thank you particularly for items 2 and 5, slight skewing of the image 
may better account for the distortions in size and or aspect ratio that 
I've been thinking as the problem, because skewing can be localized to only 
parts of the image, which better describes tesseract's behavior with such 
images (in parts good, in other parts garbled).
Item 5 particularly seems very promising in that I could use it to 
ascii-ize the text that tesseract produces. I do a lot of post-editing with 
vim-scripts that often require ascii text to work properly. Can't really 
cut out numerals though, 0 and O aren't the only problem there, 5 and S, 1 
and I or l, J and ].

On Monday, March 18, 2019 at 10:03:18 PM UTC-7, Jonathan wrote:
>
> I don't really agree with your statement. There is a lot of things we had 
> to consider with image processing before tesseract finally gave us accurate 
> results. But it all makes sense. Here is our actual pipeline:
>
>  1 - Cleanup the image: remove any artifact of the camera or scan device, 
> cut the paper accurately, remove noise, binarize 
>  2 - Unskew the image: make text lines very horizontal
>  3 - Cut the zone of interest: take text zone of interest in the document, 
> using DNN to recognize the zones
>  4 - Clean the text zone: remove any unrelevant part in the image (like 
> lines, tables, stamps)
>  5 - Create a whitelist based on the zone of probable characters (this one 
> improves accuracy a lot !)
>  6 - Submit to tesseract with appropriate settings for the language
>
> 1: it is understandable how noise or image quality could affect recognition
> 2: tesseract expect lines of text to be straight
> 3: this reduces the processing speed and allow us to focus on the zone for 
> further cleaning (next steps) or custom parameters before submitting
> 4: lines, tables, and other things can alter recognition, because a piece 
> of line sometimes is recognised as |, -, _, l, `1`. it could also affect 
> nearby characters, especially when working with Chinese-based characters
> 5: whitelisting based on the content helps recognition a lot. simple 
> example is if you search for numbers, whitelist "1234567890" - 0 is close 
> to O. Even humans make the mistake, that's why we banned O from Wifi 
> passwords :laugh:
> 6: Settings of tesseract can improve a lot the recognition when working 
> with non-english scripts or when image is not perfect (tesseract works best 
> with dpi 300)
>
> We gone from 10% accuracy to nearly 95% now. Each image is different and 
> each may require different processing or parameters. Making a solutions 
> that fits all is very complex, but I still think it is possible if the 
> application is specific enough. I guess that is why it is not included in 
> tesseract. Making it work very well for a specific use-case would break 
> others. 
>
> I guess you just have to find the right pre-processing for your kind of 
> image
>
> Hope it thelps
>
> On Mon, 18 Mar 2019 at 18:59, <[email protected] <javascript:>> wrote:
>
>> I would like some advice concerning the general use of tesseract, because 
>> my experience with it tends to two extremes: either tesseract performs 
>> flawlessly, with no prior modification of the image necessary except 
>> cropping to the text and (most significant) enlarging the image by a factor 
>> of 2 or 4; or tesseract's output is riddled with errors.
>>
>> Following advice to improve the quality of the image (Fred's textcleaner 
>> script, or applying the Imagemagick functions it uses individually), 
>> usually produces significant improvement in *human readability* of the 
>> image, but as regards tesseract they usually produce no improvement, and 
>> most often actual deterioration in its performance.
>>
>> So I am looking for another reason to explain tesseract's difficulty with 
>> certain images. I thought perhaps its performance may be dependent on its 
>> trying to identify the particular font used, but 
>> https://github.com/tesseract-ocr/docs/blob/master/tesseracticdar2007.pdf 
>> seems to say not. 
>>
>> The only other possibility I can think of is either the size or the 
>> aspect ratio of the text in the image has been subtly deformed. If so, it 
>> is not apparent to my eye, but certainly tesseract is very sensitive to 
>> size change, because, when it works, resizing the image makes such a 
>> dramatic improvement.
>>
>> Does anyone have other suggestions as to the nature of the problem? I'm 
>> not asking for detailed advice here, which is why I've given no image 
>> samples, but for general lines of attack, strategy rather than tactics. 
>> Thank you.
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/15dcee7c-0815-47c3-9c74-29f8e90a7ca2%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/15dcee7c-0815-47c3-9c74-29f8e90a7ca2%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
> Jonathan
> 06.49.32.74.55
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/85ca544d-1665-4ed9-9c60-50c67f0d45b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] General strategies for dealing with problem images

Reply via email to