[tesseract-ocr] Re: poor recognition of 'fi'

Tom Morris Tue, 16 Jun 2015 09:43:44 -0700

It's difficult to tell what the problem is without any example images.  Are 
you saying that there are ligatures in the image and you don't want them 
recognized as such or that there are not ligatures, but the characters are 
touching due to low resolution or poor quality scan or over inking or very 
tight kerning or ...?


If everything else is satisfactory except for the occasional composed 
character being generated, why not just add a simple post processing step 
to decompose the ligatures into their constituent characters?  It's a 
straight string substitution for characters which are not confusable with 
anything else.

Tom

On Monday, June 15, 2015 at 1:55:08 PM UTC-4, Rick Leir wrote:
>
> John: thanks, I had not seen that!  How does "tessedit_char_blacklist" 
> affect OCR speed? Accuracy? I want to use it, but feel as if I am walking 
> on thin ice..
>
> Here is a list of ligatures from 
> http://www.unicode.org/Public/UNIDATA/NamesList.txt , which ones do you 
> commonly see in Tesseract output?
>
> FB00    LATIN SMALL LIGATURE FF
>     # 0066 0066
> FB01    LATIN SMALL LIGATURE FI
>     # 0066 0069
> FB02    LATIN SMALL LIGATURE FL
>     # 0066 006C
> FB03    LATIN SMALL LIGATURE FFI
>     # 0066 0066 0069
> FB04    LATIN SMALL LIGATURE FFL
>     # 0066 0066 006C
> FB05    LATIN SMALL LIGATURE LONG S T
>     # 017F 0074
> FB06    LATIN SMALL LIGATURE ST
>     # 0073 0074
>
> There are also Armenian ligatures, Hebrew, Arabic ..
>
> On Wednesday, June 10, 2015 at 9:23:12 AM UTC-4, John Slade wrote:
>>
>> Have a look at the options "tessedit_char_blacklist" and 
>> "tessedit_char_whitelist".  You could blacklist any ligatures you aren't 
>> interested in.
>>
>> Or go the other way and just whitelist the things you want - for instance 
>> you could whitelist to just the printable ascii characters.
>>
>> John
>>
>>
>>
>> On Monday, 8 June 2015 17:22:10 UTC+1, Rick Leir wrote:
>>>
>>> This problem with ligatures or digraphs is appearing frequently, how 
>>> can I avoid it? I want simple output text, without ligatures. It is 
>>> possible that the 'f' and 'i' are touching in the image. Is there a way to 
>>> pass hints to Tesseract? Version 3.03 on Linux. TIA
>>>
>>> image text: fish
>>> OCR: "\x{fb01}sh";
>>> utf8: ﬁsh  
>>>
>>> image text: flambeau
>>> OCR: "\x{fb02}ambeau,";
>>> utf8: ﬂambeau, 
>>>
>>>  "\x{fb01}xed";
>>> ﬁxed  
>>>
>>> "arti\x{fb01}cial";
>>> artiﬁcial 
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5e31ea97-ae2d-4939-8c15-6a42c7eefa4e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: poor recognition of 'fi'

Reply via email to