use Encode qw(decode encode);
...
    $hocr = decode( 'UTF-8', $rawhocr );

    $hocr =~ s/\x{FB00}/ff/g;
    $hocr =~ s/\x{FB01}/fi/g;

    $octets = encode('UTF-8', $hocr);

On Tuesday, June 16, 2015 at 12:42:59 PM UTC-4, Tom Morris wrote:
>
> It's difficult to tell what the problem is without any example images. 
>  Are you saying that there are ligatures in the image and you don't want 
> them recognized as such or that there are not ligatures, but the characters 
> are touching due to low resolution or poor quality scan or over inking or 
> very tight kerning or ...?
>
> If everything else is satisfactory except for the occasional composed 
> character being generated, why not just add a simple post processing step 
> to decompose the ligatures into their constituent characters?  It's a 
> straight string substitution for characters which are not confusable with 
> anything else.
>
> Tom
>
> On Monday, June 15, 2015 at 1:55:08 PM UTC-4, Rick Leir wrote:
>>
>> John: thanks, I had not seen that!  How does "tessedit_char_blacklist" 
>> affect OCR speed? Accuracy? I want to use it, but feel as if I am walking 
>> on thin ice..
>>
>> Here is a list of ligatures from 
>> http://www.unicode.org/Public/UNIDATA/NamesList.txt , which ones do you 
>> commonly see in Tesseract output?
>>
>> FB00    LATIN SMALL LIGATURE FF
>>     # 0066 0066
>> FB01    LATIN SMALL LIGATURE FI
>>     # 0066 0069
>> FB02    LATIN SMALL LIGATURE FL
>>     # 0066 006C
>> FB03    LATIN SMALL LIGATURE FFI
>>     # 0066 0066 0069
>> FB04    LATIN SMALL LIGATURE FFL
>>     # 0066 0066 006C
>> FB05    LATIN SMALL LIGATURE LONG S T
>>     # 017F 0074
>> FB06    LATIN SMALL LIGATURE ST
>>     # 0073 0074
>>
>> There are also Armenian ligatures, Hebrew, Arabic ..
>>
>> On Wednesday, June 10, 2015 at 9:23:12 AM UTC-4, John Slade wrote:
>>>
>>> Have a look at the options "tessedit_char_blacklist" and 
>>> "tessedit_char_whitelist".  You could blacklist any ligatures you aren't 
>>> interested in.
>>>
>>> Or go the other way and just whitelist the things you want - for 
>>> instance you could whitelist to just the printable ascii characters.
>>>
>>> John
>>>
>>>
>>>
>>>
>>> On Monday, 8 June 2015 17:22:10 UTC+1, Rick Leir wrote:
>>>>
>>>> This problem with ligatures or digraphs is appearing frequently, how 
>>>> can I avoid it? I want simple output text, without ligatures. It is 
>>>> possible that the 'f' and 'i' are touching in the image. Is there a way to 
>>>> pass hints to Tesseract? Version 3.03 on Linux. TIA
>>>>
>>>> image text: fish
>>>> OCR: "\x{fb01}sh";
>>>> utf8: fish  
>>>>
>>>> image text: flambeau
>>>> OCR: "\x{fb02}ambeau,";
>>>> utf8: flambeau, 
>>>>
>>>>  "\x{fb01}xed";
>>>> fixed  
>>>>
>>>> "arti\x{fb01}cial";
>>>> artificial 
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9e2eb1a5-c532-4fd0-a216-229d418ebdb2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to