use Encode qw(decode encode);
...
$hocr = decode( 'UTF-8', $rawhocr );
$hocr =~ s/\x{FB00}/ff/g;
$hocr =~ s/\x{FB01}/fi/g;
$octets = encode('UTF-8', $hocr);
On Tuesday, June 16, 2015 at 12:42:59 PM UTC-4, Tom Morris wrote:
>
> It's difficult to tell what the problem is without any example images.
> Are you saying that there are ligatures in the image and you don't want
> them recognized as such or that there are not ligatures, but the characters
> are touching due to low resolution or poor quality scan or over inking or
> very tight kerning or ...?
>
> If everything else is satisfactory except for the occasional composed
> character being generated, why not just add a simple post processing step
> to decompose the ligatures into their constituent characters? It's a
> straight string substitution for characters which are not confusable with
> anything else.
>
> Tom
>
> On Monday, June 15, 2015 at 1:55:08 PM UTC-4, Rick Leir wrote:
>>
>> John: thanks, I had not seen that! How does "tessedit_char_blacklist"
>> affect OCR speed? Accuracy? I want to use it, but feel as if I am walking
>> on thin ice..
>>
>> Here is a list of ligatures from
>> http://www.unicode.org/Public/UNIDATA/NamesList.txt , which ones do you
>> commonly see in Tesseract output?
>>
>> FB00 LATIN SMALL LIGATURE FF
>> # 0066 0066
>> FB01 LATIN SMALL LIGATURE FI
>> # 0066 0069
>> FB02 LATIN SMALL LIGATURE FL
>> # 0066 006C
>> FB03 LATIN SMALL LIGATURE FFI
>> # 0066 0066 0069
>> FB04 LATIN SMALL LIGATURE FFL
>> # 0066 0066 006C
>> FB05 LATIN SMALL LIGATURE LONG S T
>> # 017F 0074
>> FB06 LATIN SMALL LIGATURE ST
>> # 0073 0074
>>
>> There are also Armenian ligatures, Hebrew, Arabic ..
>>
>> On Wednesday, June 10, 2015 at 9:23:12 AM UTC-4, John Slade wrote:
>>>
>>> Have a look at the options "tessedit_char_blacklist" and
>>> "tessedit_char_whitelist". You could blacklist any ligatures you aren't
>>> interested in.
>>>
>>> Or go the other way and just whitelist the things you want - for
>>> instance you could whitelist to just the printable ascii characters.
>>>
>>> John
>>>
>>>
>>>
>>>
>>> On Monday, 8 June 2015 17:22:10 UTC+1, Rick Leir wrote:
>>>>
>>>> This problem with ligatures or digraphs is appearing frequently, how
>>>> can I avoid it? I want simple output text, without ligatures. It is
>>>> possible that the 'f' and 'i' are touching in the image. Is there a way to
>>>> pass hints to Tesseract? Version 3.03 on Linux. TIA
>>>>
>>>> image text: fish
>>>> OCR: "\x{fb01}sh";
>>>> utf8: fish
>>>>
>>>> image text: flambeau
>>>> OCR: "\x{fb02}ambeau,";
>>>> utf8: flambeau,
>>>>
>>>> "\x{fb01}xed";
>>>> fixed
>>>>
>>>> "arti\x{fb01}cial";
>>>> artificial
>>>>
>>>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/9e2eb1a5-c532-4fd0-a216-229d418ebdb2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.