It's difficult to tell what the problem is without any example images. Are you saying that there are ligatures in the image and you don't want them recognized as such or that there are not ligatures, but the characters are touching due to low resolution or poor quality scan or over inking or very tight kerning or ...?
If everything else is satisfactory except for the occasional composed character being generated, why not just add a simple post processing step to decompose the ligatures into their constituent characters? It's a straight string substitution for characters which are not confusable with anything else. Tom On Monday, June 15, 2015 at 1:55:08 PM UTC-4, Rick Leir wrote: > > John: thanks, I had not seen that! How does "tessedit_char_blacklist" > affect OCR speed? Accuracy? I want to use it, but feel as if I am walking > on thin ice.. > > Here is a list of ligatures from > http://www.unicode.org/Public/UNIDATA/NamesList.txt , which ones do you > commonly see in Tesseract output? > > FB00 LATIN SMALL LIGATURE FF > # 0066 0066 > FB01 LATIN SMALL LIGATURE FI > # 0066 0069 > FB02 LATIN SMALL LIGATURE FL > # 0066 006C > FB03 LATIN SMALL LIGATURE FFI > # 0066 0066 0069 > FB04 LATIN SMALL LIGATURE FFL > # 0066 0066 006C > FB05 LATIN SMALL LIGATURE LONG S T > # 017F 0074 > FB06 LATIN SMALL LIGATURE ST > # 0073 0074 > > There are also Armenian ligatures, Hebrew, Arabic .. > > On Wednesday, June 10, 2015 at 9:23:12 AM UTC-4, John Slade wrote: >> >> Have a look at the options "tessedit_char_blacklist" and >> "tessedit_char_whitelist". You could blacklist any ligatures you aren't >> interested in. >> >> Or go the other way and just whitelist the things you want - for instance >> you could whitelist to just the printable ascii characters. >> >> John >> >> >> >> On Monday, 8 June 2015 17:22:10 UTC+1, Rick Leir wrote: >>> >>> This problem with ligatures or digraphs is appearing frequently, how >>> can I avoid it? I want simple output text, without ligatures. It is >>> possible that the 'f' and 'i' are touching in the image. Is there a way to >>> pass hints to Tesseract? Version 3.03 on Linux. TIA >>> >>> image text: fish >>> OCR: "\x{fb01}sh"; >>> utf8: fish >>> >>> image text: flambeau >>> OCR: "\x{fb02}ambeau,"; >>> utf8: flambeau, >>> >>> "\x{fb01}xed"; >>> fixed >>> >>> "arti\x{fb01}cial"; >>> artificial >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5e31ea97-ae2d-4939-8c15-6a42c7eefa4e%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

