Since 'fi' and other ligatures generally get OCRed to a separate character,
I just run a post-ocr sed script to take care of them, in Linux.

On Mon, Jun 8, 2015 at 12:22 PM, Rick Leir <[email protected]> wrote:

> This problem with ligatures or digraphs is appearing frequently, how can
> I avoid it? I want simple output text, without ligatures. It is possible
> that the 'f' and 'i' are touching in the image. Is there a way to pass
> hints to Tesseract? Version 3.03 on Linux. TIA
>
> image text: fish
> OCR: "\x{fb01}sh";
> utf8: fish
>
> image text: flambeau
> OCR: "\x{fb02}ambeau,";
> utf8: flambeau,
>
>  "\x{fb01}xed";
> fixed
>
> "arti\x{fb01}cial";
> artificial
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/db2d9502-388e-4558-9ade-a484e0a941c1%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/db2d9502-388e-4558-9ade-a484e0a941c1%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 
/greg

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BOX7tq1-TOnJFYsNLOYei2S5LJODp9qKcVf30Z1uowOFa7bng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to