Since 'fi' and other ligatures generally get OCRed to a separate character, I just run a post-ocr sed script to take care of them, in Linux.
On Mon, Jun 8, 2015 at 12:22 PM, Rick Leir <[email protected]> wrote: > This problem with ligatures or digraphs is appearing frequently, how can > I avoid it? I want simple output text, without ligatures. It is possible > that the 'f' and 'i' are touching in the image. Is there a way to pass > hints to Tesseract? Version 3.03 on Linux. TIA > > image text: fish > OCR: "\x{fb01}sh"; > utf8: fish > > image text: flambeau > OCR: "\x{fb02}ambeau,"; > utf8: flambeau, > > "\x{fb01}xed"; > fixed > > "arti\x{fb01}cial"; > artificial > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/db2d9502-388e-4558-9ade-a484e0a941c1%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/db2d9502-388e-4558-9ade-a484e0a941c1%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- /greg -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CA%2BOX7tq1-TOnJFYsNLOYei2S5LJODp9qKcVf30Z1uowOFa7bng%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

