Hi Greg What ligatures do you run into with tesseract that you need to post-ocr? Thanks -- Rick
For anyone having trouble with utf-8 in sed, see: http://stackoverflow.com/questions/27072558/sed-and-utf-8-encoding On Monday, June 8, 2015 at 12:30:05 PM UTC-4, gdunkel wrote: > > Since 'fi' and other ligatures generally get OCRed to a separate > character, I just run a post-ocr sed script to take care of them, in Linux. > > On Mon, Jun 8, 2015 at 12:22 PM, Rick Leir <[email protected] <javascript:>> > wrote: > >> This problem with ligatures or digraphs is appearing frequently, how can >> I avoid it? I want simple output text, without ligatures. It is possible >> that the 'f' and 'i' are touching in the image. Is there a way to pass >> hints to Tesseract? Version 3.03 on Linux. TIA >> >> image text: fish >> OCR: "\x{fb01}sh"; >> utf8: fish >> >> image text: flambeau >> OCR: "\x{fb02}ambeau,"; >> utf8: flambeau, >> >> "\x{fb01}xed"; >> fixed >> >> "arti\x{fb01}cial"; >> artificial >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/db2d9502-388e-4558-9ade-a484e0a941c1%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/db2d9502-388e-4558-9ade-a484e0a941c1%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > /greg > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/518fa216-f54d-48b1-a381-5f79a0c2684c%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

