On Mar 12, 8:08 pm, attoampere <[email protected]> wrote: > hello! > > i am new to tesseract... and just found out about the hocr switch. > i was asking myself if there is a command line parameter for removing > hyphenation from the txt file, and perhaps seperating paragraphs? > this would help tremendously in post processing. > > thx
I'm not the expert here, but you could use a script to post-process and replace hyphens (and subsequent newline character(s)) at the end of the lines. Except, you should log every instance of replacement, then review your log to (eye)spot error-exceptions like "Jean-Paul" and "ultra- orthodox" (those are statistically rare, i think) -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

