On Mar 12, 8:08 pm, attoampere <[email protected]> wrote:
> hello!
>
> i am new to tesseract... and just found out about the hocr switch.
> i was asking myself if there is a command line parameter for removing
> hyphenation from the txt file, and perhaps seperating paragraphs?
> this would help tremendously in post processing.
>
> thx

I'm not the expert here, but you could use a script to post-process
and replace hyphens (and subsequent newline character(s)) at the end
of the lines.

Except, you should log every instance of replacement, then review your
log to (eye)spot error-exceptions like "Jean-Paul" and "ultra-
orthodox"  (those are statistically rare, i think)



-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to