>From what I could find, Tesseract does paragraph breaking for hOCR output.
As I know there are hOCR-based tools can be used for Project Gutenberg.

Warm regards,
Dmitri Silaev
www.CustomOCR.com





On Tue, Apr 26, 2011 at 12:47 AM, Enrico Segre
<[email protected]> wrote:
> I'm striving to use tesseract for providing content to the Project
> Gutenberg. There, proofing workflow requires that one blank line is
> inserted between each recognized paragraph, paragraphs being defined
> by a changing indentation of their first line w.r.o. the body text.
>
> I found this old post:
>
> http://groups.google.com/group/tesseract-ocr/browse_thread/thread/34ab77d8dd1636e3/35e59c6a67661ee3?lnk=gst&q=paragraph#35e59c6a67661ee3
>
> Am I understanding correctly that the situation hasn't changed since
> then, or is there a way?
>
> Enrico
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to