I have given a try to tesseract .... hocr on a couple of test pages. I
understand the idea, but have the impression that the marked output
contains far too many paragraphs (about each line is a paragraph) than
I would expect. Are you perhaps aware of some config variable which
sets a tolerance threshold?

Also, I don't know what to to to filter the hocr output to plain text
+ additional line break. I've looked in hocr-tools, hocr-as-no-html is
listed as "possible", not even "planned".

Do you have refs for the " hOCR-based tools can be used for Project
Gutenberg" you mentioned?

Enrico

On Apr 26, 6:12 am, Dmitri Silaev <[email protected]> wrote:
> From what I could find, Tesseract does paragraph breaking for hOCR output.
> As I know there are hOCR-based tools can be used for Project Gutenberg.
>
> Warm regards,
> Dmitri Silaevwww.CustomOCR.com
>
> On Tue, Apr 26, 2011 at 12:47 AM, Enrico Segre
>
> <[email protected]> wrote:
> > I'm striving to use tesseract for providing content to the Project
> > Gutenberg. There, proofing workflow requires that one blank line is
> > inserted between each recognized paragraph, paragraphs being defined
> > by a changing indentation of their first line w.r.o. the body text.
>
> > I found this old post:
>
> >http://groups.google.com/group/tesseract-ocr/browse_thread/thread/34a...
>
> > Am I understanding correctly that the situation hasn't changed since
> > then, or is there a way?
>
> > Enrico
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to