I have given a try to tesseract .... hocr on a couple of test pages. I understand the idea, but have the impression that the marked output contains far too many paragraphs (about each line is a paragraph) than I would expect. Are you perhaps aware of some config variable which sets a tolerance threshold?
Also, I don't know what to to to filter the hocr output to plain text + additional line break. I've looked in hocr-tools, hocr-as-no-html is listed as "possible", not even "planned". Do you have refs for the " hOCR-based tools can be used for Project Gutenberg" you mentioned? Enrico On Apr 26, 6:12 am, Dmitri Silaev <[email protected]> wrote: > From what I could find, Tesseract does paragraph breaking for hOCR output. > As I know there are hOCR-based tools can be used for Project Gutenberg. > > Warm regards, > Dmitri Silaevwww.CustomOCR.com > > On Tue, Apr 26, 2011 at 12:47 AM, Enrico Segre > > <[email protected]> wrote: > > I'm striving to use tesseract for providing content to the Project > > Gutenberg. There, proofing workflow requires that one blank line is > > inserted between each recognized paragraph, paragraphs being defined > > by a changing indentation of their first line w.r.o. the body text. > > > I found this old post: > > >http://groups.google.com/group/tesseract-ocr/browse_thread/thread/34a... > > > Am I understanding correctly that the situation hasn't changed since > > then, or is there a way? > > > Enrico > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

