page layout analysis for tesseract 2 (and tesseract default language)

Arno Teigseth Sat, 30 Oct 2010 10:52:55 -0700

Well hope I  don't step on anyone's toes here, but just wanted to share
how I've coped with page layout analysis in tesseract2:


sudo apt-get install ocropus 

(ubuntu)

ocropus is a page layout analysis command-line tool, and uses tesseract
for ocr.

I had a huge headache: 1700 pages of two-column-pages with some images
here and there, not scanned very exactly so they would be skewed, not
very centered and so on. 

Before I've been straightening and cutting the pages with scripts using
imagemagick, but it isn't very practical. And the images are still
there, right? Lots of manual work.

Now, the command

ocroscript rec-tess 001.jpg > page001.html

does page layout analysis, sends the pieces assumed to be text to
tesseract for recognition, and builds up an html page from the results.

Hope this can be useful for someone, at least if you're stuck with
pre-tesseract3 versions.

best
Arno

signature.asc
Description: This is a digitally signed message part

page layout analysis for tesseract 2 (and tesseract default language)

Reply via email to