Well hope I don't step on anyone's toes here, but just wanted to share how I've coped with page layout analysis in tesseract2:
sudo apt-get install ocropus (ubuntu) ocropus is a page layout analysis command-line tool, and uses tesseract for ocr. I had a huge headache: 1700 pages of two-column-pages with some images here and there, not scanned very exactly so they would be skewed, not very centered and so on. Before I've been straightening and cutting the pages with scripts using imagemagick, but it isn't very practical. And the images are still there, right? Lots of manual work. Now, the command ocroscript rec-tess 001.jpg > page001.html does page layout analysis, sends the pieces assumed to be text to tesseract for recognition, and builds up an html page from the results. Hope this can be useful for someone, at least if you're stuck with pre-tesseract3 versions. best Arno
signature.asc
Description: This is a digitally signed message part

