I think that you should split page text block to multiple columns, then rows, by leptonica or opencv. Thus, it is easy to ocr.
在 2016年3月29日星期二 UTC+8下午2:29:27,[email protected]写道: > > Hi All, > > I've been experimenting with tesseract and have been impressed with the > accuracy of the software. I'm looking to use tesseract to process around > 200 pages of printed material that was printed in around 1934. I've > attached a sample of the PDF I need to work with. > > I'm looking to improve the accuracy of the OCR process as much as > possible. I believe that with the vast, and I admit intimidating, list of > options available that there are ways to improve the accuracy. Speed of > recognition isn't as high a factor as accuracy for this project. > > The following steps is what I've found works best so far: > > 1. Convert the PDF to TIFF > > convert -density 350 input.pdf -type Grayscale -background white +matte > -depth 32 input.tif > > > 2. Clean the TIFF file using the text cleaner script [1] > > textcleaner -t 25 -s 1 -g input.tif cleaned.tif > > > 3. OCR the cleaned TIFF file. > > tesseract cleaned.tif ./test-ocr > > > Any thoughts on ways to improve the accuracy will be gratefully received. > > > With thanks. > > > -Corey > > > [1] http://www.fmwconcepts.com/imagemagick/textcleaner/ > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/85b3a0a7-15d2-446c-bc22-b4ae652391c0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

