I think that you should split page text block to multiple columns, then 
rows, by leptonica or opencv.  Thus, it is easy to ocr.

在 2016年3月29日星期二 UTC+8下午2:29:27,[email protected]写道:
>
> Hi All,
>
> I've been experimenting with tesseract and have been impressed with the 
> accuracy of the software. I'm looking to use tesseract to process around 
> 200 pages of printed material that was printed in around 1934. I've 
> attached a sample of the PDF I need to work with. 
>
> I'm looking to improve the accuracy of the OCR process as much as 
> possible. I believe that with the vast, and I admit intimidating, list of 
> options available that there are ways to improve the accuracy. Speed of 
> recognition isn't as high a factor as accuracy for this project. 
>
> The following steps is what I've found works best so far:
>
> 1. Convert the PDF to TIFF
>
> convert -density 350 input.pdf -type Grayscale -background white +matte 
> -depth 32 input.tif
>
>
> 2. Clean the TIFF file using the text cleaner script [1]
>
> textcleaner -t 25 -s 1 -g input.tif cleaned.tif
>
>
> 3. OCR the cleaned TIFF file.
>
> tesseract cleaned.tif ./test-ocr
>
>
> Any thoughts on ways to improve the accuracy will be gratefully received. 
>
>
> With thanks. 
>
>
> -Corey
>
>
> [1] http://www.fmwconcepts.com/imagemagick/textcleaner/
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/85b3a0a7-15d2-446c-bc22-b4ae652391c0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to