On Feb 7, 2016 4:25 AM, khandy21yo <[email protected]> wrote:
>
> If you want to get serious about OCRing documents, look at how Project 
> Gutenberg does ir, 
>
> ? After OCR each page goes through 3 passes of cleanup and formatting.
>

When I was doing my masters, I worked on OCR/IR... It has been a while but some 
things to consider:

Accuracy can be improved by proper training with known ground truth data. If 
the typeface of the RT11 manuals is the 'DEC' standard and matches the other 
scanned files. You can use that for training: produce images from the DOCUMENT 
output along side straight ascii... There's the initial ground truth.

Tesseract  is the OCR engine, there is a project called octopus that provides 
layout and other processing using tesseracts for OCR.

You can improve accuracy by using multiple OCR engines and vote on the results.

Some packages that may help: tesseracts, cuneiform (another OCR engine from 
Russia). Unpaper is a package that can help clean up scan images before ocring.

Having said all of that: for my personal stuff I use gscan2pdf under Ubuntu 
since it includes most of the above packages in a GUI.

-ron _______________________________________________
> Simh mailing list
> [email protected]
> http://mailman.trailing-edge.com/mailman/listinfo/simh
_______________________________________________
Simh mailing list
[email protected]
http://mailman.trailing-edge.com/mailman/listinfo/simh

Reply via email to