Re: [Vo]:Neat new OCR technology

Michel Jullian Fri, 19 Mar 2010 05:11:13 -0700

2010/3/19 Michel Jullian <[email protected]>:
... if you convert a
> clearscan pdf back to image format in higher resolution e.g. 600 dpi
> (this can be set in edit>preferences>convert from pdf>TIFF>edit
> settings), make a new pdf from that, and re-do an OCR on it,
> interestingly the recognition accuracy is improved,


Let me retract this, after experimenting on a few more pages it turns
out the 2nd OCR pass makes roughly the same number of recognition
errors as the 1st pass on average, what fooled me is that it doesn't
do them on the same words. So there is no point really in going
through the complexity and hard work of a 2nd pass.

There is another use however, useful this time, of the trick of saving
as tiff and re-pdf-ing before OCRing: it circumvents the "Acrobat
could not perform recognition (OCR) on this page because: This page
contains renderable text." error you get on some documents, which
annoyingly aborts the whole OCR process. If anyone knows of a simpler
way, I am interested.

Last point, I see they have integrated the "OCR multiple files"
feature to the main menu in version 9, so one doesn't have to go
through the batch processing procedure to OCR a large collection of
documents. Much more convenient.

Michel

Re: [Vo]:Neat new OCR technology

Reply via email to