Re: [Vo]:Neat new OCR technology

2010-03-19 Thread Michel Jullian
2010/3/19 Michel Jullian :
... if you convert a
> clearscan pdf back to image format in higher resolution e.g. 600 dpi
> (this can be set in edit>preferences>convert from pdf>TIFF>edit
> settings), make a new pdf from that, and re-do an OCR on it,
> interestingly the recognition accuracy is improved,

Let me retract this, after experimenting on a few more pages it turns
out the 2nd OCR pass makes roughly the same number of recognition
errors as the 1st pass on average, what fooled me is that it doesn't
do them on the same words. So there is no point really in going
through the complexity and hard work of a 2nd pass.

There is another use however, useful this time, of the trick of saving
as tiff and re-pdf-ing before OCRing: it circumvents the "Acrobat
could not perform recognition (OCR) on this page because: This page
contains renderable text." error you get on some documents, which
annoyingly aborts the whole OCR process. If anyone knows of a simpler
way, I am interested.

Last point, I see they have integrated the "OCR multiple files"
feature to the main menu in version 9, so one doesn't have to go
through the batch processing procedure to OCR a large collection of
documents. Much more convenient.

Michel



Re: [Vo]:Neat new OCR technology

2010-03-18 Thread Michel Jullian
One can download Acrobat 9 from their web site and try it for a month for free.

Disappointingly, the accuracy of the recognition itself is not better
with this clearscan option, it's just the look. However, thanks to the
zoomable (vector) nature of the clearscan characters, if you convert a
clearscan pdf back to image format in higher resolution e.g. 600 dpi
(this can be set in edit>preferences>convert from pdf>TIFF>edit
settings), make a new pdf from that, and re-do an OCR on it,
interestingly the recognition accuracy is improved, at least it seemed
to be in the couple trials I have done. If this is confirmed,
hopefully they will realize this and automate the two pass OCR in
version 10.

Michel

2010/3/18 Jed Rothwell :
> That is impressive!
>
> I hate Adobe's user interface and documentation, but I might get this
> product anyway.
>
> - Jed
>
>



Re: [Vo]:Neat new OCR technology

2010-03-18 Thread Jed Rothwell

That is impressive!

I hate Adobe's user interface and documentation, but I might get this 
product anyway.


- Jed



[Vo]:Neat new OCR technology

2010-03-18 Thread Michel Jullian
Jed, have you tried the "clearscan" setting in Adobe Acrobat 9 OCR?
Very impressive.

They explain their clever (and "obvious", in retrospect) trick in this
demo video: http://my.adobe.acrobat.com/p28891758/

Michel