Re: Discrepancy between rendered and extracted characters.

Zeev Sands Sat, 19 Apr 2014 12:50:27 -0700

On 04/19/2014 03:28 PM, Tres Finocchiaro wrote:

@ZS,


Is the text part of the original PDF or has it been created with OCR?

That sounds similar to an OCR issue where the scanner that scanned in the
document made the mistake.

-Tres

I obtained the document from a 3rd party, so I am not sure, but lookingat the "producer" field in it's meta data I see 'Adobe Acrobat Pro11.0.6 Paper Capture Plug-in'. So it appears, you are correct, thedocument might have been scanned. Ouch!

What are my options for extracting an error-free text? Using a betterOCR software? I have just started using pdfbox, so I haven't compiledany statistics on the variety or frequency of these errors, How dopeople deal with this issue? Is it possible to write a set of rules fora few characters?


Thank you,
-ZS

Re: Discrepancy between rendered and extracted characters.

Reply via email to