On 04/19/2014 03:28 PM, Tres Finocchiaro wrote:
@ZS,

Is the text part of the original PDF or has it been created with OCR?

That sounds similar to an OCR issue where the scanner that scanned in the
document made the mistake.

-Tres


I obtained the document from a 3rd party, so I am not sure, but looking at the "producer" field in it's meta data I see 'Adobe Acrobat Pro 11.0.6 Paper Capture Plug-in'. So it appears, you are correct, the document might have been scanned. Ouch!

What are my options for extracting an error-free text? Using a better OCR software? I have just started using pdfbox, so I haven't compiled any statistics on the variety or frequency of these errors, How do people deal with this issue? Is it possible to write a set of rules for a few characters?

Thank you,
-ZS

Reply via email to