On 04/19/2014 03:28 PM, Tres Finocchiaro wrote:
@ZS,
Is the text part of the original PDF or has it been created with OCR?
That sounds similar to an OCR issue where the scanner that scanned in the
document made the mistake.
-Tres
I obtained the document from a 3rd party, so I am not sure, but looking
at the "producer" field in it's meta data I see 'Adobe Acrobat Pro
11.0.6 Paper Capture Plug-in'. So it appears, you are correct, the
document might have been scanned. Ouch!
What are my options for extracting an error-free text? Using a better
OCR software? I have just started using pdfbox, so I haven't compiled
any statistics on the variety or frequency of these errors, How do
people deal with this issue? Is it possible to write a set of rules for
a few characters?
Thank you,
-ZS