Re: PDF text extraction result different from what they look in PDF reader application

Qingchao Kong Mon, 12 May 2014 03:32:00 -0700

Is there a way to prevent this? I mean a way to configure PDFBox not
to extracted the scanned text and get the right displayed text?


Regards,

On Tue, May 6, 2014 at 5:56 PM, Andreas Lehmkühler <[email protected]> wrote:
> Hi,
>
>> Qingchao Kong <[email protected]> hat am 5. Mai 2014 um 12:50 geschrieben:
>>
>>
>> Hi, I am using PDFBox to extract text from PDF files.
>> I noticed that, for some PDF files(usually old PDFs), when you select
>> some text using your mouse in the PDF reader application (I use Evince
>> on Ubuntu), some other text come up, different from the text when you
>> don't select them.
>>
>> I find that PDFBox sometimes actually extract the selected text, not
>> the text when you don't select them. Could anybody tell me why this
>> happen? Am I understood?
> Sounds like a scanned document. Some scanners combine the scanned picture and
> the scanned text (using a more or less acurate OCR software) in one pdf.
> The picture is visible and the text is invisible but can be extracted, so that
> the displayed content differs from the extracted one.
>
> BR
> Andreas Lehmkühler



-- 
Qingchao Kong

Ph.D. Candidate
State Key Laboratory of Management and Control for Complex Systems
Institute of Automation, Chinese Academy of Sciences

No. 95 Zhongguancun East Road
Haidian District, Beijing 100190 China

Re: PDF text extraction result different from what they look in PDF reader application

Reply via email to