Is there a way to prevent this? I mean a way to configure PDFBox not to extracted the scanned text and get the right displayed text?
Regards, On Tue, May 6, 2014 at 5:56 PM, Andreas Lehmkühler <[email protected]> wrote: > Hi, > >> Qingchao Kong <[email protected]> hat am 5. Mai 2014 um 12:50 geschrieben: >> >> >> Hi, I am using PDFBox to extract text from PDF files. >> I noticed that, for some PDF files(usually old PDFs), when you select >> some text using your mouse in the PDF reader application (I use Evince >> on Ubuntu), some other text come up, different from the text when you >> don't select them. >> >> I find that PDFBox sometimes actually extract the selected text, not >> the text when you don't select them. Could anybody tell me why this >> happen? Am I understood? > Sounds like a scanned document. Some scanners combine the scanned picture and > the scanned text (using a more or less acurate OCR software) in one pdf. > The picture is visible and the text is invisible but can be extracted, so that > the displayed content differs from the extracted one. > > BR > Andreas Lehmkühler -- Qingchao Kong Ph.D. Candidate State Key Laboratory of Management and Control for Complex Systems Institute of Automation, Chinese Academy of Sciences No. 95 Zhongguancun East Road Haidian District, Beijing 100190 China

