Hi,
Stephen Haggai <[email protected]> hat am 20. Juli 2012 um 05:44 geschrieben: > > _______________________________________________________________________________________ > > Note: This e-mail is subject to the disclaimer contained at the bottom of this > message. > _______________________________________________________________________________________ > > > Hi, > > I have looked at the PDF file. It looks as if text in all the pages were > scanned as images. I am certain that one cannot extract text from (text > scanned as) images using PDFBox. Could someone correct me if I am wrong. You are correct. The pdfs consists of scanned text and yes pdfbox can't extract that text, but the images. Those could be used with a OCR-software to get the text. I didn't try that but it should work, more or less precise. BTW: It is always a good idea to extract the text using the acrobat reader. Just select the text a copy and paste it to an editor. If that doesn't work it most likely won't work using PDFBox. > > Thanks, > Stephen > > -----Original Message----- > From: Big Donkeys [mailto:[email protected]] > Sent: Friday, 20 July 2012 6:09 AM > To: [email protected] > Subject: Can't extract text Adobe-WinCharSetFFFF-UCS2 > > Hi, I'm having some troubles extracting text from some South Korean PDF > files using PDFTextStripper. When I try I get a "severe error could not parse > predefined CMAP file for 'Adobe-WinCharSetFFFF-UCS2'" message and then > gives me some gibberish. File opens and displays fine in Adobe reader. > I'm using pdfbox-app-1.7.0.jar. > > Here is a link to an example PDF that gives me trouble: > > http://eng.khoa.go.kr/inc/func/fileDownloadBlob_nori.asp?cmsCd=CM0237&ntNo=626&fNo=4 > > Any ideas? > > _______________________________________________________________________________________ > > The information transmitted in this message and its attachments (if any) is > intended > only for the person or entity to which it is addressed. > The message may contain confidential and/or privileged material. Any review, > retransmission, dissemination or other use of, or taking of any action in > reliance > upon this information, by persons or entities other than the intended > recipient is > prohibited. > > If you have received this in error, please contact the sender and delete this > e-mail > and associated material from any computer. > > The intended recipient of this e-mail may only use, reproduce, disclose or > distribute > the information contained in this e-mail and any attached files, with the > permission > of the sender. > > This message has been scanned for viruses. > _______________________________________________________________________________________ Br Andreas Lehmkühler

