RE: Issues with extraction content of PDF files

Allison, Timothy B. Fri, 18 Dec 2015 05:29:23 -0800

Hi Edwin,
  Thank you for reaching out to Tika.  As I mentioned [0], the issue appears to 
be that the pdf file doesn’t contain Unicode mappings for the characters in the 
document.  This means that PDFBox has no way of converting character codes 
within the PDF into anything useful.  I checked with pdftotext, and it also 
didn’t pull out anything useful.
   I’m not a PDF expert, and you may want to drop a note to the PDFBox users 
list to see if someone there might have a workaround/solution.


               Best,

                       Tim


[0] 
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201512.mbox/%3cby2pr09mb11297223e13e266cfb2a5ffc7...@by2pr09mb112.namprd09.prod.outlook.com%3E

From: Zheng Lin Edwin Yeo [mailto:[email protected]]
Sent: Friday, December 18, 2015 4:44 AM
To: [email protected]
Subject: Issues with extraction content of PDF files

Hi,

I'm indexing some PDF documents in Solr. However, for certain PDF files, there 
are chinese text in the documents, but after indexing, what is indexed in the 
content is either a series of "??????" or an empty content.

What could be the reason that causes this?

I've shared one of the file with the issue on dropbox, which you can access via 
the link here: 
https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0


Regards,
Edwin

RE: Issues with extraction content of PDF files

Reply via email to