Hi Edwin,
Thank you for reaching out to Tika. As I mentioned [0], the issue appears to
be that the pdf file doesn’t contain Unicode mappings for the characters in the
document. This means that PDFBox has no way of converting character codes
within the PDF into anything useful. I checked with pdftotext, and it also
didn’t pull out anything useful.
I’m not a PDF expert, and you may want to drop a note to the PDFBox users
list to see if someone there might have a workaround/solution.
Best,
Tim
[0]
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201512.mbox/%3cby2pr09mb11297223e13e266cfb2a5ffc7...@by2pr09mb112.namprd09.prod.outlook.com%3E
From: Zheng Lin Edwin Yeo [mailto:[email protected]]
Sent: Friday, December 18, 2015 4:44 AM
To: [email protected]
Subject: Issues with extraction content of PDF files
Hi,
I'm indexing some PDF documents in Solr. However, for certain PDF files, there
are chinese text in the documents, but after indexing, what is indexed in the
content is either a series of "??????" or an empty content.
What could be the reason that causes this?
I've shared one of the file with the issue on dropbox, which you can access via
the link here:
https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
Regards,
Edwin