Hi Tim, Thanks for your reply and advice.
I've drop a note to the PDFBox user list too. Will update here also if I find any solutions from there. Regards, Edwin On 18 December 2015 at 21:28, Allison, Timothy B. <[email protected]> wrote: > Hi Edwin, > > Thank you for reaching out to Tika. As I mentioned [0], the issue > appears to be that the pdf file doesn’t contain Unicode mappings for the > characters in the document. This means that PDFBox has no way of > converting character codes within the PDF into anything useful. I checked > with pdftotext, and it also didn’t pull out anything useful. > > I’m not a PDF expert, and you may want to drop a note to the PDFBox > users list to see if someone there might have a workaround/solution. > > > > Best, > > > > Tim > > > > > > [0] > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201512.mbox/%3cby2pr09mb11297223e13e266cfb2a5ffc7...@by2pr09mb112.namprd09.prod.outlook.com%3E > > > > *From:* Zheng Lin Edwin Yeo [mailto:[email protected]] > *Sent:* Friday, December 18, 2015 4:44 AM > *To:* [email protected] > *Subject:* Issues with extraction content of PDF files > > > > Hi, > > > > I'm indexing some PDF documents in Solr. However, for certain PDF files, > there are chinese text in the documents, but after indexing, what is > indexed in the content is either a series of "??????" or an empty content. > > > > What could be the reason that causes this? > > > > I've shared one of the file with the issue on dropbox, which you can > access via the link here: > https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0 > > > > > > Regards, > > Edwin >
