Re: Issues with extraction content of PDF files

Zheng Lin Edwin Yeo Fri, 18 Dec 2015 09:59:38 -0800

Hi Tim,

Thanks for your reply and advice.


I've drop a note to the PDFBox user list too. Will update here also if I
find any solutions from there.

Regards,
Edwin


On 18 December 2015 at 21:28, Allison, Timothy B. <[email protected]>
wrote:

> Hi Edwin,
>
>   Thank you for reaching out to Tika.  As I mentioned [0], the issue
> appears to be that the pdf file doesn’t contain Unicode mappings for the
> characters in the document.  This means that PDFBox has no way of
> converting character codes within the PDF into anything useful.  I checked
> with pdftotext, and it also didn’t pull out anything useful.
>
>    I’m not a PDF expert, and you may want to drop a note to the PDFBox
> users list to see if someone there might have a workaround/solution.
>
>
>
>                Best,
>
>
>
>                        Tim
>
>
>
>
>
> [0]
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201512.mbox/%3cby2pr09mb11297223e13e266cfb2a5ffc7...@by2pr09mb112.namprd09.prod.outlook.com%3E
>
>
>
> *From:* Zheng Lin Edwin Yeo [mailto:[email protected]]
> *Sent:* Friday, December 18, 2015 4:44 AM
> *To:* [email protected]
> *Subject:* Issues with extraction content of PDF files
>
>
>
> Hi,
>
>
>
> I'm indexing some PDF documents in Solr. However, for certain PDF files,
> there are chinese text in the documents, but after indexing, what is
> indexed in the content is either a series of "??????" or an empty content.
>
>
>
> What could be the reason that causes this?
>
>
>
> I've shared one of the file with the issue on dropbox, which you can
> access via the link here:
> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>
>
>
>
>
> Regards,
>
> Edwin
>

Re: Issues with extraction content of PDF files

Reply via email to