On Tue, 22 Jul 2014, Clemens Wyss DEV wrote:
I have thousands of pdf's that are extracted using tika and then indexed/analyzed in Lucene. An there seems to be "cryprtic" text (binary data?) in some of the pdfs.

Are you able to identify a small pdf (ideally sub 100kb) which shows the problem? If so, please open a new JIRA, and upload the problematic file

It might be a Tika bug, or it might be one in the upstream Apache PDFBox, but we'll need a sample file to work it out!

Nick

Reply via email to