I will try to find one ;)
Another question:
Is it possible to tell tika to extra only first n pages?

-----Ursprüngliche Nachricht-----
Von: Nick Burch [mailto:[email protected]] 
Gesendet: Dienstag, 22. Juli 2014 13:35
An: [email protected]
Betreff: Re: AW: Determine binary pdf?

On Tue, 22 Jul 2014, Clemens Wyss DEV wrote:
> I have thousands of pdf's that are extracted using tika and then 
> indexed/analyzed in Lucene. An there seems to be "cryprtic" text 
> (binary
> data?) in some of the pdfs.

Are you able to identify a small pdf (ideally sub 100kb) which shows the 
problem? If so, please open a new JIRA, and upload the problematic file

It might be a Tika bug, or it might be one in the upstream Apache PDFBox, but 
we'll need a sample file to work it out!

Nick

Reply via email to