I will try to find one ;) Another question: Is it possible to tell tika to extra only first n pages?
-----Ursprüngliche Nachricht----- Von: Nick Burch [mailto:[email protected]] Gesendet: Dienstag, 22. Juli 2014 13:35 An: [email protected] Betreff: Re: AW: Determine binary pdf? On Tue, 22 Jul 2014, Clemens Wyss DEV wrote: > I have thousands of pdf's that are extracted using tika and then > indexed/analyzed in Lucene. An there seems to be "cryprtic" text > (binary > data?) in some of the pdfs. Are you able to identify a small pdf (ideally sub 100kb) which shows the problem? If so, please open a new JIRA, and upload the problematic file It might be a Tika bug, or it might be one in the upstream Apache PDFBox, but we'll need a sample file to work it out! Nick
