Using Tika 1.5 (latest release which uses PDFBox) I'm seeing the following IOException parsing certain PDFs.

java.io.IOException: Error: Header doesn't contain versioninfo
at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:335)
   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:177)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
...

Should this be something more specific than just an IOException, so that Tika can know whether to just let it bubble up as an IOException, or encapsulate it into a TikaException?

I don't know enough about the PDFBox project to know if there are ever any exceptions besides IOExceptions thrown. Perhaps there could be a PDFParseException or something like that when you run into known situations. But if IOExceptions only ever happen when you run into known situations, then Tika could just know that is the case and wrap any IOException from PDFBox into a TikaException.

What do you think?

Thanks,
Daniel Gibby

Reply via email to