Using Tika 1.5 (latest release which uses PDFBox) I'm seeing the
following IOException parsing certain PDFs.
java.io.IOException: Error: Header doesn't contain versioninfo
at
org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:335)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:177)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
...
Should this be something more specific than just an IOException, so that
Tika can know whether to just let it bubble up as an IOException, or
encapsulate it into a TikaException?
I don't know enough about the PDFBox project to know if there are ever
any exceptions besides IOExceptions thrown. Perhaps there could be a
PDFParseException or something like that when you run into known
situations. But if IOExceptions only ever happen when you run into known
situations, then Tika could just know that is the case and wrap any
IOException from PDFBox into a TikaException.
What do you think?
Thanks,
Daniel Gibby