David,
this is a known issue, that you can find at:
https://issues.apache.org/jira/browse/PDFBOX-536
It will be solved in the next release.
Greetings,
Erik
Hi,
I have several PDFs that seem to have no issue being opened by Adobe Reader
(8.1) and Acrobat 9. I am using Tika 0.4 (and testing same issue with tika 0.5
as it fails in both versions). I have tracked the issue down to PDFBox as
follows.
When extracting in PDFBox 0.7.3 (and also with 0.8.0-incubating), I get the
following(...0.8.0-incubating):
Exception in thread "main" org.apache.pdfbox.exceptions.WrappedIOException
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:860)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:825)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:750)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:173)
Caused by: java.util.NoSuchElementException
at java.util.AbstractList$Itr.next(AbstractList.java:427)
at
org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
at
org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
... 4 more
The source line PDFXrefStreamParser.java:115 is:
Integer objID = (Integer)objIter.next();
I downloaded the source for ...0.8.0-incubating and debugged it to see if I
could extract any further information. I am not an expert on the internals of
a PDF.
In the case of this PDF, objIter is instantiated and initialize before the while
loop on Line 100. In my case, objIter is of size 879. All works fine until the
while loop ( while(pdfSource.available() > 0) ) on line 100 hits the 880th
iteration as I have verified stepping through debugger (eclipse) with PDFBox 0.8.0
incubating built and debbing.
Any words of wisdom? The PDF would seem to be corrupt, except that PDF viewers
and editors work with it just fine as noted above.
David