Hi,

I have several PDFs that seem to have no issue being opened by Adobe Reader 
(8.1) and Acrobat 9.  I am using Tika 0.4 (and testing same issue with tika 0.5 
as it fails in both versions).  I have tracked the issue down to PDFBox as 
follows.

When extracting in PDFBox 0.7.3 (and also with 0.8.0-incubating), I get the 
following(...0.8.0-incubating):

Exception in thread "main" org.apache.pdfbox.exceptions.WrappedIOException
      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:860)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:825)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:750)
      at org.apache.pdfbox.ExtractText.main(ExtractText.java:173)
Caused by: java.util.NoSuchElementException
      at java.util.AbstractList$Itr.next(AbstractList.java:427)
      at 
org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
      at 
org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
      ... 4 more


The source line  PDFXrefStreamParser.java:115  is:

                   Integer objID = (Integer)objIter.next();

I downloaded the source for ...0.8.0-incubating and debugged it to see if I 
could extract any further information.  I am not an expert on the internals of 
a PDF.

In the case of this PDF, objIter   is instantiated and initialize before the 
while loop on Line 100.  In my case, objIter  is of size 879.  All works fine 
until the while loop  ( while(pdfSource.available() > 0) )  on line 100 hits 
the 880th iteration as I have verified stepping through debugger (eclipse) with 
PDFBox 0.8.0 incubating built and debbing.

Any words of wisdom?  The PDF would seem to be corrupt, except that PDF viewers 
and editors work with it just fine as noted above.

David

Reply via email to