Re: Valid PDF not Parsing: NoSuchElementException

Erik Scholtz, ArgonSoft GmbH Tue, 26 Jan 2010 09:32:58 -0800

David,

this is a known issue, that you can find at:


https://issues.apache.org/jira/browse/PDFBOX-536

It will be solved in the next release.

Greetings,
Erik

Hi,

I have several PDFs that seem to have no issue being opened by Adobe Reader 
(8.1) and Acrobat 9.  I am using Tika 0.4 (and testing same issue with tika 0.5 
as it fails in both versions).  I have tracked the issue down to PDFBox as 
follows.

When extracting in PDFBox 0.7.3 (and also with 0.8.0-incubating), I get the 
following(...0.8.0-incubating):

Exception in thread "main" org.apache.pdfbox.exceptions.WrappedIOException
      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:860)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:825)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:750)
      at org.apache.pdfbox.ExtractText.main(ExtractText.java:173)
Caused by: java.util.NoSuchElementException
      at java.util.AbstractList$Itr.next(AbstractList.java:427)
      at 
org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
      at 
org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
      ... 4 more


The source line  PDFXrefStreamParser.java:115  is:

                   Integer objID = (Integer)objIter.next();

I downloaded the source for ...0.8.0-incubating and debugged it to see if I 
could extract any further information.  I am not an expert on the internals of 
a PDF.

In the case of this PDF, objIter   is instantiated and initialize before the while 
loop on Line 100.  In my case, objIter  is of size 879.  All works fine until the 
while loop  ( while(pdfSource.available() > 0) )  on line 100 hits the 880th 
iteration as I have verified stepping through debugger (eclipse) with PDFBox 0.8.0 
incubating built and debbing.

Any words of wisdom?  The PDF would seem to be corrupt, except that PDF viewers 
and editors work with it just fine as noted above.

David

Re: Valid PDF not Parsing: NoSuchElementException

Reply via email to