update: my first hunch that this error is related to date parsing was wrong. The error actually occurs inside a 'stream' element while parsing a number. The stream has multiple 'Tm' sequences such as
1 0 0 1 - 783 Tm in it. According to PDF 1.7 [1], the 'Tm' operator needs to be preceded by six numbers, of which the fifth's denote the 'x' component of the translation (in what I assume are homogeneous coordinates). '-' is not a number in PDF, so Ben's parser is correct to throw an exception --- I'm wondering though if it's reasonable to substitute a '0' for a '-' where a number is expected? I made that change to 0.8.0 which lets the parsing and text extraction complete; now I'm seeing a number of errors which are unrelated; I will report them in a separate thread. - Godmar On Wed, Jan 6, 2010 at 10:45 PM, Godmar Back <[email protected]> wrote: > > Hi, > > I'm trying to use PDFBox to index PDF files via the Nutch plugin. Nutch > uses PDFBox 0.7.4, but I also tried pdfbox 0.8.0incubating, with the same > effect. > > I am unable to parse any PDFs created by ScanSoft PDF Create! 3. I'm seeing > the following error: > > In 0.7.4/Nutch: > > *2010-01-06 21:21:35,679 WARN parse.pdf - General exception in PDF > parser: Error: value is not an integer type actual='-' > 2010-01-06 21:21:35,679 WARN parse.pdf - java.io.IOException: Error: value > is not an integer type actual='-' > 2010-01-06 21:21:35,679 WARN parse.pdf - at > org.pdfbox.cos.COSInteger.<init>(COSInteger.java:85) > 2010-01-06 21:21:35,679 WARN parse.pdf - at > org.pdfbox.cos.COSNumber.get(COSNumber.java:110) > 2010-01-06 21:21:35,679 WARN parse.pdf - at > org.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260) > 2010-01-06 21:21:35,679 WARN parse.pdf - at > org.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:115) > 2010-01-06 21:21:35,680 WARN parse.pdf - at > org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:133) > 2010-01-06 21:21:35,680 WARN parse.pdf - at > org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:206) > 2010-01-06 21:21:35,680 WARN parse.pdf - at > org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:178) > 2010-01-06 21:21:35,680 WARN parse.pdf - at > org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339) > 2010-01-06 21:21:35,680 WARN parse.pdf - at > org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263) > 2010-01-06 21:21:35,680 WARN parse.pdf - at > org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219) > 2010-01-06 21:21:35,680 WARN parse.pdf - at > org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152) > 2010-01-06 21:21:35,680 WARN parse.pdf - at > org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:102) > 2010-01-06 21:21:35,680 WARN parse.pdf - at > org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) > 2010-01-06 21:21:35,680 WARN parse.pdf - at > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766) > 2010-01-06 21:21:35,680 WARN parse.pdf - at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)* > > when running pdfbox 0.8.0's ExtractText: > > *Exception in thread "main" java.io.IOException: Error: value is not an > integer type actual='-' > at org.apache.pdfbox.cos.COSInteger.<init>(COSInteger.java:71) > at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:96) > at > org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:255) > at > org.apache.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:101) > at > org.apache.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:119) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:216) > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188) > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367) > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291) > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247) > at org.apache.pdfbox.ExtractText.main(ExtractText.java:229) > * > Apparently, PDFbox attempts to interpret a '-' as a Long. > > pdfinfo and pdftotext, part of Poppler, do not have trouble parsing these > files. > > I don't want to post the PDF in question, but would be willing to email it > to an interested developer. > > The PDF contains: > > */CreationDate (D:20081026134850-05'00')* > > Not having read the PDF spec, I'm guessing that PDFbox may have trouble > parsing this date (and misinterprets the '-' as the nex token). > Looking at org.apache.pdfbox.util.DateConverter, I see: > > private static final SimpleDateFormat[] POTENTIAL_FORMATS = new > SimpleDateFormat[] { > new SimpleDateFormat("EEEE, dd MMM yyyy hh:mm:ss a"), > new SimpleDateFormat("EEEE, MMM dd, yyyy hh:mm:ss a"), > new SimpleDateFormat("MM/dd/yyyy hh:mm:ss"), > new SimpleDateFormat("MM/dd/yyyy")}; > > Perhaps the Date format used in these PDF files needs to be added to > POTENTIAL_FORMATs? > > Thanks for any insight you could provide. > > This hickup is preventing me from ingesting several PDFs into Nutch. > > - Godmar > >

