Hi,

I'm trying to use PDFBox to index PDF files via the Nutch plugin. Nutch uses
PDFBox 0.7.4, but I also tried pdfbox 0.8.0incubating, with the same effect.

I am unable to parse any PDFs created by ScanSoft PDF Create! 3. I'm seeing
the following error:

In 0.7.4/Nutch:

*2010-01-06 21:21:35,679 WARN  parse.pdf - General exception in PDF
parser: Error:
value is not an integer type actual='-'
2010-01-06 21:21:35,679 WARN  parse.pdf - java.io.IOException: Error: value
is not an integer type actual='-'
2010-01-06 21:21:35,679 WARN  parse.pdf - at
org.pdfbox.cos.COSInteger.<init>(COSInteger.java:85)
2010-01-06 21:21:35,679 WARN  parse.pdf - at
org.pdfbox.cos.COSNumber.get(COSNumber.java:110)
2010-01-06 21:21:35,679 WARN  parse.pdf - at
org.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260)
2010-01-06 21:21:35,679 WARN  parse.pdf - at
org.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:115)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:133)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:206)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:178)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:102)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
2010-01-06 21:21:35,680 WARN  parse.pdf - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)*

when running pdfbox 0.8.0's ExtractText:

*Exception in thread "main" java.io.IOException: Error: value is not an
integer type actual='-'
       at org.apache.pdfbox.cos.COSInteger.<init>(COSInteger.java:71)
       at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:96)
       at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:255)
       at
org.apache.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:101)
       at
org.apache.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:119)
       at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:216)
       at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
       at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
       at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
       at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
       at org.apache.pdfbox.ExtractText.main(ExtractText.java:229)
*
Apparently, PDFbox attempts to interpret a '-' as a Long.

pdfinfo and pdftotext, part of Poppler, do not have trouble parsing these
files.

I don't want to post the PDF in question, but would be willing to email it
to an interested developer.

The PDF contains:

*/CreationDate (D:20081026134850-05'00')*

Not having read the PDF spec, I'm guessing that PDFbox may have trouble
parsing this date (and misinterprets the '-' as the nex token).
Looking at org.apache.pdfbox.util.DateConverter, I see:

    private static final SimpleDateFormat[] POTENTIAL_FORMATS = new
SimpleDateFormat[] {
        new SimpleDateFormat("EEEE, dd MMM yyyy hh:mm:ss a"),
        new SimpleDateFormat("EEEE, MMM dd, yyyy hh:mm:ss a"),
        new SimpleDateFormat("MM/dd/yyyy hh:mm:ss"),
        new SimpleDateFormat("MM/dd/yyyy")};

Perhaps the Date format used in these PDF files needs to be added to
POTENTIAL_FORMATs?

Thanks for any insight you could provide.

This hickup is preventing me from ingesting several PDFs into Nutch.

 - Godmar

Reply via email to