Hi,
I'm trying to use PDFBox to index PDF files via the Nutch plugin. Nutch uses
PDFBox 0.7.4, but I also tried pdfbox 0.8.0incubating, with the same effect.
I am unable to parse any PDFs created by ScanSoft PDF Create! 3. I'm seeing
the following error:
In 0.7.4/Nutch:
*2010-01-06 21:21:35,679 WARN parse.pdf - General exception in PDF
parser: Error:
value is not an integer type actual='-'
2010-01-06 21:21:35,679 WARN parse.pdf - java.io.IOException: Error: value
is not an integer type actual='-'
2010-01-06 21:21:35,679 WARN parse.pdf - at
org.pdfbox.cos.COSInteger.<init>(COSInteger.java:85)
2010-01-06 21:21:35,679 WARN parse.pdf - at
org.pdfbox.cos.COSNumber.get(COSNumber.java:110)
2010-01-06 21:21:35,679 WARN parse.pdf - at
org.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260)
2010-01-06 21:21:35,679 WARN parse.pdf - at
org.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:115)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:133)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:206)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:178)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:102)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
2010-01-06 21:21:35,680 WARN parse.pdf - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)*
when running pdfbox 0.8.0's ExtractText:
*Exception in thread "main" java.io.IOException: Error: value is not an
integer type actual='-'
at org.apache.pdfbox.cos.COSInteger.<init>(COSInteger.java:71)
at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:96)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:255)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:101)
at
org.apache.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:119)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:216)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:229)
*
Apparently, PDFbox attempts to interpret a '-' as a Long.
pdfinfo and pdftotext, part of Poppler, do not have trouble parsing these
files.
I don't want to post the PDF in question, but would be willing to email it
to an interested developer.
The PDF contains:
*/CreationDate (D:20081026134850-05'00')*
Not having read the PDF spec, I'm guessing that PDFbox may have trouble
parsing this date (and misinterprets the '-' as the nex token).
Looking at org.apache.pdfbox.util.DateConverter, I see:
private static final SimpleDateFormat[] POTENTIAL_FORMATS = new
SimpleDateFormat[] {
new SimpleDateFormat("EEEE, dd MMM yyyy hh:mm:ss a"),
new SimpleDateFormat("EEEE, MMM dd, yyyy hh:mm:ss a"),
new SimpleDateFormat("MM/dd/yyyy hh:mm:ss"),
new SimpleDateFormat("MM/dd/yyyy")};
Perhaps the Date format used in these PDF files needs to be added to
POTENTIAL_FORMATs?
Thanks for any insight you could provide.
This hickup is preventing me from ingesting several PDFs into Nutch.
- Godmar