I'm getting the following error on a regular basis with PDFs on Nutch 1.1

2010-07-13 10:57:32,719 ERROR tika.TikaParser - Error parsing
http://www.careeronestop.org/TridionMutlimedia/spring2010_tcm24-5512.pdf
java.io.IOException: expected='endstream' actual=''
org.apache.pdfbox.io.pushbackinputstr...@721ba923
        at
org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:380)
        at
org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:528)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:814)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:63)
        at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:878)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
2010-07-13 10:57:32,788 WARN  fetcher.Fetcher - Error parsing:
http://www.careeronestop.org/TridionMutlimedia/spring2010_tcm24-5512.pdf:
failed(2,0): expected='endstream' actual=''
org.apache.pdfbox.io.pushbackinputstr...@721ba923

I manually downloaded the file and it opens just fine.  The document
properties show that it is 438,870 bytes long.  PDF version 1.4 (Acrobat
5.x).

I have file.content.limit set to 1310720 (1,310,720) byte so file size
should not be the issue.  I check several of the files that I got the error
on and every file was smaller than the file.content.limit 

Any ideas on what the problem may be and how to resolve it?

Thanks
Brad

Reply via email to