I'm getting the following error on a regular basis with PDFs on Nutch 1.1 2010-07-13 10:57:32,719 ERROR tika.TikaParser - Error parsing http://www.careeronestop.org/TridionMutlimedia/spring2010_tcm24-5512.pdf java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.pushbackinputstr...@721ba923 at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:380) at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:528) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:814) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:63) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:878) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) 2010-07-13 10:57:32,788 WARN fetcher.Fetcher - Error parsing: http://www.careeronestop.org/TridionMutlimedia/spring2010_tcm24-5512.pdf: failed(2,0): expected='endstream' actual='' org.apache.pdfbox.io.pushbackinputstr...@721ba923
I manually downloaded the file and it opens just fine. The document properties show that it is 438,870 bytes long. PDF version 1.4 (Acrobat 5.x). I have file.content.limit set to 1310720 (1,310,720) byte so file size should not be the issue. I check several of the files that I got the error on and every file was smaller than the file.content.limit Any ideas on what the problem may be and how to resolve it? Thanks Brad