This has been mentioned several times on the list Probably due to the fetch size limit. The default value in Nutch is
*<property> <name>http.content.limit</* *name> <value>65536</value> <description>The length limit for downloaded content using the http protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. </description> </property> * try setting -1 to see if this solves the issue. You can also test the parsing using : bin/nutch org.apache.nutch.parse.ParserChecker blablabla.pdf or by calling Tika directly on a URL e.g. /usr/local/bin/tika-0.7/tika-app/target/tika-app-0.7.jar blablabla.pdf Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com On 11 January 2011 12:18, Peter Litsegård <[email protected]> wrote: > Hi! > > I'm running Nutch v1.2 and experience problems while trying to index > PDF-documents. The error I receive is: > > Error parsing: <docname>.pdf: failed(2,0): expected='endstream' actual='' > org.apache.pdfbox.io.pushbackinputstr...@cbf92 > > I've inspected the security settings and printing/content copying/page > extraction are all allowed. While inspecting the document properties I see: > > - PDF Producer: Adobe PDF Library 9.9 > - PDF Version: 1.5 (Acrobat 5.x) > > What might be the culprit here? > > Thanks in advance! > /Peter

