I'm also having the same issue with nutch-1.2. $ bin/nutch org.apache.nutch.parse.ParserChecker http://www.egamaster.com/datos/politica_fr.pdf --------- Url --------------- http://www.egamaster.com/datos/politica_fr.pdf--------- ParseData --------- Version: 5 Status: *failed*(2,0): expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@2918958e Title: Outlinks: 0 Content Metadata: Parse Metadata:
$ bin/nutch org.apache.nutch.parse.ParserChecker http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf Exception in thread "main" java.lang.NullPointerException at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) $ java -jar /usr/local/bin/tika-app-0.9.jar http://www.egamaster.com/datos/politica_fr.pdf <?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="xmpTPg:NPages" content="1"/> <meta name="Creation-Date" content="2009-01-15T11:50:32Z"/> .... <property> <name>http.content.limit</name> <value>*200000* <!-- 65536--></value> <description>The length limit for downloaded content using the http protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. </description> </property> <mime-type type="application/pdf"> <alias type="application/x-pdf"/> <acronym>PDF</acronym> <comment>Portable Document Format</comment> <magic priority="50"> <match value="%PDF-" type="string" offset="0"/> </magic> <glob pattern="*.pdf"/> </mime-type> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-tf|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. Nutch now also includes integration with Tika to leverage Tika's parsing capabilities for multiple content types. The existing Nutch parser implementations will likely be phased out in the next release or so, as such, it is a good idea to begin migrating away from anything not provided by parse-tika. </description> </property> -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

