I've tried both URLs with nutch-1.3 with <property> <name>http.content.limit</name> <value>-1</value> <description/> </property>
and ./nutch org.apache.nutch.parse.ParserChecker ran without any problems. It could be that the nutch-site.xml is loaded from conf/ then overriden by the one found in the job file (which is the reason why we separated the runtime/deploy from runtime/local in nutch 1.3 and 2.0). Try deleting the job file or generating a fresh one with 'ant job' and see if this fixes the issue. Julien On 21 March 2011 13:43, Gabriele Kahlout <[email protected]> wrote: > I'm also having the same issue with nutch-1.2. > > $ bin/nutch org.apache.nutch.parse.ParserChecker > http://www.egamaster.com/datos/politica_fr.pdf > --------- > Url > --------------- > http://www.egamaster.com/datos/politica_fr.pdf--------- > ParseData > --------- > Version: 5 > Status: *failed*(2,0): expected='endstream' actual='' > org.apache.pdfbox.io.PushBackInputStream@2918958e > Title: > Outlinks: 0 > Content Metadata: > Parse Metadata: > > $ bin/nutch org.apache.nutch.parse.ParserChecker > http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf > Exception in thread "main" java.lang.NullPointerException > at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) > > > $ java -jar /usr/local/bin/tika-app-0.9.jar > http://www.egamaster.com/datos/politica_fr.pdf > <?xml version="1.0" encoding="UTF-8"?> > <html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta name="xmpTPg:NPages" content="1"/> > <meta name="Creation-Date" content="2009-01-15T11:50:32Z"/> > .... > > > > <property> > <name>http.content.limit</name> > <value>*200000* <!-- 65536--></value> > <description>The length limit for downloaded content using the http > protocol, in bytes. If this value is nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting. > </description> > </property> > > > <mime-type type="application/pdf"> > <alias type="application/x-pdf"/> > <acronym>PDF</acronym> > <comment>Portable Document Format</comment> > <magic priority="50"> > <match value="%PDF-" type="string" offset="0"/> > </magic> > <glob pattern="*.pdf"/> > </mime-type> > > <property> > <name>plugin.includes</name> > > <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-tf|urlnormalizer-(pass|regex|basic)</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. > By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please > enable > protocol-httpclient, but be aware of possible intermittent problems with > the > underlying commons-httpclient library. Nutch now also includes > integration with Tika > to leverage Tika's parsing capabilities for multiple content types. The > existing Nutch > parser implementations will likely be phased out in the next release or > so, as such, it is > a good idea to begin migrating away from anything not provided by > parse-tika. > </description> > </property> > > -- > Regards, > K. Gabriele > > --- unchanged since 20/9/10 --- > P.S. If the subject contains "[LON]" or the addressee acknowledges the > receipt within 48 hours then I don't resend the email. > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ > time(x) < Now + 48h) ⇒ ¬resend(I, this). > > If an email is sent by a sender that is not a trusted contact or the email > does not contain a valid code then the email is not received. A valid code > starts with a hyphen and ends with "X". > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ > L(-[a-z]+[0-9]X)). > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

