I've tried with a small file and it worked: $ bin/nutch org.apache.nutch.parse.ParserChecker http://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3&hl=en [1] 15177 michaela:nutch-1.2 simpatico$ --------- Url --------------- http://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3--------- ParseData --------- Version: 5 Status: success(1,0) Title: Moved Temporarily Outlinks: 1 outlink: toUrl: https://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3anchor: here Content Metadata: X-Frame-Options=SAMEORIGIN Date=Mon, 21 Mar 2011 15:05:06 GMT X-XSS-Protection=1; mode=block Expires=Mon, 21 Mar 2011 15:05:06 GMT Location= https://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3Via=1.0 cerbero.localdomain:800 (squid/2.6.STABLE21) Connection=close Content-Type=text/html; charset=UTF-8 X-Cache=MISS from cerbero.localdomain Server=GSE X-Content-Type-Options=nosniff Cache-Control=private, max-age=0 Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
On Mon, Mar 21, 2011 at 3:34 PM, Julien Nioche < [email protected]> wrote: > I've tried both URLs with nutch-1.3 with > > <property> > <name>http.content.limit</name> > <value>-1</value> > <description/> > </property> > > and ./nutch org.apache.nutch.parse.ParserChecker ran without any problems. > > It could be that the nutch-site.xml is loaded from conf/ then overriden by > the one found in the job file (which is the reason why we separated the > runtime/deploy from runtime/local in nutch 1.3 and 2.0). Try deleting the > job file or generating a fresh one with 'ant job' and see if this fixes the > issue. > > Julien > > > On 21 March 2011 13:43, Gabriele Kahlout <[email protected]> wrote: > >> I'm also having the same issue with nutch-1.2. >> >> $ bin/nutch org.apache.nutch.parse.ParserChecker >> http://www.egamaster.com/datos/politica_fr.pdf >> --------- >> Url >> --------------- >> http://www.egamaster.com/datos/politica_fr.pdf--------- >> ParseData >> --------- >> Version: 5 >> Status: *failed*(2,0): expected='endstream' actual='' >> org.apache.pdfbox.io.PushBackInputStream@2918958e >> Title: >> Outlinks: 0 >> Content Metadata: >> Parse Metadata: >> >> $ bin/nutch org.apache.nutch.parse.ParserChecker >> http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf >> Exception in thread "main" java.lang.NullPointerException >> at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) >> >> >> $ java -jar /usr/local/bin/tika-app-0.9.jar >> http://www.egamaster.com/datos/politica_fr.pdf >> <?xml version="1.0" encoding="UTF-8"?> >> <html xmlns="http://www.w3.org/1999/xhtml"> >> <head> >> <meta name="xmpTPg:NPages" content="1"/> >> <meta name="Creation-Date" content="2009-01-15T11:50:32Z"/> >> .... >> >> >> >> <property> >> <name>http.content.limit</name> >> <value>*200000* <!-- 65536--></value> >> <description>The length limit for downloaded content using the http >> protocol, in bytes. If this value is nonnegative (>=0), content longer >> than it will be truncated; otherwise, no truncation at all. Do not >> confuse this setting with the file.content.limit setting. >> </description> >> </property> >> >> >> <mime-type type="application/pdf"> >> <alias type="application/x-pdf"/> >> <acronym>PDF</acronym> >> <comment>Portable Document Format</comment> >> <magic priority="50"> >> <match value="%PDF-" type="string" offset="0"/> >> </magic> >> <glob pattern="*.pdf"/> >> </mime-type> >> >> <property> >> <name>plugin.includes</name> >> >> <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-tf|urlnormalizer-(pass|regex|basic)</value> >> <description>Regular expression naming plugin directory names to >> include. Any plugin not matching this expression is excluded. >> In any case you need at least include the nutch-extensionpoints plugin. >> By >> default Nutch includes crawling just HTML and plain text via HTTP, >> and basic indexing and search plugins. In order to use HTTPS please >> enable >> protocol-httpclient, but be aware of possible intermittent problems with >> the >> underlying commons-httpclient library. Nutch now also includes >> integration with Tika >> to leverage Tika's parsing capabilities for multiple content types. The >> existing Nutch >> parser implementations will likely be phased out in the next release or >> so, as such, it is >> a good idea to begin migrating away from anything not provided by >> parse-tika. >> </description> >> </property> >> >> -- >> Regards, >> K. Gabriele >> >> --- unchanged since 20/9/10 --- >> P.S. If the subject contains "[LON]" or the addressee acknowledges the >> receipt within 48 hours then I don't resend the email. >> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ >> time(x) < Now + 48h) ⇒ ¬resend(I, this). >> >> If an email is sent by a sender that is not a trusted contact or the email >> does not contain a valid code then the email is not received. A valid code >> starts with a hyphen and ends with "X". >> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ >> L(-[a-z]+[0-9]X)). >> >> > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

