Otherwise trying your instructions is not working out. When will nutch-1.3 be released?
On Mon, Mar 21, 2011 at 4:07 PM, Gabriele Kahlout <[email protected]>wrote: > I've tried with a small file and it worked: > > $ bin/nutch org.apache.nutch.parse.ParserChecker > http://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3&hl=en > [1] 15177 > michaela:nutch-1.2 simpatico$ --------- > Url > --------------- > > http://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3--------- > ParseData > --------- > Version: 5 > Status: success(1,0) > Title: Moved Temporarily > Outlinks: 1 > outlink: toUrl: > https://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3anchor: > here > Content Metadata: X-Frame-Options=SAMEORIGIN Date=Mon, 21 Mar 2011 15:05:06 > GMT X-XSS-Protection=1; mode=block Expires=Mon, 21 Mar 2011 15:05:06 GMT > Location= > https://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3Via=1.0 > cerbero.localdomain:800 (squid/2.6.STABLE21) Connection=close > Content-Type=text/html; charset=UTF-8 X-Cache=MISS from cerbero.localdomain > Server=GSE X-Content-Type-Options=nosniff Cache-Control=private, max-age=0 > Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 > > > On Mon, Mar 21, 2011 at 3:34 PM, Julien Nioche < > [email protected]> wrote: > >> I've tried both URLs with nutch-1.3 with >> >> <property> >> <name>http.content.limit</name> >> <value>-1</value> >> <description/> >> </property> >> >> and ./nutch org.apache.nutch.parse.ParserChecker ran without any problems. >> >> It could be that the nutch-site.xml is loaded from conf/ then overriden by >> the one found in the job file (which is the reason why we separated the >> runtime/deploy from runtime/local in nutch 1.3 and 2.0). Try deleting the >> job file or generating a fresh one with 'ant job' and see if this fixes the >> issue. >> >> Julien >> >> >> On 21 March 2011 13:43, Gabriele Kahlout <[email protected]>wrote: >> >>> I'm also having the same issue with nutch-1.2. >>> >>> $ bin/nutch org.apache.nutch.parse.ParserChecker >>> http://www.egamaster.com/datos/politica_fr.pdf >>> --------- >>> Url >>> --------------- >>> http://www.egamaster.com/datos/politica_fr.pdf--------- >>> ParseData >>> --------- >>> Version: 5 >>> Status: *failed*(2,0): expected='endstream' actual='' >>> org.apache.pdfbox.io.PushBackInputStream@2918958e >>> Title: >>> Outlinks: 0 >>> Content Metadata: >>> Parse Metadata: >>> >>> $ bin/nutch org.apache.nutch.parse.ParserChecker >>> http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf >>> Exception in thread "main" java.lang.NullPointerException >>> at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) >>> >>> >>> $ java -jar /usr/local/bin/tika-app-0.9.jar >>> http://www.egamaster.com/datos/politica_fr.pdf >>> <?xml version="1.0" encoding="UTF-8"?> >>> <html xmlns="http://www.w3.org/1999/xhtml"> >>> <head> >>> <meta name="xmpTPg:NPages" content="1"/> >>> <meta name="Creation-Date" content="2009-01-15T11:50:32Z"/> >>> .... >>> >>> >>> >>> <property> >>> <name>http.content.limit</name> >>> <value>*200000* <!-- 65536--></value> >>> <description>The length limit for downloaded content using the http >>> protocol, in bytes. If this value is nonnegative (>=0), content longer >>> than it will be truncated; otherwise, no truncation at all. Do not >>> confuse this setting with the file.content.limit setting. >>> </description> >>> </property> >>> >>> >>> <mime-type type="application/pdf"> >>> <alias type="application/x-pdf"/> >>> <acronym>PDF</acronym> >>> <comment>Portable Document Format</comment> >>> <magic priority="50"> >>> <match value="%PDF-" type="string" offset="0"/> >>> </magic> >>> <glob pattern="*.pdf"/> >>> </mime-type> >>> >>> <property> >>> <name>plugin.includes</name> >>> >>> <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-tf|urlnormalizer-(pass|regex|basic)</value> >>> <description>Regular expression naming plugin directory names to >>> include. Any plugin not matching this expression is excluded. >>> In any case you need at least include the nutch-extensionpoints plugin. >>> By >>> default Nutch includes crawling just HTML and plain text via HTTP, >>> and basic indexing and search plugins. In order to use HTTPS please >>> enable >>> protocol-httpclient, but be aware of possible intermittent problems >>> with the >>> underlying commons-httpclient library. Nutch now also includes >>> integration with Tika >>> to leverage Tika's parsing capabilities for multiple content types. The >>> existing Nutch >>> parser implementations will likely be phased out in the next release or >>> so, as such, it is >>> a good idea to begin migrating away from anything not provided by >>> parse-tika. >>> </description> >>> </property> >>> >>> -- >>> Regards, >>> K. Gabriele >>> >>> --- unchanged since 20/9/10 --- >>> P.S. If the subject contains "[LON]" or the addressee acknowledges the >>> receipt within 48 hours then I don't resend the email. >>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ >>> time(x) < Now + 48h) ⇒ ¬resend(I, this). >>> >>> If an email is sent by a sender that is not a trusted contact or the >>> email does not contain a valid code then the email is not received. A valid >>> code starts with a hyphen and ends with "X". >>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ >>> L(-[a-z]+[0-9]X)). >>> >>> >> >> >> -- >> * >> *Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> > > > > -- > Regards, > K. Gabriele > > --- unchanged since 20/9/10 --- > P.S. If the subject contains "[LON]" or the addressee acknowledges the > receipt within 48 hours then I don't resend the email. > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ > time(x) < Now + 48h) ⇒ ¬resend(I, this). > > If an email is sent by a sender that is not a trusted contact or the email > does not contain a valid code then the email is not received. A valid code > starts with a hyphen and ends with "X". > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ > L(-[a-z]+[0-9]X)). > > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

