I'm still not able to parse those pdfs, although they are fetched: QueueFeeder finished: total 3 records + hit by time limit :0 fetching http://www.egamaster.com/datos/politica_fr.pdf fetching http://singinst.org/upload/artificial-intelligence-risk.pdf -finishing thread FetcherThread, activeThreads=6 fetching http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0 *Error parsing: http://singinst.org/upload/artificial-intelligence-risk.pdf: failed(2,0): expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@3d3c33b7* -finishing thread FetcherThread, activeThreads=2 *Error parsing: http://www.egamaster.com/datos/politica_fr.pdf: failed(2,0): expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@2bf8f8c8* -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 *Error parsing: http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf: failed(2,0): expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@4b6c06dd* -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2011-03-31 10:50:37, elapsed: 00:00:14
Statistics for CrawlDb: crawl/crawldb/0 *TOTAL urls: 3* retry 0: 3 min score: 1.0 avg score: 1.0 max score: 1.0 *status 2 (db_fetched): 3* CrawlDb statistics: done On Mon, Mar 21, 2011 at 4:28 PM, Gabriele Kahlout <[email protected]>wrote: > Otherwise trying your instructions is not working out. When will nutch-1.3 > be released? > > > On Mon, Mar 21, 2011 at 4:07 PM, Gabriele Kahlout < > [email protected]> wrote: > >> I've tried with a small file and it worked: >> >> $ bin/nutch org.apache.nutch.parse.ParserChecker >> http://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3&hl=en >> [1] 15177 >> michaela:nutch-1.2 simpatico$ --------- >> Url >> --------------- >> >> http://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3--------- >> ParseData >> --------- >> Version: 5 >> Status: success(1,0) >> Title: Moved Temporarily >> Outlinks: 1 >> outlink: toUrl: >> https://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3anchor: >> here >> Content Metadata: X-Frame-Options=SAMEORIGIN Date=Mon, 21 Mar 2011 >> 15:05:06 GMT X-XSS-Protection=1; mode=block Expires=Mon, 21 Mar 2011 >> 15:05:06 GMT Location= >> https://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3Via=1.0 >> cerbero.localdomain:800 (squid/2.6.STABLE21) Connection=close >> Content-Type=text/html; charset=UTF-8 X-Cache=MISS from cerbero.localdomain >> Server=GSE X-Content-Type-Options=nosniff Cache-Control=private, max-age=0 >> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 >> >> >> >> On Mon, Mar 21, 2011 at 3:34 PM, Julien Nioche < >> [email protected]> wrote: >> >>> I've tried both URLs with nutch-1.3 with >>> >>> <property> >>> <name>http.content.limit</name> >>> <value>-1</value> >>> <description/> >>> </property> >>> >>> and ./nutch org.apache.nutch.parse.ParserChecker ran without any >>> problems. >>> >>> It could be that the nutch-site.xml is loaded from conf/ then overriden >>> by the one found in the job file (which is the reason why we separated the >>> runtime/deploy from runtime/local in nutch 1.3 and 2.0). Try deleting the >>> job file or generating a fresh one with 'ant job' and see if this fixes the >>> issue. >>> >>> Julien >>> >>> >>> On 21 March 2011 13:43, Gabriele Kahlout <[email protected]>wrote: >>> >>>> I'm also having the same issue with nutch-1.2. >>>> >>>> $ bin/nutch org.apache.nutch.parse.ParserChecker >>>> http://www.egamaster.com/datos/politica_fr.pdf >>>> --------- >>>> Url >>>> --------------- >>>> http://www.egamaster.com/datos/politica_fr.pdf--------- >>>> ParseData >>>> --------- >>>> Version: 5 >>>> Status: *failed*(2,0): expected='endstream' actual='' >>>> org.apache.pdfbox.io.PushBackInputStream@2918958e >>>> Title: >>>> Outlinks: 0 >>>> Content Metadata: >>>> Parse Metadata: >>>> >>>> $ bin/nutch org.apache.nutch.parse.ParserChecker >>>> http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf >>>> Exception in thread "main" java.lang.NullPointerException >>>> at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) >>>> >>>> >>>> $ java -jar /usr/local/bin/tika-app-0.9.jar >>>> http://www.egamaster.com/datos/politica_fr.pdf >>>> <?xml version="1.0" encoding="UTF-8"?> >>>> <html xmlns="http://www.w3.org/1999/xhtml"> >>>> <head> >>>> <meta name="xmpTPg:NPages" content="1"/> >>>> <meta name="Creation-Date" content="2009-01-15T11:50:32Z"/> >>>> .... >>>> >>>> >>>> >>>> <property> >>>> <name>http.content.limit</name> >>>> <value>*200000* <!-- 65536--></value> >>>> <description>The length limit for downloaded content using the http >>>> protocol, in bytes. If this value is nonnegative (>=0), content longer >>>> than it will be truncated; otherwise, no truncation at all. Do not >>>> confuse this setting with the file.content.limit setting. >>>> </description> >>>> </property> >>>> >>>> >>>> <mime-type type="application/pdf"> >>>> <alias type="application/x-pdf"/> >>>> <acronym>PDF</acronym> >>>> <comment>Portable Document Format</comment> >>>> <magic priority="50"> >>>> <match value="%PDF-" type="string" offset="0"/> >>>> </magic> >>>> <glob pattern="*.pdf"/> >>>> </mime-type> >>>> >>>> <property> >>>> <name>plugin.includes</name> >>>> >>>> <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-tf|urlnormalizer-(pass|regex|basic)</value> >>>> <description>Regular expression naming plugin directory names to >>>> include. Any plugin not matching this expression is excluded. >>>> In any case you need at least include the nutch-extensionpoints >>>> plugin. By >>>> default Nutch includes crawling just HTML and plain text via HTTP, >>>> and basic indexing and search plugins. In order to use HTTPS please >>>> enable >>>> protocol-httpclient, but be aware of possible intermittent problems >>>> with the >>>> underlying commons-httpclient library. Nutch now also includes >>>> integration with Tika >>>> to leverage Tika's parsing capabilities for multiple content types. >>>> The existing Nutch >>>> parser implementations will likely be phased out in the next release >>>> or so, as such, it is >>>> a good idea to begin migrating away from anything not provided by >>>> parse-tika. >>>> </description> >>>> </property> >>>> >>>> -- >>>> Regards, >>>> K. Gabriele >>>> >>>> --- unchanged since 20/9/10 --- >>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the >>>> receipt within 48 hours then I don't resend the email. >>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ >>>> time(x) < Now + 48h) ⇒ ¬resend(I, this). >>>> >>>> If an email is sent by a sender that is not a trusted contact or the >>>> email does not contain a valid code then the email is not received. A valid >>>> code starts with a hyphen and ends with "X". >>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ >>>> L(-[a-z]+[0-9]X)). >>>> >>>> >>> >>> >>> -- >>> * >>> *Open Source Solutions for Text Engineering >>> >>> http://digitalpebble.blogspot.com/ >>> http://www.digitalpebble.com >>> >> >> >> >> -- >> Regards, >> K. Gabriele >> >> --- unchanged since 20/9/10 --- >> P.S. If the subject contains "[LON]" or the addressee acknowledges the >> receipt within 48 hours then I don't resend the email. >> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ >> time(x) < Now + 48h) ⇒ ¬resend(I, this). >> >> If an email is sent by a sender that is not a trusted contact or the email >> does not contain a valid code then the email is not received. A valid code >> starts with a hyphen and ends with "X". >> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ >> L(-[a-z]+[0-9]X)). >> >> > > > -- > Regards, > K. Gabriele > > --- unchanged since 20/9/10 --- > P.S. If the subject contains "[LON]" or the addressee acknowledges the > receipt within 48 hours then I don't resend the email. > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ > time(x) < Now + 48h) ⇒ ¬resend(I, this). > > If an email is sent by a sender that is not a trusted contact or the email > does not contain a valid code then the email is not received. A valid code > starts with a hyphen and ends with "X". > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ > L(-[a-z]+[0-9]X)). > > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

