I've finally tried with nutch-1.3 branch applying NUTCH-967-1.3.patch[1] and I get the same issue. Does it work for you?
bin/nutch parse crawl/segments/0/20110406195251 ParseSegment: starting at 2011-04-06 19:53:15 ParseSegment: segment: crawl/segments/0/20110406195251 Error parsing: http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf: failed(2,0): expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@5d469658 Error parsing: http://singinst.org/upload/artificial-intelligence-risk.pdf: failed(2,0): expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@108f2ca6 Error parsing: http://www.egamaster.com/datos/politica_fr.pdf: failed(2,0): expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@13647278 ParseSegment: finished at 2011-04-06 19:53:23, elapsed: 00:00:08 [1] https://issues.apache.org/jira/secure/attachment/12474966/NUTCH-967-1.3.patch On Thu, Mar 31, 2011 at 11:11 AM, McGibbney, Lewis John < [email protected]> wrote: > Hi Gabriele > > Was wondering if you have had a chance to try checking out, building branch > and trying out same crawl with it? > ________________________________________ > From: Gabriele Kahlout [[email protected]] > Sent: 31 March 2011 09:59 > To: Julien Nioche > Cc: [email protected] > Subject: Re: Unable to extract PDF content > > I'm still not able to parse those pdfs, although they are fetched: > > QueueFeeder finished: total 3 records + hit by time limit :0 > fetching http://www.egamaster.com/datos/politica_fr.pdf > fetching http://singinst.org/upload/artificial-intelligence-risk.pdf > -finishing thread FetcherThread, activeThreads=6 > fetching > > http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf > -finishing thread FetcherThread, activeThreads=6 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0 > *Error parsing: > http://singinst.org/upload/artificial-intelligence-risk.pdf: > failed(2,0): expected='endstream' actual='' > org.apache.pdfbox.io.PushBackInputStream@3d3c33b7* > -finishing thread FetcherThread, activeThreads=2 > *Error parsing: http://www.egamaster.com/datos/politica_fr.pdf: > failed(2,0): > expected='endstream' actual='' > org.apache.pdfbox.io.PushBackInputStream@2bf8f8c8* > -finishing thread FetcherThread, activeThreads=1 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > *Error parsing: > > http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf > : > failed(2,0): expected='endstream' actual='' > org.apache.pdfbox.io.PushBackInputStream@4b6c06dd* > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: finished at 2011-03-31 10:50:37, elapsed: 00:00:14 > > Statistics for CrawlDb: crawl/crawldb/0 > *TOTAL urls: 3* > retry 0: 3 > min score: 1.0 > avg score: 1.0 > max score: 1.0 > *status 2 (db_fetched): 3* > CrawlDb statistics: done > > > > Glasgow Caledonian University is a registered Scottish charity, number > SC021474 > > Winner: Times Higher Education’s Widening Participation Initiative of the > Year 2009 and Herald Society’s Education Initiative of the Year 2009. > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html > > Winner: Times Higher Education’s Outstanding Support for Early Career > Researchers of the Year 2010, GCU as a lead with Universities Scotland > partners. > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

