bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://singinst.org/upload/artificial-intelligence-risk.pdf
No problems at all with branch-1.2 and 1.3 with standard configuration. Gabriele's URL works as well without issues: bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://www.egamaster.com/datos/politica_fr.pdf What's going on with your configurations? > Hi Gabriele > > Was wondering if you have had a chance to try checking out, building branch > and trying out same crawl with it? > ________________________________________ > From: Gabriele Kahlout [[email protected]] > Sent: 31 March 2011 09:59 > To: Julien Nioche > Cc: [email protected] > Subject: Re: Unable to extract PDF content > > I'm still not able to parse those pdfs, although they are fetched: > > QueueFeeder finished: total 3 records + hit by time limit :0 > fetching http://www.egamaster.com/datos/politica_fr.pdf > fetching http://singinst.org/upload/artificial-intelligence-risk.pdf > -finishing thread FetcherThread, activeThreads=6 > fetching > http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670 > .pdf -finishing thread FetcherThread, activeThreads=6 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0 > *Error parsing: > http://singinst.org/upload/artificial-intelligence-risk.pdf: failed(2,0): > expected='endstream' actual='' > org.apache.pdfbox.io.PushBackInputStream@3d3c33b7* > -finishing thread FetcherThread, activeThreads=2 > *Error parsing: http://www.egamaster.com/datos/politica_fr.pdf: > failed(2,0): expected='endstream' actual='' > org.apache.pdfbox.io.PushBackInputStream@2bf8f8c8* > -finishing thread FetcherThread, activeThreads=1 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > *Error parsing: > http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670 > .pdf: failed(2,0): expected='endstream' actual='' > org.apache.pdfbox.io.PushBackInputStream@4b6c06dd* > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: finished at 2011-03-31 10:50:37, elapsed: 00:00:14 > > Statistics for CrawlDb: crawl/crawldb/0 > *TOTAL urls: 3* > retry 0: 3 > min score: 1.0 > avg score: 1.0 > max score: 1.0 > *status 2 (db_fetched): 3* > CrawlDb statistics: done > > > > Glasgow Caledonian University is a registered Scottish charity, number > SC021474 > > Winner: Times Higher Education’s Widening Participation Initiative of the > Year 2009 and Herald Society’s Education Initiative of the Year 2009. > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219, > en.html > > Winner: Times Higher Education’s Outstanding Support for Early Career > Researchers of the Year 2010, GCU as a lead with Universities Scotland > partners. > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691 > ,en.html

