bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
http://singinst.org/upload/artificial-intelligence-risk.pdf

No problems at all with branch-1.2 and 1.3 with standard configuration.

Gabriele's URL works as well without issues:

bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
http://www.egamaster.com/datos/politica_fr.pdf

What's going on with your configurations?

> Hi Gabriele
> 
> Was wondering if you have had a chance to try checking out, building branch
> and trying out same crawl with it?
> ________________________________________
> From: Gabriele Kahlout [[email protected]]
> Sent: 31 March 2011 09:59
> To: Julien Nioche
> Cc: [email protected]
> Subject: Re: Unable to extract PDF content
> 
> I'm still not able to parse those pdfs, although they are fetched:
> 
> QueueFeeder finished: total 3 records + hit by time limit :0
> fetching http://www.egamaster.com/datos/politica_fr.pdf
> fetching http://singinst.org/upload/artificial-intelligence-risk.pdf
> -finishing thread FetcherThread, activeThreads=6
> fetching
> http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670
> .pdf -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
> *Error parsing:
> http://singinst.org/upload/artificial-intelligence-risk.pdf: failed(2,0):
> expected='endstream' actual=''
> org.apache.pdfbox.io.PushBackInputStream@3d3c33b7*
> -finishing thread FetcherThread, activeThreads=2
> *Error parsing: http://www.egamaster.com/datos/politica_fr.pdf:
> failed(2,0): expected='endstream' actual=''
> org.apache.pdfbox.io.PushBackInputStream@2bf8f8c8*
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> *Error parsing:
> http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670
> .pdf: failed(2,0): expected='endstream' actual=''
> org.apache.pdfbox.io.PushBackInputStream@4b6c06dd*
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-03-31 10:50:37, elapsed: 00:00:14
> 
> Statistics for CrawlDb: crawl/crawldb/0
> *TOTAL urls: 3*
> retry 0: 3
> min score: 1.0
> avg score: 1.0
> max score: 1.0
> *status 2 (db_fetched): 3*
> CrawlDb statistics: done
> 
> 
> 
> Glasgow Caledonian University is a registered Scottish charity, number
> SC021474
> 
> Winner: Times Higher Education’s Widening Participation Initiative of the
> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,
> en.html
> 
> Winner: Times Higher Education’s Outstanding Support for Early Career
> Researchers of the Year 2010, GCU as a lead with Universities Scotland
> partners.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691
> ,en.html

Reply via email to