Hi Gabriele Was wondering if you have had a chance to try checking out, building branch and trying out same crawl with it? ________________________________________ From: Gabriele Kahlout [[email protected]] Sent: 31 March 2011 09:59 To: Julien Nioche Cc: [email protected] Subject: Re: Unable to extract PDF content
I'm still not able to parse those pdfs, although they are fetched: QueueFeeder finished: total 3 records + hit by time limit :0 fetching http://www.egamaster.com/datos/politica_fr.pdf fetching http://singinst.org/upload/artificial-intelligence-risk.pdf -finishing thread FetcherThread, activeThreads=6 fetching http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0 *Error parsing: http://singinst.org/upload/artificial-intelligence-risk.pdf: failed(2,0): expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@3d3c33b7* -finishing thread FetcherThread, activeThreads=2 *Error parsing: http://www.egamaster.com/datos/politica_fr.pdf: failed(2,0): expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@2bf8f8c8* -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 *Error parsing: http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf: failed(2,0): expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@4b6c06dd* -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2011-03-31 10:50:37, elapsed: 00:00:14 Statistics for CrawlDb: crawl/crawldb/0 *TOTAL urls: 3* retry 0: 3 min score: 1.0 avg score: 1.0 max score: 1.0 *status 2 (db_fetched): 3* CrawlDb statistics: done Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

