RE: Unable to extract PDF content

McGibbney, Lewis John Thu, 31 Mar 2011 02:13:35 -0700

Hi Gabriele

Was wondering if you have had a chance to try checking out, building branch and 
trying out same crawl with it?
________________________________________
From: Gabriele Kahlout [[email protected]]
Sent: 31 March 2011 09:59
To: Julien Nioche
Cc: [email protected]
Subject: Re: Unable to extract PDF content


I'm still not able to parse those pdfs, although they are fetched:

QueueFeeder finished: total 3 records + hit by time limit :0
fetching http://www.egamaster.com/datos/politica_fr.pdf
fetching http://singinst.org/upload/artificial-intelligence-risk.pdf
-finishing thread FetcherThread, activeThreads=6
fetching
http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
*Error parsing: http://singinst.org/upload/artificial-intelligence-risk.pdf:
failed(2,0): expected='endstream' actual=''
org.apache.pdfbox.io.PushBackInputStream@3d3c33b7*
-finishing thread FetcherThread, activeThreads=2
*Error parsing: http://www.egamaster.com/datos/politica_fr.pdf: failed(2,0):
expected='endstream' actual=''
org.apache.pdfbox.io.PushBackInputStream@2bf8f8c8*
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
*Error parsing:
http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf:
failed(2,0): expected='endstream' actual=''
org.apache.pdfbox.io.PushBackInputStream@4b6c06dd*
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-03-31 10:50:37, elapsed: 00:00:14

Statistics for CrawlDb: crawl/crawldb/0
*TOTAL urls: 3*
retry 0: 3
min score: 1.0
avg score: 1.0
max score: 1.0
*status 2 (db_fetched): 3*
CrawlDb statistics: done



Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

RE: Unable to extract PDF content

Reply via email to