Hi Gabriele,

So it is the same PDF's which are causing the bother :0| even between Nutch-1.2 
and branch

I have not applied the patch and all pdf's seem to be parsing fine with branch. 
I am interested to understand what the problem is with these particular pdf 
files. I know that we should be able to parse both normal as well as encrypted 
pdf, however on the surface it appears the files are normal in nature, two of 
pdf version 1.3 and one of version 1.4. Although I have not attempted to parse 
any version 1.3 files the 1.4 and above files parse without any hitches so I 
doubt that this is the problem. I am also unfamiliar with the 'expected' and 
'actual' values you are getting as output, wondering if you might be able to 
get more information on this output from somewhere.
________________________________________
From: Gabriele Kahlout [[email protected]]
Sent: 06 April 2011 19:04
To: [email protected]
Cc: McGibbney, Lewis John
Subject: Re: Unable to extract PDF content

I've finally tried with nutch-1.3 branch applying NUTCH-967-1.3.patch[1] and I 
get the same issue. Does it work for you?

bin/nutch parse crawl/segments/0/20110406195251
ParseSegment: starting at 2011-04-06 19:53:15
ParseSegment: segment: crawl/segments/0/20110406195251
Error parsing: 
http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf:
 failed(2,0): expected='endstream' actual='' 
org.apache.pdfbox.io.PushBackInputStream@5d469658
Error parsing: http://singinst.org/upload/artificial-intelligence-risk.pdf: 
failed(2,0): expected='endstream' actual='' 
org.apache.pdfbox.io.PushBackInputStream@108f2ca6
Error parsing: http://www.egamaster.com/datos/politica_fr.pdf: failed(2,0): 
expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@13647278
ParseSegment: finished at 2011-04-06 19:53:23, elapsed: 00:00:08

[1] 
https://issues.apache.org/jira/secure/attachment/12474966/NUTCH-967-1.3.patch



Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Reply via email to