Hi Gabriele, So it is the same PDF's which are causing the bother :0| even between Nutch-1.2 and branch
I have not applied the patch and all pdf's seem to be parsing fine with branch. I am interested to understand what the problem is with these particular pdf files. I know that we should be able to parse both normal as well as encrypted pdf, however on the surface it appears the files are normal in nature, two of pdf version 1.3 and one of version 1.4. Although I have not attempted to parse any version 1.3 files the 1.4 and above files parse without any hitches so I doubt that this is the problem. I am also unfamiliar with the 'expected' and 'actual' values you are getting as output, wondering if you might be able to get more information on this output from somewhere. ________________________________________ From: Gabriele Kahlout [[email protected]] Sent: 06 April 2011 19:04 To: [email protected] Cc: McGibbney, Lewis John Subject: Re: Unable to extract PDF content I've finally tried with nutch-1.3 branch applying NUTCH-967-1.3.patch[1] and I get the same issue. Does it work for you? bin/nutch parse crawl/segments/0/20110406195251 ParseSegment: starting at 2011-04-06 19:53:15 ParseSegment: segment: crawl/segments/0/20110406195251 Error parsing: http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf: failed(2,0): expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@5d469658 Error parsing: http://singinst.org/upload/artificial-intelligence-risk.pdf: failed(2,0): expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@108f2ca6 Error parsing: http://www.egamaster.com/datos/politica_fr.pdf: failed(2,0): expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@13647278 ParseSegment: finished at 2011-04-06 19:53:23, elapsed: 00:00:08 [1] https://issues.apache.org/jira/secure/attachment/12474966/NUTCH-967-1.3.patch Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

