Would you share some working public pdf links so that I'm trying on the same material?
On Wed, Apr 6, 2011 at 10:09 PM, McGibbney, Lewis John < [email protected]> wrote: > Hi Gabriele, > > So it is the same PDF's which are causing the bother :0| even between > Nutch-1.2 and branch > > I have not applied the patch and all pdf's seem to be parsing fine with > branch. I am interested to understand what the problem is with these > particular pdf files. I know that we should be able to parse both normal as > well as encrypted pdf, however on the surface it appears the files are > normal in nature, two of pdf version 1.3 and one of version 1.4. Although I > have not attempted to parse any version 1.3 files the 1.4 and above files > parse without any hitches so I doubt that this is the problem. I am also > unfamiliar with the 'expected' and 'actual' values you are getting as > output, wondering if you might be able to get more information on this > output from somewhere. > ________________________________________ > From: Gabriele Kahlout [[email protected]] > Sent: 06 April 2011 19:04 > To: [email protected] > Cc: McGibbney, Lewis John > Subject: Re: Unable to extract PDF content > > I've finally tried with nutch-1.3 branch applying NUTCH-967-1.3.patch[1] > and I get the same issue. Does it work for you? > > bin/nutch parse crawl/segments/0/20110406195251 > ParseSegment: starting at 2011-04-06 19:53:15 > ParseSegment: segment: crawl/segments/0/20110406195251 > Error parsing: > http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf: > failed(2,0): expected='endstream' actual='' > org.apache.pdfbox.io.PushBackInputStream@5d469658 > Error parsing: http://singinst.org/upload/artificial-intelligence-risk.pdf: > failed(2,0): expected='endstream' actual='' > org.apache.pdfbox.io.PushBackInputStream@108f2ca6 > Error parsing: http://www.egamaster.com/datos/politica_fr.pdf: > failed(2,0): expected='endstream' actual='' > org.apache.pdfbox.io.PushBackInputStream@13647278 > ParseSegment: finished at 2011-04-06 19:53:23, elapsed: 00:00:08 > > [1] > https://issues.apache.org/jira/secure/attachment/12474966/NUTCH-967-1.3.patch > > > > Glasgow Caledonian University is a registered Scottish charity, number > SC021474 > > Winner: Times Higher Education’s Widening Participation Initiative of the > Year 2009 and Herald Society’s Education Initiative of the Year 2009. > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html > > Winner: Times Higher Education’s Outstanding Support for Early Career > Researchers of the Year 2010, GCU as a lead with Universities Scotland > partners. > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

