Would you share some working public pdf links so that I'm trying on the same
material?

On Wed, Apr 6, 2011 at 10:09 PM, McGibbney, Lewis John <
[email protected]> wrote:

> Hi Gabriele,
>
> So it is the same PDF's which are causing the bother :0| even between
> Nutch-1.2 and branch
>
> I have not applied the patch and all pdf's seem to be parsing fine with
> branch. I am interested to understand what the problem is with these
> particular pdf files. I know that we should be able to parse both normal as
> well as encrypted pdf, however on the surface it appears the files are
> normal in nature, two of pdf version 1.3 and one of version 1.4. Although I
> have not attempted to parse any version 1.3 files the 1.4 and above files
> parse without any hitches so I doubt that this is the problem. I am also
> unfamiliar with the 'expected' and 'actual' values you are getting as
> output, wondering if you might be able to get more information on this
> output from somewhere.
> ________________________________________
> From: Gabriele Kahlout [[email protected]]
> Sent: 06 April 2011 19:04
> To: [email protected]
> Cc: McGibbney, Lewis John
> Subject: Re: Unable to extract PDF content
>
> I've finally tried with nutch-1.3 branch applying NUTCH-967-1.3.patch[1]
> and I get the same issue. Does it work for you?
>
> bin/nutch parse crawl/segments/0/20110406195251
> ParseSegment: starting at 2011-04-06 19:53:15
> ParseSegment: segment: crawl/segments/0/20110406195251
> Error parsing:
> http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf:
> failed(2,0): expected='endstream' actual=''
> org.apache.pdfbox.io.PushBackInputStream@5d469658
> Error parsing: http://singinst.org/upload/artificial-intelligence-risk.pdf:
> failed(2,0): expected='endstream' actual=''
> org.apache.pdfbox.io.PushBackInputStream@108f2ca6
> Error parsing: http://www.egamaster.com/datos/politica_fr.pdf:
> failed(2,0): expected='endstream' actual=''
> org.apache.pdfbox.io.PushBackInputStream@13647278
> ParseSegment: finished at 2011-04-06 19:53:23, elapsed: 00:00:08
>
> [1]
> https://issues.apache.org/jira/secure/attachment/12474966/NUTCH-967-1.3.patch
>
>
>
> Glasgow Caledonian University is a registered Scottish charity, number
> SC021474
>
> Winner: Times Higher Education’s Widening Participation Initiative of the
> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
>
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>
> Winner: Times Higher Education’s Outstanding Support for Early Career
> Researchers of the Year 2010, GCU as a lead with Universities Scotland
> partners.
>
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Reply via email to