I've finally tried with nutch-1.3 branch applying NUTCH-967-1.3.patch[1] and
I get the same issue. Does it work for you?

bin/nutch parse crawl/segments/0/20110406195251
ParseSegment: starting at 2011-04-06 19:53:15
ParseSegment: segment: crawl/segments/0/20110406195251
Error parsing:
http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf:
failed(2,0): expected='endstream' actual=''
org.apache.pdfbox.io.PushBackInputStream@5d469658
Error parsing: http://singinst.org/upload/artificial-intelligence-risk.pdf:
failed(2,0): expected='endstream' actual=''
org.apache.pdfbox.io.PushBackInputStream@108f2ca6
Error parsing: http://www.egamaster.com/datos/politica_fr.pdf: failed(2,0):
expected='endstream' actual=''
org.apache.pdfbox.io.PushBackInputStream@13647278
ParseSegment: finished at 2011-04-06 19:53:23, elapsed: 00:00:08

[1]
https://issues.apache.org/jira/secure/attachment/12474966/NUTCH-967-1.3.patch

On Thu, Mar 31, 2011 at 11:11 AM, McGibbney, Lewis John <
[email protected]> wrote:

> Hi Gabriele
>
> Was wondering if you have had a chance to try checking out, building branch
> and trying out same crawl with it?
> ________________________________________
> From: Gabriele Kahlout [[email protected]]
> Sent: 31 March 2011 09:59
> To: Julien Nioche
> Cc: [email protected]
> Subject: Re: Unable to extract PDF content
>
> I'm still not able to parse those pdfs, although they are fetched:
>
> QueueFeeder finished: total 3 records + hit by time limit :0
> fetching http://www.egamaster.com/datos/politica_fr.pdf
> fetching http://singinst.org/upload/artificial-intelligence-risk.pdf
> -finishing thread FetcherThread, activeThreads=6
> fetching
>
> http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
> *Error parsing:
> http://singinst.org/upload/artificial-intelligence-risk.pdf:
> failed(2,0): expected='endstream' actual=''
> org.apache.pdfbox.io.PushBackInputStream@3d3c33b7*
> -finishing thread FetcherThread, activeThreads=2
> *Error parsing: http://www.egamaster.com/datos/politica_fr.pdf:
> failed(2,0):
> expected='endstream' actual=''
> org.apache.pdfbox.io.PushBackInputStream@2bf8f8c8*
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> *Error parsing:
>
> http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf
> :
> failed(2,0): expected='endstream' actual=''
> org.apache.pdfbox.io.PushBackInputStream@4b6c06dd*
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-03-31 10:50:37, elapsed: 00:00:14
>
> Statistics for CrawlDb: crawl/crawldb/0
> *TOTAL urls: 3*
> retry 0: 3
> min score: 1.0
> avg score: 1.0
> max score: 1.0
> *status 2 (db_fetched): 3*
> CrawlDb statistics: done
>
>
>
> Glasgow Caledonian University is a registered Scottish charity, number
> SC021474
>
> Winner: Times Higher Education’s Widening Participation Initiative of the
> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
>
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>
> Winner: Times Higher Education’s Outstanding Support for Early Career
> Researchers of the Year 2010, GCU as a lead with Universities Scotland
> partners.
>
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Reply via email to