RE: refetching

jeffersonzhou Sun, 10 Jul 2011 03:18:22 -0700

I have the same issue too.

Markus, could you be more specific about:
" Start a new generate/fetch/parse/update/index cyle."


thanks

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Saturday, July 09, 2011 8:39 PM
To: [email protected]
Cc: Cam Bazz
Subject: Re: refetching

Start a new generate/fetch/parse/update/index cyle.

> Hello,
> 
> I have injected, generated a fetch list, fetched, then parsed, and
> indexed to solr with success.
> 
> When I do:
> 
> 
> cam@glacier:/home/nutch$ bin/nutch readdb /home/crawl/crawldb -stats
> CrawlDb statistics start: /home/crawl/crawldb
> Statistics for CrawlDb: /home/crawl/crawldb
> TOTAL urls:     841285
> retry 0:        797382
> retry 1:        43903
> min score:      3.0
> avg score:      1.0366234
> max score:      117.74
> status 1 (db_unfetched):        219050
> status 2 (db_fetched):  543113
> status 3 (db_gone):     3250
> status 4 (db_redir_temp):       58224
> status 5 (db_redir_perm):       17648
> CrawlDb statistics: done
> 
> So I understand there is 219K of unfetched documents.
> 
> I wanted to refetch the same segment, and I got:
> 
> cam@glacier:/home/nutch$ bin/nutch fetch
> /home/crawl/segments/20110708182237 Fetcher: Your 'http.agent.name' value
> should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2011-07-09 14:30:13
> Fetcher: segment: /home/crawl/segments/20110708182237
> Fetcher: java.io.IOException: Segment already fetched!
>         at
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutpu
> tFormat.java:50) at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at
> org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107) at
> org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145) at
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
> org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116)
> 
> 
> I understand nutch will not fetch urls that have been fetched, unless
> some timeout occurs. (default is one month I guess)
> 
> What I want is only to fetch those that are not fetched in the same
> segment.
> 
> How can I do it?
> 
> Best Regards,
> C.B.

RE: refetching

Reply via email to