I have the same issue too. Markus, could you be more specific about: " Start a new generate/fetch/parse/update/index cyle."
thanks -----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Saturday, July 09, 2011 8:39 PM To: [email protected] Cc: Cam Bazz Subject: Re: refetching Start a new generate/fetch/parse/update/index cyle. > Hello, > > I have injected, generated a fetch list, fetched, then parsed, and > indexed to solr with success. > > When I do: > > > cam@glacier:/home/nutch$ bin/nutch readdb /home/crawl/crawldb -stats > CrawlDb statistics start: /home/crawl/crawldb > Statistics for CrawlDb: /home/crawl/crawldb > TOTAL urls: 841285 > retry 0: 797382 > retry 1: 43903 > min score: 3.0 > avg score: 1.0366234 > max score: 117.74 > status 1 (db_unfetched): 219050 > status 2 (db_fetched): 543113 > status 3 (db_gone): 3250 > status 4 (db_redir_temp): 58224 > status 5 (db_redir_perm): 17648 > CrawlDb statistics: done > > So I understand there is 219K of unfetched documents. > > I wanted to refetch the same segment, and I got: > > cam@glacier:/home/nutch$ bin/nutch fetch > /home/crawl/segments/20110708182237 Fetcher: Your 'http.agent.name' value > should be listed first in > 'http.robots.agents' property. > Fetcher: starting at 2011-07-09 14:30:13 > Fetcher: segment: /home/crawl/segments/20110708182237 > Fetcher: java.io.IOException: Segment already fetched! > at > org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutpu > tFormat.java:50) at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at > org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107) at > org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145) at > org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at > org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116) > > > I understand nutch will not fetch urls that have been fetched, unless > some timeout occurs. (default is one month I guess) > > What I want is only to fetch those that are not fetched in the same > segment. > > How can I do it? > > Best Regards, > C.B.

