Hello,
I have injected, generated a fetch list, fetched, then parsed, and
indexed to solr with success.
When I do:
cam@glacier:/home/nutch$ bin/nutch readdb /home/crawl/crawldb -stats
CrawlDb statistics start: /home/crawl/crawldb
Statistics for CrawlDb: /home/crawl/crawldb
TOTAL urls: 841285
retry 0: 797382
retry 1: 43903
min score: 3.0
avg score: 1.0366234
max score: 117.74
status 1 (db_unfetched): 219050
status 2 (db_fetched): 543113
status 3 (db_gone): 3250
status 4 (db_redir_temp): 58224
status 5 (db_redir_perm): 17648
CrawlDb statistics: done
So I understand there is 219K of unfetched documents.
I wanted to refetch the same segment, and I got:
cam@glacier:/home/nutch$ bin/nutch fetch /home/crawl/segments/20110708182237
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2011-07-09 14:30:13
Fetcher: segment: /home/crawl/segments/20110708182237
Fetcher: java.io.IOException: Segment already fetched!
at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:50)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116)
I understand nutch will not fetch urls that have been fetched, unless
some timeout occurs. (default is one month I guess)
What I want is only to fetch those that are not fetched in the same segment.
How can I do it?
Best Regards,
C.B.