refetching

Cam Bazz Sat, 09 Jul 2011 05:34:10 -0700

Hello,

I have injected, generated a fetch list, fetched, then parsed, and
indexed to solr with success.


When I do:


cam@glacier:/home/nutch$ bin/nutch readdb /home/crawl/crawldb -stats
CrawlDb statistics start: /home/crawl/crawldb
Statistics for CrawlDb: /home/crawl/crawldb
TOTAL urls:     841285
retry 0:        797382
retry 1:        43903
min score:      3.0
avg score:      1.0366234
max score:      117.74
status 1 (db_unfetched):        219050
status 2 (db_fetched):  543113
status 3 (db_gone):     3250
status 4 (db_redir_temp):       58224
status 5 (db_redir_perm):       17648
CrawlDb statistics: done

So I understand there is 219K of unfetched documents.

I wanted to refetch the same segment, and I got:

cam@glacier:/home/nutch$ bin/nutch fetch /home/crawl/segments/20110708182237
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2011-07-09 14:30:13
Fetcher: segment: /home/crawl/segments/20110708182237
Fetcher: java.io.IOException: Segment already fetched!
        at 
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:50)
        at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116)


I understand nutch will not fetch urls that have been fetched, unless
some timeout occurs. (default is one month I guess)

What I want is only to fetch those that are not fetched in the same segment.

How can I do it?

Best Regards,
C.B.

refetching

Reply via email to