Original fetch interval: What do you mean? The script starts once a week (of course only if it is not running). The fetch cycle takes 1-3 days depending on -topN and -depth. If you mean the attribute "next fetch time" on each URLs I didn't change anything - I think 30 days by default.
The high scoring was just an assumption from me. I didn't have a look at the score values so far. All my target sites are quite big and should contain more than 1 million URLs (i.e. www.motor-talk.de). So I tried big -topN (20,000) and big -depth (20), but only one domain has always been selected and the runtime increases up to five days. I blocked that predominant domain (BTW this is www.carmondo.de) in regex-urlfilter.txt, but another domain moved up and became predominant ( www.motor-talk.de). Since I'm using Nutch-1.2 I implemented NutchBean for searching. Each search contains results well distributed from all domains. So I think the index should be Ok. Since everything seems to be Ok, I'll now run the script with big -topN and -depth and hope, that this behaviour will change some day - maybe after all URLs from the predominant domain are fetched. I'll let you know. 2011/7/11 lewis john mcgibbney <[email protected]> > What was the original fetch interval between successive crawls? > > Yopu're script looks fine and this would also shadown the fact that > crawlking does not seem to be a problem. You mentioned that the domain > which > is being fecthed more than others seems to recieve a higher scoring count > than other sites, how did you acertain this? I know that this is a simple > suggestion, but could it possibly be the case that -topN = 500 exceeds the > number of pages in the domains which are not being fetched at subsequent > recrawls? > [...] >

