Re: Partitioning selected urls for politeness and scoring

Thomas Eggebrecht Mon, 11 Jul 2011 07:40:53 -0700

Original fetch interval: What do you mean? The script starts once a week (of
course only if it is not running). The fetch cycle takes 1-3 days depending
on -topN and -depth. If you mean the attribute "next fetch time" on each
URLs I didn't change anything - I think 30 days by default.


The high scoring was just an assumption from me. I didn't have a look at the
score values so far.

All my target sites are quite big and should contain more than 1 million
URLs (i.e. www.motor-talk.de). So I tried big -topN (20,000) and big -depth
(20), but only one domain has always been selected and the runtime increases
up to five days.

I blocked that predominant domain (BTW this is www.carmondo.de) in
regex-urlfilter.txt, but another domain moved up and became predominant (
www.motor-talk.de).

Since I'm using Nutch-1.2 I implemented NutchBean for searching. Each search
contains results well distributed from all domains. So I think the index
should be Ok.

Since everything seems to be Ok, I'll now run the script with big -topN and
-depth and hope, that this behaviour will change some day - maybe after all
URLs from the predominant domain are fetched. I'll let you know.

2011/7/11 lewis john mcgibbney <[email protected]>

> What was the original fetch interval between successive crawls?
>
> Yopu're script looks fine and this would also shadown the fact that
> crawlking does not seem to be a problem. You mentioned that the domain
> which
> is being fecthed more than others seems to recieve a higher scoring count
> than other sites, how did you acertain this? I know that this is a simple
> suggestion, but could it possibly be the case that -topN = 500 exceeds the
> number of pages in the domains which are not being fetched at subsequent
> recrawls?
> [...]
>

Re: Partitioning selected urls for politeness and scoring

Reply via email to