Andrzej, do you know if the results will be evenly distributed?  In that setup, 
will there be five depth 2 sites returned from each seed url, or could it end 
up being 100 from the first and none from the last 20?

There's also a property called db.max.outlinks.per.page - can that be used to 
limit the number of depth 2 sites fetched?

-- Rob

-----Original Message-----
From: Andrzej Bialecki [mailto:[email protected]] 
Sent: Thursday, October 28, 2010 1:43 AM
To: [email protected]
Subject: Re: downloading exact number of pages from list of seed urls

On 2010-10-28 02:14, Rob Hunter wrote:
> Krish,
> 
>    I think what you're looking for is a depth of 2 - I believe depth of
> 1 will only return foo.bar.  Also, due to your depth change, I think you
> can reduce your topN to 50k.  I'm unsure if your results will be evenly
> distributed across your domains, hopefully someone else has an answer
> for that.

Actually, the "depth" is a misnomer... it doesn't relate to the actual
depth of URL paths, only to the number of hops from the seed pages. A
seed page www.a.com/index.html may contain links like this:

www.a.com/one.html
www.a.com/two/one.html
www.a.com/two/three/one.html

..and from the point of view of Nutch they are all at "depth" 2, i.e. it
takes two rounds of fetching to fetch them - the first round to fetch
the see page, and the second round to fetch the linked pages.

In other words, depth=1 means "fetch the seed pages", depth=2 means
"fetch pages outlinked from seed pages", and so on...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to