On 2010-10-28 02:14, Rob Hunter wrote:
> Krish,
> 
>    I think what you're looking for is a depth of 2 - I believe depth of
> 1 will only return foo.bar.  Also, due to your depth change, I think you
> can reduce your topN to 50k.  I'm unsure if your results will be evenly
> distributed across your domains, hopefully someone else has an answer
> for that.

Actually, the "depth" is a misnomer... it doesn't relate to the actual
depth of URL paths, only to the number of hops from the seed pages. A
seed page www.a.com/index.html may contain links like this:

www.a.com/one.html
www.a.com/two/one.html
www.a.com/two/three/one.html

..and from the point of view of Nutch they are all at "depth" 2, i.e. it
takes two rounds of fetching to fetch them - the first round to fetch
the see page, and the second round to fetch the linked pages.

In other words, depth=1 means "fetch the seed pages", depth=2 means
"fetch pages outlinked from seed pages", and so on...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to