On 2010-10-28 02:14, Rob Hunter wrote: > Krish, > > I think what you're looking for is a depth of 2 - I believe depth of > 1 will only return foo.bar. Also, due to your depth change, I think you > can reduce your topN to 50k. I'm unsure if your results will be evenly > distributed across your domains, hopefully someone else has an answer > for that.
Actually, the "depth" is a misnomer... it doesn't relate to the actual depth of URL paths, only to the number of hops from the seed pages. A seed page www.a.com/index.html may contain links like this: www.a.com/one.html www.a.com/two/one.html www.a.com/two/three/one.html ..and from the point of view of Nutch they are all at "depth" 2, i.e. it takes two rounds of fetching to fetch them - the first round to fetch the see page, and the second round to fetch the linked pages. In other words, depth=1 means "fetch the seed pages", depth=2 means "fetch pages outlinked from seed pages", and so on... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

