Andrzej, do you know if the results will be evenly distributed? In that setup, will there be five depth 2 sites returned from each seed url, or could it end up being 100 from the first and none from the last 20?
There's also a property called db.max.outlinks.per.page - can that be used to limit the number of depth 2 sites fetched? -- Rob -----Original Message----- From: Andrzej Bialecki [mailto:[email protected]] Sent: Thursday, October 28, 2010 1:43 AM To: [email protected] Subject: Re: downloading exact number of pages from list of seed urls On 2010-10-28 02:14, Rob Hunter wrote: > Krish, > > I think what you're looking for is a depth of 2 - I believe depth of > 1 will only return foo.bar. Also, due to your depth change, I think you > can reduce your topN to 50k. I'm unsure if your results will be evenly > distributed across your domains, hopefully someone else has an answer > for that. Actually, the "depth" is a misnomer... it doesn't relate to the actual depth of URL paths, only to the number of hops from the seed pages. A seed page www.a.com/index.html may contain links like this: www.a.com/one.html www.a.com/two/one.html www.a.com/two/three/one.html ..and from the point of view of Nutch they are all at "depth" 2, i.e. it takes two rounds of fetching to fetch them - the first round to fetch the see page, and the second round to fetch the linked pages. In other words, depth=1 means "fetch the seed pages", depth=2 means "fetch pages outlinked from seed pages", and so on... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

