Krish, I think what you're looking for is a depth of 2 - I believe depth of 1 will only return foo.bar. Also, due to your depth change, I think you can reduce your topN to 50k. I'm unsure if your results will be evenly distributed across your domains, hopefully someone else has an answer for that.
-- Rob -----Original Message----- From: Krish Pan [mailto:[email protected]] Sent: Wednesday, October 27, 2010 2:29 PM To: [email protected] Subject: downloading exact number of pages from list of seed urls Hi, I am trying to use nutch to just download exact number of (say 5) html pages from each seed page I provide, I was wondering if this is the right approach, Seed Urls = total 10,000 bin/nuch crawl urls/<list of domains> -dir <out> -depth 1 -topN 60000 here depth = 1 because I just want pages from only first level i.e. if domain if foo.bar I want to download foo.bar/spam.htm foo.bar/ham.htm foo.bar/eggs.htm but NOT foo.bar/ham/spam.htm And, -topN is 60,000 because there there are 10,000 seed urls 10,000 home pages and 5 top pages per url Any suggestions? Thanks, krish

