RE: downloading exact number of pages from list of seed urls

Rob Hunter Wed, 27 Oct 2010 17:14:50 -0700

Krish,

   I think what you're looking for is a depth of 2 - I believe depth of
1 will only return foo.bar.  Also, due to your depth change, I think you
can reduce your topN to 50k.  I'm unsure if your results will be evenly
distributed across your domains, hopefully someone else has an answer
for that.


-- Rob

-----Original Message-----
From: Krish Pan [mailto:[email protected]] 
Sent: Wednesday, October 27, 2010 2:29 PM
To: [email protected]
Subject: downloading exact number of pages from list of seed urls

Hi,

I am trying to use nutch to just download exact number of (say 5) html
pages
from each seed page I provide,

I was wondering if this is the right approach,

Seed Urls = total 10,000

 bin/nuch crawl urls/<list of domains> -dir <out> -depth 1 -topN 60000

here depth = 1 because I just want pages from only first level

i.e. if domain if foo.bar I want to download

foo.bar/spam.htm
foo.bar/ham.htm
foo.bar/eggs.htm

but NOT
foo.bar/ham/spam.htm

And,

-topN is 60,000 because there there are 10,000 seed urls
10,000 home pages and 5 top pages per url

Any suggestions?

Thanks,
krish

RE: downloading exact number of pages from list of seed urls

Reply via email to