downloading exact number of pages from list of seed urls

Krish Pan Wed, 27 Oct 2010 14:29:31 -0700

Hi,

I am trying to use nutch to just download exact number of (say 5) html pages
from each seed page I provide,


I was wondering if this is the right approach,

Seed Urls = total 10,000

 bin/nuch crawl urls/<list of domains> -dir <out> -depth 1 -topN 60000

here depth = 1 because I just want pages from only first level

i.e. if domain if foo.bar I want to download

foo.bar/spam.htm
foo.bar/ham.htm
foo.bar/eggs.htm

but NOT
foo.bar/ham/spam.htm

And,

-topN is 60,000 because there there are 10,000 seed urls
10,000 home pages and 5 top pages per url

Any suggestions?

Thanks,
krish

downloading exact number of pages from list of seed urls

Reply via email to