Hi Siddharth, Check out the bin/crawl script. There you can set the number of slave nodes, as well as topN for your crawl (size fetchlist * number of slaves), which you want to be 700,000+.
If you tell the bin/crawl script to execute 1 round of 700,000+ pages, you will get your entire seed list. You'd really only want to do it like this if you're only planning on crawling the pages once and not interested in any of the outlinks. If you run another crawl using the same crawl db, you will end up following the outlinks collected in the initial crawl, unless you've excluded everything but your desired pages in regex urlfilter. Hope that helps. jce On Tue, Mar 10, 2015 at 8:14 AM, Siddharth Shah <[email protected]> wrote: > Hello All, > I have a question regarding running Nutch on Hadoop. The > current setup is as follows > > - Hadoop 1.0.3 cluster on AWS's EMR (1 Master - Medium Instance + 3 > Slave Nodes Small Instance) > - Nutch 1.7 > - Apart from default hadoop config only mapred.map.tasks set to 3 > - On Nutch I've update nutch-site.xml with proper agent name > > I have seed-list of about 7,00,000 pages from a single domain. So my > questions are > > - What setting do I need to update so that fetcher works on all 3 nodes > as opposed to single node? > - What would be appropriate settings for depth and topN values? (I am > assuming them to be 1 and 700000 respectively) > > Thank you, > Sidharth > -- Jonathan Cooper-Ellis Field Enablement Engineer <http://www.cloudera.com>

