Crawling Pages from Single Domain

Siddharth Shah Tue, 10 Mar 2015 05:15:55 -0700

Hello All,
              I have a question regarding running Nutch on Hadoop. The
current setup is as follows


   - Hadoop 1.0.3 cluster on AWS's EMR (1 Master - Medium Instance + 3
   Slave Nodes Small Instance)
   - Nutch 1.7
   - Apart from default hadoop config only mapred.map.tasks set to 3
   - On Nutch I've update nutch-site.xml with proper agent name

I have seed-list of about 7,00,000 pages from a single domain. So my
questions are

   - What setting do I need to update so that fetcher works on all 3 nodes
   as opposed to single node?
   - What would be appropriate settings for depth and topN values? (I am
   assuming them to be 1 and 700000 respectively)

Thank you,
Sidharth

Crawling Pages from Single Domain

Reply via email to