Hello All,
I have a question regarding running Nutch on Hadoop. The
current setup is as follows
- Hadoop 1.0.3 cluster on AWS's EMR (1 Master - Medium Instance + 3
Slave Nodes Small Instance)
- Nutch 1.7
- Apart from default hadoop config only mapred.map.tasks set to 3
- On Nutch I've update nutch-site.xml with proper agent name
I have seed-list of about 7,00,000 pages from a single domain. So my
questions are
- What setting do I need to update so that fetcher works on all 3 nodes
as opposed to single node?
- What would be appropriate settings for depth and topN values? (I am
assuming them to be 1 and 700000 respectively)
Thank you,
Sidharth