[ https://issues.apache.org/jira/browse/NUTCH-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel resolved NUTCH-2384. ------------------------------------ Resolution: Incomplete Fix Version/s: (was: 2.4) Hi [~shubham.gupta], this can be hardly traced to any issue without the full job and task logs. Possible reason: map tasks may require more memory. Same for reducers, depending on how many fetcher threads are configured. Questions regarding configuration are better asked on the [Nutch user mailing list|http://nutch.apache.org/mailing_lists.html]. Thanks! > nutch 2.3.1 job not properly interacting with hadoop 2.7.1 > ---------------------------------------------------------- > > Key: NUTCH-2384 > URL: https://issues.apache.org/jira/browse/NUTCH-2384 > Project: Nutch > Issue Type: Test > Components: nutchNewbie > Affects Versions: 2.3.1 > Environment: nutch 2.3.1 + hadoop 2.7.1 + mongodb > Reporter: Shubham Gupta > Priority: Major > > Hey, > I am testing the Nutch crawler on local environment as well as on Hadoop > cluster. > The script is able to fetch millions of documents but the apache job created > after running the command "ant clean runtime" fails to do so. > While testing in the local environment i.e using the following commands: > bin/nutch fetch -all -crawlId <table-name>. > It ends up fetching all the URLs that are present in the queue. And I have > been able to crawl over a 100,000 URLs. (5000 seed URLs) > Whereas, when I run the same project on the Hadoop cluster, I am not able to > reach even the 100,000 mark. It has only fetched a 45,000 URLs. (1100 seed > URLs) > When tested with 5000 seed URLs, then also it was able to fetch such amounts > of data. > The plugins used in Nutch are as follows: > protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic > The settings I am using with the hadoop cluster are as follows: > MAPRED-SITE.XML: > <property> > <name>mapreduce.map.memory.mb</name> > <value>1024</value> > </property> > <property> > <name>mapreduce.reduce.memory.mb</name> > <value>2048</value> > </property> > <property> > <name>mapreduce.reduce.java.opts</name> > <value>-Xmx1800m</value> > </property> > <property> > <name>mapreduce.map.java.opts</name> > <value>-Xmx712m</value> > </property> > <property> > <name>mapred.job.tracker.http.address</name> > <value>master:50030</value> > </property> > <property> > <name>yarn.app.mapreduce.am.resource.mb</name> > <value>1024</value> > </property> > <property> > <name>yarn.app.mapreduce.am.command-opts</name> > <value>-Xmx800m</value> > </property> > YARN-SITE.XML: > <property> > <name>yarn.scheduler.minimum-allocation-mb</name> > <value>1024</value> > <description>minimum memory allcated to containers.</description> > </property> > <property> > <name>yarn.scheduler.maximum-allocation-mb</name> > <value>5120</value> > <description>maximum memory allcated to containers.</description> > </property> > <property> > <name>yarn.scheduler.minimum-allocation-vcores</name> > <value>1</value> > </property> > <property> > <name>yarn.scheduler.maximum-allocation-vcores</name> > <value>4</value> > </property> > <property> > <name>yarn.nodemanager.resource.memory-mb</name> > <value>12288</value> > <description>max memory allcated to nodemanager.</description> > </property> > <property> > <name>yarn.nodemanager.vmem-pmem-ratio</name> > <value>2.1</value> > </property> > <property> > <name>yarn.scheduler.capacity.maximum-am-resource-percent</name> > <value>100</value> > </property> > <property> > <name>yarn.nodemanager.vmem-check-enabled</name> > <value>false</value> > <description>Whether virtual memory limits will be enforced for > containers</description> > </property> > The RAM available to the system is 6 GB and Network Bandwidth available is 4 > Mb/sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005)