[ 
https://issues.apache.org/jira/browse/NUTCH-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2384.
------------------------------------
       Resolution: Incomplete
    Fix Version/s:     (was: 2.4)

Hi [~shubham.gupta], this can be hardly traced to any issue without the full 
job and task logs. Possible reason: map tasks may require more memory. Same for 
reducers, depending on how many fetcher threads are configured. Questions 
regarding configuration are better asked on the [Nutch user mailing 
list|http://nutch.apache.org/mailing_lists.html]. Thanks!

> nutch 2.3.1 job not properly interacting with hadoop 2.7.1
> ----------------------------------------------------------
>
>                 Key: NUTCH-2384
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2384
>             Project: Nutch
>          Issue Type: Test
>          Components: nutchNewbie
>    Affects Versions: 2.3.1
>         Environment: nutch 2.3.1 + hadoop 2.7.1 + mongodb
>            Reporter: Shubham Gupta
>            Priority: Major
>
> Hey, 
> I am testing the Nutch crawler on local environment as well as on Hadoop 
> cluster. 
> The script is able to fetch millions of documents but the apache job created 
> after running the command "ant clean runtime" fails to do so.
> While testing in the local environment i.e using the following commands:
> bin/nutch fetch -all -crawlId <table-name>.
> It ends up fetching all the URLs that are present in the queue. And I have 
> been able to crawl over a 100,000 URLs. (5000 seed URLs)
> Whereas, when I run the same project on the Hadoop cluster, I am not able to 
> reach even the 100,000 mark. It has only fetched a 45,000  URLs. (1100 seed 
> URLs)
> When tested with 5000 seed URLs, then also it was able to fetch such amounts 
> of data.
> The plugins used in Nutch are as follows:
> protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic
> The settings I am using with the hadoop cluster are as follows:
> MAPRED-SITE.XML:
> <property>
> <name>mapreduce.map.memory.mb</name>
> <value>1024</value>
> </property>
> <property>
> <name>mapreduce.reduce.memory.mb</name>
> <value>2048</value>
> </property>
> <property>
> <name>mapreduce.reduce.java.opts</name>
> <value>-Xmx1800m</value>
> </property>
> <property>
> <name>mapreduce.map.java.opts</name>
> <value>-Xmx712m</value>
> </property>
> <property>
> <name>mapred.job.tracker.http.address</name>
> <value>master:50030</value>
> </property>
> <property>
>     <name>yarn.app.mapreduce.am.resource.mb</name>
>         <value>1024</value>
>         </property>
>         <property>
>             <name>yarn.app.mapreduce.am.command-opts</name>
>                 <value>-Xmx800m</value>
>                 </property>
> YARN-SITE.XML:
> <property>
>     <name>yarn.scheduler.minimum-allocation-mb</name>
>     <value>1024</value>
>    <description>minimum memory allcated to containers.</description>
> </property>
> <property>
>     <name>yarn.scheduler.maximum-allocation-mb</name>
>     <value>5120</value>
>    <description>maximum memory allcated to containers.</description>
> </property>
> <property>
>     <name>yarn.scheduler.minimum-allocation-vcores</name>
>     <value>1</value>
> </property>
> <property>
>     <name>yarn.scheduler.maximum-allocation-vcores</name>
>     <value>4</value>
>  </property>
> <property>
>    <name>yarn.nodemanager.resource.memory-mb</name>
>    <value>12288</value>
> <description>max memory allcated to nodemanager.</description>
> </property>
> <property>
>  <name>yarn.nodemanager.vmem-pmem-ratio</name>
>  <value>2.1</value>
> </property>
> <property>
>   <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>   <value>100</value>
> </property>
> <property>
>    <name>yarn.nodemanager.vmem-check-enabled</name>
>     <value>false</value>
>     <description>Whether virtual memory limits will be enforced for 
> containers</description>
>   </property>
> The RAM available to the system is 6 GB and Network Bandwidth available is 4 
> Mb/sec. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to