Nutch + Solr - Indexer causes java.lang.OutOfMemoryError: Java heap space
Hello everyone, I have configured my 2 servers to run in distributed mode (with Hadoop) and my configuration for crawling process is Nutch 2.2.1 - HBase (as a storage) and Solr. Solr is run by Tomcat. The problem is everytime I try to do the last step - I mean when I want to index data from HBase into Solr. After then this *[1]* error occures. I tried to add CATALINA_OPTS (or JAVA_OPTS) like this: CATALINA_OPTS=$JAVA_OPTS -XX:+UseConcMarkSweepGC -Xms1g -Xmx6000m -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m -XX:+CMSClassUnloadingEnabled to Tomcat's catalina.sh script and run server with this script but it didn't help. I also add these *[2]* properties to nutch-site.xml file but it ended up with OutOfMemory again. Can you help me please? *[1]* /2014-09-06 22:52:50,683 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2367) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:587) at java.lang.StringBuffer.append(StringBuffer.java:332) at java.io.StringWriter.write(StringWriter.java:77) at org.apache.solr.common.util.XML.escape(XML.java:204) at org.apache.solr.common.util.XML.escapeCharData(XML.java:77) at org.apache.solr.common.util.XML.writeXML(XML.java:147) at org.apache.solr.client.solrj.util.ClientUtils.writeVal(ClientUtils.java:161) at org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:129) at org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateRequest.java:355) at org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.java:271) at org.apache.solr.client.solrj.request.RequestWriter.getContentStream(RequestWriter.java:66) at org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getDelegate(RequestWriter.java:94) at org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getName(RequestWriter.java:104) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:247) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:96) at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:650) at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1793) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:779) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) / *[2]* property namehttp.content.limit/name value15000/value descriptionThe length limit for downloaded content using the http protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. For our purposes it is twice bigger than default - parsing big pages: 128 * 1024 /description /property property nameindexer.max.tokens/name value10/value /property property namehttp.timeout/name value5/value descriptionThe default network timeout, in milliseconds./description /property property namesolr.commit.size/name value100/value description Defines the number of documents to send to Solr in a single update batch. Decrease when handling very large documents to prevent Nutch from running out of memory. NOTE: It does not explicitly trigger a server side commit. /description /property -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Solr-Indexer-causes-java-lang-OutOfMemoryError-Java-heap-space-tp4157307.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch 1.7 fetch happening in a single map task.
Hi Julien, What do you mean by crawlID please? I am using nutch 1.8 and follow the instruction in the tutorial as mentioned before, and seems have a similar situation, that is, fetch runs on only one map task. I am running on a cluster of four nodes on hadoop 2.4.1. Notice that the map task can be assigned to any node, but only one map each round. I have set numSlaves=4 mode=distributed The seed url list includes five different websites from different host. Is there any settings I missed out? Thanks in advance. Regards, Simon On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: No, just do 'bin/crawl seedDir crawlID solrURL numberOfRounds' from the master node. It internally calls the nutch script for the individual commands, which takes care of sending the job jar to your hadoop cluster, see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271 On 29 August 2014 15:24, S.L simpleliving...@gmail.com wrote: Sorry Julien , I overlooked the directory names. My understanding is that the Hadoop Job is submitted to a cluster by using the following command on the RM node bin/hadoop .job file params Are you suggesting I submit the script instead of the Nutch .job jar like below? bin/hadoop bin/crawl seedDir crawlID solrURL numberOfRounds On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: As the name runtime/deploy suggest - it is used exactly for that purpose ;-) Just make sure HADOOP_HOME/bin is added to the path and run the script, that's all. Look at the bottom of the nutch script for details. Julien PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU ( http://sched.co/1pbE15n) were we'll cover things like these On 29 August 2014 14:30, S.L simpleliving...@gmail.com wrote: Thanks, can this be used on a hadoop cluster? Sent from my HTC - Reply message - From: Julien Nioche lists.digitalpeb...@gmail.com To: user@nutch.apache.org user@nutch.apache.org Subject: Nutch 1.7 fetch happening in a single map task. Date: Fri, Aug 29, 2014 9:00 AM See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script just go to runtime/deploy/bin and run the script from there. Julien On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com wrote: Hi Julien, I have 15 domains and they are all being fetched in a single map task which does not fetch all the urls no matter what depth or topN i give. I am submitting the Nutch job jar which seems to be using the Crawl.java class, how do I use the Crawl script on a Hadoop cluster, are there any pointers you can share? Thanks. On Aug 29, 2014 4:40 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Meraj, The generator will place all the URLs in a single segment if all they belong to the same host for politeness reason. Otherwise it will use whichever value is passed with the -numFetchers parameter in the generation step. Why don't you use the crawl script in /bin instead of tinkering with the (now deprecated) Crawl class? It comes with a good default configuration and should make your life easier. Julien On 28 August 2014 06:47, Meraj A. Khan mera...@gmail.com wrote: Hi All, I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that there is only a single reducer in the generate partition job. I am running in a situation where the subsequent fetch is only running in a single map task (I believe as a consequence of a single reducer in the earlier phase). How can I force Nutch to do fetch in multiple map tasks , is there a setting to force more than one reducers in the generate-partition job to have more map tasks ?. Please also note that I have commented out the code in Crawl.java to not do the LInkInversion phase as , I dont need the scoring of the URLS that Nutch crawls, every URL is equally important to me. Thanks. -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: Nutch 1.7 fetch happening in a single map task.
I think that is a typo , and it is actually CrawlDirectory. For the single map task issue although I have not tried it yet,but we can control the number of fetchers by numFetchers parameter when doing the generate via the bin/generate. On Sep 7, 2014 9:23 AM, Simon Z simonz.nu...@gmail.com wrote: Hi Julien, What do you mean by crawlID please? I am using nutch 1.8 and follow the instruction in the tutorial as mentioned before, and seems have a similar situation, that is, fetch runs on only one map task. I am running on a cluster of four nodes on hadoop 2.4.1. Notice that the map task can be assigned to any node, but only one map each round. I have set numSlaves=4 mode=distributed The seed url list includes five different websites from different host. Is there any settings I missed out? Thanks in advance. Regards, Simon On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: No, just do 'bin/crawl seedDir crawlID solrURL numberOfRounds' from the master node. It internally calls the nutch script for the individual commands, which takes care of sending the job jar to your hadoop cluster, see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271 On 29 August 2014 15:24, S.L simpleliving...@gmail.com wrote: Sorry Julien , I overlooked the directory names. My understanding is that the Hadoop Job is submitted to a cluster by using the following command on the RM node bin/hadoop .job file params Are you suggesting I submit the script instead of the Nutch .job jar like below? bin/hadoop bin/crawl seedDir crawlID solrURL numberOfRounds On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: As the name runtime/deploy suggest - it is used exactly for that purpose ;-) Just make sure HADOOP_HOME/bin is added to the path and run the script, that's all. Look at the bottom of the nutch script for details. Julien PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU ( http://sched.co/1pbE15n) were we'll cover things like these On 29 August 2014 14:30, S.L simpleliving...@gmail.com wrote: Thanks, can this be used on a hadoop cluster? Sent from my HTC - Reply message - From: Julien Nioche lists.digitalpeb...@gmail.com To: user@nutch.apache.org user@nutch.apache.org Subject: Nutch 1.7 fetch happening in a single map task. Date: Fri, Aug 29, 2014 9:00 AM See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script just go to runtime/deploy/bin and run the script from there. Julien On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com wrote: Hi Julien, I have 15 domains and they are all being fetched in a single map task which does not fetch all the urls no matter what depth or topN i give. I am submitting the Nutch job jar which seems to be using the Crawl.java class, how do I use the Crawl script on a Hadoop cluster, are there any pointers you can share? Thanks. On Aug 29, 2014 4:40 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Meraj, The generator will place all the URLs in a single segment if all they belong to the same host for politeness reason. Otherwise it will use whichever value is passed with the -numFetchers parameter in the generation step. Why don't you use the crawl script in /bin instead of tinkering with the (now deprecated) Crawl class? It comes with a good default configuration and should make your life easier. Julien On 28 August 2014 06:47, Meraj A. Khan mera...@gmail.com wrote: Hi All, I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that there is only a single reducer in the generate partition job. I am running in a situation where the subsequent fetch is only running in a single map task (I believe as a consequence of a single reducer in the earlier phase). How can I force Nutch to do fetch in multiple map tasks , is there a setting to force more than one reducers in the generate-partition job to have more map tasks ?. Please also note that I have commented out the code in Crawl.java to not do the LInkInversion phase as , I dont need the scoring of the URLS that Nutch crawls, every URL is equally important to me. Thanks. -- Open Source Solutions for Text Engineering