Nutch + Solr - Indexer causes java.lang.OutOfMemoryError: Java heap space

2014-09-07 Thread glumet
Hello everyone, 

I have configured my 2 servers to run in distributed mode (with Hadoop) and
my configuration for crawling process is Nutch 2.2.1 - HBase (as a storage)
and Solr. Solr is run by Tomcat. The problem is everytime I try to do the
last step - I mean when I want to index data from HBase into Solr. After
then this *[1]* error occures. I tried to add CATALINA_OPTS (or JAVA_OPTS)
like this:

CATALINA_OPTS=$JAVA_OPTS -XX:+UseConcMarkSweepGC -Xms1g -Xmx6000m
-XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m
-XX:+CMSClassUnloadingEnabled

to Tomcat's catalina.sh script and run server with this script but it didn't
help. I also add these *[2]* properties to nutch-site.xml file but it ended
up with OutOfMemory again. Can you help me please?

*[1]*
/2014-09-06 22:52:50,683 FATAL org.apache.hadoop.mapred.Child: Error running
child : java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:587)
at java.lang.StringBuffer.append(StringBuffer.java:332)
at java.io.StringWriter.write(StringWriter.java:77)
at org.apache.solr.common.util.XML.escape(XML.java:204)
at org.apache.solr.common.util.XML.escapeCharData(XML.java:77)
at org.apache.solr.common.util.XML.writeXML(XML.java:147)
at
org.apache.solr.client.solrj.util.ClientUtils.writeVal(ClientUtils.java:161)
at
org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:129)
at
org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateRequest.java:355)
at
org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.java:271)
at
org.apache.solr.client.solrj.request.RequestWriter.getContentStream(RequestWriter.java:66)
at
org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getDelegate(RequestWriter.java:94)
at
org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getName(RequestWriter.java:104)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:247)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:96)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:650)
at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1793)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:779)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
/

*[2]*

property
  namehttp.content.limit/name
  value15000/value
  descriptionThe length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  For our purposes it is twice bigger than default - parsing big pages: 128
* 1024
  /description
/property

property
   nameindexer.max.tokens/name
   value10/value
/property

property
  namehttp.timeout/name
  value5/value
  descriptionThe default network timeout, in milliseconds./description
/property

property
  namesolr.commit.size/name
  value100/value
  description
  Defines the number of documents to send to Solr in a single update batch.
  Decrease when handling very large documents to prevent Nutch from running
  out of memory. NOTE: It does not explicitly trigger a server side commit.
  /description
/property



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-Solr-Indexer-causes-java-lang-OutOfMemoryError-Java-heap-space-tp4157307.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch 1.7 fetch happening in a single map task.

2014-09-07 Thread Simon Z
Hi Julien,

What do you mean by crawlID please? I am using nutch 1.8 and follow the
instruction in the tutorial as mentioned before, and seems have a similar
situation, that is, fetch runs on only one map task. I am running on a
cluster of four nodes on hadoop 2.4.1.

Notice that the map task can be assigned to any node, but only one map each
round.

I have set

numSlaves=4
mode=distributed


The seed url list includes five different websites from different host.


Is there any settings I missed out?

Thanks in advance.

Regards,

Simon


On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 No, just do 'bin/crawl seedDir crawlID solrURL numberOfRounds' from
 the master node. It internally calls the nutch script for the individual
 commands, which takes care of sending the job jar to your hadoop cluster,
 see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271




 On 29 August 2014 15:24, S.L simpleliving...@gmail.com wrote:

  Sorry Julien , I overlooked the directory names.
 
  My understanding is that the Hadoop Job is submitted  to a cluster by
 using
  the following command on the RM node bin/hadoop .job file params
 
  Are you suggesting I submit the script instead of the Nutch .job jar like
  below?
 
  bin/hadoop  bin/crawl seedDir crawlID solrURL numberOfRounds
 
 
  On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche 
  lists.digitalpeb...@gmail.com wrote:
 
   As the name runtime/deploy suggest - it is used exactly for that
 purpose
   ;-) Just make sure HADOOP_HOME/bin is added to the path and run the
  script,
   that's all.
   Look at the bottom of the nutch script for details.
  
   Julien
  
   PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU (
   http://sched.co/1pbE15n) were we'll cover things like these
  
  
  
   On 29 August 2014 14:30, S.L simpleliving...@gmail.com wrote:
  
Thanks, can this be used on a hadoop cluster?
   
Sent from my HTC
   
- Reply message -
From: Julien Nioche lists.digitalpeb...@gmail.com
To: user@nutch.apache.org user@nutch.apache.org
Subject: Nutch 1.7 fetch happening in a single map task.
Date: Fri, Aug 29, 2014 9:00 AM
   
See
   
  http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
   
just go to runtime/deploy/bin and run the script from there.
   
Julien
   
   
On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com wrote:
   
 Hi Julien,

 I have 15 domains and they are all being fetched in a single map
 task
which
 does not fetch all the urls no matter what depth or topN i give.

 I am submitting the Nutch job jar which seems to be using the
   Crawl.java
 class, how do I use the Crawl script on a Hadoop cluster, are there
  any
 pointers you can share?

 Thanks.
 On Aug 29, 2014 4:40 AM, Julien Nioche 
   lists.digitalpeb...@gmail.com
 wrote:

  Hi Meraj,
 
  The generator will place all the URLs in a single segment if all
  they
  belong to the same host for politeness reason. Otherwise it will
  use
  whichever value is passed with the -numFetchers parameter in the
 generation
  step.
 
  Why don't you use the crawl script in /bin instead of tinkering
  with
the
  (now deprecated) Crawl class? It comes with a good default
configuration
  and should make your life easier.
 
  Julien
 
 
  On 28 August 2014 06:47, Meraj A. Khan mera...@gmail.com
 wrote:
 
   Hi All,
  
   I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I
 noticed
   that
  there
   is only a single reducer in the generate partition job. I am
   running
 in
  a
   situation where the subsequent fetch is only running in a
 single
   map
 task
   (I believe as a consequence of a single reducer in the earlier
phase).
  How
   can I force Nutch to do fetch in multiple map tasks , is there
 a
 setting
  to
   force more than one reducers in the generate-partition job to
  have
more
  map
   tasks ?.
  
   Please also note that I have commented out the code in
 Crawl.java
   to
 not
  do
   the LInkInversion phase as , I dont need the scoring of the
 URLS
   that
  Nutch
   crawls, every URL is equally important to me.
  
   Thanks.
  
 
 
 
  --
 
  Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble
 

   
   
   
--
   
Open Source Solutions for Text Engineering
   
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
   
  
  
  
   --
  
   Open Source Solutions for Text Engineering
  
   http://digitalpebble.blogspot.com/
   http://www.digitalpebble.com
   http://twitter.com/digitalpebble
  
 



 

Re: Nutch 1.7 fetch happening in a single map task.

2014-09-07 Thread Meraj A. Khan
I think that is a typo , and it is actually CrawlDirectory. For  the single
map task issue although I have not tried it yet,but  we can control the
number of fetchers by numFetchers parameter when doing the generate via the
bin/generate.
On Sep 7, 2014 9:23 AM, Simon Z simonz.nu...@gmail.com wrote:

 Hi Julien,

 What do you mean by crawlID please? I am using nutch 1.8 and follow the
 instruction in the tutorial as mentioned before, and seems have a similar
 situation, that is, fetch runs on only one map task. I am running on a
 cluster of four nodes on hadoop 2.4.1.

 Notice that the map task can be assigned to any node, but only one map each
 round.

 I have set

 numSlaves=4
 mode=distributed


 The seed url list includes five different websites from different host.


 Is there any settings I missed out?

 Thanks in advance.

 Regards,

 Simon


 On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

  No, just do 'bin/crawl seedDir crawlID solrURL numberOfRounds'
 from
  the master node. It internally calls the nutch script for the individual
  commands, which takes care of sending the job jar to your hadoop cluster,
  see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271
 
 
 
 
  On 29 August 2014 15:24, S.L simpleliving...@gmail.com wrote:
 
   Sorry Julien , I overlooked the directory names.
  
   My understanding is that the Hadoop Job is submitted  to a cluster by
  using
   the following command on the RM node bin/hadoop .job file params
  
   Are you suggesting I submit the script instead of the Nutch .job jar
 like
   below?
  
   bin/hadoop  bin/crawl seedDir crawlID solrURL numberOfRounds
  
  
   On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche 
   lists.digitalpeb...@gmail.com wrote:
  
As the name runtime/deploy suggest - it is used exactly for that
  purpose
;-) Just make sure HADOOP_HOME/bin is added to the path and run the
   script,
that's all.
Look at the bottom of the nutch script for details.
   
Julien
   
PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU (
http://sched.co/1pbE15n) were we'll cover things like these
   
   
   
On 29 August 2014 14:30, S.L simpleliving...@gmail.com wrote:
   
 Thanks, can this be used on a hadoop cluster?

 Sent from my HTC

 - Reply message -
 From: Julien Nioche lists.digitalpeb...@gmail.com
 To: user@nutch.apache.org user@nutch.apache.org
 Subject: Nutch 1.7 fetch happening in a single map task.
 Date: Fri, Aug 29, 2014 9:00 AM

 See

  
 http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script

 just go to runtime/deploy/bin and run the script from there.

 Julien


 On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com wrote:

  Hi Julien,
 
  I have 15 domains and they are all being fetched in a single map
  task
 which
  does not fetch all the urls no matter what depth or topN i give.
 
  I am submitting the Nutch job jar which seems to be using the
Crawl.java
  class, how do I use the Crawl script on a Hadoop cluster, are
 there
   any
  pointers you can share?
 
  Thanks.
  On Aug 29, 2014 4:40 AM, Julien Nioche 
lists.digitalpeb...@gmail.com
  wrote:
 
   Hi Meraj,
  
   The generator will place all the URLs in a single segment if
 all
   they
   belong to the same host for politeness reason. Otherwise it
 will
   use
   whichever value is passed with the -numFetchers parameter in
 the
  generation
   step.
  
   Why don't you use the crawl script in /bin instead of tinkering
   with
 the
   (now deprecated) Crawl class? It comes with a good default
 configuration
   and should make your life easier.
  
   Julien
  
  
   On 28 August 2014 06:47, Meraj A. Khan mera...@gmail.com
  wrote:
  
Hi All,
   
I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I
  noticed
that
   there
is only a single reducer in the generate partition job. I am
running
  in
   a
situation where the subsequent fetch is only running in a
  single
map
  task
(I believe as a consequence of a single reducer in the
 earlier
 phase).
   How
can I force Nutch to do fetch in multiple map tasks , is
 there
  a
  setting
   to
force more than one reducers in the generate-partition job to
   have
 more
   map
tasks ?.
   
Please also note that I have commented out the code in
  Crawl.java
to
  not
   do
the LInkInversion phase as , I dont need the scoring of the
  URLS
that
   Nutch
crawls, every URL is equally important to me.
   
Thanks.
   
  
  
  
   --
  
   Open Source Solutions for Text Engineering