Thanks. We should decrease the default setting for commit.size.

> Confirming that this worked. Also, times look interesting: to send 73K
> documents in 1000 doc batches (default) took 16 minutes; to send 73K
> documents in 100 doc batches took 15 minutes 24 seconds.
> 
> Regards,
> 
> Arkadi
> 
> > -----Original Message-----
> > From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au]
> > Sent: Friday, 28 October 2011 12:11 PM
> > To: user@nutch.apache.org; markus.jel...@openindex.io
> > Subject: [ExternalEmail] RE: OutOfMemoryError when indexing into Solr
> > 
> > Hi Markus,
> > 
> > > -----Original Message-----
> > > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > > Sent: Thursday, 27 October 2011 11:33 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: OutOfMemoryError when indexing into Solr
> > > 
> > > Interesting, how many records and how large are your records?
> > 
> > There a bit more than 80,000 documents.
> > 
> > <property>
> > 
> >       <name>http.content.limit</name> <value>150000000</value>
> > 
> > </property>
> > 
> > <property>
> > 
> >    <name>indexer.max.tokens</name><value>100000</value>
> > 
> > </property>
> > 
> > > How did you increase JVM heap size?
> > 
> > opts="-XX:+UseConcMarkSweepGC -Xms500m -Xmx6000m -
> > XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m -
> > XX:+CMSClassUnloadingEnabled"
> > 
> > > Do you have custom indexing filters?
> > 
> > Yes. They add a few fields to each document. These fields are small,
> > within a hundred of bytes per document.
> > 
> > > Can you decrease the commit.size?
> > 
> > Yes. Thank you. Good idea. I did not even consider it because, for
> > whatever reason, this option was not in my nutch-default.xml. I've put
> > it to 100. I hope that Solr commit is not done after sending each
> > bunch. Else this would have a very negative impact on performance
> > because Solr commits are very expensive.
> > 
> > > Do you also index large amounts of anchors (without deduplication)
> > 
> > and pass in a very large linkdb?
> > 
> > I do index anchors, but don't think that there is anything
> > extraordinary about them. As I only index less than 100K pages, my
> > linkdb should not be nearly as large as in cases when people index
> > millions of documents.
> > 
> > > The reducer of IndexerMapReduce is a notorious RAM consumer.
> > 
> > If reducing solr.commit.size helps, it would make sense to decrease the
> > default value. Sending small bunches of documents to Solr without
> > commits is not that expensive to risk having memory problems.
> > 
> > Thanks again.
> > 
> > Regards,
> > 
> > Arkadi
> > 
> > > On Thursday 27 October 2011 05:54:54 arkadi.kosmy...@csiro.au wrote:
> > > > Hi,
> > > > 
> > > > I am working with a Nutch 1.4 snapshot and having a very strange
> > > 
> > > problem
> > > 
> > > > that makes the system run out of memory when indexing into Solr.
> > 
> > This
> > 
> > > does
> > > 
> > > > not look like a trivial lack of memory problem that can be solved
> > 
> > by
> > 
> > > > giving more memory to the JVM. I've increased the max memory size
> > > 
> > > from 2Gb
> > > 
> > > > to 3Gb, then to 6Gb, but this did not make any difference.
> > > > 
> > > > A log extract is included below.
> > > > 
> > > > Would anyone have any idea of how to fix this problem?
> > > > 
> > > > Thanks,
> > > > 
> > > > Arkadi
> > > > 
> > > > 
> > > > 2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000
> > 
> > documents
> > 
> > > > 2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000
> > 
> > documents
> > 
> > > > 2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner -
> > 
> > job_local_0254
> > 
> > > > java.lang.OutOfMemoryError: Java heap space
> > > > 
> > > >        at java.util.Arrays.copyOfRange(Arrays.java:3209)
> > > >        at java.lang.String.<init>(String.java:215)
> > > >        at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
> > > >        at java.nio.CharBuffer.toString(CharBuffer.java:1157)
> > > >        at org.apache.hadoop.io.Text.decode(Text.java:350)
> > > >        at org.apache.hadoop.io.Text.decode(Text.java:322)
> > > >        at org.apache.hadoop.io.Text.readString(Text.java:403)
> > > >        at
> > > 
> > > org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
> > > 
> > > >        at
> > 
> > org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWri
> > 
> > > tab
> > > 
> > > > leConfigurable.java:54) at
> > 
> > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
> > 
> > > zer
> > > 
> > > > .deserialize(WritableSerialization.java:67) at
> > 
> > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
> > 
> > > zer
> > > 
> > > > .deserialize(WritableSerialization.java:40) at
> > 
> > org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:99
> > 
> > > 1)
> > > 
> > > > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
> > > 
> > > at
> > 
> > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Red
> > 
> > > uce
> > > 
> > > > Task.java:241) at
> > 
> > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTas
> > 
> > > k.j
> > > 
> > > > ava:237) at
> > 
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
> > > 81)
> > > 
> > > > at
> > 
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
> > > 50)
> > > 
> > > > at
> > 
> > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
> > 
> > > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at
> > 
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216
> > 
> > > )
> > > 
> > > > 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer -
> > 
> > java.io.IOException:
> > > Job
> > > 
> > > > failed!
> > > 
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350

Reply via email to