Thanks. We should decrease the default setting for commit.size.
> Confirming that this worked. Also, times look interesting: to send 73K > documents in 1000 doc batches (default) took 16 minutes; to send 73K > documents in 100 doc batches took 15 minutes 24 seconds. > > Regards, > > Arkadi > > > -----Original Message----- > > From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au] > > Sent: Friday, 28 October 2011 12:11 PM > > To: user@nutch.apache.org; markus.jel...@openindex.io > > Subject: [ExternalEmail] RE: OutOfMemoryError when indexing into Solr > > > > Hi Markus, > > > > > -----Original Message----- > > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > > Sent: Thursday, 27 October 2011 11:33 PM > > > To: user@nutch.apache.org > > > Subject: Re: OutOfMemoryError when indexing into Solr > > > > > > Interesting, how many records and how large are your records? > > > > There a bit more than 80,000 documents. > > > > <property> > > > > <name>http.content.limit</name> <value>150000000</value> > > > > </property> > > > > <property> > > > > <name>indexer.max.tokens</name><value>100000</value> > > > > </property> > > > > > How did you increase JVM heap size? > > > > opts="-XX:+UseConcMarkSweepGC -Xms500m -Xmx6000m - > > XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m - > > XX:+CMSClassUnloadingEnabled" > > > > > Do you have custom indexing filters? > > > > Yes. They add a few fields to each document. These fields are small, > > within a hundred of bytes per document. > > > > > Can you decrease the commit.size? > > > > Yes. Thank you. Good idea. I did not even consider it because, for > > whatever reason, this option was not in my nutch-default.xml. I've put > > it to 100. I hope that Solr commit is not done after sending each > > bunch. Else this would have a very negative impact on performance > > because Solr commits are very expensive. > > > > > Do you also index large amounts of anchors (without deduplication) > > > > and pass in a very large linkdb? > > > > I do index anchors, but don't think that there is anything > > extraordinary about them. As I only index less than 100K pages, my > > linkdb should not be nearly as large as in cases when people index > > millions of documents. > > > > > The reducer of IndexerMapReduce is a notorious RAM consumer. > > > > If reducing solr.commit.size helps, it would make sense to decrease the > > default value. Sending small bunches of documents to Solr without > > commits is not that expensive to risk having memory problems. > > > > Thanks again. > > > > Regards, > > > > Arkadi > > > > > On Thursday 27 October 2011 05:54:54 arkadi.kosmy...@csiro.au wrote: > > > > Hi, > > > > > > > > I am working with a Nutch 1.4 snapshot and having a very strange > > > > > > problem > > > > > > > that makes the system run out of memory when indexing into Solr. > > > > This > > > > > does > > > > > > > not look like a trivial lack of memory problem that can be solved > > > > by > > > > > > giving more memory to the JVM. I've increased the max memory size > > > > > > from 2Gb > > > > > > > to 3Gb, then to 6Gb, but this did not make any difference. > > > > > > > > A log extract is included below. > > > > > > > > Would anyone have any idea of how to fix this problem? > > > > > > > > Thanks, > > > > > > > > Arkadi > > > > > > > > > > > > 2011-10-27 07:08:22,162 INFO solr.SolrWriter - Adding 1000 > > > > documents > > > > > > 2011-10-27 07:08:42,248 INFO solr.SolrWriter - Adding 1000 > > > > documents > > > > > > 2011-10-27 07:13:54,110 WARN mapred.LocalJobRunner - > > > > job_local_0254 > > > > > > java.lang.OutOfMemoryError: Java heap space > > > > > > > > at java.util.Arrays.copyOfRange(Arrays.java:3209) > > > > at java.lang.String.<init>(String.java:215) > > > > at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) > > > > at java.nio.CharBuffer.toString(CharBuffer.java:1157) > > > > at org.apache.hadoop.io.Text.decode(Text.java:350) > > > > at org.apache.hadoop.io.Text.decode(Text.java:322) > > > > at org.apache.hadoop.io.Text.readString(Text.java:403) > > > > at > > > > > > org.apache.nutch.parse.ParseText.readFields(ParseText.java:50) > > > > > > > at > > > > org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWri > > > > > tab > > > > > > > leConfigurable.java:54) at > > > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali > > > > > zer > > > > > > > .deserialize(WritableSerialization.java:67) at > > > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali > > > > > zer > > > > > > > .deserialize(WritableSerialization.java:40) at > > > > org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:99 > > > > > 1) > > > > > > > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931) > > > > > > at > > > > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Red > > > > > uce > > > > > > > Task.java:241) at > > > > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTas > > > > > k.j > > > > > > > ava:237) at > > > > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java: > > > 81) > > > > > > > at > > > > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java: > > > 50) > > > > > > > at > > > > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) > > > > > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at > > > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216 > > > > > ) > > > > > > > 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - > > > > java.io.IOException: > > > Job > > > > > > > failed! > > > > > > -- > > > Markus Jelsma - CTO - Openindex > > > http://www.linkedin.com/in/markus17 > > > 050-8536620 / 06-50258350