Confirming that this worked. Also, times look interesting: to send 73K documents in 1000 doc batches (default) took 16 minutes; to send 73K documents in 100 doc batches took 15 minutes 24 seconds.
Regards, Arkadi > -----Original Message----- > From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au] > Sent: Friday, 28 October 2011 12:11 PM > To: user@nutch.apache.org; markus.jel...@openindex.io > Subject: [ExternalEmail] RE: OutOfMemoryError when indexing into Solr > > Hi Markus, > > > -----Original Message----- > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > Sent: Thursday, 27 October 2011 11:33 PM > > To: user@nutch.apache.org > > Subject: Re: OutOfMemoryError when indexing into Solr > > > > Interesting, how many records and how large are your records? > > There a bit more than 80,000 documents. > > <property> > <name>http.content.limit</name> <value>150000000</value> > </property> > > <property> > <name>indexer.max.tokens</name><value>100000</value> > </property> > > > How did you increase JVM heap size? > > opts="-XX:+UseConcMarkSweepGC -Xms500m -Xmx6000m - > XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m - > XX:+CMSClassUnloadingEnabled" > > > Do you have custom indexing filters? > > Yes. They add a few fields to each document. These fields are small, > within a hundred of bytes per document. > > > Can you decrease the commit.size? > > Yes. Thank you. Good idea. I did not even consider it because, for > whatever reason, this option was not in my nutch-default.xml. I've put > it to 100. I hope that Solr commit is not done after sending each > bunch. Else this would have a very negative impact on performance > because Solr commits are very expensive. > > > > Do you also index large amounts of anchors (without deduplication) > and pass in a very large linkdb? > > I do index anchors, but don't think that there is anything > extraordinary about them. As I only index less than 100K pages, my > linkdb should not be nearly as large as in cases when people index > millions of documents. > > > The reducer of IndexerMapReduce is a notorious RAM consumer. > > If reducing solr.commit.size helps, it would make sense to decrease the > default value. Sending small bunches of documents to Solr without > commits is not that expensive to risk having memory problems. > > Thanks again. > > Regards, > > Arkadi > > > > > > On Thursday 27 October 2011 05:54:54 arkadi.kosmy...@csiro.au wrote: > > > Hi, > > > > > > I am working with a Nutch 1.4 snapshot and having a very strange > > problem > > > that makes the system run out of memory when indexing into Solr. > This > > does > > > not look like a trivial lack of memory problem that can be solved > by > > > giving more memory to the JVM. I've increased the max memory size > > from 2Gb > > > to 3Gb, then to 6Gb, but this did not make any difference. > > > > > > A log extract is included below. > > > > > > Would anyone have any idea of how to fix this problem? > > > > > > Thanks, > > > > > > Arkadi > > > > > > > > > 2011-10-27 07:08:22,162 INFO solr.SolrWriter - Adding 1000 > documents > > > 2011-10-27 07:08:42,248 INFO solr.SolrWriter - Adding 1000 > documents > > > 2011-10-27 07:13:54,110 WARN mapred.LocalJobRunner - > job_local_0254 > > > java.lang.OutOfMemoryError: Java heap space > > > at java.util.Arrays.copyOfRange(Arrays.java:3209) > > > at java.lang.String.<init>(String.java:215) > > > at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) > > > at java.nio.CharBuffer.toString(CharBuffer.java:1157) > > > at org.apache.hadoop.io.Text.decode(Text.java:350) > > > at org.apache.hadoop.io.Text.decode(Text.java:322) > > > at org.apache.hadoop.io.Text.readString(Text.java:403) > > > at > > org.apache.nutch.parse.ParseText.readFields(ParseText.java:50) > > > at > > > > > > org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWri > > tab > > > leConfigurable.java:54) at > > > > > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali > > zer > > > .deserialize(WritableSerialization.java:67) at > > > > > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali > > zer > > > .deserialize(WritableSerialization.java:40) at > > > > > > org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:99 > > 1) > > > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931) > > at > > > > > > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Red > > uce > > > Task.java:241) at > > > > > > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTas > > k.j > > > ava:237) at > > > > > > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java: > > 81) > > > at > > > > > > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java: > > 50) > > > at > > > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) > > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at > > > > > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216 > > ) > > > 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - > java.io.IOException: > > Job > > > failed! > > > > -- > > Markus Jelsma - CTO - Openindex > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350