Confirming that this worked. Also, times look interesting: to send 73K 
documents in 1000 doc batches (default) took 16 minutes; to send 73K documents 
in 100 doc batches took 15 minutes 24 seconds.

Regards,

Arkadi

> -----Original Message-----
> From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au]
> Sent: Friday, 28 October 2011 12:11 PM
> To: user@nutch.apache.org; markus.jel...@openindex.io
> Subject: [ExternalEmail] RE: OutOfMemoryError when indexing into Solr
> 
> Hi Markus,
> 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: Thursday, 27 October 2011 11:33 PM
> > To: user@nutch.apache.org
> > Subject: Re: OutOfMemoryError when indexing into Solr
> >
> > Interesting, how many records and how large are your records?
> 
> There a bit more than 80,000 documents.
> 
> <property>
>       <name>http.content.limit</name> <value>150000000</value>
> </property>
> 
> <property>
>    <name>indexer.max.tokens</name><value>100000</value>
> </property>
> 
> > How did you increase JVM heap size?
> 
> opts="-XX:+UseConcMarkSweepGC -Xms500m -Xmx6000m -
> XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m -
> XX:+CMSClassUnloadingEnabled"
> 
> > Do you have custom indexing filters?
> 
> Yes. They add a few fields to each document. These fields are small,
> within a hundred of bytes per document.
> 
> > Can you decrease the commit.size?
> 
> Yes. Thank you. Good idea. I did not even consider it because, for
> whatever reason, this option was not in my nutch-default.xml. I've put
> it to 100. I hope that Solr commit is not done after sending each
> bunch. Else this would have a very negative impact on performance
> because Solr commits are very expensive.
> 
> 
> > Do you also index large amounts of anchors (without deduplication)
> and pass in a very large linkdb?
> 
> I do index anchors, but don't think that there is anything
> extraordinary about them. As I only index less than 100K pages, my
> linkdb should not be nearly as large as in cases when people index
> millions of documents.
> 
> > The reducer of IndexerMapReduce is a notorious RAM consumer.
> 
> If reducing solr.commit.size helps, it would make sense to decrease the
> default value. Sending small bunches of documents to Solr without
> commits is not that expensive to risk having memory problems.
> 
> Thanks again.
> 
> Regards,
> 
> Arkadi
> 
> 
> >
> > On Thursday 27 October 2011 05:54:54 arkadi.kosmy...@csiro.au wrote:
> > > Hi,
> > >
> > > I am working with a Nutch 1.4 snapshot and having a very strange
> > problem
> > > that makes the system run out of memory when indexing into Solr.
> This
> > does
> > > not look like a trivial lack of memory problem that can be solved
> by
> > > giving more memory to the JVM. I've increased the max memory size
> > from 2Gb
> > > to 3Gb, then to 6Gb, but this did not make any difference.
> > >
> > > A log extract is included below.
> > >
> > > Would anyone have any idea of how to fix this problem?
> > >
> > > Thanks,
> > >
> > > Arkadi
> > >
> > >
> > > 2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000
> documents
> > > 2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000
> documents
> > > 2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner -
> job_local_0254
> > > java.lang.OutOfMemoryError: Java heap space
> > >        at java.util.Arrays.copyOfRange(Arrays.java:3209)
> > >        at java.lang.String.<init>(String.java:215)
> > >        at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
> > >        at java.nio.CharBuffer.toString(CharBuffer.java:1157)
> > >        at org.apache.hadoop.io.Text.decode(Text.java:350)
> > >        at org.apache.hadoop.io.Text.decode(Text.java:322)
> > >        at org.apache.hadoop.io.Text.readString(Text.java:403)
> > >        at
> > org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
> > >        at
> > >
> >
> org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWri
> > tab
> > > leConfigurable.java:54) at
> > >
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
> > zer
> > > .deserialize(WritableSerialization.java:67) at
> > >
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
> > zer
> > > .deserialize(WritableSerialization.java:40) at
> > >
> >
> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:99
> > 1)
> > > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
> > at
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Red
> > uce
> > > Task.java:241) at
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTas
> > k.j
> > > ava:237) at
> > >
> >
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
> > 81)
> > > at
> > >
> >
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
> > 50)
> > > at
> >
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
> > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at
> > >
> >
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216
> > )
> > > 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer -
> java.io.IOException:
> > Job
> > > failed!
> >
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

Reply via email to