Thanks for the info.  I'll give the solr deduplication a try.  It looks like
its not as thorough as the regular dedup process (URL, Content, highest
score, shortest URL), but I think it will work.

Brad 

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Friday, September 24, 2010 5:27 AM
To: [email protected]
Subject: Re: Nutch 1.2 solrdedup and OutOfMemoryError

I'm not suprised that your memory is eaten when fetching almost 10 million
documents! It's a bit tough to read the deduplication code but it looks like
it's hardcoded to fetch all records and split them between maps. If you've
got one map, it'll fetch all records and so eating your memory.

I'm unsure how this can be fixed, but in the mean time you can solve it by
implementing deduplication in your solrconfig.

On Friday 24 September 2010 04:59:03 brad wrote:
> I 'm running into an error trying to run solrdedup bin/nutch solrdedup 
> http://127.0.0.1:8080/solr-nutch/
> 
> 2010-09-23 18:37:16,119 INFO  mapred.JobClient - Running job:
>  job_local_0001 2010-09-23 18:37:17,123 INFO  mapred.JobClient -  map 
> 0%  reduce 0% 2010-09-23 18:52:17,801 WARN  mapred.LocalJobRunner -
>  job_local_0001 java.lang.OutOfMemoryError: Java heap space
>       at
> org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec
> .java
> : 323)
>       at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:204)
>       at
> org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:405)
>       at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:171)
>       at
>
org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.
> j ava:339)
>       at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:206)
>       at
> org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.j
> ava:1
> 1 0)
>       at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:173)
>       at
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:101)
>       at
> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse
> (Bina
> r yResponseParser.java:39)
>       at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Common
> sHttp
> S olrServer.java:466)
>       at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Common
> sHttp
> S olrServer.java:243)
>       at
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest
> .java
> : 89)
>       at
> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
>       at
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.get
> Recor d Reader(SolrDeleteDuplicates.java:233)
>       at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>       at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:17
> 7)
> 
> 
> I'm running solr via tomcat.  Tomcat is being started with the memory 
> parameters of :
> -Xms2048m -Xmx2048m
> 
> So basically there is 2 gb of memory allocated to stack space.  I have 
> noticed that by changing the parameters some, the location of the 
> error can change some, but the bottom line is I still run out of stack
space.
> 
> Nutch runs for about 15 minute and then the error occurs.
> 
> I only have 1 solr index and data/index directory size is about 85gb 
> I'm using the deliver solrconfig.xml file
> 
> Is there something else I need to do?  Some change to the Solr or 
> Tomcat config I have missed.
> 
> 
> Config:
> Nutch Release 1.2 - 08/07/2010
> CentOS Linux 5.5
> Linux 2.6.18-194.3.1.el5 on x86_64
> Intel(R) Xeon(R) CPU X3220 @ 2.40GHz
> 8gb of ram
> 
> 
> Thanks
> Brad
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Reply via email to