I'm not suprised that your memory is eaten when fetching almost 10 million 
documents! It's a bit tough to read the deduplication code but it looks like 
it's hardcoded to fetch all records and split them between maps. If you've got 
one map, it'll fetch all records and so eating your memory.

I'm unsure how this can be fixed, but in the mean time you can solve it by 
implementing deduplication in your solrconfig.

On Friday 24 September 2010 04:59:03 brad wrote:
> I 'm running into an error trying to run solrdedup
> bin/nutch solrdedup http://127.0.0.1:8080/solr-nutch/
> 
> 2010-09-23 18:37:16,119 INFO  mapred.JobClient - Running job:
>  job_local_0001 2010-09-23 18:37:17,123 INFO  mapred.JobClient -  map 0%
>  reduce 0% 2010-09-23 18:52:17,801 WARN  mapred.LocalJobRunner -
>  job_local_0001 java.lang.OutOfMemoryError: Java heap space
>       at
> org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec.java
> : 323)
>       at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:204)
>       at
> org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:405)
>       at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:171)
>       at
> org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.
> j ava:339)
>       at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:206)
>       at
> org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:1
> 1 0)
>       at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:173)
>       at
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:101)
>       at
> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(Bina
> r yResponseParser.java:39)
>       at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
> S olrServer.java:466)
>       at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
> S olrServer.java:243)
>       at
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java
> : 89)
>       at
> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
>       at
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecor
> d Reader(SolrDeleteDuplicates.java:233)
>       at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>       at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> 
> 
> I'm running solr via tomcat.  Tomcat is being started with the memory
> parameters of :
> -Xms2048m -Xmx2048m
> 
> So basically there is 2 gb of memory allocated to stack space.  I have
> noticed that by changing the parameters some, the location of the error can
> change some, but the bottom line is I still run out of stack space.
> 
> Nutch runs for about 15 minute and then the error occurs.
> 
> I only have 1 solr index and data/index directory size is about 85gb
> I'm using the deliver solrconfig.xml file
> 
> Is there something else I need to do?  Some change to the Solr or Tomcat
> config I have missed.
> 
> 
> Config:
> Nutch Release 1.2 - 08/07/2010
> CentOS Linux 5.5
> Linux 2.6.18-194.3.1.el5 on x86_64
> Intel(R) Xeon(R) CPU X3220 @ 2.40GHz
> 8gb of ram
> 
> 
> Thanks
> Brad
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to