Although it might be possible to list mutliple fields for a deduplication 
processor, i doubt the usefulness of it. If multiple fields are concatenated 
before hashing, you can only deduplicate documents that have identical bodies 
for all fields. I'd rather define a single processor for a single field and 
creating a Solr field for saving the digest. This way i can deduplicate 
documents that have similar bodies (beware of the bread crumbs in sites) OR 
exactly the same title for different URL's.
 
-----Original message-----
From: Nemani, Raj <[email protected]>
Sent: Fri 24-09-2010 23:09
To: [email protected]; 
Subject: RE: Nutch 1.2 solrdedup and OutOfMemoryError

Well, I think you can specify a list of fields in SolrConfig.xml  during
dedup configuration to control how Solr determines if two documents are
identical.  It should be pretty flexible.  Correct me of course if I
misunderstood your comment.

-----Original Message-----
From: brad [mailto:[email protected]] 
Sent: Friday, September 24, 2010 4:00 PM
To: [email protected]
Subject: RE: Nutch 1.2 solrdedup and OutOfMemoryError

Thanks for the info.  I'll give the solr deduplication a try.  It looks
like
its not as thorough as the regular dedup process (URL, Content, highest
score, shortest URL), but I think it will work.

Brad 

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Friday, September 24, 2010 5:27 AM
To: [email protected]
Subject: Re: Nutch 1.2 solrdedup and OutOfMemoryError

I'm not suprised that your memory is eaten when fetching almost 10
million
documents! It's a bit tough to read the deduplication code but it looks
like
it's hardcoded to fetch all records and split them between maps. If
you've
got one map, it'll fetch all records and so eating your memory.

I'm unsure how this can be fixed, but in the mean time you can solve it
by
implementing deduplication in your solrconfig.

On Friday 24 September 2010 04:59:03 brad wrote:
> I 'm running into an error trying to run solrdedup bin/nutch solrdedup

> http://127.0.0.1:8080/solr-nutch/
> 
> 2010-09-23 18:37:16,119 INFO  mapred.JobClient - Running job:
>  job_local_0001 2010-09-23 18:37:17,123 INFO  mapred.JobClient -  map 
> 0%  reduce 0% 2010-09-23 18:52:17,801 WARN  mapred.LocalJobRunner -
>  job_local_0001 java.lang.OutOfMemoryError: Java heap space
> at
> org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec
> .java
> : 323)
> at
>
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:204)
> at
>
org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:405
)
> at
>
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:171)
> at
>
org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCod
ec.
> j ava:339)
> at
>
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:206)
> at
> org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.j
> ava:1
> 1 0)
> at
>
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:173)
> at
>
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:101
)
> at
> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse
> (Bina
> r yResponseParser.java:39)
> at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Common
> sHttp
> S olrServer.java:466)
> at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Common
> sHttp
> S olrServer.java:243)
> at
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest
> .java
> : 89)
> at
> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
> at
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.get
> Recor d Reader(SolrDeleteDuplicates.java:233)
> at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:17
> 7)
> 
> 
> I'm running solr via tomcat.  Tomcat is being started with the memory 
> parameters of :
> -Xms2048m -Xmx2048m
> 
> So basically there is 2 gb of memory allocated to stack space.  I have

> noticed that by changing the parameters some, the location of the 
> error can change some, but the bottom line is I still run out of stack
space.
> 
> Nutch runs for about 15 minute and then the error occurs.
> 
> I only have 1 solr index and data/index directory size is about 85gb 
> I'm using the deliver solrconfig.xml file
> 
> Is there something else I need to do?  Some change to the Solr or 
> Tomcat config I have missed.
> 
> 
> Config:
> Nutch Release 1.2 - 08/07/2010
> CentOS Linux 5.5
> Linux 2.6.18-194.3.1.el5 on x86_64
> Intel(R) Xeon(R) CPU X3220 @ 2.40GHz
> 8gb of ram
> 
> 
> Thanks
> Brad
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Reply via email to