I have the digest field already in the schema because the index is shared between nutch docs and others. I do not know if the second approach is the quickest in my case.
I can set the digest value to something unique for non nutch documets easily (I have an I'd field that I can use to populate the digest field during indxing of new non_nutch documets. I have custom tool that does the indexing of these docs). But I have more than3 millon documents in the index already that I don't want start over with new indexing again if I don't have to. Is there a way I can update the digest field with the value from the corresponding I'd field using solr? Thanks Raj ----- Original Message ----- From: Markus Jelsma <markus.jel...@buyways.nl> To: solr-user@lucene.apache.org <solr-user@lucene.apache.org> Sent: Tue Sep 28 18:19:17 2010 Subject: RE: Solr Deduplication and Field Collpasing You could create a custom update processor that adds a digest field for newly added documents that do not have the digest field themselves. This way, the documents that are not added by Nutch get a proper non-empty digest field so the deduplication processor won't create the same empty hash and overwrite those. Or you could extend org.apache.solr.update.processor.SignatureUpdateProcessorFactory so it skips documents with an empty digest field. I'd think the latter would be the quickest route but correct me if i'm wrong. Cheers, -----Original message----- From: Nemani, Raj <raj.nem...@turner.com> Sent: Tue 28-09-2010 23:28 To: solr-user@lucene.apache.org; Subject: Solr Deduplication and Field Collpasing All, I have setup Nutch to submit the crawl results to Solr index. I have some duplicates in the documents generated by the Nutch crawl. There is filed 'digest' that Nutch generates that is same for those documents that are duplicates. While setting up the the dedupe processor in the Solr config file, I have used this 'Digest' field in the following way(see below for config details). Since my index has documents other than the ones generated by Nutch I cannot use 'overwritedupes=true" because for non-Nutch generated documents the digest field will not be populated and I found that Solr deletes every one of those documents that do not have the digest filed populated. Probably because they all will have the same 'sig' filed value generated based on an 'empty' digest field forcing Solr to delete everything? In any case, given the scenario I though I would set 'overwritedupes=false' and use field collapsing based on digest or sig filed but I could not get filed collapsing to work. Based on the wiki documentation I was adding the query string "&group=true&group.filed=sig&" "&group=true&group.filed=digest&" to my over all query in admin console and I still got the duplicate documents in the results. Is there anything special I need to do to get field collapsing working? I am running Solr 1.4. All this is because Nutch thinks that (url *is* the unique id for the nutch document) http://mysite.mydomain.com/index.html and http://mysite/index.html (the difference is only in the alias and for an internal site both are valid) are different documents depending on how the link is setup. This is reason for me to try deduplication. I cannot submit SolrDedup command from Nutch because non-Nutch generated documents do not have digest filed populated and I read on the mailing lists that this will cause the SolrDedup initiated from Nutch to fail. This forced me to do try deduplication on Solr side. Thanks so much in advance for your help. ------------------------------------------------------------------------ ------------------------------------------------------------------------ -------------------------------------------------------- Here is my configuration: SolrConfig.xml <updateRequestProcessorChain name="dedupe"> <processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory" > <bool name="enabled">true</bool> <str name="signatureField">sig</str> <bool name="overwriteDupes">false</bool> <str name="signatureClass">org.apache.solr.update.processor.Lookup3Signature< /str> <str name="fields">digest</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" > <lst name="defaults"> <str name="update.processor">dedupe</str> </lst> </requestHandler> Schema.xml -------------------- <field name="sig" type="string" stored="true" indexed="true" multiValued="true" /> Thanks so much for your help