I have the digest field already in the schema because the index is shared 
between nutch docs and others.  I do not know if the second approach is the 
quickest in my case.

I can set the digest value to something unique for non nutch documets easily (I 
have an I'd field that I can use to populate the digest field during indxing of 
new non_nutch documets.  I have custom tool that does the indexing of these 
docs).  But I have more than3 millon documents in the index already that I 
don't want start over with new indexing again if I don't have to. Is there a 
way I can update the digest field with the value from the corresponding I'd 
field using solr? 

Thanks
Raj

----- Original Message -----
From: Markus Jelsma <markus.jel...@buyways.nl>
To: solr-user@lucene.apache.org <solr-user@lucene.apache.org>
Sent: Tue Sep 28 18:19:17 2010
Subject: RE: Solr Deduplication and Field Collpasing

You could create a custom update processor that adds a digest field for newly 
added documents that do not have the digest field themselves. This way, the 
documents that are not added by Nutch get a proper non-empty digest field so 
the deduplication processor won't create the same empty hash and overwrite 
those. Or you could extend 
org.apache.solr.update.processor.SignatureUpdateProcessorFactory so it skips 
documents with an empty digest field. I'd think the latter would be the 
quickest route but correct me if i'm wrong.

 

Cheers,
 
-----Original message-----
From: Nemani, Raj <raj.nem...@turner.com>
Sent: Tue 28-09-2010 23:28
To: solr-user@lucene.apache.org; 
Subject: Solr Deduplication and Field Collpasing

All,



I have setup Nutch to submit the crawl results to Solr index.  I have
some duplicates in the documents generated by the Nutch crawl.  There is
filed 'digest' that Nutch generates that is same for those documents
that are duplicates.  While setting up the the dedupe processor in the
Solr config file, I have used this 'Digest' field in the following
way(see below for config details).  Since my index has documents other
than the ones generated by Nutch I cannot use 'overwritedupes=true"
because for non-Nutch generated documents the digest field will not be
populated and I found that Solr deletes every one of those documents
that do not have the digest filed populated. Probably because they all
will have the same 'sig' filed value generated based on an 'empty'
digest field forcing Solr to delete everything?



In any case, given the scenario I though I would set
'overwritedupes=false' and use field collapsing based on digest or sig
filed but I could not get filed collapsing to work.  Based on the wiki
documentation I was adding the query string
"&group=true&group.filed=sig&" "&group=true&group.filed=digest&" to my
over all query in admin console and I still got the duplicate documents
in the results.  Is there anything special I need to do to get field
collapsing working?  I am running Solr 1.4.



All this is because Nutch thinks that (url *is* the unique id for the
nutch document)

http://mysite.mydomain.com/index.html and http://mysite/index.html (the
difference is only in the alias and for an internal site both are valid)
are different documents depending on how the link is setup.  This is
reason for me to try deduplication.  I cannot submit SolrDedup command
from Nutch because non-Nutch generated documents do not have digest
filed populated and I read on the mailing lists that this will cause the
SolrDedup initiated from Nutch to fail.  This forced me to do try
deduplication on Solr side.



Thanks so much in advance for your help.

------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------------------------------------

Here is my configuration:



SolrConfig.xml

               

               

               

               

               <updateRequestProcessorChain name="dedupe">

                   <processor

               

class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"

               >

                     <bool name="enabled">true</bool>

                     <str name="signatureField">sig</str>

                     <bool name="overwriteDupes">false</bool>

                     <str

               

name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<

               /str> 

                 <str name="fields">digest</str>

                 </processor>

                   <processor class="solr.LogUpdateProcessorFactory" />

                   <processor class="solr.RunUpdateProcessorFactory" />

                 </updateRequestProcessorChain>

               

               

               <requestHandler name="/update"

class="solr.XmlUpdateRequestHandler" >

                  <lst name="defaults">

                    <str name="update.processor">dedupe</str>

                  </lst>

                </requestHandler>

               

               Schema.xml

               --------------------

               

               <field name="sig" type="string" stored="true"
indexed="true"

               multiValued="true" />



Thanks so much for your help



Reply via email to