Hi,

We have documents from multiple sources, which might have duplicates from
different sources.

We might identify a duplicate document which shares say
md5(title,short_desc,location), although a more up to date doc might come
AFTER an older one (order not guaranteed) added to solr.

One thought I had to identify, and keep only the latest doc was to have
something like

version=md5(title,short_desc,location).'_'.sprintf("%013d",epoch_modified)
e.g.
version=97d88afe14f66e8d7f54cccc986c3e8a_0001638206899

and then:
schema.xml
<field name="version"  type="string"  indexed="true"  stored="true"
 multiValued="false"/>

solrconfig.xml
 <updateRequestProcessorChain default="true">
    <processor class="solr.DocBasedVersionConstraintsProcessorFactory">
      <str name="versionField">version</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>

Just wondering whether anyone has tried this or potential pitfalls?

Many thanks,
Dan

Reply via email to