Hi,
We have documents from multiple sources, which might have duplicates from
different sources.
We might identify a duplicate document which shares say
md5(title,short_desc,location), although a more up to date doc might come
AFTER an older one (order not guaranteed) added to solr.
One thought I had to identify, and keep only the latest doc was to have
something like
version=md5(title,short_desc,location).'_'.sprintf("%013d",epoch_modified)
e.g.
version=97d88afe14f66e8d7f54cccc986c3e8a_0001638206899
and then:
schema.xml
<field name="version" type="string" indexed="true" stored="true"
multiValued="false"/>
solrconfig.xml
<updateRequestProcessorChain default="true">
<processor class="solr.DocBasedVersionConstraintsProcessorFactory">
<str name="versionField">version</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Just wondering whether anyone has tried this or potential pitfalls?
Many thanks,
Dan