Hi, Actually to answer my own question this won't help since I'd need to know the id to update the document, In this case I might be best to make the id equal to how a duplicate is identified, and then use versionField to update to only the latest version.
i.e. using the previous example id=97d88afe14f66e8d7f54cccc986c3e8a and version=1638206899 with schema.xml <field name="version" type="long" indexed="true" stored="true" multiValued="false"/> and solrconfig.xml is the same as before. Cheers, Dan On Mon, 29 Nov 2021 at 17:36, Dan Rosher <[email protected]> wrote: > Hi, > > We have documents from multiple sources, which might have duplicates from > different sources. > > We might identify a duplicate document which shares say > md5(title,short_desc,location), although a more up to date doc might come > AFTER an older one (order not guaranteed) added to solr. > > One thought I had to identify, and keep only the latest doc was to have > something like > > version=md5(title,short_desc,location).'_'.sprintf("%013d",epoch_modified) > e.g. > version=97d88afe14f66e8d7f54cccc986c3e8a_0001638206899 > > and then: > schema.xml > <field name="version" type="string" indexed="true" stored="true" > multiValued="false"/> > > solrconfig.xml > <updateRequestProcessorChain default="true"> > <processor class="solr.DocBasedVersionConstraintsProcessorFactory"> > <str name="versionField">version</str> > </processor> > <processor class="solr.LogUpdateProcessorFactory" /> > <processor class="solr.RunUpdateProcessorFactory" /> > </updateRequestProcessorChain> > > Just wondering whether anyone has tried this or potential pitfalls? > > Many thanks, > Dan > >
