Hi,

Actually to answer my own question this won't help since I'd need to know
the id to update the document, In this case I might be best to make the id
equal to how a duplicate is identified, and then use versionField to update
to only the latest version.

i.e. using the previous example id=97d88afe14f66e8d7f54cccc986c3e8a and
version=1638206899 with

schema.xml
<field name="version"  type="long"  indexed="true"  stored="true"
 multiValued="false"/>

and solrconfig.xml is the same as before.

Cheers,
Dan


On Mon, 29 Nov 2021 at 17:36, Dan Rosher <[email protected]> wrote:

> Hi,
>
> We have documents from multiple sources, which might have duplicates from
> different sources.
>
> We might identify a duplicate document which shares say
> md5(title,short_desc,location), although a more up to date doc might come
> AFTER an older one (order not guaranteed) added to solr.
>
> One thought I had to identify, and keep only the latest doc was to have
> something like
>
> version=md5(title,short_desc,location).'_'.sprintf("%013d",epoch_modified)
> e.g.
> version=97d88afe14f66e8d7f54cccc986c3e8a_0001638206899
>
> and then:
> schema.xml
> <field name="version"  type="string"  indexed="true"  stored="true"
>  multiValued="false"/>
>
> solrconfig.xml
>  <updateRequestProcessorChain default="true">
>     <processor class="solr.DocBasedVersionConstraintsProcessorFactory">
>       <str name="versionField">version</str>
>     </processor>
>     <processor class="solr.LogUpdateProcessorFactory" />
>     <processor class="solr.RunUpdateProcessorFactory" />
>   </updateRequestProcessorChain>
>
> Just wondering whether anyone has tried this or potential pitfalls?
>
> Many thanks,
> Dan
>
>

Reply via email to