Re: what is SOLR syntax to remove duplicated documents

Mikhail Khludnev Sun, 22 Oct 2023 13:00:45 -0700

You can find id terms repeating in an index via
https://solr.apache.org/guide/solr/latest/query-guide/terms-component.html
and terms.mincount=2
or do the same via facets
q=*:*&facet=true&facet.field=id&facet.limit=-1&facet.mincount=2 (just on
top of my head)
Then you can query duplicated ids one by one. If you don't have strictly
unique field assigned, it's not possible to drop duplicates. You can get
internal unique identifier a kind of analogy to ROW_NUMBER via [docid] see
https://solr.apache.org/guide/solr/latest/query-guide/document-transformers.html#docid-docidaugmenterfactory
.
But I'm not aware about a query accepting this number.



On Sun, Oct 22, 2023 at 3:22 PM Vince McMahon <sippingonesandze...@gmail.com>
wrote:

> I have a SOLR 8.X.  I suspect one of the core has duplicates and wants to
> remove the duplicated documents.  Signature, as in the SOLR guide, is not
> implemented.  https://solr.apache.org/guide/6_6/de-duplication.html
>
> in sql, a query without the use of a hash column will be liked:
> ;WITH CTE AS
> (
>     SELECT  cols,
>             RN = ROW_NUMBER() OVER( PARTITION BY cols
>                                     ORDER BY updated DESC)
>     FROM [table]
> )
> DELETE FROM CTE
> WHERE RN > 1
>
> what would be the syntax for SOLR query?
>


-- 
Sincerely yours
Mikhail Khludnev

Re: what is SOLR syntax to remove duplicated documents

Reply via email to