[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Yonik Seeley (JIRA) Tue, 04 Nov 2008 12:07:06 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645061#action_12645061
 ]


Yonik Seeley commented on SOLR-799:
-----------------------------------

bq. Maybe we just do overwrite dupe for now?

+1, as long as we don't do anything to preclude the other stuff - we just need 
to leave "room" in the config XML and the update API such that we don't have to 
break the back compatibility of this patch if/when future features are 
implemented.

bq. Another point that was brought up is whether or not to delete any docs that 
match the update docs uniqueField id term, but not its similarity/update term.  
You are choosing to use the updateTerm to do updates rather then the unique 
term.

It seems like uniqueField should normally enforce uniqueness, regardless of 
what this component does.  If one wants duplicate ids, then it seems like a 
different field should be used for that (other than the uniqueKey field).  If 
one wants to delete *only* on the hash field, then they can make the hash field 
the id field.


> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: SOLR-799.patch, SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Reply via email to