[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645061#action_12645061 ]
Yonik Seeley commented on SOLR-799: ----------------------------------- bq. Maybe we just do overwrite dupe for now? +1, as long as we don't do anything to preclude the other stuff - we just need to leave "room" in the config XML and the update API such that we don't have to break the back compatibility of this patch if/when future features are implemented. bq. Another point that was brought up is whether or not to delete any docs that match the update docs uniqueField id term, but not its similarity/update term. You are choosing to use the updateTerm to do updates rather then the unique term. It seems like uniqueField should normally enforce uniqueness, regardless of what this component does. If one wants duplicate ids, then it seems like a different field should be used for that (other than the uniqueKey field). If one wants to delete *only* on the hash field, then they can make the hash field the id field. > Add support for hash based exact/near duplicate document handling > ----------------------------------------------------------------- > > Key: SOLR-799 > URL: https://issues.apache.org/jira/browse/SOLR-799 > Project: Solr > Issue Type: New Feature > Components: update > Reporter: Mark Miller > Priority: Minor > Attachments: SOLR-799.patch, SOLR-799.patch > > > Hash based duplicate document detection is efficient and allows for blocking > as well as field collapsing. Lets put it into solr. > http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.