[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638850#action_12638850
 ] 

Mark Miller commented on SOLR-799:
----------------------------------

bq. 1. Prevent new insert - SignatureUpdateProcessor generates a signature and 
adds it as a field; AbortIfExistingUpdateProcessor aborts the update if a doc 
exists with a specific field in common with the doc to be added.

I like the idea of using UpdateProcessors for all of this. Its very clean 
compared to hacking around the DirectUpdateHandler. Unfortunately, I think 
AbortIfExistingUpdateProcessor would require locks that are too course. 
Ideally, you want to be able to inject code into the DirectUpdateHandlers 3 
levels of locking (iw,sync(this),none). Thats whats needed for efficiency, but 
the cleanness gets whacked - its ugly to get that done, and doesn't really mesh 
with the UpdateHandler API thats been defined. The linking of 
DirectUpdateHandlers2's addDoc implementation to the whole idea...there would 
have to be changes that just don't seem worth the added functionality.

Which leaves just hardcoding the support into DirectUpdateHandler, kind of like 
was done before for deletes/id dupes, and then just give options on the add doc 
cmd. Again I don't like it. But the anything else quickly breaks down for me. 
Any suggestions, insights?

> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to