[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638850#action_12638850 ]
Mark Miller commented on SOLR-799: ---------------------------------- bq. 1. Prevent new insert - SignatureUpdateProcessor generates a signature and adds it as a field; AbortIfExistingUpdateProcessor aborts the update if a doc exists with a specific field in common with the doc to be added. I like the idea of using UpdateProcessors for all of this. Its very clean compared to hacking around the DirectUpdateHandler. Unfortunately, I think AbortIfExistingUpdateProcessor would require locks that are too course. Ideally, you want to be able to inject code into the DirectUpdateHandlers 3 levels of locking (iw,sync(this),none). Thats whats needed for efficiency, but the cleanness gets whacked - its ugly to get that done, and doesn't really mesh with the UpdateHandler API thats been defined. The linking of DirectUpdateHandlers2's addDoc implementation to the whole idea...there would have to be changes that just don't seem worth the added functionality. Which leaves just hardcoding the support into DirectUpdateHandler, kind of like was done before for deletes/id dupes, and then just give options on the add doc cmd. Again I don't like it. But the anything else quickly breaks down for me. Any suggestions, insights? > Add support for hash based exact/near duplicate document handling > ----------------------------------------------------------------- > > Key: SOLR-799 > URL: https://issues.apache.org/jira/browse/SOLR-799 > Project: Solr > Issue Type: New Feature > Components: update > Reporter: Mark Miller > Priority: Minor > Attachments: SOLR-799.patch > > > Hash based duplicate document detection is efficient and allows for blocking > as well as field collapsing. Lets put it into solr. > http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.