[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638217#action_12638217
 ] 

Andrzej Bialecki  commented on SOLR-799:
----------------------------------------

+1 on the incremental sig calculation.

Re: different "types" of signatures. Our experience in Nutch is that signature 
type is rarely changed, and we assume that this setting is selected once per 
lifetime of an index, i.e. there are never any mixed cases of documents with 
incompatible signatures. If we want to be sure that they are comparable, we 
could prepend a byte or two of unique signature type id - this way, even if a 
signature value matches but was calculated using other impl. the documents 
won't be considered duplicates, which is the way it should work, because 
different signature algorithms are incomparable.

Re: signature as byte[] - I think it's better if we return byte[] from 
Signature, and until we support binary fields we just turn this into a hex 
string.

Re: field ordering in DeduplicateUpdateProcessorFactory: I think that both 
sigFields (if defined) and any other document fields (if sigFields is 
undefined) should be first ordered in a predictable way (lexicographic?). 
Current patch uses a HashSet which doesn't guarantee any particular ordering - 
in fact the ordering may be different if you run the same code under different 
JVMs, which may introduce a random factor to the sig. calculation.

> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to