[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638217#action_12638217 ]
Andrzej Bialecki commented on SOLR-799: ---------------------------------------- +1 on the incremental sig calculation. Re: different "types" of signatures. Our experience in Nutch is that signature type is rarely changed, and we assume that this setting is selected once per lifetime of an index, i.e. there are never any mixed cases of documents with incompatible signatures. If we want to be sure that they are comparable, we could prepend a byte or two of unique signature type id - this way, even if a signature value matches but was calculated using other impl. the documents won't be considered duplicates, which is the way it should work, because different signature algorithms are incomparable. Re: signature as byte[] - I think it's better if we return byte[] from Signature, and until we support binary fields we just turn this into a hex string. Re: field ordering in DeduplicateUpdateProcessorFactory: I think that both sigFields (if defined) and any other document fields (if sigFields is undefined) should be first ordered in a predictable way (lexicographic?). Current patch uses a HashSet which doesn't guarantee any particular ordering - in fact the ordering may be different if you run the same code under different JVMs, which may introduce a random factor to the sig. calculation. > Add support for hash based exact/near duplicate document handling > ----------------------------------------------------------------- > > Key: SOLR-799 > URL: https://issues.apache.org/jira/browse/SOLR-799 > Project: Solr > Issue Type: New Feature > Components: update > Reporter: Mark Miller > Priority: Minor > Attachments: SOLR-799.patch > > > Hash based duplicate document detection is efficient and allows for blocking > as well as field collapsing. Lets put it into solr. > http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.