[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638009#action_12638009 ]
Yonik Seeley commented on SOLR-799: ----------------------------------- Some thoughts... - How should different "types" be handled (for example when we support binary fields). For example, different base64 encoders might use different line lengths or different line endings (CR/LF). Perhaps it's good enough to say that the string form must be identical, and leave it at that for now? The alternative would be signatures based on the Lucene Document about to be indexed. - It would be nice to be able to calculate a signature for a document w/o having to catenate all the fields together. Perhaps change calculate(String content) to something like calculate(Iterable<CharSequence> content)? An alternative option would be incremental hashing... {code} Signature sig = ourSignatureCreator.create(); sig.add(f1) sig.add(f2) sig.add(f3) String s = sig.getSignature() {code} Looking at how TextProfileSignature works, i'd lean toward incremental hashing to avoid building yet another big string. Having a hashing object also opens up the possibility to easily add other method signatures for more efficient hashing. - It appears that if you put fields in a different order that the signature will change - It appears that documents with different field names but the same content will have the same signature. - I don't understand the dedup logic in DUH2... it seems like we want to delete by id and by sig... unfortunately there is no IndexWriter.updateDocument(Term[] terms, Document doc) so we'll have to do a separate non-atomic delete on the sig for now, right? - There's probably no need for a separate test solrconfig-deduplicate.xml if all it adds is an update processor. Tests could just explicitly specify the update handler on updates. > Add support for hash based exact/near duplicate document handling > ----------------------------------------------------------------- > > Key: SOLR-799 > URL: https://issues.apache.org/jira/browse/SOLR-799 > Project: Solr > Issue Type: New Feature > Components: update > Reporter: Mark Miller > Priority: Minor > Attachments: SOLR-799.patch > > > Hash based duplicate document detection is efficient and allows for blocking > as well as field collapsing. Lets put it into solr. > http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.