[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637719#action_12637719 ]
Mark Miller commented on SOLR-799: ---------------------------------- Thanks for the review Andrzej. I've made the first two changes (I put at the top of TextProfileSignature that its 'borrowed' from Nutch and grabbed Hadoops MD5Hash class and stripped its Hadoop dependencies) and I'm investigating change 3. I'll put up another patch in a couple days. - Mark > Add support for hash based exact/near duplicate document handling > ----------------------------------------------------------------- > > Key: SOLR-799 > URL: https://issues.apache.org/jira/browse/SOLR-799 > Project: Solr > Issue Type: New Feature > Components: update > Reporter: Mark Miller > Priority: Minor > Attachments: SOLR-799.patch > > > Hash based duplicate document detection is efficient and allows for blocking > as well as field collapsing. Lets put it into solr. > http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.