[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638427#action_12638427 ]
Hoss Man commented on SOLR-799: ------------------------------- some misc comments from a user perspective based on the current state of the wiki... 1) rather then a comma seperated <str> fields, we should just use an <arr> 2) we should consider if/how we want to support using dynamicFields (ie: field name globs) in listing fields that are included in the signature) 3) "By default, all non null fields on the document will be used." ... there's no such thing as a null field -- there are fields that have no value, and there are fields whose value is an empty string, but no null value. 4) yonik already asked other questions i had based on the wiki: how the order of fields in the update command affects the signature that gets computed -- both in terms of fields with different names, and fields with the same name. the fields should probably be stable sorted by field name, so that the order of fields with teh same name affects the signature, but the relative order of fields with different names doesn't (since the order of fields with the same name actually affects the way the document is indexed, but the order of different field names does not) > Add support for hash based exact/near duplicate document handling > ----------------------------------------------------------------- > > Key: SOLR-799 > URL: https://issues.apache.org/jira/browse/SOLR-799 > Project: Solr > Issue Type: New Feature > Components: update > Reporter: Mark Miller > Priority: Minor > Attachments: SOLR-799.patch > > > Hash based duplicate document detection is efficient and allows for blocking > as well as field collapsing. Lets put it into solr. > http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.