[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638427#action_12638427
 ] 

Hoss Man commented on SOLR-799:
-------------------------------

some misc comments from a user perspective based on the current state of the 
wiki...

1) rather then a comma seperated <str> fields, we should just use an <arr>

2) we should consider if/how we want to support using dynamicFields (ie: field 
name globs) in listing fields that are included in the signature)

3) "By default, all non null fields on the document will be used." ... there's 
no such thing as a null field -- there are fields that have no value, and there 
are fields whose value is an empty string, but no null value.

4) yonik already asked other questions i had based on the wiki: how the order 
of fields in the update command affects the signature that gets computed -- 
both in terms of fields with different names, and fields with the same name.  
the fields should probably be stable sorted by field name, so that the order of 
fields with teh same name affects the signature, but the relative order of 
fields with different names doesn't (since the order of fields with the same 
name actually affects the way the document is indexed, but the order of 
different field names does not)

> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to