Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.

The "TextProfileSignature" page has been changed by JoelNothman:
http://wiki.apache.org/solr/TextProfileSignature

Comment:
description of algorithm

New page:
TextProfileSignature calculates a fuzzy hash of textual fields for 
[[Deduplication]], and may be incorporated using a 
SignatureUpdateProcessorFactory definition including the following parameters:

|| Name || Type || Description || Default value ||
|| `minTokenLen` || int || The minimum token length to consider || 2 ||
|| `quantRate` || float || When multiplied by the maximum token frequency, this 
determines count quantization || .01 ||

The signature calculation proceeds as follows:

=== Tokenization and normalization ===

* Tokens are contiguous alphanumeric characters
* Normalized to lowercase
* Discarded if shorter than `minTokenLen`

Tokens are then counted, tracking the frequency `maxFreq` of the most frequent 
token.

=== Count quantization ===

A value `quant` is calculated as follows:

|| || 1 || if `maxFreq` <= 1 ||
||`quant` := || 2 || if round(`maxFreq * quantRate`) < 2 ||
|| || round(`maxFreq * quantRate`) || otherwise ||

Token frequencies are then rounded down to the nearest multiple of `quant`, and 
any token occurring less than `quant` times is discarded.

=== Hashing ===

The set of frequencies is transformed to a string as a space-delimited sequence 
of tokens and their frequencies, in descending frequency order. This is then 
MD5-hashed.

See also 
[[http://lucene.apache.org/solr/api-4_0_0-BETA/org/apache/solr/update/processor/TextProfileSignature.html|TextProfileSignature's
 javadoc]]

Reply via email to