Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "TextProfileSignature" page has been changed by JoelNothman: http://wiki.apache.org/solr/TextProfileSignature Comment: description of algorithm New page: TextProfileSignature calculates a fuzzy hash of textual fields for [[Deduplication]], and may be incorporated using a SignatureUpdateProcessorFactory definition including the following parameters: || Name || Type || Description || Default value || || `minTokenLen` || int || The minimum token length to consider || 2 || || `quantRate` || float || When multiplied by the maximum token frequency, this determines count quantization || .01 || The signature calculation proceeds as follows: === Tokenization and normalization === * Tokens are contiguous alphanumeric characters * Normalized to lowercase * Discarded if shorter than `minTokenLen` Tokens are then counted, tracking the frequency `maxFreq` of the most frequent token. === Count quantization === A value `quant` is calculated as follows: || || 1 || if `maxFreq` <= 1 || ||`quant` := || 2 || if round(`maxFreq * quantRate`) < 2 || || || round(`maxFreq * quantRate`) || otherwise || Token frequencies are then rounded down to the nearest multiple of `quant`, and any token occurring less than `quant` times is discarded. === Hashing === The set of frequencies is transformed to a string as a space-delimited sequence of tokens and their frequencies, in descending frequency order. This is then MD5-hashed. See also [[http://lucene.apache.org/solr/api-4_0_0-BETA/org/apache/solr/update/processor/TextProfileSignature.html|TextProfileSignature's javadoc]]