Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "TextProfileSignature" page has been changed by JoelNothman: http://wiki.apache.org/solr/TextProfileSignature?action=diff&rev1=2&rev2=3 Comment: some analysis See also [[http://lucene.apache.org/solr/api-4_0_0-BETA/org/apache/solr/update/processor/TextProfileSignature.html|TextProfileSignature's javadoc]] + == Implications and limitations == + + Though this matches two texts approximately, it is still based on exactly matching a single hash. It may fail to match documents that differ by exactly one word, if that word's frequency changes from `k * quant - 1` to `k * quant`. + + Words appearing once are ignored unless the text consists only of words appearing once. Hence, "the cat sat on a mat" will hash distinctly to "the cat sat on the mat". + + For the default `quantRate` (0.01), quant will exceed 2 only if the most frequent word occurs `maxFreq >= 251` times. + + These properties all suggest that TextProfileSignature is brittle for short texts. + + TextProfileSignature operates on raw text, without the filtering provided by Analyzers, and hence will fail to ignore HTML, normalize for diacritics, stem, or incorporate the relative importance of different tokens, etc. +