Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.

The "TextProfileSignature" page has been changed by JoelNothman:
http://wiki.apache.org/solr/TextProfileSignature?action=diff&rev1=2&rev2=3

Comment:
some analysis

  
  See also 
[[http://lucene.apache.org/solr/api-4_0_0-BETA/org/apache/solr/update/processor/TextProfileSignature.html|TextProfileSignature's
 javadoc]]
  
+ == Implications and limitations ==
+ 
+ Though this matches two texts approximately, it is still based on exactly 
matching a single hash. It may fail to match documents that differ by exactly 
one word, if that word's frequency changes from `k * quant - 1` to `k * quant`.
+ 
+ Words appearing once are ignored unless the text consists only of words 
appearing once. Hence, "the cat sat on a mat" will hash distinctly to "the cat 
sat on the mat".
+ 
+ For the default `quantRate` (0.01), quant will exceed 2 only if the most 
frequent word occurs `maxFreq >= 251` times.
+ 
+ These properties all suggest that TextProfileSignature is brittle for short 
texts.
+ 
+ TextProfileSignature operates on raw text, without the filtering provided by 
Analyzers, and hence will fail to ignore HTML, normalize for diacritics, stem, 
or incorporate the relative importance of different tokens, etc.
+ 

Reply via email to