On 25 May 2016, at 13:15, Dianne Skoll wrote:
On Wed, 25 May 2016 18:10:57 +0100
Paul Stead <paul.st...@zeninternet.co.uk> wrote:
[quoting Dianne]
"Concepts" is a lossy process. You are throwing away information.
That is by design, similar to fingerprinting emails with iXhash or
Razor.
iXhash and Razor are designed to detect mass-mailings of identical or
very similar messages; they measure "bulk-ness" and not hammy-ness or
spammy-ness directly.
More importantly (IMHO) they aren't designed to collide with existing
common tokens and be added back into messages that may contain those
tokens already in order to influence Bayesian classification.
There is sound statistical theory consistent with empirical evidence
underpinning the Bayes classifier implementation in SA. While there can
be legitimate critiques of the SA implementation specifically and in
general how well email word frequency fits Bayes' Theorem, injecting a
pile of new derivative meta-tokens based on pre-conceived notions of
"concepts" into the Bayesian analysis invalidates the assumption of what
the input for Naive Bayes analysis is: *independent* features. The
"concepts" approach adds words that are *dependent* on the presence of
other words in the document and to make it worse, those dependent words
may already exist in some pristine messages. It unmoors the SA Bayes
implementation from any theoretical grounding, converting its complex
math from statistical analysis into arbitrary numerology.