On 25 May 2016, at 13:15, Dianne Skoll wrote:

On Wed, 25 May 2016 18:10:57 +0100
Paul Stead <paul.st...@zeninternet.co.uk> wrote:

[quoting Dianne]
"Concepts" is a lossy process.  You are throwing away information.
That is by design, similar to fingerprinting emails with iXhash or
Razor.

iXhash and Razor are designed to detect mass-mailings of identical or
very similar messages; they measure "bulk-ness" and not hammy-ness or
spammy-ness directly.

More importantly (IMHO) they aren't designed to collide with existing common tokens and be added back into messages that may contain those tokens already in order to influence Bayesian classification.

There is sound statistical theory consistent with empirical evidence underpinning the Bayes classifier implementation in SA. While there can be legitimate critiques of the SA implementation specifically and in general how well email word frequency fits Bayes' Theorem, injecting a pile of new derivative meta-tokens based on pre-conceived notions of "concepts" into the Bayesian analysis invalidates the assumption of what the input for Naive Bayes analysis is: *independent* features. The "concepts" approach adds words that are *dependent* on the presence of other words in the document and to make it worse, those dependent words may already exist in some pristine messages. It unmoors the SA Bayes implementation from any theoretical grounding, converting its complex math from statistical analysis into arbitrary numerology.

Reply via email to