Re: SA Concepts - plugin for email semantics

Bill Cole Sat, 28 May 2016 12:38:44 -0700

On 25 May 2016, at 13:15, Dianne Skoll wrote:

On Wed, 25 May 2016 18:10:57 +0100
Paul Stead <paul.st...@zeninternet.co.uk> wrote:

[quoting Dianne]

"Concepts" is a lossy process.  You are throwing away information.

That is by design, similar to fingerprinting emails with iXhash or
Razor.


iXhash and Razor are designed to detect mass-mailings of identical or
very similar messages; they measure "bulk-ness" and not hammy-ness or
spammy-ness directly.

More importantly (IMHO) they aren't designed to collide with existingcommon tokens and be added back into messages that may contain thosetokens already in order to influence Bayesian classification.

There is sound statistical theory consistent with empirical evidenceunderpinning the Bayes classifier implementation in SA. While there canbe legitimate critiques of the SA implementation specifically and ingeneral how well email word frequency fits Bayes' Theorem, injecting apile of new derivative meta-tokens based on pre-conceived notions of"concepts" into the Bayesian analysis invalidates the assumption of whatthe input for Naive Bayes analysis is: *independent* features. The"concepts" approach adds words that are *dependent* on the presence ofother words in the document and to make it worse, those dependent wordsmay already exist in some pristine messages. It unmoors the SA Bayesimplementation from any theoretical grounding, converting its complexmath from statistical analysis into arbitrary numerology.

Re: SA Concepts - plugin for email semantics

Reply via email to