Re: SA Concepts - plugin for email semantics

RW Tue, 31 May 2016 15:21:12 -0700

On Tue, 31 May 2016 15:20:56 -0400
Bill Cole wrote:

> On 29 May 2016, at 11:07, RW wrote:
>


> > Statistical filters are based on some statistical theory combined
> > with pragmatic kludges and assumptions. Practical filters have been
> > developed based on what's been found to work, not on what's more
> > statistically correct.  
> 
> I'm not aware of any hard evidence that the SA Bayes pragmatic
> kludges and assumptions perform better or worse than an
> implementation that used fewer or different ones. 

It's not specific to SA, for example there's no sound basis for
assigning token probability to tokens that have zero ham or spam
counts, many classifications turn on completely made-up probabilities.
There's also no way of assigning meaningful probabilities to tokens
that enter or re-enter the database while it's mature without making
an assumption about the current spam/ham training ratio.

The assumption that tokens are independent was never reasonable in the
first place, there's plenty of natural duplication e.g. ip address and
RDNS, and strong correlations between important tokens. There's also a
lot of inadvertent duplication for example from metadata headers that
are not primarily intended for Bayes.


I don't think concepts is a particular good idea, but I don't like to
see someone's worked dismissed on such paper-thin theoretical grounds. 



> > I think the OP is probably underselling it, in that it could be
> > used to
> > extract information that normal tokenization can't get, for example:
> > ...
> > The main problem is that you'd need a lot of rules to make a 
> > substantial
> > difference.  
> 
> So: re-invent SpamAssassin v1 but without rule scores, using Bayes to
> do half-assed dynamic score adjustment per site with rules that with
> either evolve constantly or grow stale?

I was thinking that it would be an alternative to local custom rules
- particularly for spams that leave Bayes with little to work with and
 where individual body rules aren't worth much of a score.

Re: SA Concepts - plugin for email semantics

Reply via email to