Re: Bayes and multipart messages

Amir 'CG' Caspi Thu, 09 Jan 2014 19:15:54 -0800

On Thu, January 9, 2014 6:20 pm, Karsten Bräckelmann wrote:
> Even the most effective results I have ever seen on a non-personal
> attack is merely getting the Bayes classification to a neutral. And that
> was not a "regular" text token, but includes mail headers. And a biased
> Bayes database towards some specific mail headers that spam run happened
> to use...


So, I unfortunately still see the occasional FN slipping through my
filters with bayes_00... which means either these spams are magically
hitting some very hammy tokens, or I've got some major problems with my
bayes DB.  I've been training my DB both with autolearn and with manual
sa-learn spam classification (the latter run every week or two on my spam
folder, which holds the last 30 days of spam), but I admit that autolearn
has been running for probably years before I actually started to
"properly" set up and train SA, so that may be one issue, that it
autolearned spam as ham.  On the other hand, other users on my system who
have ALSO been autolearning for years don't seem to get bayes_00 FN hits,
just bayes_50ish (sometimes as low as 20 but that's rare), so I'm not sure
autolearn is the problem (unless I was mistakenly autolearning a helluva
lot more spam than they have over that time, for some reason).

I'd prefer not to dump my entire bayes DB and start over, though I can do
that if I have to... but I'd like to try to diagnose the issue before
burning down the house.

What's the way that I can inject the bayes-identified tokens (hammy or
spammy) into my SA headers, so that I can try to debug what's causing this
problem?  I'd want to do this for all emails, not just ones identified as
ham or spam.  I've seen some people posting real-language bayes hits here
so I'm wondering how to do that.  (I imagine there's no way to get the
actual real-language words out of the existing bayes DB since they're
stored as hashes, right?  That is, the actual words aren't stored, their
hashes are?  Or is that not right?)

Thanks.

                                                --- Amir

Re: Bayes and multipart messages

Reply via email to