On Thu, January 9, 2014 6:20 pm, Karsten Bräckelmann wrote: > Even the most effective results I have ever seen on a non-personal > attack is merely getting the Bayes classification to a neutral. And that > was not a "regular" text token, but includes mail headers. And a biased > Bayes database towards some specific mail headers that spam run happened > to use...
So, I unfortunately still see the occasional FN slipping through my filters with bayes_00... which means either these spams are magically hitting some very hammy tokens, or I've got some major problems with my bayes DB. I've been training my DB both with autolearn and with manual sa-learn spam classification (the latter run every week or two on my spam folder, which holds the last 30 days of spam), but I admit that autolearn has been running for probably years before I actually started to "properly" set up and train SA, so that may be one issue, that it autolearned spam as ham. On the other hand, other users on my system who have ALSO been autolearning for years don't seem to get bayes_00 FN hits, just bayes_50ish (sometimes as low as 20 but that's rare), so I'm not sure autolearn is the problem (unless I was mistakenly autolearning a helluva lot more spam than they have over that time, for some reason). I'd prefer not to dump my entire bayes DB and start over, though I can do that if I have to... but I'd like to try to diagnose the issue before burning down the house. What's the way that I can inject the bayes-identified tokens (hammy or spammy) into my SA headers, so that I can try to debug what's causing this problem? I'd want to do this for all emails, not just ones identified as ham or spam. I've seen some people posting real-language bayes hits here so I'm wondering how to do that. (I imagine there's no way to get the actual real-language words out of the existing bayes DB since they're stored as hashes, right? That is, the actual words aren't stored, their hashes are? Or is that not right?) Thanks. --- Amir