Re: shifting the midpoint between the average spam and average

Joe Flowers 4 Sep 2004 14:02:18 -0000

Joe E.:

Thanks for getting past the usual knee-jerk reaction and seeing.

Joe F.

Joe Emenaker wrote:

Steve Bertrand wrote:
> SA isn't about the "average" it's about the accuracy.
If this were the case, then why aren't the spam scores ("*required_hits*") for each message either 1 or 0 and nothing else?
Oh, come on now. This is just a troll on a very legitimate and informative statement.
No... actually, I think he's got you there.
When I read his first post, my kneejerk reaction was: "Dude... RTFM!! Learn about 'spam_threshold', adust that to your spam/ham averages and not the other way around and stop asking silly questions..."

Then, he mentioned that he has a bunch of users who already are using spam_threshold, but their values need to be tweaked, and it would be easier for him to tweak the scoring, than everybody's thresholds. At *that* point, my kneejerk reaction was to tell him to write a script that records all of the spam scores for each user, along with whether that user categorized it as spam or ham and then write a little script like this (http://fruitpie.blastpoint.com/~jemenake/spamreport.cgi) to let the users custom-pick their desired level of false-positives/false-negatives.
But *THEN*, I finally saw the light.
From what I can gather, he's talking about the problem presented whenever you add/remove/change your SA rules (or.... heaven forbid... upgrade to 3.0). Whenever that happens, SA's scoring is going to shift and, everyone's individual optimum spam_threshold would shift, too.
That's pretty screwed.
What the-other-Joe seems to be asking for is for is some way for SA to keep "re-centering" itself so that he doesn't have to go fix everybody's spam_threshold (or ask the individual users to) whenever he changes the rules.

The easiest way to do this is probably for SA to somehow track what the highest and lowest scores have been for the last week or so and, if they both shift in the same direction by some amount, then SA would compensate for that. On the face of it, this might be able to be implemented with something similar to the auto-whitelisting which SA already has (since the auto-whitelist is just an averaging feature).

The even slicker way to do what the-other-Joe is talking about is like this, but it requires user feedback in the form of them having Spam and Ham trash folders (like many people already use, myself included, for Bayes training). If you had that, then SA would have available reliable data regarding the average score of all ham and all spam. Then, SA would be able to always adjust its scoring so that these averages fell equally on either side of 5. Then, nobody would ever need to mess with their spam_threshold, really. The admin could change the ruleset, the spammers could change their tactics, etc. As long as the user kept using their Spam and Ham trash folders, SA would keep learning and keep re-centering. The user would experience brief spikes of false-negatives and false-positives, but the auto-centering would correct for it within a week or so.

Now, the ultra-deluxe-honeymoon-suite version of this would go one step futher. If SA did have access to the scores of everything dropped into the Spam and Ham folders, then SA could not only adjust so that a score of 5 fell squarely between the averages, but it could *scale* the scoring so that a spam_threshold of "10" would be guaranteed to *catch* everything that the user has ever dropped into their Spam trash folder (aka, all known spam from the past).... and a spam_threshold of "0" would be guaranteed to *pass* everything that the user has ever dropped into their Ham trash folder (aka, all known ham from the past).

If SA could do *that*, then the spam_threshold just becomes a 0-10 number that the user chooses to indicate their personal preference between false-positives and false-negatives. The user would never have to change that value unless the user's *preference* changed.
And I think *that* should be an ultimate goal.
So, to summarize, I think that the-other-Joe has hit upon an important idea here... but I think that it can, ultimately, be taken even further and that it could make SA really, really slick.
- Joe

Re: shifting the midpoint between the average spam and average

Reply via email to