Re: Time for my monthly beating again...

Joe Flowers 19 Feb 2005 05:55:53 -0000

I'll try to keep it as short as possible.

By my preference and from hearing continuing horror stories about spamd, I have a C program in the place of spamd. It makes calls to Perl - Perl is "embedded" in the C program. The C spamd replacement talks to a C program running on our NetWare NetMail (soon to be Hula) email servers. Actually, the same Linux box running SpamAssassin uses this spamd replacement to talk to 3 different email servers over TCP sockets at the same time.

The default SpamAssassin v2.64 (Mail::SpamAssassin::PerMsgStatus::get_hits) score of 5.0 corresponds to 51 "+" marks that are placed into each incoming email message header. The NetMail "Rule server" on the email servers then filters on those pluses. The more pluses, the more likely the message is a Spam message. Every user can adjust his or her threshold away from the 51 default. So, the user does have some control over his/her own Spam settings. The program on the email servers will never put more than 101 "+" marks nor less than 1 "+" mark in any email header. If the email message header reaches or exceeds the threshold (the number of pluses) set by the user, then the message is filtered by the NetMail Rule server and placed in a user's "MostlySpam" folder. i.e., server-side filtering.

A SA get_hits score of 0 or less corresponds to 1 "+" mark in the email message header. A SA get_hits score of 10 or more corresponds to 101 "+" marks in the email message header.

Right or wrong (?), I thought that since SA defaults to 5.0, then most of the crucial action must be happening between 0 and 10. Also, I didn't want to deal with NULL problems that are associated with 0 "+" marks in the headers, and I didn't want to clog up the headers unnecessarily with an ungodly number of pluses, but I still wanted as fine as control as possible within a get_hits of 0 and 10 - I didn't want to just discard the significant information held in the tenths spot of the get_hits score.

On the spamd replacement side, the average of all of the get_hits of all of the messages are stored in a very small ("tiny") text file, along with the number of messages contributing to this average number - the total number of messages processed.

Basically and roughly:
-----------------------------------------------------------------------------

hits=get_hits; //Mail::SpamAssassin::PerMsgStatus::get_hits

//Let's try to keep control of the "outliers" - prevent the averages from being so sensitive to large positive or negative get_hits // values. Hopefully, the averages will never reach -20 or 30 (the "walls"). If they do run into these walls, then we either need to // adjust, broaden these limits or abandon this technique altogether or re-think the implementation.

if(hits<-20.0) {
hits=-20.0;    }

if(hits>30.0) {
hits=30.0;    }

//Read the "OldAverage" and "TotalNumberOfMessagesProcessed" from the tiny text file.

NumberOfPluses = (10.0*(hits-OldAverage)) + 51.0;

//Round NumberOfPluses off correctly. FractionPartOfNumberOfPluses=modf(NumberOfPluses, &IntegerPartOfNumberOfPluses); if(FractionPartOfNumberOfPluses >= 0.5) { NumberOfPluses=(NumberOfPluses+1.0); }

//Put an upper and lower bound on the number of pluses (+).
if(NumberOfPluses < 1.0) { NumberOfPluses = 1.0 }
if(NumberOfPluses > 101.0) { NumberOfPluses = 101.0; }

NewAverage=((OldAverage*TotalNumberOfMessagesProcessed) + hits);
TotalNumberOfMessagesProcessed++;
NewAverage=(NewAverage/TotalNumberOfMessagesProcessed);

//Update the "OldAverage" (with NewAverage) and "TotalNumberOfMessagesProcessed" in the tiny text file.

-----------------------------------------------------------------------------

That's the heart of it.... I hope that made sense with enough meat. The jury is still out of course and I've got my fingers crossed, but everything is still going very very well. Right now, we're somewhere around the 15K TotalNumberOfMessagesProcessed mark.

Just looking at it, SpamAssassin itself has to be doing an incredibly good job at identifying and scoring these messages from a relativistic point of view; otherwise, there is no way I would be seeing these great results, and I probably would have run into a "wall" long before now.

Joe
------------------------------------------------------------

Joe Emenaker wrote:

Joe Flowers wrote:
Very preliminary results are no less than AWESOME.
So... how are you implementing the "drifting" spam threshold?
- Joe

Re: Time for my monthly beating again...

Reply via email to