Re: TMDA false assumptions

Karsten M. Self Mon, 22 Sep 2003 21:07:51 -0700

on Mon, Sep 22, 2003 at 06:45:56PM -0700, Chris Berry ([EMAIL PROTECTED]) wrote:
> >From: "Karsten M. Self" <[EMAIL PROTECTED]>
> >As I have:
> >
> >    http://kmself.home.netcom.com/Rants/challenge-response.html
> >
> >    I might add that I myself use a mix of whitelisting and spam
> >    filtering (via SpamAssassin) to filter my own mail with a very high
> >    level of accuracy, in terms of true positives, true negatives, false
> >    positives, and false negatives. Namely: better than 98% true
> >    positive (filtered spam),
> 
> That's good, and spamassassin is definitely a nice tool especially
> with the bayesian filtering and vipul's razor modules.


Quite ;-)

> >less than 2% false negative (unfiltered spam)
> 
> That's bad, I don't know about you, but as an administrator, I get
> ALOT of email, often as many as thousand a day or more (counting spam,
> around 100-300 legit) and a 2% false negative rate would give me
> nearly 20 spam messages a day, which is highly annoying, but that's a
> problem I could live with if it weren't for the next issue.

Ah, Grasshopper, but I use a whitelist.  Much in the same way that TMDA
provides one, except that I manage the adds and deletes myself.  While
it's partial homebrew, I could (and may) scrap it for SA's own whitelist
feature.

Well, that 2% makes it to my "greymail" box.  That's addresses I haven't
seen before, and mail that hasn't tripped some other rule.  So spam in
my actual inbox is a rare event -- once or twice amonth, with a daily
volume of 300-500 (typical) to 2k+ (bad virus day) mail.  Similar
experience when serving as admin for an ISP.  Daily mail was ~800+
items, spam was 60-120 items/day, and my inbox was virtually spotless.

> >99.98% true negative (unfiltered non-spam), and less than
> >    0.02% false positive (filtered non-spam).
> 
> That's totally unacceptable, at that rate I could lose as many as 5-10 
> legitimate emails a week and that's just for one person, 

I disagree for several reasons:

  - It's better than I can do by hand.  Far better.  At much less cost
    (time).

  - Most of the false positives come from one or more of:  list mail,
    someone I've never communicated with before, or a commercial mailing
    from an organization I have a legitimate relationship with (usually
    newsletters or catalog/special offer pitches).  All of it's pretty
    dodgy.

    When it comes to false positives from persons with whom I really
    shouldn't have roundfiled, I can think of two specific cases:

    1.  Rusty Foster in a post to a mailing list intentionally included
        a bunch of spam triggers.  Come to think of it, I don't remember
        if it filtered as spam or if I just thought it should have.
        This over two years ago.

    2.  An email from a very seldom used, but long-established account,
        was tagged as spam by SpamAssasin based on autowhitelist rules
        (they've been changed).  The reason for this was that the same
        address had been spoofed in spam three times, with high SA
        scores (20+), and this then factored into the score on the
        specific item.  The specific incident contributed to a redesign
        of SA whitelisting features such that they are used to lower
        spam scores, but not raise them (or not much).  And yes, I
        caught hell for this.

     Other than these two cases, in three years of SA use, I've had no
     misclassification of significant mail.  If limited to direct
     personal correspondance, the number of false positives is
     effectively zero.

   - My error rate in other aspects of mail processing is far higher.
     I've got list filters that go whacky, my POP mailbox fills to
     capacity when I can't empty it regularly, corrupted mailboxes crap
     out hundreds (or thousands) of messages.  This last has happend to
     users at work or associates more often than myself.  In the grand
     scheme of things, though, the leakage from spam classification is
     minimal, marginal, and low-value.

> if you figure a more reasonable email volume for the rest of our
> employees at about 1/3 that rate you're talking hundreds of lost
> emails every week.  

Most of which will likely be:

  - Newsletters.
  - Marketing materials.
  - Mailing list posts.
  - Other "out-of-the-blue" posts.

Inherently:  stuff that's tough to classify.  The _value_ of any
misclassified posts will be low.

The is, however, all the more reason not to refuse all suspected spam
point blank, but to filter the borderline stuff to a review folder.  My
own recommendation is to SMTP reject spam at scores of 10 or higher.
Valid mail rejected in this manner will result in a notice to the sender
-- and is the responsibility of their mailserver to deliver in any
regard.  Spammer's SMTP engines will likely simply toss the rejection.

Mail scoring 5 >= sa > 10 would be tagged, but delivered.  In practice
you may find that you can split this range based on observed SA scores,
for which I provide a plot:

    http://twiki.iwethey.org/twiki/pub/Main/SpamEmailTrends/SAScores.png

Note that rejecting at a score of 10+ effectively blocks 86% of spam.
This is one year old data, and the situation may have changed...  OK,
I'm running new numbers now, may post later...


> Even if my numbers are two orders of magnitude too high, that would
> still be at least one a week, and I can tell you that my boss'
> acceptable loss rate is ZERO, EVER, FOR THE LIFE OF THE COMPANY.  

Provide your boss with a chart, having three axes.

Plot on this chart:

  - Spam received.
  - Spam accepted.
  - Cost

Ask him where he chooses to draw the line.

I know of no solution that is more effective than the mix of:

  - Virus rejection filters.
  - A content or Bayesian classifier.  Not restricted to SpamAssassin,
    but for which it is among the better solutions.
  - A whitelist.


> Which just can't be achieved using spamassassin alone, unless you want
> to manually go through all your junk mail, in which case why are you
> bothering in the first place?

Well, as a system administrator, GNU/Linux user, and mutt fan of some
experience....

It's pretty easy to look through my spam folder and classify what's in
it by eyeball with a high level of accuracy.

More likely:  someone says "why haven't you responded to my mail", and
because I actually *do* archive spam (how do you think I pulled out the
three spoofs of the client email above), I can produce the 

> >While some C-R proponents claim filtering doesn't work, it clearly does.
> 
> In my opinion, no one spam fighting system is effective by itself, a
> good system should take advantage of all of them, to try and mitigate
> the worst features of each.

SA tends to provide this within a single package.  It doesn't restrict
itself to a single formula as other systems do.


Peace.

-- 
Karsten M. Self <[EMAIL PROTECTED]>        http://kmself.home.netcom.com/
 What Part of "Gestalt" don't you understand?
    Defeat EU Software Patents!                         http://swpat.ffii.org/

signature.asc
Description: Digital signature

_____________________________________________
tmda-users mailing list ([EMAIL PROTECTED])
http://tmda.net/lists/listinfo/tmda-users

Re: TMDA false assumptions

Reply via email to