[SLUG] Re: DSPAM vs SpamAssassin FYI

Jonathan A. Zdziarski Sat, 21 Feb 2004 20:10:50 -0800

Greetings to all,

I'm not a member of SLUG, so if you expect a response from me, please
respond directly to my email address.


While doing a Google search today, I noticed some drivel about DSPAM vs.
SpamAssassin back in October, and how obvious it is that the information
posted was done so by SpamAssassin bigots as evidenced by the utter lack
of any real information and an abundance of bullsh*t.

I felt compelled to send my rebuttal to this list so that the
individuals on the list could get a better grasp of the issue and make
an _informed_ decision - something they've apparently been deprived of
as a result of the poor construction of a decent comparison.

As always, you're free to disagree with me - and I'm sure some will have
plenty of personal comments which are quite irrelevant to the issue
(evidence that the individual really has nothing useful to say).  A
dying animal is always at its fiercest. I'm sure some will agree with
this rebuttal, some will disagree, and some on the list may continue to
not have a clue or not care.  In any event, I felt an accurate rebuttal
of the DSPAM vs. SpamAssassin thread was necessary.  So necessary that
I've decided to add this to my FAQ.  It's possible some of the
information about SA's accuracy may have changed since October, and if
that's so - wonderful - but the result is still the same.

Finally, if you receive no response to this email from the original
authors of the thread, you may assume that they didn't receive it
because SpamAssassin marked it as a False Positive.

On with the show! =)

Q. Why should I use DSPAM instead of SpamAssassin?
A. SpamAssassin is a heuristic anti-spam tool which has been
well-respected among the open-source community due to its longevity. It
is, however, a dying animal. Statistical filters have greatly exceeded
heuristic filter capabilities in terms of both speed and accuracy. The
answer as to why DSPAM is a more sensible solution involves first
dispelling several myths about SpamAssassin:

      * Myth 1: SpamAssassin is accurate enough for my network.
        SpamAssassin is programmed to detect very specific
        characteristics of emails. As a result, its rulesets require a
        significant amount of updating and tuning, and it is only around
        95% accurate at best (as reported by the documentation, and most
        users), which is more than one HUNDRED times less accurate than
        DSPAM! While 95% may sound pretty close to DSPAM's level of
        accuracy, it's worlds apart. 95% translates to 1
        misclassification for every 20 messages. DSPAM can easily
        achieve 99.95% (which is 1 misclassification in every 2000
        messages) and even up to 99.984% (1 in about 6250). Out of the
        box, SpamAssassin is nowhere near 95%, but closer to 90%, which
        is 1 misclassifications out of 10. While some users may be able
        to tune SpamAssassin to squeeze 99% out of it, that's still only
        1 in 100 compared to DSPAM's 1 in 2000 minimum.  Even at 99.5%,
        this is only 1 out of 200 and is very poor compared to even the
        most naive of statistical filters.
        
      * Myth 2: SpamAssassin requires no training
        The 95% accuracy level represents only well-tuned applications
        of SpamAssassin. Unfortunately, SpamAssassin requires both
        frequent updating of its rulesets and tuning in order to make it
        live up to this mediocre level of filtering. While your users
        won't spend time training it, your high-paid sysadmins will. If
        you are an ISP, it is generally best to use a filter that allows
        your users to train instead for two reasons: 1. it gives the
        users something to do with spam, to improve the accuracy of the
        filter. 2. it gives the user a sense of satisfaction rather than
        futility, so they don't spend that time calling your abuse
        department.
        
      *  Myth 3: SpamAssassin makes a good front-end "firewall" for spam
        Heuristic Anti-Spam tools are significantly less accurate than
        statistical filters, and in the case of SpamAssassin, much
        slower than DSPAM. Many people argue that SpamAssassin makes a
        good front-line filter because it requires no training. Not so.
        For starters, it requires significant tuning to achieve even
        tolerable levels of accuracy. Many of the newer features
        supported by DSPAM such as pre-seeded dictionaries and shared
        groups make learning very fast. Secondly, if you're being
        mailbombed then SpamAssassin is the last thing you want running
        on your servers - I've seen it take up 99.9% CPU on some _small_
        systems during a mailbomb. This is primarily due to the fact
        that SpamAssassin is written in PERL, an interpreted language
        with high overhead. DSPAM is written in C (a compiled language)
        and experiences execution times between 0.03s and 0.10s on
        average hardware (such as my 4200RPM laptop). Also think about
        this: The amount of resources you consume running SpamAssassin
        as a first line of defense actually weakens the second line of
        defense, because its not getting the opportunity to see and
        train on several messages...so you're killing your servers only
        to get poor accuracy.
        
      *  Myth 4: PERL is designed for language processing, so
        SpamAssassin is written in a more appropriate language.
        WRONG! And let me preface this with the fact that I've had about
        10 years of experience coding PERL. While PERL is very useful
        for language processing and web applications, it is also an
        extremely slow, interpreted language - even when compiled. The
        average overhead for a single PERL process is around 2MB of RAM
        on many systems, or more. Even compiled PERL still requires the
        use of a bootstrapped interpreter and bytecode translation. PERL
        is slow as Christmas compared to a compiled language, and the
        regular expression functions PERL touts for text extraction have
        their roots in the C implementation of regular expressions,
        which are much faster. DSPAM has very low-level string functions
        coded in C which are extremely fast, effective, and don't even
        require the use of processor-intensive regular expressions.
        While PERL is useful for data extraction and reporting, it is
        the completely wrong choice for language processing, especially
        in a large-scale environment. If you were analyzing one mailbox,
        PERL would be acceptable...but if you plan on running this on a
        production system with live users, it is a death wish. Take it
        from this PERL geek, and don't believe any other PERL geek who
        tells you otherwise.
        
Now that we've taken a look at a few myths, lets analyze the differences
from a functional perspective which make DSPAM a more feasable solution
for anyone serious about spam filtering:

      *  Approach to Analysis
        SpamAssassin is based primarily on a set of rules to detect the
        individual characteristics of spam. DSPAM, on the other hand,
        puts all of its weight on statistical filtering and supporting
        algorithms. The advantage to using DSPAM's approach is that
        almost all of the rules SpamAssassin uses to identify the
        characteristics of spam are automatically performed by DSPAM's
        approach. On top of this, because DSPAM's analysis is on a
        per-user basis, it is able to determine just how important each
        characteristic (or "rule" in SpamAssassin talk) is to each user,
        rather than collectively. For example, SpamAssassin's first rule
        is to identify if the MUA is pine. Many users receive more spams
        from a pine MUA than not. DSPAM performs this automatically as
        part of its Bayesian analysis and is able to calculate the
        probability on a per-user basis, so a user who receives a lot of
        innocent pine mail will get a more innocent probability than
        someone whose only pine mail are spams. This keeps DSPAM very
        lightweight and resource friendly. Out of SpamAssassin's 921
        rules, only 133 rules were not performed by the advanced
        Bayesian filtering of DSPAM. Out of that 133, 39 were
        duplicates, range rules, or nearly identical rules. 33 were
        blackhole rules, 31 were rare, very low scoring, or unmeaningful
        rules, and 4 were illogical. This left a total of 26 good rules
        performed by SpamAssassin that were not performed by DSPAM.
        While these 26 remaining rules are good, they themselves do not
        positively identify spam, but only a few underlying
        characteristics that may or may not identify a particular
        message (innocent or spam).
        
      * Learning Method
        SpamAssassin provides the bonus of filtering right out of the
        box due to its ruleset compilation and scoring mechanism. Once
        it's tuned, however, there's no way to train it to learn new
        spam or make it any better than it's going to get. DSPAM
        supports the dynamic creation of seeded dictionaries to provide
        a very rapid learning process. Many high-volume users start
        filtering spam with DSPAM on the first day. One problem with
        SpamAssassin's approach to this is that because SpamAssassin's
        rules are fairly "static", many spammers are actually
        downloading and installing SpamAssassin to test how well their
        messages will get around SpamAssasin. DSPAM's adaptive learning
        approach makes this practice impossible. It is extremely
        difficult for a spammer to construct a message that will
        circumvent a large number of users' filters because each user's
        filter is very different. Advanced algorithms such as Bayesian
        Noise Reduction make it computationally infeasable to perform
        wordlist attacks and other such type attacks on DSPAM users.
        Because spam is ever-changing, DSPAM provides a means of
        constantly adapting to the new spam without having to maintain a
        list of rules, or update anything.
        
      * Maintenance
        Maintenance between the two is very different. SpamAssassin
        keeps the maintenance focused primarily around the
        administrator, by providing new rulesets frequently to meet the
        constant changing of spam. This is ideal if you have a dedicated
        SpamAssassin administrator, and provides a much more effort-free
        experience to the end-user, but unfortunately leaves the
        end-user with no means of recourse or satisfaction when they
        receive a spam (and they will receive several using a heuristic
        based filter), meaning not only can they do nothing about it,
        but your customers are more likely to complain about it as a
        result. DSPAM does just the opposite. No additional
        administrative maintenance is required for DSPAM, and the effort
        lies entirely on each end-user's shoulders to forward spams they
        receive into the system. The added bonus of keeping the
        responsibility to the user enables your users to choose whether
        or not they want to have DSPAM filter their spam. By not using
        it, they effectively opt-out of spam filtering. 
        
Overall, DSPAM is the more sensible choice for any network that is
serious about spam filtering. It's faster and more accurate, and better
equipped to act as a first-line of defense against spam...so ditch that
old dinosaur and give DSPAM a test drive, you won't be disappointed.


-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

[SLUG] Re: DSPAM vs SpamAssassin FYI

Reply via email to