Greetings to all,
I'm not a member of SLUG, so if you expect a response from me, please
respond directly to my email address.
While doing a Google search today, I noticed some drivel about DSPAM vs.
SpamAssassin back in October, and how obvious it is that the information
posted was done so by SpamAssassin bigots as evidenced by the utter lack
of any real information and an abundance of bullsh*t.
I felt compelled to send my rebuttal to this list so that the
individuals on the list could get a better grasp of the issue and make
an _informed_ decision - something they've apparently been deprived of
as a result of the poor construction of a decent comparison.
As always, you're free to disagree with me - and I'm sure some will have
plenty of personal comments which are quite irrelevant to the issue
(evidence that the individual really has nothing useful to say). A
dying animal is always at its fiercest. I'm sure some will agree with
this rebuttal, some will disagree, and some on the list may continue to
not have a clue or not care. In any event, I felt an accurate rebuttal
of the DSPAM vs. SpamAssassin thread was necessary. So necessary that
I've decided to add this to my FAQ. It's possible some of the
information about SA's accuracy may have changed since October, and if
that's so - wonderful - but the result is still the same.
Finally, if you receive no response to this email from the original
authors of the thread, you may assume that they didn't receive it
because SpamAssassin marked it as a False Positive.
On with the show! =)
Q. Why should I use DSPAM instead of SpamAssassin?
A. SpamAssassin is a heuristic anti-spam tool which has been
well-respected among the open-source community due to its longevity. It
is, however, a dying animal. Statistical filters have greatly exceeded
heuristic filter capabilities in terms of both speed and accuracy. The
answer as to why DSPAM is a more sensible solution involves first
dispelling several myths about SpamAssassin:
* Myth 1: SpamAssassin is accurate enough for my network.
SpamAssassin is programmed to detect very specific
characteristics of emails. As a result, its rulesets require a
significant amount of updating and tuning, and it is only around
95% accurate at best (as reported by the documentation, and most
users), which is more than one HUNDRED times less accurate than
DSPAM! While 95% may sound pretty close to DSPAM's level of
accuracy, it's worlds apart. 95% translates to 1
misclassification for every 20 messages. DSPAM can easily
achieve 99.95% (which is 1 misclassification in every 2000
messages) and even up to 99.984% (1 in about 6250). Out of the
box, SpamAssassin is nowhere near 95%, but closer to 90%, which
is 1 misclassifications out of 10. While some users may be able
to tune SpamAssassin to squeeze 99% out of it, that's still only
1 in 100 compared to DSPAM's 1 in 2000 minimum. Even at 99.5%,
this is only 1 out of 200 and is very poor compared to even the
most naive of statistical filters.
* Myth 2: SpamAssassin requires no training
The 95% accuracy level represents only well-tuned applications
of SpamAssassin. Unfortunately, SpamAssassin requires both
frequent updating of its rulesets and tuning in order to make it
live up to this mediocre level of filtering. While your users
won't spend time training it, your high-paid sysadmins will. If
you are an ISP, it is generally best to use a filter that allows
your users to train instead for two reasons: 1. it gives the
users something to do with spam, to improve the accuracy of the
filter. 2. it gives the user a sense of satisfaction rather than
futility, so they don't spend that time calling your abuse
department.
* Myth 3: SpamAssassin makes a good front-end "firewall" for spam
Heuristic Anti-Spam tools are significantly less accurate than
statistical filters, and in the case of SpamAssassin, much
slower than DSPAM. Many people argue that SpamAssassin makes a
good front-line filter because it requires no training. Not so.
For starters, it requires significant tuning to achieve even
tolerable levels of accuracy. Many of the newer features
supported by DSPAM such as pre-seeded dictionaries and shared
groups make learning very fast. Secondly, if you're being
mailbombed then SpamAssassin is the last thing you want running
on your servers - I've seen it take up 99.9% CPU on some _small_
systems during a mailbomb. This is primarily due to the fact
that SpamAssassin is written in PERL, an interpreted language
with high overhead. DSPAM is written in C (a compiled language)
and experiences execution times between 0.03s and 0.10s on
average hardware (such as my 4200RPM laptop). Also think about
this: The amount of resources you consume running SpamAssassin
as a first line of defense actually weakens the second line of
defense, because its not getting the opportunity to see and
train on several messages...so you're killing your servers only
to get poor accuracy.
* Myth 4: PERL is designed for language processing, so
SpamAssassin is written in a more appropriate language.
WRONG! And let me preface this with the fact that I've had about
10 years of experience coding PERL. While PERL is very useful
for language processing and web applications, it is also an
extremely slow, interpreted language - even when compiled. The
average overhead for a single PERL process is around 2MB of RAM
on many systems, or more. Even compiled PERL still requires the
use of a bootstrapped interpreter and bytecode translation. PERL
is slow as Christmas compared to a compiled language, and the
regular expression functions PERL touts for text extraction have
their roots in the C implementation of regular expressions,
which are much faster. DSPAM has very low-level string functions
coded in C which are extremely fast, effective, and don't even
require the use of processor-intensive regular expressions.
While PERL is useful for data extraction and reporting, it is
the completely wrong choice for language processing, especially
in a large-scale environment. If you were analyzing one mailbox,
PERL would be acceptable...but if you plan on running this on a
production system with live users, it is a death wish. Take it
from this PERL geek, and don't believe any other PERL geek who
tells you otherwise.
Now that we've taken a look at a few myths, lets analyze the differences
from a functional perspective which make DSPAM a more feasable solution
for anyone serious about spam filtering:
* Approach to Analysis
SpamAssassin is based primarily on a set of rules to detect the
individual characteristics of spam. DSPAM, on the other hand,
puts all of its weight on statistical filtering and supporting
algorithms. The advantage to using DSPAM's approach is that
almost all of the rules SpamAssassin uses to identify the
characteristics of spam are automatically performed by DSPAM's
approach. On top of this, because DSPAM's analysis is on a
per-user basis, it is able to determine just how important each
characteristic (or "rule" in SpamAssassin talk) is to each user,
rather than collectively. For example, SpamAssassin's first rule
is to identify if the MUA is pine. Many users receive more spams
from a pine MUA than not. DSPAM performs this automatically as
part of its Bayesian analysis and is able to calculate the
probability on a per-user basis, so a user who receives a lot of
innocent pine mail will get a more innocent probability than
someone whose only pine mail are spams. This keeps DSPAM very
lightweight and resource friendly. Out of SpamAssassin's 921
rules, only 133 rules were not performed by the advanced
Bayesian filtering of DSPAM. Out of that 133, 39 were
duplicates, range rules, or nearly identical rules. 33 were
blackhole rules, 31 were rare, very low scoring, or unmeaningful
rules, and 4 were illogical. This left a total of 26 good rules
performed by SpamAssassin that were not performed by DSPAM.
While these 26 remaining rules are good, they themselves do not
positively identify spam, but only a few underlying
characteristics that may or may not identify a particular
message (innocent or spam).
* Learning Method
SpamAssassin provides the bonus of filtering right out of the
box due to its ruleset compilation and scoring mechanism. Once
it's tuned, however, there's no way to train it to learn new
spam or make it any better than it's going to get. DSPAM
supports the dynamic creation of seeded dictionaries to provide
a very rapid learning process. Many high-volume users start
filtering spam with DSPAM on the first day. One problem with
SpamAssassin's approach to this is that because SpamAssassin's
rules are fairly "static", many spammers are actually
downloading and installing SpamAssassin to test how well their
messages will get around SpamAssasin. DSPAM's adaptive learning
approach makes this practice impossible. It is extremely
difficult for a spammer to construct a message that will
circumvent a large number of users' filters because each user's
filter is very different. Advanced algorithms such as Bayesian
Noise Reduction make it computationally infeasable to perform
wordlist attacks and other such type attacks on DSPAM users.
Because spam is ever-changing, DSPAM provides a means of
constantly adapting to the new spam without having to maintain a
list of rules, or update anything.
* Maintenance
Maintenance between the two is very different. SpamAssassin
keeps the maintenance focused primarily around the
administrator, by providing new rulesets frequently to meet the
constant changing of spam. This is ideal if you have a dedicated
SpamAssassin administrator, and provides a much more effort-free
experience to the end-user, but unfortunately leaves the
end-user with no means of recourse or satisfaction when they
receive a spam (and they will receive several using a heuristic
based filter), meaning not only can they do nothing about it,
but your customers are more likely to complain about it as a
result. DSPAM does just the opposite. No additional
administrative maintenance is required for DSPAM, and the effort
lies entirely on each end-user's shoulders to forward spams they
receive into the system. The added bonus of keeping the
responsibility to the user enables your users to choose whether
or not they want to have DSPAM filter their spam. By not using
it, they effectively opt-out of spam filtering.
Overall, DSPAM is the more sensible choice for any network that is
serious about spam filtering. It's faster and more accurate, and better
equipped to act as a first-line of defense against spam...so ditch that
old dinosaur and give DSPAM a test drive, you won't be disappointed.
--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html