Re: Some thoughts on Baysian Setup...

Chris St. Pierre Mon, 27 Aug 2007 07:47:22 -0700

On Mon, 27 Aug 2007, OliverScott wrote:

1. Most users don't know how, arn't allowed, or can't be bothered to train
Bayes. In most cases spamassassin is left to auto-train bayes.


Disagree.  With proper training -- or if you make it trivially easy,
like GMail/Yahoo's "Report as Spam" links -- then users will train Bayes.

2. Most people would consider the same emails to be SPAM. 90% of what I
think is spam would also be what you think is spam, with only a small
percentage of emails that we disagree on.


Strongly disagree.  Many users consider anything they don't want to be
spam, including all sorts of soliticed email.  I had one user who,
rather than turn off email notifications from Facebook, reported them
as spam until they started getting blocked.  Since we've implemented a
system where reporting a message as spam automatically blacklists the
sender for the reporting user, I've had a number of reports of
students blacklisting their professors because they didn't want some
notification they got sent.

Perhaps you and I might agree on what spam is, but Joe User does _not_.

3. The emails which we would disagree on would probably be newsletters and
advertising emails from legitimate companies. Unwanted newsletters and
advertising emails which people have deliberately (possibiliy due to
stupidity) signed up to should not be trained as SPAM, but should be
manually blacklisted if necessary.


Again, you and I would probably find this situation, but you and Joe
User (or I and Joe User) would not.

4. Site wide bayes saves disk space and more importantly it saves
significantly on disk IO or memory requirements.


Not sure on this one.  None of the performance statistics I gather saw
any noticeable hit when I switched from sitewide to per-user.

5. A larger database leads to more accurate baysian identification - I am
guessing this is right?


"It depends." :)  With Bayes poisoning all the rage, it sometimes
helps to avoid a really huge database.  A few months ago, we started
over and, for the first week or two, spam went up, but then it dropped
to below previous levels; cleaning out the crap can help from
time-to-time.

So what's important is having a well-tuned database -- not necessarily
a large database.  If Joe and Jane User get different kinds of mail,
disagree on what spam is, etc., then they should have different
databases.  (What if Joe receives a legitimate newsletter on stock
tips, for instance?)

1. What I think of as HAM emails could be widely different from what you
think of as HAM emails - if I were to sort your inbox by hand (without
knowing you personally) I would probably delete some good emails by mistake
while getting rid of the spam.


I again disagree.  We retain all of the messages that users report as
FPs and FNs, and, in general, the FPs are more obvious and certainly
easier to agree on.  I would never use the FNs as a spam corpus, for
aforementioned reasons, but I think the FPs would be pretty reliable.

2. If a server has one customer who is a plumber and one who is an artist,
site wide bayes would learn that emails containing the words pipes or canvas
are good. The plumber will get emails with the word canvas in them tagged as
bayes_00 and vice versa.


Agree, mostly.

If you have one customer who is a day trader and one who works with
Pfizer Canada, then they'll constantly be fighting each other because
the former doesn't want spam about Viagra from our neighbors to the
north and the latter doesn't want spam about the latest stock that's
about to blow up.  (This is obviously a contrived example, but you get
the idea.)

With a diverse user base, any sort of one-size-fits-all filtering is
bound to increase FPs and FNs.

3. If per user bayes is chosen then bayes_00 will only fire on emails
containing words which have occurred in emails which YOU have received in
the past and which scored low enough to be autolearned.


..or were expressly learned by the user.  Agree.

4. If a HAM email is misclasified as SPAM then users are more likely to
report this to their admin or to train the filter themselves, than for SPAM
emails which are not tagged. People will ignore a few spam slipping through
but not false positives!


For some value of "few," I agree.

SPAM tokens are stored on a server wide basis - can be a LARGE database if
this helps
HAM tokens are stored on a per user basis - probably only needs a 1-2Mb file
per user.


I think users would be just as adept at poisoning such a split
database as they would be at poisoning a unified, site-wide database.
In any reasonably diverse user base, what my fellow user thinks is
spam should not affect what I get in my mailbox.

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University

Re: Some thoughts on Baysian Setup...

Reply via email to