On Mon, 27 Aug 2007, OliverScott wrote:
1. Most users don't know how, arn't allowed, or can't be bothered to train Bayes. In most cases spamassassin is left to auto-train bayes.
Disagree. With proper training -- or if you make it trivially easy, like GMail/Yahoo's "Report as Spam" links -- then users will train Bayes.
2. Most people would consider the same emails to be SPAM. 90% of what I think is spam would also be what you think is spam, with only a small percentage of emails that we disagree on.
Strongly disagree. Many users consider anything they don't want to be spam, including all sorts of soliticed email. I had one user who, rather than turn off email notifications from Facebook, reported them as spam until they started getting blocked. Since we've implemented a system where reporting a message as spam automatically blacklists the sender for the reporting user, I've had a number of reports of students blacklisting their professors because they didn't want some notification they got sent. Perhaps you and I might agree on what spam is, but Joe User does _not_.
3. The emails which we would disagree on would probably be newsletters and advertising emails from legitimate companies. Unwanted newsletters and advertising emails which people have deliberately (possibiliy due to stupidity) signed up to should not be trained as SPAM, but should be manually blacklisted if necessary.
Again, you and I would probably find this situation, but you and Joe User (or I and Joe User) would not.
4. Site wide bayes saves disk space and more importantly it saves significantly on disk IO or memory requirements.
Not sure on this one. None of the performance statistics I gather saw any noticeable hit when I switched from sitewide to per-user.
5. A larger database leads to more accurate baysian identification - I am guessing this is right?
"It depends." :) With Bayes poisoning all the rage, it sometimes helps to avoid a really huge database. A few months ago, we started over and, for the first week or two, spam went up, but then it dropped to below previous levels; cleaning out the crap can help from time-to-time. So what's important is having a well-tuned database -- not necessarily a large database. If Joe and Jane User get different kinds of mail, disagree on what spam is, etc., then they should have different databases. (What if Joe receives a legitimate newsletter on stock tips, for instance?)
1. What I think of as HAM emails could be widely different from what you think of as HAM emails - if I were to sort your inbox by hand (without knowing you personally) I would probably delete some good emails by mistake while getting rid of the spam.
I again disagree. We retain all of the messages that users report as FPs and FNs, and, in general, the FPs are more obvious and certainly easier to agree on. I would never use the FNs as a spam corpus, for aforementioned reasons, but I think the FPs would be pretty reliable.
2. If a server has one customer who is a plumber and one who is an artist, site wide bayes would learn that emails containing the words pipes or canvas are good. The plumber will get emails with the word canvas in them tagged as bayes_00 and vice versa.
Agree, mostly. If you have one customer who is a day trader and one who works with Pfizer Canada, then they'll constantly be fighting each other because the former doesn't want spam about Viagra from our neighbors to the north and the latter doesn't want spam about the latest stock that's about to blow up. (This is obviously a contrived example, but you get the idea.) With a diverse user base, any sort of one-size-fits-all filtering is bound to increase FPs and FNs.
3. If per user bayes is chosen then bayes_00 will only fire on emails containing words which have occurred in emails which YOU have received in the past and which scored low enough to be autolearned.
..or were expressly learned by the user. Agree.
4. If a HAM email is misclasified as SPAM then users are more likely to report this to their admin or to train the filter themselves, than for SPAM emails which are not tagged. People will ignore a few spam slipping through but not false positives!
For some value of "few," I agree.
SPAM tokens are stored on a server wide basis - can be a LARGE database if this helps HAM tokens are stored on a per user basis - probably only needs a 1-2Mb file per user.
I think users would be just as adept at poisoning such a split database as they would be at poisoning a unified, site-wide database. In any reasonably diverse user base, what my fellow user thinks is spam should not affect what I get in my mailbox. Chris St. Pierre Unix Systems Administrator Nebraska Wesleyan University