I've been thinking about this a lot and I think there's a better way to do bayesian filtering that the way it is currently being implimented.

The problem now is - there's too much information that is being looked at. The solution is to look at only the parts of the message that are most likly to be spammy.

What do i mean - you ask! Basically - I suggest ignoring most of the message body and focusing on the headers - expecially the subject - and from the body you would extract selective data. I would also include the rule names triggered in the headers and use bayes to help score them.

From the BODY I would extract all URLs - maybe just the domain part of the URLs, email addresses - phone numbers - maybe large text. Maybe strange html tags. The rest I would ignore.

WHY - you ask? The HOT SPAMMY parts of the message is the parts I specified. The BODY is the least spammy part. Looking at the body is like pouring warm water into hot water - or in the case of non-spam - cool water into cold water. Especially with spammers putting invisible non-spam text in the body - it basically dilutes the message - waters it down. But there are parts that spammers are far less likely to water down - the Subject - and the links.

Remember - the thing that gives spam away is that spam wants you to DO SOMETHING. So spam has to catch your eye in the subject line. That makes the SUBJECT HOT. It also needs for you to act - whivh is usually a link to click on. That means the LINKS are HOT. Then - there are things in the headers that they can't entirely conceal - so it's hot as well.

I know this is somewhat of a different concept that what we are used to - but I think it deserves investigating.

Marc Perkel



Reply via email to