Mr. A.J. O'Neill wrote on Friday, November 17, 2006 7:25 PM -0500: > One of the many steps involved in the processing was the removal (or > ignoring of) punctuation before searching for search tokens. I draw > your attention to the following extract from a Spam Clues report > > 'beneficiary' 0.844828 0 1 > 'beneficiary.' 0.844828 0 1 > > I would argue that there is no difference between these two tokens > and that the inclusion of the punctuation adds nothing to the process > but in this instance is likely to give the token a lower score than > may be appropriate.
This type of specific choice in the tokenizer resulted from testing in a number of people's working environments. It was shown to improve classification empirically. This suggests that the intuition behind your argument, which I originally shared as well, is not correct for the purpose of classifying email as ham/spam at the time this was tested. A lot of the small choices in Spambayes turn out to be the results of empirical testing rather than intuition, and it's surprising (non-intuitive) how often our intuition about our own language is incorrect. If you're looking for a reason to explain the empirical results, one possibility is that it provides differentiation based on grammar, as opposed to just word occurrence. This is something that you normally don't get with a tokenizer that only recognizes words and not sentence structure. > I also used a stop list of words which are so common that they are > useless to index or use in search engines or other search indexes. > Below are a number of instances of words which I believe are not > appropriate tokens to use to differentiate between spam and ham > emails. There is a clash between the philosophy of naive Bayesian classification and rule-based schemes. The idea behind rule-based schemes is that we can tap human beings' pattern recognition ability to create rules that we run in a computer. Since we can recognize spam easily when we see it, we are the best experts to consult when forming a rule set. The problem with this notion is that computers are not currently capable of creating inferences in the same way as people because the system architecture is so different. While people can indeed reliably distinguish spam, often from only a part of the message, they cannot reliably tell you how they made the decision. The aim of naive Bayesian classification is to avoid all the particular problems of trying to construct a useful rule set and instead look at simple statistical properties of language the do not require human-like inference. The underlying model is fundamentally different. A Bayesian classifier is not trying to emulate a speaker of natural language. The approach has strengths as well as weaknesses. One of the strengths is that you don't have to decide what words you think are the best or worst spam indicators. If you tend to favor rule-based approaches, this also looks like a huge weakness. The classifier learns word probabilities by observing your message classifications. To the extent that you are surprised by the spam probabilities of individual words, you would make the classifier worse by manually overriding the training results on a token-by-token basis. This happens far more often than you would think. Words that indicate a spam likeliness equal to a ham likeliness score somewhere near 0.5 and do not contribute to the final score. Another of the strengths is that the word probabilities vary widely among different recipients. It's a strength because there is no such thing as a ham word list that will reliably avoid Bayesian classifiers. That's also a weakness, if you wish to apply Bayesian methods on a server without tracking the word probabilities separately for each mailbox. What this suggests is that it is equally difficult to come up with a list of words that the classifier should ignore that would work for most users. There is a fundamental disagreement in the approaches of Bayesian and rule-based systems. Proponents of rule-based systems believe that people can best identify what clues are most significant, while proponents of Bayesian systems either believe that people cannot reliably identify the most important clues, or even if they can, they don't care to do so. The last condition is important if spam avoidance is simply a utilitarian goal, not a hobby. Personally, I tried rule-based systems first and then experimented with Spambayes. I found that my intuition on word probabilities was indeed wrong a significant proportion of the time and the naive Bayesian approach did about as well as my rule-based system when it was at its peak. The Bayesian approach required much less maintenance and it works well for a wide variety of end-users without requiring insight from them. I still feel there are very useful rules to help detect spam that are complimentary to word frequency. These are things such as whether the message comes from a particular mailing list, whether the sending IP is on a DNS blacklist that I choose or to which one of my mailbox addresses the message is addressed. My own compromise on this is to either put them in the domain MTA, or to write Outlook rules that run before the Bayesian classifier. In terms of overall system architecture, I tend to believe that the rule-based approaches belong in the domain MTA, whenever possible, and should generate rejections during the SMTP session, preferably before DATA. This eliminates most of the spam at the lowest possible system cost and with the largest savings in bandwidth. You can eliminate another significant amount of spam by running rule-based content filters, such as SpamAssassin, in the MTA. This is very expensive, so it is important to run it on as few messages as possible. This generates rejections at the end of DATA, which are still useful for legitimate messages that are improperly classified. For the spam that slips through global rule-base systems, it then makes sense to do computationally intensive and user-specific content filtering like Spambayes in the MUA. The spam load is hopefully reduced enough that the end-user doesn't mind scanning the junk folder for the occasional false positive. -- Seth Goodman _______________________________________________ SpamBayes@python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html