More years ago than I care to remember I did a Masters thesis on incorporating time dependent query terms in search queries used for searching "News" feeds. Part of the thesis involved implementing a test system.
One of the many steps involved in the processing was the removal (or ignoring of) punctuation before searching for search tokens. I draw your attention to the following extract from a Spam Clues report 'beneficiary' 0.844828 0 1 'beneficiary.' 0.844828 0 1 I would argue that there is no difference between these two tokens and that the inclusion of the punctuation adds nothing to the process but in this instance is likely to give the token a lower score than may be appropriate. I further draw your attention to the following extracts from the same Spam Clues report: '+31633775038' 0.844828 0 1 '30%' 0.844828 0 1 '65%to' 0.844828 0 1 '7.5.430' 0.867197 4 2 '17/11/2006' 0.909938 1 2 '268.14.7/537' 0.909938 1 2 '5:56' 0.909938 1 2 While strings of numbers such as TCP/IP addresses may be useful in differentiating spam from ham, generally numbers, digits and amounts for currency are not good choices for tokens. In particular the above date '17/11/2006' and time '5:56' tokens can normally be considered to be random and are unlikely to be of any use in classifying spam/ham. I also used a stop list of words which are so common that they are useless to index or use in search engines or other search indexes. Below are a number of instances of words which I believe are not appropriate tokens to use to differentiate between spam and ham emails. 'under' 0.814607 3 1 'its' 0.862812 1 1 'us.' 0.862812 1 1 'our' 0.611666 16 2 'when' 0.637817 7 1 'that' 0.664752 19 3 'all' 0.674394 12 2 'around' 0.739628 4 1 'it,' 0.848794 1 1 'up,' 0.848794 1 1 'p.m.' 0.813589 7 2 'does' 0.814607 3 1 Generally I find the current version of SpamBayes to be a very useful tool but I would like the ability to permanently set the value of a token i.e. I'd like to be able to set the token 'pharmacy' to value 1.0 to ensure that all emails containing it are classified as spam; likewise I'd like to classify certain terms as having value 0.0 so that they are always classified as ham. Keep up the good work and I hope that my suggestions are worthwhile. Regards A.J. O'Neill M. App. Sc. M.B. Computing Grad. Dip. K.B.S.
_______________________________________________ SpamBayes@python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html