> In the meantime, (if you have time), don't you think the > strategy of pasting a portion of a legitimate message in with > the Spam is going to be troublesome? Mathematically it seems > like a problem since half or more of the message wouldn't > look like spam.
If the 'legitimate message' part is actually comprised of words (tokens) that are also in messages you have trained as ham, then yes, that could be a problem (if the ratio of those words is high enough). However, (ignoring personally tailored messages for the moment) the chances of hitting on words that happen to be in your database as ham is pretty low, and there's the additional chance that a word will be used that's actually in your spam database (this is where an individual filter shines, since a word that's ham for you could be spam for me). If the message is tailored to you (say it's a copy of a ham message that you received), then the chance is much higher that those tokens will be in your database as ham. However, this raises the cost of sending that spam message to you, a lot. That sort of spam is extremely rare, since it's much more cost effective to just send bulk mail out to everyone and rely on those without (effective) filters to generate your revenue. It seems like there are two main methods of combating this spamming technique at the moment: using effective training (particularly training that keeps the database size small, which greatly reduces the chance of a random hit), and analysis techniques like DSPAM's "Bayesian Dobly". > Here's the scoring and a good sample message. The scoring is > higher than when it arrived because I used it to train as spam. In the future, it would really help if you could send us clues prior to training - training changes the clue list drastically, especially with messages like this. > # ham trained on: 19365 > # spam trained on: 1719 You have trained on a lot more ham than spam (11.3::1), which is probably the biggest problem here. SpamBayes works best with approximately even numbers of ham and spam - with this imbalance everything will look a lot more like ham. That's also a fairly large database. It seems that the best results generally come from fairly small databases (a few hundred messages). It would definitely be worth retraining from scratch, and seeing if that resolves the problem. With Outlook, the best method would probably be 'train on mistakes' (i.e. train unsures, false positives, and false negatives). See <http://entrian.com/sbwiki/TrainingIdeas> for (a lot) more on training styles. Since you're retraining, you might also like to try the "use_bigrams" option, which generally gives good results (and should be good with randomly appended words) and reduces the required training time. If you'd like to do this, open the file "default_bayes_customize.ini" in your data directory (create one if there isn't one already) in a text editor (like notepad or wordpad). Add these lines (excluding the """) to the end of the file: """ [Classifier] x-use_bigrams:True """ =Tony.Meyer -- Please always include the list ([email protected]) in your replies (reply-all), and please don't send me personal mail about SpamBayes. http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
