> I wrote to you last week about Spam with Ham tacked on the end. > I've retrained my database using the "exception" method and > turned on bigrams. > This message came in at 2%. > Thanks in advance for any further ideas you have. SpamBayes > gets most of my Spam, just not Spam using this new technique. > > 'send' 0.0636887 12 2 > 'long' 0.0768659 6 1 > 'sorry' 0.0918367 2 0 > 'subject:Video' 0.0918367 2 0 > 'take' 0.109281 9 3 > 'let' 0.120729 8 3
These were in the body of the spam, not the tacked on bit. They're quite strong ham clues for you. Training on a few more spams like this should change that (assuming that they are in the same sort of format, and assuming that your ham doesn't look like this). > 'to:addr:above-the-garage.com' 0.3861 25 50 You've train on twice as many spam messages as ham messages with this token, but it's strongly ham. That's not good. This is because of the inbalance in training (this is one of the main issues that needs to be solved to make SpamBayes easier to use). For example, if you had trained on 118 ham (the same number as spam), and none of them had this token in it, then the score for this token would be 0.67. Similar changes apply to the other tokens in the clues list. Try grabbing a random selection of 81 (this will bring the numbers into balance) ham messages and training on them and see if that helps. =Tony.Meyer -- Please always include the list ([email protected]) in your replies (reply-all), and please don't send me personal mail about SpamBayes. http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
