> Back in http://mail.python.org/pipermail/spambayes/2006-January/ > 018702.html , Tim Peters and I had a dialog about training on > unusual ham - monthly messages from http://www.boldtype.com. I > just got another one and it scored 50% on the spam scale. The > clues follow - [...] > Combined Score: 50% (0.5) Internal ham score (*H*): 1 > Internal spam score (*S*): 1
IOW, the message looked a lot like ham, *and* a lot like spam. > # ham trained on: 1229 > # spam trained on: 20331 As others have said, there's quite an imbalance here, as well as quite a large database. My personal opinion (which is backed up by at least some of the research) is that larger databases are worse. > '1950' 0.97619 0 9 > [...] > 'broke' 0.997512 0 90 > 'accordance' 0.998921 0 208 > 'discreet' 0.999019 0 229 None of the spam clues look very spammy to me (although I don't know what you consider spam of course). Do you have any idea what the 9 to 90 messages that had these clues were? Were these all in some sort of 'word salad' spam? If so, then perhaps avoid training these would help (and I believe the large database and the imbalance will contribute to the problem). =Tony.Meyer -- Please always include the list (spambayes at python.org) in your replies (reply-all), and please don't send me personal mail about SpamBayes. http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
