On Fri, 14 Jan 2005 09:19:37 +1300, "Tony Meyer" <[EMAIL PROTECTED]> wrote:
>> >With 'classic' train to exhaustion, the database is kept exactly >> >balanced, I believe. How well is your system working for you? >> >> Erm, not all that well. :| > >:( I'm trying to get things rearranged a little for 1.1 so that it's easier >to try out different training regimes (including tte) with the various apps, >so hopefully that'll help. Ah, that sounds good. You mean by making it easier to tweak the training code, or by exposing more options in the user interface? >> My incoming mail is very unbalanced - 17:1 spam:ham since I >> started the training - which can't help, but so far I have >> 18% unsure spam and 3% false negatives. No mistakes on ham >> though; none scored higher than 0.5%. Given that, I suppose I >> could simply mess with the thresholds. > >I've read reports of people who have done that (in an extreme way, so that >the cutoffs are 5% and 10% or something like that). It seems pretty risky >to me, though, since a message that contains nothing that has been seen >before will score 0.5 and that would be same under that system... I don't think I'd need to go that far. Most of the unsure spam I get is in the 70-90% range. The FNs are all 419-type scams - got to give the Nigerians points for effort, they're laboriously written and different every time. BTW, I'm also seeing better results since finally re-enabling my SpamCopAndAssassin patch and retraining (was running vanilla 1.0.1 before). The URL blacklist support (http://surbl.org) recently added to SpamAssassin seems to make for particularly good spam clues; from the most recent spam to come in: Combined Score: 100% (0.999917) Internal ham score (*H*): 2.23853e-005 Internal spam score (*S*): 0.999857 # ham trained on: 187 # spam trained on: 193 29 Significant Tokens token spamprob #ham #spam 'out:' 0.0918367 2 0 'viagra,' 0.155172 1 0 'to:addr:[munged]' 0.299867 84 37 'url:index' 0.301585 16 7 'url:com' 0.603076 79 124 'private' 0.611308 3 5 'over' 0.616477 18 30 'discount' 0.648476 2 4 'save' 0.654963 5 10 'proto:http' 0.656739 87 172 'url:' 0.694878 22 52 'sell' 0.719354 1 3 'prescription' 0.801118 2 9 'header:Received:8' 0.802243 7 30 '70%' 0.805954 1 5 'sa_rule:3.0:DRUGS_ERECTILE' 0.84212 4 23 'drugs.' 0.844828 0 1 'shipping!' 0.844828 0 1 'subject:discount' 0.844828 0 1 'x-mailer:microsoft outlook [snip] 0.89925 1 11 'generic' 0.908163 0 2 'subject:without' 0.908163 0 2 'sa_rule:3.0:URIBL_SBL' 0.939819 6 100 'required!!' 0.949438 0 4 'sa_rule:3.0:URIBL_AB_SURBL' 0.965581 2 64 'thanks:' 0.969799 0 7 'sa_rule:3.0:URIBL_WS_SURBL' 0.973332 3 121 'sa_rule:3.0:URIBL_OB_SURBL' 0.976995 2 97 'sa_rule:3.0:URIBL_SC_SURBL' 0.977448 2 99 Some spammers have now resorted to removing explicit links from their spam and asking recipients to cut and paste an address into their browser, apparently to avoid their URLs automatically being picked up and added to these blacklists. -- Mat. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
