Magnus Holmgren wrote:
>>>> >>>>DISCLAIMER: I *really* think it's a bad idea to adjust this. But if you >>>>insist, >>>>it is possible. >>>> >>>>I want there to still be some difficulty to intimidate you from changing >>>>this >>>>without some consideration. (it shouldn't be hard to find the setting >>>>knowing >>>>what file it's in, so this isn't much of a hurdle) > >> >> >> You can always hack the source, and yes, it was easy to find. :-) >> >> Now for the consideration part: >> >> First, we don't want to learn anything as spam that isn't. With a >> default lower limit of 12 points that's very unlikely and as already >> mentioned I haven't yet noticed a single false positive in my case. >> Second, we don't want bayes poisoning, i.e. "hammy" words recorded as >> "spammy". I guess the reasoning is that if the header scores lots of >> points while the body scores low or even zero, then the body isn't >> spammy enough and shouldn't be learnt from. Conversely, if the header is >> clean then any (at least 9!) body points are probably just coincidence. >> Right? >> >> Now, whether bayes poisoning is really is an issue is debated. Someone >> pointed out that the random words hidden by spammers in the message in >> various ways aren't likely to resemble typical legit correspondence; >> indeed they are just random noise that doesn't contribute in any >> direction. In my case most real messages are in Swedish, meaning less >> problem with those (but slightly more with English ones). Also, many >> body points doesn't mean there is no bayes poison. Finally, when spam >> slips through, the user would want to feed it to sa-learn regardless of >> any bayes poison. Yes, bayes poison should be trained without worry. However, bayes poison is not the topic of discussion here. We are talking about mis-learning, something COMPLETELY different. Mis-learning a ham message as spam is always bad, and can have a minor or severe impact depending on the circumstances. There is no question of that mis-learning should be avoided whenever possible. Learning bayes poison as spam isn't a matter of "oh, it doesn't matter because it's in the random noise" it's a matter of accurate training. You WANT SA to learn about common tokens that are used by both categories. This is important to SA's accuracy, as it's a fact of reality. Mis-learning is not random noise, it doesn't reflect reality, and it is not the same thing as bayes poison. Not at ALL the same. It's just bad. >> >> In conclusion, I feel confident in letting SA learn from every message >> that I am certain that it can be certain is spam. Are you sure your conclusions are based on accurate perceptions of the consequences?