Magnus Holmgren wrote:

>>>>
>>>>DISCLAIMER: I *really* think it's a bad idea to adjust this. But if you 
>>>>insist,
>>>>it is possible.
>>>>
>>>>I want there to still be some difficulty to intimidate you from changing 
>>>>this
>>>>without some consideration. (it shouldn't be hard to find the setting 
>>>>knowing
>>>>what file it's in, so this isn't much of a hurdle)
>
>>
>>
>> You can always hack the source, and yes, it was easy to find. :-)
>>
>> Now for the consideration part:
>>
>> First, we don't want to learn anything as spam that isn't. With a
>> default lower limit of 12 points that's very unlikely and as already
>> mentioned I haven't yet noticed a single false positive in my case.
>> Second, we don't want bayes poisoning, i.e. "hammy" words recorded as
>> "spammy". I guess the reasoning is that if the header scores lots of
>> points while the body scores low or even zero, then the body isn't
>> spammy enough and shouldn't be learnt from. Conversely, if the header is
>> clean then any (at least 9!) body points are probably just coincidence.
>> Right?
>>
>> Now, whether bayes poisoning is really is an issue is debated. Someone
>> pointed out that the random words hidden by spammers in the message in
>> various ways aren't likely to resemble typical legit correspondence;
>> indeed they are just random noise that doesn't contribute in any
>> direction. In my case most real messages are in Swedish, meaning less
>> problem with those (but slightly more with English ones). Also, many
>> body points doesn't mean there is no bayes poison. Finally, when spam
>> slips through, the user would want to feed it to sa-learn regardless of
>> any bayes poison.



Yes, bayes poison should be trained without worry. However, bayes poison is not
the topic of discussion here. We are talking about mis-learning, something
COMPLETELY different.

Mis-learning a ham message as spam is always bad, and can have a minor or severe
impact depending on the circumstances. There is no question of that mis-learning
should be avoided whenever possible.

Learning bayes poison as spam isn't a matter of "oh, it doesn't matter because
it's in the random noise" it's a matter of accurate training. You WANT SA to
learn about common tokens that are used by both categories. This is important to
SA's accuracy, as it's a fact of reality.


Mis-learning is not random noise, it doesn't reflect reality, and it is not the
same thing as bayes poison. Not at ALL the same. It's just bad.




>>
>> In conclusion, I feel confident in letting SA learn from every message
>> that I am certain that it can be certain is spam.


Are you sure your conclusions are based on accurate perceptions of the 
consequences?




Reply via email to