At 05:12 2/07/2004, Nick Leverton wrote:
On Thu, Jul 01, 2004 at 01:07:51PM +1200, Simon Byrnand wrote:
> So the new algorithm does this:
>
> When trying to autolearn spam, don't learn it if the bayes score is
> already > +4.0 (BAYES_99) and while trying to autolearn ham, don't learn it
> if the bayes score is already < -4.0 (BAYES_00). There is no limitiation on
> learning a message the opposite way to what bayes previously thought it
> should score, as this is sometimes neccessary.
I think it's a great idea, and I had been thinking along the same lines myself. From all the analyses I've read, most learning systems learn best from teaching by exception. I find that reinforcing SA's Bayes when (for instance) it guessed ham, makes it very hard in the future to learn enough contrary examples of similar mail.
Exactly my thoughts.... too much autolearning also makes manual learning ineffective I find. After a couple of months of autolearning it has learnt > 10,000 spams and hams, and when you get something like those german spams coming through, I found that I could manually learn them with sa-learn as spam, and they were STILL getting BAYES_00. Same with Nigerian spams - manually learning them would sometimes raise their bayes score a bit, but I could never get them to BAYES_90 or more, and they would always creep back towards BAYES_40 and less.
After blowing away the bayes database and letting it autolearn for only 4 hours, the accuracy was far better than a several month old autolearnt database, which goes to show that something is going very wrong with long term autolearning - and I think its a combination of excessive reinforcement of things which are already correctly learnt, causing dilution of the strength of individual learnt messages, (eg very large token count values) and also the current code that wont let bayes learn in the opposite direction than what it currently thinks a message should be. (Very wrong IMHO)
I haven't had time to dig into the code to do it though. I'll be interested to see how it works for you !
Sure.... the patch seems to work fine and act as I describe it as far as the algorithm above goes, but as to how effective my bayes database will be a month from now, that remains to be seen :)
Early indications are promising, and it is learning FAR less messages than before, maybe 1/5 or less, which goes to show just how much reinforcement was going on before. I've checked the BAYES scores on a decent amount of ham and spam and things look good so far.
Regards, Simon
