At 13:07 1/07/2004, you wrote:
So the new algorithm does this:
When trying to autolearn spam, don't learn it if the bayes score is already > +4.0 (BAYES_99) and while trying to autolearn ham, don't learn it if the bayes score is already < -4.0 (BAYES_00). There is no limitiation on learning a message the opposite way to what bayes previously thought it should score, as this is sometimes neccessary.
This patch is a bit of a hack in the sense that rather than checking for a score of plus or minus 4, it should really be checking for either BAYES_00 and BAYES_99 specifically, or it should be checking for a probability of <0.01 or >0.99 neither of which I know how to do.
The reason for this being that it should be relying on the BAYES probabilities, not the points that are assigned, in case the GA changes them in the future, or people customize them.
Thoughts anybody ? I'll be leaving this to run for a while and see whether it solves (for me) the problem of having to start the bayes database over from scratch every couple of months...
One further thought, since this drastically reduces the number of messages autolearnt, it may take quite some time to reach the 200 ham/spam threshold when first starting out, so if something like this were to go into SA, it might perhaps makes sense for it to be conditional on 200 ham/spam's being learnt.
EG, with less than 200 hams/spams its ok to learn BAYES_99 and BAYES_00 messages, but once 200/200 is reached, it should then start behaving as above...since "dilution" isn't a problem with only a couple of hundred messages learnt, its when you start getting tens of thousands that its a problem...
Regards, Simon
