Hi All,

After being unhappy with some aspects of the Bayes autolearning for a while now I decided to have a go myself and see if I could figure out what to change with my very limited perl abilities, and came up with the following patch to PerMsgStatus.pm, which I think kills two birds with one stone:


--- PerMsgStatus.bak Thu Jul 1 11:51:42 2004 +++ PerMsgStatus.pm Thu Jul 1 12:55:54 2004 @@ -281,8 +281,8 @@ return; }

-  my $learner_said_ham_hits = -1.0;
-  my $learner_said_spam_hits = 1.0;
+  my $learner_said_ham_hits = -4.0;
+  my $learner_said_spam_hits = 4.0;

   if ($isspam) {
     my $required_body_hits = 3;
@@ -298,16 +298,16 @@
                  $self->{head_only_hits}." < ".$required_head_hits.")");
       return;
     }
-    if ($self->{learned_hits} < $learner_said_ham_hits) {
-      dbg ("auto-learn? no: learner indicated ham (".
-                 $self->{learned_hits}." < ".$learner_said_ham_hits.")");
+    if ($self->{learned_hits} > $learner_said_spam_hits) {
+      dbg ("auto-learn? no: learner indicated spam (".
+                 $self->{learned_hits}." > ".$learner_said_spam_hits.")");
       return;
     }

   } else {
-    if ($self->{learned_hits} > $learner_said_spam_hits) {
-      dbg ("auto-learn? no: learner indicated spam (".
-                 $self->{learned_hits}." > ".$learner_said_spam_hits.")");
+    if ($self->{learned_hits} < $learner_said_ham_hits) {
+      dbg ("auto-learn? no: learner indicated ham (".
+                 $self->{learned_hits}." < ".$learner_said_ham_hits.")");
       return;
     }
   }



Rationale:

The old behaviour (to the best of my understanding) does this: when trying to autolearn ham, if the message ALREADY scores more than +1.0 from BAYES, eg BAYES_60 through BAYES_99, the message will NOT be learnt.

Conversly when trying to autolearn spam, if the existing BAYES score is less than -1.0, (eg BAYES_20, BAYES_01, and BAYES_00 in set 4, or BAYES_00 in set 3) then it will NOT learn it.

The fact that BAYES_10 won't stop it autolearning while BAYES_20 will, (due to the scoring) shows that simply comparing the score is flawed IMHO, but its slightly to the side of my main points.

The reasonining behind this I guess is that if a message is already giving a negative bayes score we shouldn't be trying to learn it as spam. Unfortunately this reasoning is flawed IMHO.

Two good examples - nigerian spams and the recent german spams. I've always wondered why autolearning bayes seems to have such a hard time with the nigerian stuff - and now I think I know why.

Initially, nigerian spams don't look very spammy, they use mostly common language, so its understandable for it them get negative scores from bayes to start with, however when other tests give a high enough score to trigger autolearning as spam, they DONT get learnt under the old logic, because they already have BAYES_00 or BAYES_01.

Another example is the german spams - when they first came out, there was NOTHING in them for spamassassin to pick up on, and because they didn't match ANY tests, they mistakenly got learnt as ham. Once they were learnt as ham, and getting BAYES_00 its now impossible for the autolearning system to correct itself and learn it as spam, once other rules start catching them, such as DCC, razor, new custom rules etc.

It's not uncommon for something which doesn't match any tests to initially be learnt wrong, but as soon as DCC, razor, RBL's etc pick up on it, it should be possible to automatically learn it the RIGHT way around, otherwise you have a system that keeps needing manual intervention.

The second issue is one of database "dilution". Without fail, if I leave the bayes database for too long autolearning, (a couple of months) it starts to perform poorly. (FN's)

Once before I mentioned this, and I still believe that the root cause is simply dilution of data caused by excessive reinforcement of messages that already have maximal probabilities.

EG, autolearning as spam a message that already gets BAYES_99 or autolearning as ham a message that already gets BAYES_00.

Even though the learner might not be quite as smart in the short term, (since it is still possible to learn useful tokens from a message already getting maximum probabilities) in the long term the system should be more stable, since only truly useful messages will be learnt. (Those that aren't getting BAYES_00 or BAYES_99 already)

So the new algorithm does this:

When trying to autolearn spam, don't learn it if the bayes score is already > +4.0 (BAYES_99) and while trying to autolearn ham, don't learn it if the bayes score is already < -4.0 (BAYES_00). There is no limitiation on learning a message the opposite way to what bayes previously thought it should score, as this is sometimes neccessary.

This patch is a bit of a hack in the sense that rather than checking for a score of plus or minus 4, it should really be checking for either BAYES_00 and BAYES_99 specifically, or it should be checking for a probability of <0.01 or >0.99 neither of which I know how to do.

The reason for this being that it should be relying on the BAYES probabilities, not the points that are assigned, in case the GA changes them in the future, or people customize them.

Thoughts anybody ? I'll be leaving this to run for a while and see whether it solves (for me) the problem of having to start the bayes database over from scratch every couple of months...

Regards,
Simon



Reply via email to