Experiment: I removed the current sliding/shrinking window code and
replaced it with this simple bit:
my ($lo, $hi);
if ($is_nice{$test}) {
$hi = 0;
$lo = $ranking{$test} * -4.5;
}
else {
$lo = 0;
$hi = $ranking{$test} * 4.5;
}
Which relies on the new RANKING code (that has a reasonably good
distribution of RANKS from low to high). I then took last night's
corpus submission results and did a 10fcv:
BEFORE:
# TCR: 38.978094 SpamRecall: 98.376% SpamPrec: 99.809% FP: 0.16% FN: 1.40%
# TCR: 39.409745 SpamRecall: 98.699% SpamPrec: 99.750% FP: 0.21% FN: 1.12%
# TCR: 46.518954 SpamRecall: 98.693% SpamPrec: 99.829% FP: 0.15% FN: 1.12%
# TCR: 43.135758 SpamRecall: 98.651% SpamPrec: 99.804% FP: 0.17% FN: 1.16%
# TCR: 40.119504 SpamRecall: 98.491% SpamPrec: 99.801% FP: 0.17% FN: 1.30%
# TCR: 41.669789 SpamRecall: 98.485% SpamPrec: 99.821% FP: 0.15% FN: 1.30%
# TCR: 43.030230 SpamRecall: 98.491% SpamPrec: 99.835% FP: 0.14% FN: 1.30%
# TCR: 43.879162 SpamRecall: 98.494% SpamPrec: 99.843% FP: 0.13% FN: 1.30%
# TCR: 38.722524 SpamRecall: 98.556% SpamPrec: 99.770% FP: 0.20% FN: 1.24%
# TCR: 42.063830 SpamRecall: 98.676% SpamPrec: 99.787% FP: 0.18% FN: 1.14%
average TCR -> 41.752759
AFTER:
# TCR: 67.784762 SpamRecall: 99.213% SpamPrec: 99.861% FP: 0.12% FN: 0.68%
# TCR: 76.040598 SpamRecall: 99.149% SpamPrec: 99.907% FP: 0.08% FN: 0.73%
# TCR: 87.009780 SpamRecall: 99.174% SpamPrec: 99.935% FP: 0.06% FN: 0.71%
# TCR: 85.340528 SpamRecall: 99.292% SpamPrec: 99.907% FP: 0.08% FN: 0.61%
# TCR: 78.383260 SpamRecall: 99.118% SpamPrec: 99.921% FP: 0.07% FN: 0.76%
# TCR: 76.038462 SpamRecall: 99.050% SpamPrec: 99.926% FP: 0.06% FN: 0.82%
# TCR: 84.527316 SpamRecall: 99.056% SpamPrec: 99.952% FP: 0.04% FN: 0.81%
# TCR: 77.193059 SpamRecall: 99.098% SpamPrec: 99.921% FP: 0.07% FN: 0.78%
# TCR: 78.039474 SpamRecall: 99.070% SpamPrec: 99.929% FP: 0.06% FN: 0.80%
# TCR: 79.080000 SpamRecall: 99.087% SpamPrec: 99.929% FP: 0.06% FN: 0.79%
average TCR -> 78.9437239
Now, bearing in mind that we might not want to use RANK since inevitably
some of those low ranking rules will get removed and things would get
shifted around, this does suggest we should think about something a bit
more straightforward based on RANK or maybe S/O. Whatever the current
windowing system does, I think it is limiting the scores a bit too much.
But, it's not quite that simple...
I suspected a lot of the benefit came merely from lowering the minimum
score to always be 0 (whereas the current ranging code sometimes forces
a rule to be a specific non-zero number like 2.800 or something, which
is absurd), giving the perceptron a lot more freedom.
This is where it gets interesting (Henry, thanks for the pointer in the
perceptron code)... the score ranges (mine or the original) aren't even
being used by the perceptron since that code got commented out somewhere
along the way. The *only* effect of my change was to change the
scores.h file to go from about 431 non-mutable to 107 non-mutable rules.
The lowering is because the new ranging code I wrote doesn't do the
crazy "thou shall have a score of 2.800" thing.
While the improvement and fix was not exactly accidental, replacing ugly
complicated code with clean and simple code as a method to fix bugs is
(while quite valid and good) illustrative that we need to trim a lot of
fat.
Daniel
--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting