http://bugzilla.spamassassin.org/show_bug.cgi?id=2419
------- Additional Comments From [EMAIL PROTECTED] 2004-05-06 20:16 -------
Subject: Re: Increased Bayes Score Breakdown Near Extremes
> I wonder if those ranges will work for everyone. I'm using chi^2
> combining and I find that it's the rare ham message that exceeds a
> Bayes score of 0.5. Here are my hand-tweaked scores around the center
> (a union of the old and new rules):
Well, if I look at the Bayes histograms for other people, sure they do
vary in how the middle behaves for them, probably due to their corpus
being different, different balances between ham and spam counts, etc.
Justin's is very spammy in the middle, but some people are very hammy.
The peak in the center of the histogram is going to vary in composition
for everyone.
jmason:
0.00-0.01 1.930 98.654 0.0191879
0.01-0.05 0.660 0.443 0.598368
0.05-0.20 1.072 0.382 0.737276
0.20-0.40 1.105 0.260 0.809524
0.40-0.60 35.302 0.260 0.992689 <- very spammy
0.60-0.80 4.404 0.000 1
0.80-0.95 4.602 0.000 1
0.95-0.99 4.289 0.000 1
0.99-1.00 46.635 0.000 1
> Based on the nightly rule results, my Bayes scores look something like
> jm's, so I'm not the only one. (I don't know which combining rule was
> used by jm.)
I can divide up the middle a bit more. At worst, the GA will just zero
it and it'll just be a very tiny bit of dead weight. 0.40-0.45 has very
few hits in it, so I'm not too inclined to go with 5% ranges.
I think Justin's mail (and perhaps yours too) is sort of a best case for
how non-spammy the ham is. Here's his middle range a bit more fleshed
out:
0.400-0.425 0.231 0.076 0.752443 <- too thin alone
0.425-0.450 0.495 0.092 0.843271 <- too thin alone
0.450-0.475 0.627 0.015 0.976636 <- too thin alone
0.475-0.500 14.599 0.076 0.994821 <- thick enough from here down
0.500-0.525 16.084 0.000 1
0.525-0.550 1.452 0.000 1
0.550-0.575 0.940 0.000 1
0.575-0.600 0.874 0.000 1
0.600-0.800 4.404 0.000 1
So, I could see going with:
0.400-0.475
0.475-0.525
0.525-0.600
That's a 5% wide band in the middle. I could do less, I could do more.
Here's mine:
0.400-0.425 0.007 0.143 0.0466667 <- too thin alone
0.425-0.450 0.000 0.191 0 <- too thin alone
0.450-0.475 0.014 0.252 0.0526316 <- too thin alone
0.475-0.500 0.238 1.350 0.149874 <- too thin alone
0.500-0.525 1.253 0.587 0.680978 <- thick enough
0.525-0.550 0.272 0.014 0.951049 <- too thin alone
0.550-0.575 0.204 0.020 0.910714 <- too thin alone
0.575-0.600 0.143 0.020 0.877301 <- too thin alone
Even for Justin's, the 0.40-0.475 is about as sparse as you'd want
to think about making it to get good score optimizations.
So, if I go ahead and change the middle into 3 regions, how wide do we
want the middle to be? The more I think about it, the narrower I want
to make it. I could make it 0.4999 to .5001 even -- there are more than
enough hits in there.
> To accommodate those who rarely get ham scoring >= 0.5 it might be
> better to at least split the rules at 0.5, perhaps [0.4,0.5) and
> [0.5,0.6). (IIRC the evals only allow ranges like (0.4,0.5], which is
> not quite as good.)
Hmmm.... 0.5 is so meaningless I doubt it really matters.
> Users could have lines such as
> body BAYES_44 eval:check_bayes('0.44', '0.49999999999')
> in their preferences file, but that should not be necessary for what
> might be a common preference.
I doubt we want to complicate this unnecessarily. It really won't make
any difference in terms of end results.
> Also, it would be better if the rules were named consistently, say
> using the lower end of the range for the number (BAYES_01 instead of
> BAYES_05 for check_bayes('0.01', '0.05')).
Maybe. :-)
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.