I don't have time to develop this idea further, but thought I'd offer it
for consideration. Let me begin by saying that I think that Henry Stern's
proposal likely has much more promise,
http://bugzilla.spamassassin.org/show_bug.cgi?id=2910
and that I'm describing this idea just in case someone wants to
pursue it further.
The motivation for an alternative scoring technique is to make it more
tractable for users who write their own custom rules and want to
rescore the entire collection of rules, or who want to rescore the rules
based upon their own collection of ham and spam. The current GA framework
as I understand it, isn't set up to allow a quick, effective score
calculation.
A linear programming model generally consists of a set of linear equations
with
inequalities (constraints), and an objective function that is to be
minimized
(or maximized).
I'll sketch the idea out here, but it definitely needs some work to make
this into something useable.
1) constraint: no false positive. For each ham message, find the rules
that hit on that message. Call them R1, R2 ... RN. Generate the following
constraint:
R1*S1 + R2*S2 ... RN*SN <= 0.8 * C
where S1, S2 ... SN are the respective scores for those rules and C is our
cutoff value (In SA, by default, C = 5.0). The reason for using only 80% as
the cutoff (i.e., 4.0) is to leave some room for error when the rules and
scores are deployed.
2) constraint: no single rule can tag a spam. For all rules, establish the
following constraint: Ri <= 0.8 * C (i.e., Ri <= 4.0), for all i.
3) constraint: no false negatives. For each spam message which triggers
rules R1, R2 ... RN, establish the following constraint:
R1*S1 + R2*S2 ... RN*SN >= 1.2 * C
(in SA, C defaults to 5.0, so 1.2 * C is 6.0).
Note that if a set of rules {R1,R2 ... RN} appeared together in both a spam
and ham context, then this constraint and constraint (1) are mutually
exclusive,
and the respective scores couldn't be determined, and those rules must be
excluded entirely. One way to deal with this might be to leave out the spam
constraint entirely and accept false negatives in this case.
4) constraint: no spam feature can be scored less than 0. This would be true
for all
rules which are intended to trigger on spam and not ham. For rules like the
Habeas rule, the constraint can be changed to allow/encourage a negative
score.
for all i in the set of spam detection rules, Ri >= 0.
The objective function:
Maximize Sum(Si*Pi) for all i, where Pi is the probability that rule Ri
is not a false positive. Said differently, if 95% of the time, Rule Ri
appears in spam messages, and thus in 5% of the messages it appears in
ham, then Pi = 0.95. There may be better objective functions, but this
seemed
like a start.
There would be some fall out from the rules above. Because of the rather
severe constraints intended to eliminate false positives, this approach
might have a tendency to set more scores to 0, or a very small number than
the present GA, and other proposed "fuzzy" and non-linear methods of
scoring.
A possible advantage is that LP solvers already exist and should be easy to
plug into the existing framework. The LP solvers should also be fast, esp.
compared to the existing GA framework.