Justin Mason wrote:
> hey Robert -- what does the "spamp" mean?
>
> - --j.
I found this in a message from Robert on 2004-01-24
Quoting his message:
I'm very close to completing the first draft of my standardized scoring
algorithm.
I've identified a number of different types of rules/algorithms I'm
working with:
* spam / default - any rule which is not assigned to a different rule
type falls into this algorithm.
* ham - rules that should have negative scores, hit more ham than spam
* hamp - rules which indicate this is probably ham
* hamg - rules guaranteed(?) to identify ham
* hamgg - rules REALLY guaranteed to identify ham
* spamp - rules which indicate this is probably spam
* spamnow - rules which probably indicate spam now, though they may have
hit corpus ham in the past (useful for discontinued email addresses)
* spamg - rules guaranteed(?) to identify spam
* spamgg - rules REALLY guaranteed to identify spam
* vbg - rules guaranteed(?) to identify a virus or bogus bounce
* vbgg - rules REALLY guaranteed to identify "
* obfu - rules which identify obfuscated spam words. Scores high since
these don't hit normal English
* spamu - guaranteed spammer URI (eg: BigEvil)
* FP* - spam rule which could cause FP under default algorithm
* max:* - set maximum score for this rule (part of large ruleset)
These names can probably be improved upon. Suggestions are welcome.
Algorithms ($sc = spam count, $hc = ham count, $rh = Required Hits):
* spam:
$sr (spam ratio) = $sc / ( $hc + 1 )
$t = int( $sr / 100 ) # how many hundreds?
if ( $t > 0 ) then tscore = $t + ( $sr mod 100 )
elif ( $sr > 9 ) then tscore = 1 + ( $sr mod 100 )
else tscore = $sr / 10
case $tscore in
0) tscore = 0.050
0.001-3.000) use as-is
3.001+) if $hc > 0 then tscore = 3.000
if $sc < 1000 then tscore = 3.000
4.001+) if $sc < 10000 then tscore = 4.000
5.001+) if $sc < 100000 then tscore = 5.000
6.001+) tscore = 6.000
esac
rulescore = $tscore * 9.0 / $rh
* ham: Like spam, but reverse the use of $sc and $hc, and then make the
result a negative score.
* spamp: If any ham, treat as spam. Otherwise,
case digits($sc) in
1) div=9 ;;
2) div=5 ;;
3) div=3 ;;
4) div=2 ;;
5) div=1 ;;
*) div=1/2 ;;
esac
tscore = ns = $reqhits / $div
if ( default score > $tscore ) use default, else use $tscore
* hamp - like spamp, but reverse to negative scores.
* spamnow - treat like spamp, unless $hc > 0. If any ham matches, then
simply double the default spam score.
* spamg - algorithm like spamp, but
case digits($sc) in
1) div=3 ;;
2) div=2 ;;
3) div=1 ;;
*) div=1/2 ;;
esac
to increase scores faster (lowest score with no ham = $rh/3)
* hamg - like spamg, but with negative scores
* spamgg - if ham, then score at $rh/3, else $rh * 0.85
* hamgg - like spamgg, but with negative scores
* vbg - like spamg
* vbgg - like spamgg
* obfu - like spamg, but
case digits($sc) in
1) div=2 ;;
2) div=1 ;;
*) div=1/2 ;;
esac
to increase scores faster (lowest score with no ham = $rh/2)
* spamu - like obfu
* FP* - the "*" is a number indicating how many FPs have been caused by
this rule in the past. Divide the default spam score by that number.
* max:* - The "*" is a maximum score numerator $max, such that the
intended maximum score for the rule is $max/$rh. If the default score
calculated above this $max/$rh maximum, then use this maximum instead.
These rules seem to cover almost all situations I have in my custome
rules file. The only significant exception are rules like:
header RM_sw_Free Subject =~ /\bfree\b/i
describe RM_sw_Free Subject includes word suggesting spammer
score RM_sw_Free 1.246 # 2560s/103h of 91714 corpus
(74113s/17601h) 01/24/04
header RM_sw_FreeBang Subject =~ /\bFree\!/i
describe RM_sw_FreeBang Subject includes Free! exclamation
score RM_sw_FreeBang 1.250 # 75s/2h of 91714 corpus
(74113s/17601h) 01/24/04
Every email hit by the first rule is also hit by the second. Therefore
one rule or the other should have its score reduced to take this into
account. Any suggestions?
Actually, this case is too simple, since I can rewrite this as follows to
avoid the duplication, but the general question applies to other, more
complicated rule combinations.
header RM_sw_Free Subject =~ /(?!\bfree\!)\bfree\W/i
describe RM_sw_Free Subject includes word suggesting spammer
score RM_sw_Free 1.246 #
header RM_sw_FreeBang Subject =~ /\bFree\!/i
describe RM_sw_FreeBang Subject includes Free! exclamation
score RM_sw_FreeBang 1.250 # 75s/2h of 91714 corpus
(74113s/17601h) 01/24/04
The algorithm is written into my masscheck script right now as a
combination of bash and bc, but I figure once it's finalized I should be
able to recode it in perl, something like:
> # perl stdscore.pl $type $rh $sc $hc
> 2.345
Thanks for any feedback and suggestions.
Bob Menschel