Justin Mason wrote:
> hey Robert -- what does the "spamp" mean?
>
> - --j.

I found this in a message from Robert on 2004-01-24

Quoting his message:
I'm very close to completing the first draft of my standardized scoring
algorithm.

I've identified a number of different types of rules/algorithms I'm
working with:

* spam / default - any rule which is not assigned to a different rule
  type falls into this algorithm.
* ham - rules that should have negative scores, hit more ham than spam
* hamp - rules which indicate this is probably ham
* hamg - rules guaranteed(?) to identify ham
* hamgg - rules REALLY guaranteed to identify ham
* spamp - rules which indicate this is probably spam
* spamnow - rules which probably indicate spam now, though they may have
  hit corpus ham in the past (useful for discontinued email addresses)
* spamg - rules guaranteed(?) to identify spam
* spamgg - rules REALLY guaranteed to identify spam
* vbg - rules guaranteed(?) to identify a virus or bogus bounce
* vbgg - rules REALLY guaranteed to identify "
* obfu - rules which identify obfuscated spam words. Scores high since
  these don't hit normal English
* spamu - guaranteed spammer URI (eg: BigEvil)
* FP* - spam rule which could cause FP under default algorithm
* max:* - set maximum score for this rule (part of large ruleset)

These names can probably be improved upon. Suggestions are welcome.

Algorithms ($sc = spam count, $hc = ham count, $rh = Required Hits):

* spam:
  $sr (spam ratio) = $sc / ( $hc + 1 )
  $t = int( $sr / 100 )  # how many hundreds?
  if   ( $t > 0 )  then tscore = $t + ( $sr mod 100 )
  elif ( $sr > 9 ) then tscore = 1  + ( $sr mod 100 )
  else                  tscore = $sr / 10
  case $tscore in
     0) tscore = 0.050
     0.001-3.000) use as-is
     3.001+) if $hc > 0      then tscore = 3.000
             if $sc < 1000   then tscore = 3.000
     4.001+) if $sc < 10000  then tscore = 4.000
     5.001+) if $sc < 100000 then tscore = 5.000
     6.001+)                      tscore = 6.000
  esac
  rulescore = $tscore * 9.0 / $rh

* ham: Like spam, but reverse the use of $sc and $hc, and then make the
  result a negative score.

* spamp: If any ham, treat as spam. Otherwise,
  case digits($sc) in
     1) div=9 ;;
     2) div=5 ;;
     3) div=3 ;;
     4) div=2 ;;
     5) div=1 ;;
     *) div=1/2 ;;
  esac
  tscore = ns = $reqhits / $div
  if ( default score > $tscore ) use default, else use $tscore

* hamp - like spamp, but reverse to negative scores.

* spamnow - treat like spamp, unless $hc > 0. If any ham matches, then
  simply double the default spam score.

* spamg - algorithm like spamp, but
  case digits($sc) in
     1) div=3 ;;
     2) div=2 ;;
     3) div=1 ;;
     *) div=1/2 ;;
  esac
  to increase scores faster (lowest score with no ham = $rh/3)

* hamg - like spamg, but with negative scores

* spamgg - if ham, then score at $rh/3, else $rh * 0.85

* hamgg - like spamgg, but with negative scores

* vbg - like spamg

* vbgg - like spamgg

* obfu - like spamg, but
  case digits($sc) in
     1) div=2 ;;
     2) div=1 ;;
     *) div=1/2 ;;
  esac
  to increase scores faster (lowest score with no ham = $rh/2)

* spamu - like obfu

* FP* - the "*" is a number indicating how many FPs have been caused by
  this rule in the past. Divide the default spam score by that number.

* max:* - The "*" is a maximum score numerator $max, such that the
  intended maximum score for the rule is $max/$rh. If the default score
  calculated above this $max/$rh maximum, then use this maximum instead.

These rules seem to cover almost all situations I have in my custome
rules file. The only significant exception are rules like:

header    RM_sw_Free             Subject =~ /\bfree\b/i
describe  RM_sw_Free             Subject includes word suggesting spammer
score     RM_sw_Free             1.246  # 2560s/103h of 91714 corpus
(74113s/17601h) 01/24/04
header    RM_sw_FreeBang         Subject =~ /\bFree\!/i
describe  RM_sw_FreeBang         Subject includes Free! exclamation
score     RM_sw_FreeBang         1.250  # 75s/2h of 91714 corpus
(74113s/17601h) 01/24/04

Every email hit by the first rule is also hit by the second. Therefore
one rule or the other should have its score reduced to take this into
account. Any suggestions?

Actually, this case is too simple, since I can rewrite this as follows to
avoid the duplication, but the general question applies to other, more
complicated rule combinations.
header    RM_sw_Free             Subject =~ /(?!\bfree\!)\bfree\W/i
describe  RM_sw_Free             Subject includes word suggesting spammer
score     RM_sw_Free             1.246  #
header    RM_sw_FreeBang         Subject =~ /\bFree\!/i
describe  RM_sw_FreeBang         Subject includes Free! exclamation
score     RM_sw_FreeBang         1.250  # 75s/2h of 91714 corpus
(74113s/17601h) 01/24/04

The algorithm is written into my masscheck script right now as a
combination of bash and bc, but I figure once it's finalized I should be
able to recode it in perl, something like:
> # perl stdscore.pl $type $rh $sc $hc
> 2.345

Thanks for any feedback and suggestions.

Bob Menschel

Reply via email to