Joe Emenaker wrote:

Florin Andrei wrote:

Anyone has a generic rule to match "stuck keyboard" spam like this one?

"Baaaargaiiiiin baaaasemeeent priiiiiiicing for viaaaaagraaaa"


This regexp...

       (\w)\1{2}

should catch any word where the same character is repeated three times or more. Change the "2" to whatever you want and it will match that number PLUS one. So "(\w)\1{4}" would match 5 chars in a row.

You need the parens to sample the matched character and then it is reused with the "\1". You can't just say "\w{5}" because it would match ANY 5 word chars.

You probably want to give some points to any word with three or more (since a scan through my spell dict didn't find ANY English words with more than two of the same char in a row), and then some more points for one with four or more... and then some for five, etc.

I've written a few rules that seem to work, they haven't had very thourough testing yet, but they might work for now. Us SARE Ninja's will probably work on this a bit more and add them to our sets shortly.

Jesse Houwing
SARE Ninja
http://www.rulesemporium.com/

For now these are the rules including their results:

Mon Jun 21 19:37:21 2004 -- masscheck.27.sh -- beginning test of
00_keystuck.cf

Section 1 -- Emails flagged as spam

Spam identified as spam by these rules: 10 of 10643
Ham  identified as spam by these rules: 0 of 23351


Section 2 -- Rules tested

body sare_t_keystuck_a
/[a-z]+([a-z])\1{3}[a-z]+\s.{0,10}[a-z]+([a-z])\2{3}[a-z]+\s.{0,10}[a-z]+([a-z])\3{3}[a-z]+/i
body sare_t_keystuck_b
/[a-z]+([a-z])\1{4}[a-z]+\s.{0,10}[a-z]+([a-z])\2{4}[a-z]+\s.{0,10}[a-z]+([a-z])\3{4}[a-z]+/i
body sare_t_keystuck_c
/[a-z]+([a-z])\1{3}[a-z]+\s.{0,10}[a-z]+([a-z])\2{3}[a-z]+\s.{0,10}[a-z]+([a-z])\3{3}[a-z]+\s.{0,10}[a-z]+([a-z])\4{3}[a-z]+/i
body sare_t_keystuck_d
/[a-z]+([a-z])\1{4}[a-z]+\s.{0,10}[a-z]+([a-z])\2{4}[a-z]+\s.{0,10}[a-z]+([a-z])\3{4}[a-z]+\s.{0,10}[a-z]+([a-z])\4{4}[a-z]+/i

score keystuck_a 1
score keystuck_b 1
score keystuck_c 1
score keystuck_d 1

Section 3 -- Frequencies Log
(First numeric frequencies, followed by percentage frequencies)

OVERALL     SPAM      HAM     S/O   SCORE  NAME
 33994    10643    23351    0.313   0.00    0.00  (all messages)
    13       13        0    1.000   1.00   1.00  sare_t_keystuck_b
    13       13        0    1.000   1.00   1.00  sare_t_keystuck_c
    11       11        0    1.000   1.00   1.00  sare_t_keystuck_d
    17       16        1    0.972   0.90   1.00  sare_t_keystuck_a

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
 33994    10643    23351    0.313   0.00    0.00  (all messages)
100.000  31.3085  68.6915    0.313   0.00    0.00  (all messages as %)
 0.038   0.1221   0.0000    1.000   1.00    1.00  sare_t_keystuck_b
 0.038   0.1221   0.0000    1.000   1.00    1.00  sare_t_keystuck_c
 0.032   0.1034   0.0000    1.000   1.00    1.00  sare_t_keystuck_d
 0.050   0.1503   0.0043    0.972   0.90    1.00  sare_t_keystuck_a


Section 4 -- Recommended Scores and Hit Log

Mon Jun 21 20:06:13 2004 -- masscheck.27.sh -- completed test of
00_keystuck.cf







Reply via email to