Joe Emenaker wrote:
Florin Andrei wrote:
Anyone has a generic rule to match "stuck keyboard" spam like this one?
"Baaaargaiiiiin baaaasemeeent priiiiiiicing for viaaaaagraaaa"
This regexp...
(\w)\1{2}
should catch any word where the same character is repeated three times
or more. Change the "2" to whatever you want and it will match that
number PLUS one. So "(\w)\1{4}" would match 5 chars in a row.
You need the parens to sample the matched character and then it is
reused with the "\1". You can't just say "\w{5}" because it would
match ANY 5 word chars.
You probably want to give some points to any word with three or more
(since a scan through my spell dict didn't find ANY English words with
more than two of the same char in a row), and then some more points
for one with four or more... and then some for five, etc.
I've written a few rules that seem to work, they haven't had very
thourough testing yet, but they might work for now. Us SARE Ninja's will
probably work on this a bit more and add them to our sets shortly.
Jesse Houwing
SARE Ninja
http://www.rulesemporium.com/
For now these are the rules including their results:
Mon Jun 21 19:37:21 2004 -- masscheck.27.sh -- beginning test of
00_keystuck.cf
Section 1 -- Emails flagged as spam
Spam identified as spam by these rules: 10 of 10643
Ham identified as spam by these rules: 0 of 23351
Section 2 -- Rules tested
body sare_t_keystuck_a
/[a-z]+([a-z])\1{3}[a-z]+\s.{0,10}[a-z]+([a-z])\2{3}[a-z]+\s.{0,10}[a-z]+([a-z])\3{3}[a-z]+/i
body sare_t_keystuck_b
/[a-z]+([a-z])\1{4}[a-z]+\s.{0,10}[a-z]+([a-z])\2{4}[a-z]+\s.{0,10}[a-z]+([a-z])\3{4}[a-z]+/i
body sare_t_keystuck_c
/[a-z]+([a-z])\1{3}[a-z]+\s.{0,10}[a-z]+([a-z])\2{3}[a-z]+\s.{0,10}[a-z]+([a-z])\3{3}[a-z]+\s.{0,10}[a-z]+([a-z])\4{3}[a-z]+/i
body sare_t_keystuck_d
/[a-z]+([a-z])\1{4}[a-z]+\s.{0,10}[a-z]+([a-z])\2{4}[a-z]+\s.{0,10}[a-z]+([a-z])\3{4}[a-z]+\s.{0,10}[a-z]+([a-z])\4{4}[a-z]+/i
score keystuck_a 1
score keystuck_b 1
score keystuck_c 1
score keystuck_d 1
Section 3 -- Frequencies Log
(First numeric frequencies, followed by percentage frequencies)
OVERALL SPAM HAM S/O SCORE NAME
33994 10643 23351 0.313 0.00 0.00 (all messages)
13 13 0 1.000 1.00 1.00 sare_t_keystuck_b
13 13 0 1.000 1.00 1.00 sare_t_keystuck_c
11 11 0 1.000 1.00 1.00 sare_t_keystuck_d
17 16 1 0.972 0.90 1.00 sare_t_keystuck_a
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
33994 10643 23351 0.313 0.00 0.00 (all messages)
100.000 31.3085 68.6915 0.313 0.00 0.00 (all messages as %)
0.038 0.1221 0.0000 1.000 1.00 1.00 sare_t_keystuck_b
0.038 0.1221 0.0000 1.000 1.00 1.00 sare_t_keystuck_c
0.032 0.1034 0.0000 1.000 1.00 1.00 sare_t_keystuck_d
0.050 0.1503 0.0043 0.972 0.90 1.00 sare_t_keystuck_a
Section 4 -- Recommended Scores and Hit Log
Mon Jun 21 20:06:13 2004 -- masscheck.27.sh -- completed test of
00_keystuck.cf