Keith C. Ivey wrote:
Note that \w matches letters and numbers, so that regex will match the string "000",Ah... true. In fact, I almost suggested.... uh... what is it... I think it's "\l" for lowercase chars or something, but I figured that we might see spam with uppercase stuck keys. And I felt that [a-zA-Z] was a little ugly. But I think I've changed my mind now.
Even with that change, it will match "www.google.com" and "American Automobile Association (AAA)" and "Henry VIII" and "Baadasssss! (2004)" and "page xxxii" and various other non-Well, that's kinda why I suggested a few points for 3-in-a-row, then some more for 4-in-a-row, etc.
spam-indicating words, so be very careful with constructing and scoring such a rule.
You can also make a rule which only caught three repetitions in the same word:
([a-z])\1{2}\w([a-z])\2{2}Anyway... this all triggered a hunch... which I just checked on...
I scanned my bayes db for all 3-or-more repetitions of letters which had high spam probabilities. Then, I had a perl script go through the list and tell me which letter was the first one repeated in the word. And, like I suspected, there were only 5 of them. Wanna guess which 5? :P
63 e
47 o
28 a
23 i
8 uThe numbers are the number of occurences. And, keep in mind that I didn't convert everything to lowercase. In other words, the spammers, at this time, don't seem to be doing the stuck-key trick with uppercase letters.
So, now, we could change the regexp to something like:
([aeiou]\1{2}which wouldn't catch "www.google.com", and it wouldn't catch "AAA" and even "Henry VIII" is off the shit-list. I guess we won't all have to sing "I am Henry the VIIIth, I'm spam, I am...." :P
- Joe
