> Anybody know of a rule for the long strings of random words that don't
> contain words like 'the, to, a, an, then, and' and those sort of words?
I'd

Here are two sets.  The first one checks only for long strings without
punctuation, and works pretty well.  The second set is a modified version
that allows some punctuation, specifically to catch some recent spams that
made it thorough the first set by including random punctuation.  This second
set hasn't been through a mass check, and for all I know may be hitting all
kinds of legit stuff as well as the spam.

# match Bayes-poison lists of lowercase words without articles or common
prepositions

body  PT_WORDLIST_10
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z]{4,12}\s+){10}/
describe PT_WORDLIST_10  string of 10+ random words
score  PT_WORDLIST_10  1.0

body  PT_WORDLIST_13
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z]{4,12}\s+){13}/
describe PT_WORDLIST_13  string of 13+ random words
score  PT_WORDLIST_13  3.0

body  PT_WORDLIST_30
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z]{4,12}\s+){30}/
describe PT_WORDLIST_30  string of 30+ random words
score  PT_WORDLIST_30  10.0

# match Bayes-poison lists of lowercase words without articles or common
prepositions

body  XX_WORDLIST_10
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z\.\,\-\;]{4,18}\s+){10}/
describe XX_WORDLIST_10  string of 10+ random words
score  XX_WORDLIST_10  1.0

body  XX_WORDLIST_13
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z\.\,\-\;]{4,18}\s+){13}/
describe XX_WORDLIST_13  string of 13+ random words
score  XX_WORDLIST_13  3.0

body  XX_WORDLIST_30
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z\.\,\-\;]{4,18}\s+){30}/
describe XX_WORDLIST_30  string of 30+ random words
score  XX_WORDLIST_30  10.0


Reply via email to