> Anybody know of a rule for the long strings of random words that don't
> contain words like 'the, to, a, an, then, and' and those sort of words?
I'd
Here are two sets. The first one checks only for long strings without
punctuation, and works pretty well. The second set is a modified version
that allows some punctuation, specifically to catch some recent spams that
made it thorough the first set by including random punctuation. This second
set hasn't been through a mass check, and for all I know may be hitting all
kinds of legit stuff as well as the spam.
# match Bayes-poison lists of lowercase words without articles or common
prepositions
body PT_WORDLIST_10
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z]{4,12}\s+){10}/
describe PT_WORDLIST_10 string of 10+ random words
score PT_WORDLIST_10 1.0
body PT_WORDLIST_13
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z]{4,12}\s+){13}/
describe PT_WORDLIST_13 string of 13+ random words
score PT_WORDLIST_13 3.0
body PT_WORDLIST_30
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z]{4,12}\s+){30}/
describe PT_WORDLIST_30 string of 30+ random words
score PT_WORDLIST_30 10.0
# match Bayes-poison lists of lowercase words without articles or common
prepositions
body XX_WORDLIST_10
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z\.\,\-\;]{4,18}\s+){10}/
describe XX_WORDLIST_10 string of 10+ random words
score XX_WORDLIST_10 1.0
body XX_WORDLIST_13
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z\.\,\-\;]{4,18}\s+){13}/
describe XX_WORDLIST_13 string of 13+ random words
score XX_WORDLIST_13 3.0
body XX_WORDLIST_30
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z\.\,\-\;]{4,18}\s+){30}/
describe XX_WORDLIST_30 string of 30+ random words
score XX_WORDLIST_30 10.0