On Wed, 2011-06-08 at 16:11 -0700, John Hardin wrote: > On Wed, 8 Jun 2011, Martin Gregorie wrote: > > > On Wed, 2011-06-08 at 07:53 -0700, John Hardin wrote: > >> How about this (untested): > >> > >> header __SUBJ_BROKEN_WORD Subject =~ /\s(?!i[PT])[a-z]{1,3}[A-Z][a-z]{2}/ > >> tflags __SUBJ_BROKEN_WORD multiple > >> meta __SUBJ_BROKEN_WORDS __SUBJ_BROKEN_WORD > 2 > > > > I tested this as well as my own variant: > > > > describe MG_SPLIT322 Two or more words obfuscated with a "xxx xx xx" > > split > > body __MG_SPL322 /\b[a-z]{3} [a-z]{2} [a-z]{2}\b/i > > tflags __MG_SPL322 multiple > > meta MG_SPLIT322 __MG_SPL322 > 2 > > score MG_SPLIT322 4 > > > > against a private collection of 491 spam messages which I use to test my > > private rules. > > > > I got 8 FPs (1.6%) with either regex because both hit on fairly common > > text such as "Log in to", "rolling out up to", "want you to be" and "and > > so on", > > My version shouldn't hit on _any_ of those examples, it's > intentionally case-sensitive. > I wrote my rule without looking at all closely at yours and did it w.r.t. the OP-supplied body text. This didn't seem to have any case pattern that I could match with a fairly simple regex.
I was totally gobsmacked to find that your regex, when applied to body text and run against my spam corpus, caused my rule to fire on exactly the same messages as my regex did. I probably wouldn't have posted about it if that hadn't happened. Martin