Bob Proulx schrieb am 02.11.2007 18:24:
body FRT_OPPORTUN1 /<inter SP2><post P2>(?!opportun)<O><P><P><O><R><T><U><N>/I
body FRT_OPPORTUN2 /<inter W0><post P2>(?!opportun)<O><P><P><O><R><T><U><N>/I
Huh? How are those rules matching? I am missing something. That
can't the right rule that is being hit here. Can someone educate me
as to what is happening here?
This rule is preprocessed by the ReplaceTags plugin. This plugin is kind
of a simple macro expander. Words between <> are macros which are
expanded by this plugin. <P> expands to [p\xfe] according to line 2808
in 72_active.cf, for example. This is done to ease rule creation for
obfuscated words.
I don't know if or how it is possible to output the processed rule, but
I guess the <post P2> expands after every normal expansion. So <P>
becomes <P><P2>, and since P2 expands to {1,2}, <P> finally expands to
[p\xfe]{1,2}. That matches one or two p or \xfe. There are two <P><P>,
so pp, ppp and pppp match this term.
On the other hand, I don't know if "oppertun" matches this rule,
although it is given this description:
describe FRT_OPPORTUN1 ReplaceTags: Oppertun (1)
The second O expands to
[go0\xd2\xd3\xd4\xd5\xd6\xd8\xf0\xf2\xf3\xf4\xf5\xf6\xf8] and there is
no e in it.
This rule will match only an obfuscated "opportun" due to the negative
look-ahead (?!opportun) never a plain "opportun" like in "opportunity".
An "oppportunity" (3p) doesn't match the look-ahead, so it matches the
pattern.
Since these rules were assigned such a high score, only very few ham
from the score-generating corpus (if any) seem to contain this
misspelling. If I understand this process correctly, the scores are not
manually determined but by a lengthy automatic analysis process for a
big message corpus that tries to minimize scores for known ham and
maximize scores for known spam as a whole.
What you can do:
- lower the score for these rules manually
- and perhaps give the SA developers your FP to include it into their
corpus.