On 12/5/2018 4:50 PM, Grant Taylor wrote:
> On 12/05/2018 02:45 PM, John Hardin wrote:
>> I've added a "too many [ascii][unicode][ascii]" rule based on that
>> but I suspect it will be pretty FP-prone and will be pretty large if
>> we want to avoid whack-a-mole syndrome. For this, normalize + bayes
>> is probably the best bet.
>
> Is it possible to detect when a Unicode code point is being used in
> place of an ASCII / ANSI character specifically to avoid pattern
> detection?  I.e. multiple Unicode code points that represent or are
> otherwise a stand in for an ASCII / ANSI "a"?
>
> Or is keeping up with this list tantamount to whack-a-mole?
>
> I would think that too high of a percentage of Unicode when bog
> standard ASCII / ANSI would suffice would be an indication in and of
> itself.  I'm not seeing how legitimate (non-spam) email would trigger
> a false positive if the percentage was tuned correctly.
>
Yes, look at KAM.cf and the Replace Tags feature used in KAM_CRIM rules
as well as the SCC_SHORT_WORDS.  Both rules are designed to catch these
exact type of obfuscation.  One more specific (CRIM) and one more
generic (SHORT_WORDS).

Reply via email to