HTML based accent characters seems to be becoming more popular in my personal
corpus. So far I've seen accents that match this basic regex, not usable as a
rule alone mind you however it should be accounted for in keyword based
rules.
/&.{1}(acute|uml|ring|grave|circ|tilde);/
A quick and dirty egrep vs corpus:
egrep '&.{1}(acute|uml|ring|grave|circ|tilde);' spam-corpus -c
1817
egrep '&.{1}(acute|uml|ring|grave|circ|tilde);' spam-corpus -v -c
7161007
egrep '&.{1}(acute|uml|ring|grave|circ|tilde);' ham-corpus -c
3
egrep '&.{1}(acute|uml|ring|grave|circ|tilde);' ham-corpus -v -c
2052964
Of course that's just simple matched lines. Again as a generic rule it isn't
useful however it is being used to evade keyword matching such as the
anti-drug custom rules.