On Fri, 9 Nov 2018, RW wrote:
On Thu, 8 Nov 2018 19:24:47 -0700
Amir Caspi wrote:
On Nov 8, 2018, at 4:51 PM, RW <rwmailli...@googlemail.com> wrote:
Unnecessary encoding is fairly common, but a long runs of ASCII
characters encoded like this seems extreme.
Right, that was a question I had asked in my email this morning...
whether we have a rule to detect long sequences of HTML entities.
I was really referring to the fact that it's pure ASCII text that's
being encoded rather than long runs per se, so I'm trying:
rawbody HTML_ENC_ASCII
/(?:&\#(?:(?:\d{1,2}|1[01]\d|12[0-7])|x[0-7][0-9a-f])\s*;\s*){10}/i
I'll add that too so that we can compare the results.
but you may well be right that long runs are inherently suspicious, I'm
not very familiar with HTML practices.
Proposed rule:
body AC_HTML_ENTITY_BONANZA
(?:&(?:[A-Za-z0-9]{2,}|#(?:[0-9]{2,5}|x[0-9A-F]{2,4}));\s*){20}
describe AC_HTML_ENTITY_BONANZA Long run of
HTML-encoded characters score AC_HTML_ENTITY_BONANZA
Early results (not all corpora are in yet) look *very* promising:
https://ruleqa.spamassassin.org/20181109-r1846219-n/__AC_HTML_ENTITY_BONANZA/detail
3% of spam, S/O .958 and almost all spam hits are <5 points.
I think we have a winner. Thanks, Amir (and possibly RW)!
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Activist: Someone who gets involved.
Unregistered Lobbyist: Someone who gets involved
with something the MSM doesn't approve of. -- WizardPC
-----------------------------------------------------------------------
2 days until Veterans Day