On Nov 8, 2018, at 4:51 PM, RW <rwmailli...@googlemail.com> wrote:
> 
> Unnecessary encoding is fairly common, but a long runs of ASCII
> characters encoded like this seems extreme.

Right, that was a question I had asked in my email this morning... whether we 
have a rule to detect long sequences of HTML entities.  It would seem not.

John, is that something we can test in a sandbox and see how it performs in 
masscheck?

Proposed rule:
body    AC_HTML_ENTITY_BONANZA  
(?:&(?:[A-Za-z0-9]{2,}|#(?:[0-9]{2,5}|x[0-9A-F]{2,4}));\s*){20}
describe        AC_HTML_ENTITY_BONANZA  Long run of HTML-encoded characters
score   AC_HTML_ENTITY_BONANZA  0.001

This should catch either decimal or hex encoding, or named entities, and allows 
the characters to be separated by variable-length whitespace (in case they use 
actual whitespace instead of encoded whitespace).

If the regexp above is too complex, we could just match on the entity 
boundaries, restricting to allowable characters inside:

body    AC_HTML_ENTITY_BONANZA  (?:&[A-Za-z0-9#]{2,};\s*){20}

Either should work, I believe.

Cheers.

--- Amir

Reply via email to