Re: Bayes underperforming, HTML entities?

Amir Caspi Thu, 08 Nov 2018 18:25:17 -0800

On Nov 8, 2018, at 4:51 PM, RW <rwmailli...@googlemail.com> wrote:
> 
> Unnecessary encoding is fairly common, but a long runs of ASCII
> characters encoded like this seems extreme.


Right, that was a question I had asked in my email this morning... whether we 
have a rule to detect long sequences of HTML entities.  It would seem not.

John, is that something we can test in a sandbox and see how it performs in 
masscheck?

Proposed rule:
body    AC_HTML_ENTITY_BONANZA  
(?:&(?:[A-Za-z0-9]{2,}|#(?:[0-9]{2,5}|x[0-9A-F]{2,4}));\s*){20}
describe        AC_HTML_ENTITY_BONANZA  Long run of HTML-encoded characters
score   AC_HTML_ENTITY_BONANZA  0.001

This should catch either decimal or hex encoding, or named entities, and allows 
the characters to be separated by variable-length whitespace (in case they use 
actual whitespace instead of encoded whitespace).

If the regexp above is too complex, we could just match on the entity 
boundaries, restricting to allowable characters inside:

body    AC_HTML_ENTITY_BONANZA  (?:&[A-Za-z0-9#]{2,};\s*){20}

Either should work, I believe.

Cheers.

--- Amir

Re: Bayes underperforming, HTML entities?

Reply via email to