On Nov 7, 2018, at 12:33 PM, Amir Caspi <ceph...@3phase.com> wrote:
> 
> In many cases, it would appear that these spams have either very little 
> (real) text (besides the usual attempt at Bayes poisoning) and/or are using 
> HTML-entity encoding to try to bypass Bayes.  Here are a couple of spamples:
> 
> https://pastebin.com/peiXZivJ
> https://pastebin.com/3h3r7r7j
> 
> Does SA decode HTML entities as part of normalize_charset?  If not ... can 
> this be added?

I'm getting a bunch more of these this morning -- all of them are using HTML 
entities to encode the spammy text.  There is a bunch of "Bayes poison" 
cleartext but all the spamminess is contained in HTML entities.

(1) Do we have any rules to detect long sequences of HTML entities?  That by 
itself seems spammy, but not definitive.

(2) Does normalize_charset decode HTML entities?  If not, is this something 
that can be included?  Do I need to file a bugzilla?

Thanks.

--- Amir

Reply via email to