Re: charset=utf-16 tricks out SA

RW Fri, 09 Oct 2015 18:03:49 -0700

On Fri, 09 Oct 2015 14:22:18 +0200
Mark Martinec wrote:
 
> The problem with this message is that it declares encoding
> as UTF-16, i.e. not explicitly stating endianness like
> UTF-16BE or UTF-16LE, and there is no BOM mark at the
> beginning of each textual part, so endianness cannot be
> determined. The RFC 2781 says that big-endian encoding
> should be assumed in absence of BOM.
> See https://en.wikipedia.org/wiki/UTF-16
> 
> In the provided message the actual endianness is LE, and
> BOM is missing, so decoding as UTF-16BE fails and the
> rule does not hit. Garbage-in, garbage-out.



I'm not seeing any body tokens, even after training.

I was expecting that the text would be tokenized as individual UTF-8
sequences. ASCII characters encoded as UTF-16 and decoded with the
wrong endianness are still valid UTF-16. Normalizing them into
UTF-8 should produce completely multi-byte UTF-8 without whitespace or
punctuation (not counting U+2000 inside UTF-8).

If I add John Hardin's diagnostic rule

body     __ALL_BODY     /.*/
tflags   __ALL_BODY     multiple

I get:

ran body rule __ALL_BODY ======> got hit: " _ _D_e_a_r_
_p_o_t_e_n_c_i_a_l_ _p_a_r_t_n_e_r_,_ _ _W_e_ _a_r_e_
_p_r_o_f_e_s_s_i_o_n_a_l_ _i_n_ _e_n_g_i_n_e_e_r_i_n_g_,_
_...

It looks like it's still UTF-16, and Bayes is seeing individual
letters (which are too short to be tokens) separated by nulls.

If I change the mime to utf-16le it works correctly, except that the
subject isn't converted - including the copy in the body.  If I set the
mime to utf-16le I get what appears to be the multi-byte UTF-8 I was
expecting.

So SA isn't falling back to big-endian, it wont normalize without an
explicit endianess. 


BTW with normalize_charset 0 it looks like a spammer can effectively
turn-off body tokenization by using UTF-16 (with correct endianness).

Re: charset=utf-16 tricks out SA

Reply via email to