On Fri, 09 Oct 2015 14:22:18 +0200 Mark Martinec wrote: > The problem with this message is that it declares encoding > as UTF-16, i.e. not explicitly stating endianness like > UTF-16BE or UTF-16LE, and there is no BOM mark at the > beginning of each textual part, so endianness cannot be > determined. The RFC 2781 says that big-endian encoding > should be assumed in absence of BOM. > See https://en.wikipedia.org/wiki/UTF-16 > > In the provided message the actual endianness is LE, and > BOM is missing, so decoding as UTF-16BE fails and the > rule does not hit. Garbage-in, garbage-out.
I'm not seeing any body tokens, even after training. I was expecting that the text would be tokenized as individual UTF-8 sequences. ASCII characters encoded as UTF-16 and decoded with the wrong endianness are still valid UTF-16. Normalizing them into UTF-8 should produce completely multi-byte UTF-8 without whitespace or punctuation (not counting U+2000 inside UTF-8). If I add John Hardin's diagnostic rule body __ALL_BODY /.*/ tflags __ALL_BODY multiple I get: ran body rule __ALL_BODY ======> got hit: " _ _D_e_a_r_ _p_o_t_e_n_c_i_a_l_ _p_a_r_t_n_e_r_,_ _ _W_e_ _a_r_e_ _p_r_o_f_e_s_s_i_o_n_a_l_ _i_n_ _e_n_g_i_n_e_e_r_i_n_g_,_ _... It looks like it's still UTF-16, and Bayes is seeing individual letters (which are too short to be tokens) separated by nulls. If I change the mime to utf-16le it works correctly, except that the subject isn't converted - including the copy in the body. If I set the mime to utf-16le I get what appears to be the multi-byte UTF-8 I was expecting. So SA isn't falling back to big-endian, it wont normalize without an explicit endianess. BTW with normalize_charset 0 it looks like a spammer can effectively turn-off body tokenization by using UTF-16 (with correct endianness).