Re: charset=utf-16 tricks out SA

Linda A. Walsh Sat, 10 Oct 2015 03:00:21 -0700


Mark Martinec wrote:

Reindl Harald wrote:

no custom body rules hit like they do for ISO/UTF8 :-(

What is your normalize_charsets setting?


The problem with this message is that it declares encoding
as UTF-16, i.e. not explicitly stating endianness like
UTF-16BE or UTF-16LE, and there is no BOM mark at the
beginning of each textual part, so endianness cannot be
determined. The RFC 2781 says that big-endian encoding
should be assumed in absence of BOM.
See https://en.wikipedia.org/wiki/UTF-16

In the provided message the actual endianness is LE, and
BOM is missing, so decoding as UTF-16BE fails and the
rule does not hit. Garbage-in, garbage-out.

----
        In the real world, RFC 2781 is full of bovine excrement.

The most common and real-world default is UTF-16LE, as blessed by
MS.  And the bigassian fanboys who are MS-haters have always hated
that fact -- but that doesn't change what any intelligent person
would assume about UTF-16 in the real world.

So you can follow rules written for the large-iron days before the PC,
or you can follow the real world.  I've encountered multiple UTF16
files in the wild that came from before BOM marks were used in an attempt
to tame MS, and use of BOM marks is not inherent in the core NT OS,
it's aways a win32-consumer addon -- sorta like why MS's Unicode
support is still, most fully, only Unicode 2.0, with spurious additions
in the later versions (w/Unicode being @ version 8 now).

So it basically boils down to whether or not you want to go with
reality, or last generation losers. I ran into this stupidity

when the perl community came out with a supposed replacementfor iconv. Except that it wasn't compatble w/the defaults.


iconv's output for UTF16 UTF-16 is LE(w/BOM) and UCS2 = UTF-16 w/no BOM.
UCS2 = MS's full Unicode Standard 2.  And that's been the standard since
MS came out with their full UCS2 support in the UCS2 charset (cept that
UCS3 and beyond wouldn't fit in 2-bytes, so they had to go with a similar
encoding to UTF-8 in UTF-16 -- where they still used UCS2 byte ordering (LE).

But the big-iron struck back by pushing through an unrealistic default
for non-BOM UTF16 files... and yeah, it's in the standard, but
in the real world, it's not the default.  Unfortunately,
SA is written in Perl which goes against real world
usage and was about 10 years late to the game w/UTF-8
support when they reactively, completely reverted UTF-8
support in perl-5.8.0.  They have only in the past few years
restored somewhat proper function of assuming locale-encoding
on console-centric byte streams, while requiring files opened
w/open (not reading <> or writing to STDOUT/STDERR) to express

text encoding if they didn't want perl's unicode bug in reading& writing binary data (0-255) as latin1 but issuing runtime

warnings or fatal errors if you manipulate that binary data

such that a charval > 255 results in the stream. In sucha case, they write out chars >255 in UTF-8 encoding, but all

chars <255 are written out in incompatible latin1 -- unless
you pre-define the output charset as one or the other -- leaving
the default case to alway generate wrong output on mixed usage
of chars <255 and those > 255.

So that's the rough history and it's still a problem in today
in the real world vs. after-the-fact standards.





ts full support in perl-8.0


If you manually edit the sample and replace UTF-16
with UTF-16LE (and normalize is enabled), your rule should
hit - at least it does so in the current trunk code.

If this seems to be common in the wild, please open a
bug ticket, as Kevin suggested, and attach the sample there.

  Mark

Re: charset=utf-16 tricks out SA

Reply via email to