Re: ASCII control codes in sequences of multibyte character sets

Richard Wordingham Sat, 31 Aug 2013 12:28:00 -0700

On Fri, 30 Aug 2013 22:23:14 +0200
Steffen "Daode" Nurpmeso <[email protected]> wrote:


> Hello character plus experts,
> i'm wondering wether there are any multibyte character sets known
> which use the numerical values of ASCII control characters that
> are vital to Unix/POSIX (plus) as part of multibyte sequences?
> In particular U+000A and U+000D?

Infamously, UTF-16, as implied by Doug's mention of SCSU.

If you count fixed length (>1) character sets as multibyte, you can add
UCS-2 and UTF-32.

UTF-16 does have the property that characters occupy a multiple
of 2-bytes, so are well behaved in this respect if one knows to work
with aligned pairs of bytes rather than bytes, and if one knows the
endianity.  Also, at present, U+0A00 and U+0D00 are unassigned.

Note that the old belief that U+FFFE would not occur externally to an
application has been decreed a fallacy, so an apparent U+FEFF or U+FFFE
at the start of a file from an external source only indicates the
endianity if one knows that file is encoded in the UTF-16 encoding
scheme as opposed to the UTF-16LE or UTF-16BE encoding scheme.

For UTF-32, reversing the bytes of a C0 control character would yield
an invalid byte seqeunce.

Richard.

Re: ASCII control codes in sequences of multibyte character sets

Reply via email to