Goodness, sorry, no, I didn't mean that at all!!! What I meant was that a recognised encoding should be used consistently, regardless of the number of bytes required, and all encodings of Unicode code points are necessarily potentially multi-byte. Single-byte encodings may save a little bit of space, and may be Windows-1252, or Windows-1253, or one of many other encodings but not, in any sense, Unicode encodings.
Windows code pages and their ilk predate Unicode, and I would only ever expect to see them used in environments where legacy support is needed, and would not expect a significant amount of new documentation about them to be written. When it is necessary to describe them, one should do so fully and properly, which is whatever it is, but they really have no meaning in a Unicode context. Nor, as far as I'm aware, do the 0x80 to 0xFF octets have any special meaning in Unicode that would require there to be a recognisable term to describe them. Code that processes arbitrary *character* sequences (for legibility or any other reason) should, surely, work with characters, which may be sequences of code points, each of which may be a sequence of bytes. I can think of no reason for chopping up byte sequences except where they are going to be recombined later, by the reverse treatment, and code, if required, that does so probably has no idea of, and need not have any idea of, meaning, and can only, surely, work with bytes. The actual octets are, of course, used in combinations, but not singly in any way that requires them to be described in Unicode terms. Or am I missing something fundamental? Best, Tony -----Original Message----- From: Unicode [mailto:[email protected]] On Behalf Of Richard Wordingham Sent: 21 September 2015 19:18 To: [email protected] Subject: Re: Concise term for non-ASCII Unicode characters On Mon, 21 Sep 2015 12:46:48 +0100 "Tony Jollans" <[email protected]> wrote: > These days, it is pretty sloppy coding that cares how many bytes an > encoding of something requires, although there may be many > circumstances where legacy support is required. Wow! Are you saying that code chopping up arbitrary character sequences for legibility (and editability!) and to avoid buffering issues should generally assume it will be read as UTF-8, and avoid splitting well-formed UTF-8 characters? (If the text is actually Windows-1252, there may be a lot of apparently ill-formed UTF-8 characters/gibberish.) > You say that, in some > contexts, one needs to be really clear that the octets 0x80 - 0xFF are > Unicode. Either something "is" Unicode, or it isn't. Either something > uses a recognised encoding, or it doesn't. Using these octets to > represent Unicode code points is not ASCII, is not UTF-8, and is not > UCS-2/UTF-16; it could, perhaps, be EBCDIC. But most of these octets *are* used to represent non-ASCII scalar values. It's just that they have to operate in combinations for UTF-8. Richard.

