Re: Unicode String Models

Hans Åberg via Unicode Wed, 12 Sep 2018 01:40:18 -0700


> On 12 Sep 2018, at 04:34, Eli Zaretskii via Unicode <[email protected]> 
> wrote:
> 
>> Date: Wed, 12 Sep 2018 00:13:52 +0200
>> Cc: [email protected]
>> From: Hans Åberg via Unicode <[email protected]>
>> 
>> It might be useful to represent non-UTF-8 bytes as Unicode code points. One 
>> way might be to use a codepoint to indicate high bit set followed by the 
>> byte value with its high bit set to 0, that is, truncated into the ASCII 
>> range. For example, U+0080 looks like it is not in use, though I could not 
>> verify this.
> 
> You must use a codepoint that is not defined by Unicode, and never
> will.  That is what Emacs does: it extends the Unicode codepoint space
> beyond 0x10FFFF.


The idea is to extend Unicode itself, so that those bytes can be represented by 
legal codepoints. Then U+0080 has had some use in other encodings, but it looks 
like not in Unicode itself. But one could use some other value or values, and 
mark it for this special purpose.

There are a number of other byte sequences that are in use, too, like overlong 
UTF-8. Also original UTF-8 can be extended to handle all 32-bit words, also 
those with the high bit set, then.

Re: Unicode String Models

Reply via email to