RE: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

Shawn Steele Sat, 30 Jan 2016 16:48:46 -0800

Why do you need illegal unicode code points?

-----Original Message-----
From: Unicode [mailto:[email protected]] On Behalf Of J Decker
Sent: Saturday, January 30, 2016 6:40 AM
To: [email protected]
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers


I do see that the code points D800-DFFF should not be encoded in any UTF format 
(UTF8/32)...

UTF8 has a way to define any byte that might otherwise be used as an encoding 
byte.

UTF16 has no way to define a code point that is D800-DFFF; this is an issue if 
I want to apply some sort of encryption algorithm and still have the result 
treated as text for transmission and encoding to other string systems.

http://www.azillionmonkeys.com/qed/unicode.html   lists Unicode
private areas Area-A which is U-F0000:U-FFFFD and Area-B which is 
U-100000:U-10FFFD which will suffice for a workaround for my purposes....

For my purposes I will implement F0000-F0800 to be (code point minus
D800 and then add F0000 (or vice versa)) and then encoded as a surrogate 
pair... it would have been super nice of unicode standards included a way to 
specify code point even if there isn't a language character assigned to that 
point.

http://unicode.org/faq/utf_bom.html
does say: "Q: Are there any 16-bit values that are invalid?

A: Unpaired surrogates are invalid in UTFs. These include any value in the 
range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any 
value in the range DC00 to DFFF not preceded by a value in the range D800 to 
DBFF "

and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?

A different issue arises if an unpaired surrogate is encountered when 
converting ill-formed UTF-16 data. By represented such an unpaired surrogate on 
its own as a 3-byte sequence, the resulting UTF-8 data stream would become 
ill-formed. While it faithfully reflects the nature of the input, Unicode 
conformance requires that encoding form conversion always results in valid data 
stream. Therefore a converter must treat this as an error. "



I did see these older messages... (not that they talk about this much just more 
info) http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html

RE: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

Reply via email to