Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

J Decker Sat, 30 Jan 2016 18:32:57 -0800

On Sat, Jan 30, 2016 at 4:45 PM, Shawn Steele
<[email protected]> wrote:
> Why do you need illegal unicode code points?


This originated from learning Javascript; which is internally UTF-16.
Playing with localStorage, some browsers use a sqlite3 database to
store values.  The database is UTF-8 so there must be a valid
conversion between the internal UTF-16 and UTF-8 localStorage (and
reverse).  I wanted to obfuscate the data stored for a certain
application; and cover all content that someone might send.  Having
slept on this, I realized that even if hieroglyphics were stored, if I
pulled out the character using codePointAt() and applied a 20 bit
random value to it using XOR it could end up as a normal character,
and I wouldn't know I had to use a 20 bit value... so every character
would have to use a 20 bit mask (which could end up with a value
that's D800-DFFF).

I've reconsidered and think for ease of implementation to just mask
every UTF-16 character (not  codepoint) with a 10 bit value, This will
result in no character changing from BMP space to surrogate-pair or
vice-versa.

Thanks for the feedback.
(sorry if I've used some terms inaccurately)

>
> -----Original Message-----
> From: Unicode [mailto:[email protected]] On Behalf Of J Decker
> Sent: Saturday, January 30, 2016 6:40 AM
> To: [email protected]
> Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
>
> I do see that the code points D800-DFFF should not be encoded in any UTF 
> format (UTF8/32)...
>
> UTF8 has a way to define any byte that might otherwise be used as an encoding 
> byte.
>
> UTF16 has no way to define a code point that is D800-DFFF; this is an issue 
> if I want to apply some sort of encryption algorithm and still have the 
> result treated as text for transmission and encoding to other string systems.
>
> http://www.azillionmonkeys.com/qed/unicode.html   lists Unicode
> private areas Area-A which is U-F0000:U-FFFFD and Area-B which is 
> U-100000:U-10FFFD which will suffice for a workaround for my purposes....
>
> For my purposes I will implement F0000-F0800 to be (code point minus
> D800 and then add F0000 (or vice versa)) and then encoded as a surrogate 
> pair... it would have been super nice of unicode standards included a way to 
> specify code point even if there isn't a language character assigned to that 
> point.
>
> http://unicode.org/faq/utf_bom.html
> does say: "Q: Are there any 16-bit values that are invalid?
>
> A: Unpaired surrogates are invalid in UTFs. These include any value in the 
> range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any 
> value in the range DC00 to DFFF not preceded by a value in the range D800 to 
> DBFF "
>
> and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?
>
> A different issue arises if an unpaired surrogate is encountered when 
> converting ill-formed UTF-16 data. By represented such an unpaired surrogate 
> on its own as a 3-byte sequence, the resulting UTF-8 data stream would become 
> ill-formed. While it faithfully reflects the nature of the input, Unicode 
> conformance requires that encoding form conversion always results in valid 
> data stream. Therefore a converter must treat this as an error. "
>
>
>
> I did see these older messages... (not that they talk about this much just 
> more info) http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html

Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

Reply via email to