Obfuscate is right. It might conceivably be better than nothing, but at its best it will stop someone for an hour or so. Why not run it through a standard encryption protocol and if necessary use one of the options mentioned before to turn it into valid text?
On Sat, Jan 30, 2016, 6:31 PM J Decker <[email protected]> wrote: > On Sat, Jan 30, 2016 at 4:45 PM, Shawn Steele > <[email protected]> wrote: > > Why do you need illegal unicode code points? > > This originated from learning Javascript; which is internally UTF-16. > Playing with localStorage, some browsers use a sqlite3 database to > store values. The database is UTF-8 so there must be a valid > conversion between the internal UTF-16 and UTF-8 localStorage (and > reverse). I wanted to obfuscate the data stored for a certain > application; and cover all content that someone might send. Having > slept on this, I realized that even if hieroglyphics were stored, if I > pulled out the character using codePointAt() and applied a 20 bit > random value to it using XOR it could end up as a normal character, > and I wouldn't know I had to use a 20 bit value... so every character > would have to use a 20 bit mask (which could end up with a value > that's D800-DFFF). > > I've reconsidered and think for ease of implementation to just mask > every UTF-16 character (not codepoint) with a 10 bit value, This will > result in no character changing from BMP space to surrogate-pair or > vice-versa. > > Thanks for the feedback. > (sorry if I've used some terms inaccurately) > > > > > -----Original Message----- > > From: Unicode [mailto:[email protected]] On Behalf Of J Decker > > Sent: Saturday, January 30, 2016 6:40 AM > > To: [email protected] > > Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair > specifiers > > > > I do see that the code points D800-DFFF should not be encoded in any UTF > format (UTF8/32)... > > > > UTF8 has a way to define any byte that might otherwise be used as an > encoding byte. > > > > UTF16 has no way to define a code point that is D800-DFFF; this is an > issue if I want to apply some sort of encryption algorithm and still have > the result treated as text for transmission and encoding to other string > systems. > > > > http://www.azillionmonkeys.com/qed/unicode.html lists Unicode > > private areas Area-A which is U-F0000:U-FFFFD and Area-B which is > U-100000:U-10FFFD which will suffice for a workaround for my purposes.... > > > > For my purposes I will implement F0000-F0800 to be (code point minus > > D800 and then add F0000 (or vice versa)) and then encoded as a surrogate > pair... it would have been super nice of unicode standards included a way > to specify code point even if there isn't a language character assigned to > that point. > > > > http://unicode.org/faq/utf_bom.html > > does say: "Q: Are there any 16-bit values that are invalid? > > > > A: Unpaired surrogates are invalid in UTFs. These include any value in > the range D800 to DBFF not followed by a value in the range DC00 to DFFF, > or any value in the range DC00 to DFFF not preceded by a value in the range > D800 to DBFF " > > > > and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? > > > > A different issue arises if an unpaired surrogate is encountered when > converting ill-formed UTF-16 data. By represented such an unpaired > surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream > would become ill-formed. While it faithfully reflects the nature of the > input, Unicode conformance requires that encoding form conversion always > results in valid data stream. Therefore a converter must treat this as an > error. " > > > > > > > > I did see these older messages... (not that they talk about this much > just more info) > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html > > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html > > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html > > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html > >

