It should be understood that any algorithm that changes the Unicode character data to non-character data is therefore binary, and not Unicode. It's inappropriate to shove binary data into unicode streams because stuff will break. https://blogs.msdn.microsoft.com/shawnste/2005/09/26/avoid-treating-binary-data-as-a-string/
-----Original Message----- From: Unicode [mailto:[email protected]] On Behalf Of Chris Jacobs Sent: Sunday, January 31, 2016 10:08 AM To: J Decker <[email protected]> Cc: [email protected] Subject: Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers J Decker schreef op 2016-01-31 18:56: > On Sun, Jan 31, 2016 at 8:31 AM, Chris Jacobs <[email protected]> > wrote: >> >> >> J Decker schreef op 2016-01-31 03:28: >>> >>> I've reconsidered and think for ease of implementation to just mask >>> every UTF-16 character (not codepoint) with a 10 bit value, This >>> will result in no character changing from BMP space to >>> surrogate-pair or vice-versa. >>> >>> Thanks for the feedback. >> >> >> So you are still trying to handle the unarmed output as plaintext. >> Do you realize that if a string in the output is replaced by a >> canonical equivalent one this may mess up things because the >> originals are not canonical equivalent? >> > I see ... things like mentioned here > http://websec.github.io/unicode-security-guide/character-transformatio > ns/ Yes especially the part about normalization. This would not only spoil the normalized string, but also, as the string can have a different length, for anything after that your ever-changing xor-values may go out of sync.

