Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

2016-01-31 Thread Philippe Verdy
I also agree.

To transport binary data over a plain-text format there are other common
types, including Base64, Quoted-Printable (and you can also compress the
binary data before this transformation, using Gzip, deflate... for example
in MIME for emails; or compress it after this transformation only over the
transport channel like in HTTP which natively supports transparent 8-bit
streams, this solution being generally more performant).

There's no reliable way to preserve the exact binary encoding of texts
using invalid UTF sequences (including unpaired surrogates in UTF-16, or
isolated surrogate code points and other non-characters in other UTFs, or
forbidden byte values or restricted byte sequence in UTF-8) without using a
binary envelope (which cannot preserve the same encoding of valid UTF
sequences).

Even by using another encoding scheme/encoding form or legacy charset
mapped with Unicode (including GB and HKCS charsets), you will fail each
time due to the canonical equivalences and the existing conforming
conversions between all UTFs which are made to preserve the identity of
characters, not the equality of their binary encodings.

In summary, what you need is:
- a transport-syntax (see HTTP for example) to allow decoding your
envelope, and
- a separate media-type (see HTTP and MIME for example, don't choose any
one in "text/*", but in "binary/*" or possibly "application/*") or some
filesystem convention or standards for file types (such as file name
extensions in common Unix/Linux filesystems or FTP, or external metadata
streams for file attributes such as in MacOS, or VMS, or even in NTFS and
almost all HTTP-based filesystems) for your chosen binary encoding
encapsulated in a text-compatible format.

If your encoded document does not match exactly the strict text encoding
conformances, it cannot be declared and handled at all as if it was valid
text. You have to handle it as an opaque BLOB (as if they were data for a
bitmap image or executable code, or a PKI encryption key, or a data
signature such as SHA or an encrypted stream such as DES).

Basic filesystems for Unix/Linux or FAT treat all their files as
unrestricted blobs (that's why they use a separate data to represent its
actual type to decode it with specific algorithms, the most common being
filename extensions to determine the envelope format, then using internal
data structures in this envelope such as MPEG, OGG, or XML with schemas
validation, or ZIP archives embedding mutiple structured streams with some
conventions)

All these options are out of scope of the Unicode standard which is not
made to transport and preserve the binary encodings, but is made purposely
to allow transparent conversions between all conforming UTFs of valid text
only (nothing else) and to support canonical equivalences as much as
possible in "Unicode-conforming process", so that they'll be able to choose
between these wellknown and standardized text representations.

2016-01-31 20:52 GMT+01:00 Shawn Steele :

> It should be understood that any algorithm that changes the Unicode
> character data to non-character data is therefore binary, and not Unicode.
> It's inappropriate to shove binary data into unicode streams because stuff
> will break.
>
> https://blogs.msdn.microsoft.com/shawnste/2005/09/26/avoid-treating-binary-data-as-a-string/
>
>
> -Original Message-
> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Chris
> Jacobs
> Sent: Sunday, January 31, 2016 10:08 AM
> To: J Decker 
> Cc: unicode@unicode.org
> Subject: Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair
> specifiers
>
>
>
> J Decker schreef op 2016-01-31 18:56:
> > On Sun, Jan 31, 2016 at 8:31 AM, Chris Jacobs 
> > wrote:
> >>
> >>
> >> J Decker schreef op 2016-01-31 03:28:
> >>>
> >>> I've reconsidered and think for ease of implementation to just mask
> >>> every UTF-16 character (not  codepoint) with a 10 bit value, This
> >>> will result in no character changing from BMP space to
> >>> surrogate-pair or vice-versa.
> >>>
> >>> Thanks for the feedback.
> >>
> >>
> >> So you are still trying to handle the unarmed output as plaintext.
> >> Do you realize that if a string in the output is replaced by a
> >> canonical equivalent one this may mess up things because the
> >> originals are not canonical equivalent?
> >>
> > I see ... things like mentioned here
> > http://websec.github.io/unicode-security-guide/character-transformatio
> > ns/
>
> Yes especially the part about normalization.
> This would not only spoil the normalized string, but also, as the string
> can have a different length, for anything after that your ever-changing
> xor-values may go out of sync.
>
>
>
>


Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

2016-01-31 Thread Chris Jacobs



J Decker schreef op 2016-01-31 18:56:
On Sun, Jan 31, 2016 at 8:31 AM, Chris Jacobs  
wrote:



J Decker schreef op 2016-01-31 03:28:


I've reconsidered and think for ease of implementation to just mask
every UTF-16 character (not  codepoint) with a 10 bit value, This 
will

result in no character changing from BMP space to surrogate-pair or
vice-versa.

Thanks for the feedback.



So you are still trying to handle the unarmed output as plaintext.
Do you realize that if a string in the output is replaced by a 
canonical

equivalent
one this may mess up things because the originals are not canonical
equivalent?


I see ... things like mentioned here
http://websec.github.io/unicode-security-guide/character-transformations/


Yes especially the part about normalization.
This would not only spoil the normalized string, but also, as the string 
can have a different length,
for anything after that your ever-changing xor-values may go out of 
sync.





RE: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

2016-01-31 Thread Shawn Steele
It should be understood that any algorithm that changes the Unicode character 
data to non-character data is therefore binary, and not Unicode.  It's 
inappropriate to shove binary data into unicode streams because stuff will 
break.
https://blogs.msdn.microsoft.com/shawnste/2005/09/26/avoid-treating-binary-data-as-a-string/


-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Chris Jacobs
Sent: Sunday, January 31, 2016 10:08 AM
To: J Decker 
Cc: unicode@unicode.org
Subject: Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers



J Decker schreef op 2016-01-31 18:56:
> On Sun, Jan 31, 2016 at 8:31 AM, Chris Jacobs 
> wrote:
>> 
>> 
>> J Decker schreef op 2016-01-31 03:28:
>>> 
>>> I've reconsidered and think for ease of implementation to just mask 
>>> every UTF-16 character (not  codepoint) with a 10 bit value, This 
>>> will result in no character changing from BMP space to 
>>> surrogate-pair or vice-versa.
>>> 
>>> Thanks for the feedback.
>> 
>> 
>> So you are still trying to handle the unarmed output as plaintext.
>> Do you realize that if a string in the output is replaced by a 
>> canonical equivalent one this may mess up things because the 
>> originals are not canonical equivalent?
>> 
> I see ... things like mentioned here
> http://websec.github.io/unicode-security-guide/character-transformatio
> ns/

Yes especially the part about normalization.
This would not only spoil the normalized string, but also, as the string can 
have a different length, for anything after that your ever-changing xor-values 
may go out of sync.





RE: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

2016-01-31 Thread Shawn Steele
Typically XOR’ing a constant isn’t really considered worth messing with.  It’s 
somewhat trivial to figure out the key to un-XOR.

On Sat, Jan 30, 2016, 6:31 PM J Decker 
> wrote:
On Sat, Jan 30, 2016 at 4:45 PM, Shawn Steele
> wrote:
> Why do you need illegal unicode code points?

This originated from learning Javascript; which is internally UTF-16.
Playing with localStorage, some browsers use a sqlite3 database to
store values.  The database is UTF-8 so there must be a valid
conversion between the internal UTF-16 and UTF-8 localStorage (and
reverse).  I wanted to obfuscate the data stored for a certain
application; and cover all content that someone might send.  Having
slept on this, I realized that even if hieroglyphics were stored, if I
pulled out the character using codePointAt() and applied a 20 bit
random value to it using XOR it could end up as a normal character,
and I wouldn't know I had to use a 20 bit value... so every character
would have to use a 20 bit mask (which could end up with a value
that's D800-DFFF).

I've reconsidered and think for ease of implementation to just mask
every UTF-16 character (not  codepoint) with a 10 bit value, This will
result in no character changing from BMP space to surrogate-pair or
vice-versa.

Thanks for the feedback.
(sorry if I've used some terms inaccurately)

>
> -Original Message-
> From: Unicode 
> [mailto:unicode-boun...@unicode.org] On 
> Behalf Of J Decker
> Sent: Saturday, January 30, 2016 6:40 AM
> To: unicode@unicode.org
> Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
>
> I do see that the code points D800-DFFF should not be encoded in any UTF 
> format (UTF8/32)...
>
> UTF8 has a way to define any byte that might otherwise be used as an encoding 
> byte.
>
> UTF16 has no way to define a code point that is D800-DFFF; this is an issue 
> if I want to apply some sort of encryption algorithm and still have the 
> result treated as text for transmission and encoding to other string systems.
>
> http://www.azillionmonkeys.com/qed/unicode.html   lists Unicode
> private areas Area-A which is U-F:U-D and Area-B which is 
> U-10:U-10FFFD which will suffice for a workaround for my purposes
>
> For my purposes I will implement F-F0800 to be (code point minus
> D800 and then add F (or vice versa)) and then encoded as a surrogate 
> pair... it would have been super nice of unicode standards included a way to 
> specify code point even if there isn't a language character assigned to that 
> point.
>
> http://unicode.org/faq/utf_bom.html
> does say: "Q: Are there any 16-bit values that are invalid?
>
> A: Unpaired surrogates are invalid in UTFs. These include any value in the 
> range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any 
> value in the range DC00 to DFFF not preceded by a value in the range D800 to 
> DBFF "
>
> and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?
>
> A different issue arises if an unpaired surrogate is encountered when 
> converting ill-formed UTF-16 data. By represented such an unpaired surrogate 
> on its own as a 3-byte sequence, the resulting UTF-8 data stream would become 
> ill-formed. While it faithfully reflects the nature of the input, Unicode 
> conformance requires that encoding form conversion always results in valid 
> data stream. Therefore a converter must treat this as an error. "
>
>
>
> I did see these older messages... (not that they talk about this much just 
> more info) http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html


Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

2016-01-31 Thread J Decker
On Sun, Jan 31, 2016 at 12:21 AM, Shawn Steele
 wrote:
> Typically XOR’ing a constant isn’t really considered worth messing with.
> It’s somewhat trivial to figure out the key to un-XOR.
>
obviously.  It's not constant, nor is it stored anywhere in the code or data.
>
>
> On Sat, Jan 30, 2016, 6:31 PM J Decker  wrote:
>
> On Sat, Jan 30, 2016 at 4:45 PM, Shawn Steele
>  wrote:
>> Why do you need illegal unicode code points?
>
> This originated from learning Javascript; which is internally UTF-16.
> Playing with localStorage, some browsers use a sqlite3 database to
> store values.  The database is UTF-8 so there must be a valid
> conversion between the internal UTF-16 and UTF-8 localStorage (and
> reverse).  I wanted to obfuscate the data stored for a certain
> application; and cover all content that someone might send.  Having
> slept on this, I realized that even if hieroglyphics were stored, if I
> pulled out the character using codePointAt() and applied a 20 bit
> random value to it using XOR it could end up as a normal character,
> and I wouldn't know I had to use a 20 bit value... so every character
> would have to use a 20 bit mask (which could end up with a value
> that's D800-DFFF).
>
> I've reconsidered and think for ease of implementation to just mask
> every UTF-16 character (not  codepoint) with a 10 bit value, This will
> result in no character changing from BMP space to surrogate-pair or
> vice-versa.
>
> Thanks for the feedback.
> (sorry if I've used some terms inaccurately)
>
>>
>> -Original Message-
>> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of J Decker
>> Sent: Saturday, January 30, 2016 6:40 AM
>> To: unicode@unicode.org
>> Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
>>
>> I do see that the code points D800-DFFF should not be encoded in any UTF
>> format (UTF8/32)...
>>
>> UTF8 has a way to define any byte that might otherwise be used as an
>> encoding byte.
>>
>> UTF16 has no way to define a code point that is D800-DFFF; this is an
>> issue if I want to apply some sort of encryption algorithm and still have
>> the result treated as text for transmission and encoding to other string
>> systems.
>>
>> http://www.azillionmonkeys.com/qed/unicode.html   lists Unicode
>> private areas Area-A which is U-F:U-D and Area-B which is
>> U-10:U-10FFFD which will suffice for a workaround for my purposes
>>
>> For my purposes I will implement F-F0800 to be (code point minus
>> D800 and then add F (or vice versa)) and then encoded as a surrogate
>> pair... it would have been super nice of unicode standards included a way to
>> specify code point even if there isn't a language character assigned to that
>> point.
>>
>> http://unicode.org/faq/utf_bom.html
>> does say: "Q: Are there any 16-bit values that are invalid?
>>
>> A: Unpaired surrogates are invalid in UTFs. These include any value in the
>> range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any
>> value in the range DC00 to DFFF not preceded by a value in the range D800 to
>> DBFF "
>>
>> and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?
>>
>> A different issue arises if an unpaired surrogate is encountered when
>> converting ill-formed UTF-16 data. By represented such an unpaired surrogate
>> on its own as a 3-byte sequence, the resulting UTF-8 data stream would
>> become ill-formed. While it faithfully reflects the nature of the input,
>> Unicode conformance requires that encoding form conversion always results in
>> valid data stream. Therefore a converter must treat this as an error. "
>>
>>
>>
>> I did see these older messages... (not that they talk about this much just
>> more info) http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html
>> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html
>> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html
>> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html



Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

2016-01-31 Thread Chris Jacobs



J Decker schreef op 2016-01-31 03:28:

I've reconsidered and think for ease of implementation to just mask
every UTF-16 character (not  codepoint) with a 10 bit value, This will
result in no character changing from BMP space to surrogate-pair or
vice-versa.

Thanks for the feedback.


So you are still trying to handle the unarmed output as plaintext.
Do you realize that if a string in the output is replaced by a canonical 
equivalent
one this may mess up things because the originals are not canonical 
equivalent?