Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Adam Borowski via Unicode
On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote:
> Le sam. 13 oct. 2018 à 18:58, Steffen Nurpmeso via Unicode <
> unicode@unicode.org> a écrit :
> > The only variance is described as:
> >
> >   Care must be taken to use the proper octets for line breaks if base64
> >   encoding is applied directly to text material that has not been
> >   converted to canonical form.  In particular, text line breaks must be
> >   converted into CRLF sequences prior to base64 encoding.  The
> >   important thing to note is that this may be done directly by the
> >   encoder rather than in a prior canonicalization step in some
> >   implementations.
> >
> > This is MIME, it specifies (in the same RFC):
> 
> I've not spoken aboutr the encoding of new lines **in the actual encoded
> text**:
> -  if their existing text-encoding ever gets converted to Base64 as if the
> whole text was an opaque binary object, their initial text-encoding will be
> preserved (so yes it will preserve the way these embedded newlines are
> encoded as CR, LF, CR+LF, NL...)
> 
> I spoke about newlines used in the transport syntax to split the initial
> binary object (which may actually contain text but it does not matter).
> MIME defines this operation and even requires splitting the binary object
> in fragments with maximum binary size so that these binary fragments can be
> converted with Base64 into lines with maximum length. In the MIME Base64
> representation you can insert newlines anywhere between fragments encoded
> separately.

There's another kind of fragmentation that can make the encoding differ (but
still decode to the same payload):

The data stream gets split into 3-byte internal, 4-byte external packets.
Any packet may contain less than those 3 bytes, in which cases it is padded
with = characters:
3 bytes 
2 bytes XXX=
1 byte  XX==

Usually, such smaller packets happen only at the end of a message, but to
support encoding a stream piecewise, they are allowed at any point.

For example:
"meow" is bWVvdw==
"me""ow"   is bWU=b3c=
yet both carry the same payload.

> Base64 is used exactly to support this flexibility in transport (or
> storage) without altering any bit of the initial content once it is
> decoded.

Right, any such variations are in packaging only.


ᛗᛖᛟᚹ
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 10 people enter a bar: 1 who understands binary,
⢿⡄⠘⠷⠚⠋⠀ 1 who doesn't, D who prefer to write it as hex,
⠈⠳⣄ and 1 who narrowly avoided an off-by-one error.


Fallback for Sinhala Consonant Clusters

2018-10-13 Thread Richard Wordingham via Unicode
Are there fallback rules for Sinhala consonant clusters?  There are
fallback rules for Devanagari, but I'm not sure if they read across.

The problem I am seeing is that the Pali syllable 'ndhe' න්‍ධෙ  is being rendered identically to a hypothetical Sinhalese
'nēdha' නේධ ,  which in NFD is
, when I use a font that lacks the
conjunct.  (Most fonts lack the conjunct.)  The Devanagari rules and my
preference would lead to a fallback rendering as න්ධෙ  (Sinhalese
'ndhe'), which is encoded as .  Is the rendering I am getting
technically wrong, or is it merely undesirable?

The ambiguity arises in part because, like the Brahmi script, the
Sinhala script uses its virama character as a vowel length indicator.

Missing touching consonants are being rendered almost as though there
were no ZWJ, but the combination of consonant and al-lakuna is being
rendered badly.

Richard.



Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Philippe Verdy via Unicode
Le sam. 13 oct. 2018 à 18:58, Steffen Nurpmeso via Unicode <
unicode@unicode.org> a écrit :

> Philippe Verdy via Unicode wrote in  w9+jearw4ghyk...@mail.gmail.com>:
>  |You forget that Base64 (as used in MIME) does not follow these rules \
>  |as it allows multiple different encodings for the same source binary. \
>  |MIME actually
>  |splits a binary object into multiple fragments at random positions, \
>  |and then encodes these fragments separately. Also MIME uses an extension
> \
>  |of Base64
>  |where it allows some variations in the encoding alphabet (so even the \
>  |same fragment of the same length may have two disting encodings).
>  |
>  |Base64 in MIME is different from standard Base64 (which never splits \
>  |the binary object before encoding it, and uses a strict alphabet of \
>  |64 ASCII
>  |characters, allowing no variation). So MIME requires special handling: \
>  |the assumpton that a binary message is encoded the same is wrong, but \
>  |MIME still
>  |requires that this non unique Base64 encoding will be decoded back \
>  |to the same initial (unsplitted) binary object (independantly of its \
>  |size and
>  |independantly of the splitting boundaries used in the transport, which \
>  |may change during the transport).
>
> Base64 is defined in RFC 2045 (Multipurpose Internet Mail
> Extensions (MIME) Part One: Format of Internet Message Bodies).
> It is a content-transfer-encoding and encodes any data
> transparently into a 7 bit clean ASCII _and_ EBCDIC compatible
> (the authors commemorate that) text.
> When decoding it reverts this representation into its original form.
> Ok, there is the CRLF newline problem, as below.
> What do you mean by "splitting"?
>
> ...
> The only variance is described as:
>
>   Care must be taken to use the proper octets for line breaks if base64
>   encoding is applied directly to text material that has not been
>   converted to canonical form.  In particular, text line breaks must be
>   converted into CRLF sequences prior to base64 encoding.  The
>   important thing to note is that this may be done directly by the
>   encoder rather than in a prior canonicalization step in some
>   implementations.
>
> This is MIME, it specifies (in the same RFC):


I've not spoken aboutr the encoding of new lines **in the actual encoded
text**:
-  if their existing text-encoding ever gets converted to Base64 as if the
whole text was an opaque binary object, their initial text-encoding will be
preserved (so yes it will preserve the way these embedded newlines are
encoded as CR, LF, CR+LF, NL...)

I spoke about newlines used in the transport syntax to split the initial
binary object (which may actually contain text but it does not matter).
MIME defines this operation and even requires splitting the binary object
in fragments with maximum binary size so that these binary fragments can be
converted with Base64 into lines with maximum length. In the MIME Base64
representation you can insert newlines anywhere between fragments encoded
separately.

The maximum size of fragment is not fixed (it is usually about 60 binary
octets, that are converted to lines of 80 ASCII characters, followed by a
newline (CR+LF is strongly suggested for MIME, but it is admitted to use
other newline sequences). Email forwarding agents frequently needed these
line lengths to process the mail properly (not just the MIME headers but as
well the content body, where they want at least some whitespace or newline
in the middle where they can freely rearrange the line lines by compressing
whitespaces or splitting lines to shorter length as necessary to their
processing; this is much less frequent today because most mail agents are
8-bit clean and allow arbitrary line lengths... except in MIME headers).

In MIME headers the situation is different, there's really a maximum
line-length there, and if a header is too long, it has to be split on
multiple lines (using continuation sequences, i.e. a newline (CR+LF is
standard here) followed by at least one space (this
insertion/change/removal of whitespaces is permitted everywhere in the MIME
header after the header type, but even before the colon that follows the
header type). So a MIME header value whose included text gets encoded with
Base64 will be split using "=?" sequences starting the indication that the
fragment is Base64 encoded (instead of being QuotedPrintable-encoded) and
then a separator and the encapsulated Base-64 encoding of a fragment, and a
single header may have multiple Base64-encoded fragments in the same header
value, and there's large freedom about where to split the value to isolate
fragments with convenient size that satisfies the MIME requirements. These
multiple fragemetns may then occur on the same line (separated by
whitespace) or on multiple line (separated by continuation sequences).

In that case, the same initial text can have multiple valid representation
in a MIME envelope format using Base64: it is not Base64 itself that 

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Steffen Nurpmeso via Unicode
Philippe Verdy via Unicode wrote in :
 |You forget that Base64 (as used in MIME) does not follow these rules \
 |as it allows multiple different encodings for the same source binary. \
 |MIME actually 
 |splits a binary object into multiple fragments at random positions, \
 |and then encodes these fragments separately. Also MIME uses an extension \
 |of Base64 
 |where it allows some variations in the encoding alphabet (so even the \
 |same fragment of the same length may have two disting encodings).
 |
 |Base64 in MIME is different from standard Base64 (which never splits \
 |the binary object before encoding it, and uses a strict alphabet of \
 |64 ASCII 
 |characters, allowing no variation). So MIME requires special handling: \
 |the assumpton that a binary message is encoded the same is wrong, but \
 |MIME still 
 |requires that this non unique Base64 encoding will be decoded back \
 |to the same initial (unsplitted) binary object (independantly of its \
 |size and 
 |independantly of the splitting boundaries used in the transport, which \
 |may change during the transport).

Base64 is defined in RFC 2045 (Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message Bodies).
It is a content-transfer-encoding and encodes any data
transparently into a 7 bit clean ASCII _and_ EBCDIC compatible
(the authors commemorate that) text.
When decoding it reverts this representation into its original form.
Ok, there is the CRLF newline problem, as below.
What do you mean by "splitting"?

...
The only variance is described as:

  Care must be taken to use the proper octets for line breaks if base64
  encoding is applied directly to text material that has not been
  converted to canonical form.  In particular, text line breaks must be
  converted into CRLF sequences prior to base64 encoding.  The
  important thing to note is that this may be done directly by the
  encoder rather than in a prior canonicalization step in some
  implementations.

This is MIME, it specifies (in the same RFC):

  2.10.  Lines

   "Lines" are defined as sequences of octets separated by a CRLF
   sequences.  This is consistent with both RFC 821 and RFC 822.
   "Lines" only refers to a unit of data in a message, which may or may
   not correspond to something that is actually displayed by a user
   agent.

and furthermore

  6.5.  Translating Encodings

   The quoted-printable and base64 encodings are designed so that
   conversion between them is possible.  The only issue that arises in
   such a conversion is the handling of hard line breaks in quoted-
   printable encoding output. When converting from quoted-printable to
   base64 a hard line break in the quoted-printable form represents a
   CRLF sequence in the canonical form of the data. It must therefore be
   converted to a corresponding encoded CRLF in the base64 form of the
   data.  Similarly, a CRLF sequence in the canonical form of the data
   obtained after base64 decoding must be converted to a quoted-
   printable hard line break, but ONLY when converting text data.

So we go over

  6.6.  Canonical Encoding Model

   There was some confusion, in the previous versions of this RFC,
   regarding the model for when email data was to be converted to
   canonical form and encoded, and in particular how this process would
   affect the treatment of CRLFs, given that the representation of
   newlines varies greatly from system to system, and the relationship
   between content-transfer-encodings and character sets.  A canonical
   model for encoding is presented in RFC 2049 for this reason.

to RFC 2049 where we find

 For example, in the case of text/plain data, the text
  must be converted to a supported character set and
  lines must be delimited with CRLF delimiters in
  accordance with RFC 822.  Note that the restriction on
  line lengths implied by RFC 822 is eliminated if the
  next step employs either quoted-printable or base64
  encoding.

and, later

   Conversion from entity form to local form is accomplished by
   reversing these steps. Note that reversal of these steps may produce
   differing results since there is no guarantee that the original and
   final local forms are the same.

and, later

   NOTE: Some confusion has been caused by systems that represent
   messages in a format which uses local newline conventions which
   differ from the RFC822 CRLF convention.  It is important to note that
   these formats are not canonical RFC822/MIME.  These formats are
   instead *encodings* of RFC822, where CRLF sequences in the canonical
   representation of the message are encoded as the local newline
   convention.  Note that formats which encode CRLF sequences as, for
   example, LF are not capable of representing MIME messages containing
   binary data which contains LF octets not part of CRLF line separation
   sequences.

Whoever understands this emojibake.
My MUA still gnaws at 

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Philippe Verdy via Unicode
In summary, two disating implementations are allowed to return different
values t and t' of Base64_Encode(d) from the same message d, but both
Base64_Decode(t') and  Base64_Decode(t) will be equal and will MUST return
d exactly.

There's an allowed choice of implementation for Base64_Encode() but
Base64_Decode() must then be updated to be permissive/flexible and ensure
that in all cases,
Base64_Decode[Base64_Encode[d]] = d, for every value of d.

The reverse is not true because of this flexibility (needed for various
transport protocols that have different requirements, notably on the
allowed set of characters, and on their maximum line lengths):
Base64_Encode[Base64_Decode[t]] = t may be false.


Le sam. 13 oct. 2018 à 16:45, Philippe Verdy  a écrit :

> You forget that Base64 (as used in MIME) does not follow these rules as it
> allows multiple different encodings for the same source binary. MIME
> actually splits a binary object into multiple fragments at random
> positions, and then encodes these fragments separately. Also MIME uses an
> extension of Base64 where it allows some variations in the encoding
> alphabet (so even the same fragment of the same length may have two disting
> encodings).
>
> Base64 in MIME is different from standard Base64 (which never splits the
> binary object before encoding it, and uses a strict alphabet of 64 ASCII
> characters, allowing no variation). So MIME requires special handling: the
> assumpton that a binary message is encoded the same is wrong, but MIME
> still requires that this non unique Base64 encoding will be decoded back to
> the same initial (unsplitted) binary object (independantly of its size and
> independantly of the splitting boundaries used in the transport, which may
> change during the transport).
>
> This also applies to the Base64 encoding used in HTTP transport syntax,
> and notably in the HTTP/1.1 streaming feature where fragment sizes are also
> variable.
>
>
> Le sam. 13 oct. 2018 à 16:27, Costello, Roger L. via Unicode <
> unicode@unicode.org> a écrit :
>
>> Hi Folks,
>>
>> Thank you for your outstanding responses!
>>
>> Below is a summary of what I learned. Are there any errors in the
>> summary? Is there anything you would add? Please let me know of anything
>> that is not clear.   /Roger
>>
>> 1. While base64 encoding is usually applied to binary, it is also
>> sometimes applied to text, such as Unicode text.
>>
>> Note: Since base64 encoding may be applied to both binary and text, in
>> the following bullets I use the more generic term "data". For example,
>> "Data d is base64-encoded to yield ..."
>>
>> 2. Neither base64 encoding nor decoding should presume any special
>> knowledge of the meaning of the data or do anything extra based on that
>> presumption.
>>
>> For example, converting Unicode text to and from base64 should not
>> perform any sort of Unicode normalization, convert between UTFs, insert or
>> remove BOMs, etc. This is like saying that converting a JPEG image to and
>> from base64 should not resize or rescale the image, change its color depth,
>> convert it to another graphic format, etc.
>>
>> If you use base64 for encoding MIME content (e.g. emails), the base64
>> decoding will not transform the content. The email parser must ensure that
>> the content is valid, so the parser might have to transform the content
>> (possibly replacing some invalid sequences or truncating), and then apply
>> Unicode normalization to render the text. These transforms are part of the
>> MIME application and are independent of whether you use base64 or any
>> another encoding or transport syntax.
>>
>> 3. If data d is different than d', then the base64 text resulting from
>> encoding d is different than the base64 text resulting from encoding d'.
>>
>> 4. If base64 text t is different than t', then the data resulting from
>> decoding t is different than the data resulting from decoding t'.
>>
>> 5. For every data d there is exactly one base64 encoding t.
>>
>> 6. Every base64 text t is an encoding of exactly one data d.
>>
>> 7. For all data d, Base64_Decode[Base64_Encode[d]] = d
>>
>>


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Philippe Verdy via Unicode
You forget that Base64 (as used in MIME) does not follow these rules as it
allows multiple different encodings for the same source binary. MIME
actually splits a binary object into multiple fragments at random
positions, and then encodes these fragments separately. Also MIME uses an
extension of Base64 where it allows some variations in the encoding
alphabet (so even the same fragment of the same length may have two disting
encodings).

Base64 in MIME is different from standard Base64 (which never splits the
binary object before encoding it, and uses a strict alphabet of 64 ASCII
characters, allowing no variation). So MIME requires special handling: the
assumpton that a binary message is encoded the same is wrong, but MIME
still requires that this non unique Base64 encoding will be decoded back to
the same initial (unsplitted) binary object (independantly of its size and
independantly of the splitting boundaries used in the transport, which may
change during the transport).

This also applies to the Base64 encoding used in HTTP transport syntax, and
notably in the HTTP/1.1 streaming feature where fragment sizes are also
variable.


Le sam. 13 oct. 2018 à 16:27, Costello, Roger L. via Unicode <
unicode@unicode.org> a écrit :

> Hi Folks,
>
> Thank you for your outstanding responses!
>
> Below is a summary of what I learned. Are there any errors in the summary?
> Is there anything you would add? Please let me know of anything that is not
> clear.   /Roger
>
> 1. While base64 encoding is usually applied to binary, it is also
> sometimes applied to text, such as Unicode text.
>
> Note: Since base64 encoding may be applied to both binary and text, in the
> following bullets I use the more generic term "data". For example, "Data d
> is base64-encoded to yield ..."
>
> 2. Neither base64 encoding nor decoding should presume any special
> knowledge of the meaning of the data or do anything extra based on that
> presumption.
>
> For example, converting Unicode text to and from base64 should not perform
> any sort of Unicode normalization, convert between UTFs, insert or remove
> BOMs, etc. This is like saying that converting a JPEG image to and from
> base64 should not resize or rescale the image, change its color depth,
> convert it to another graphic format, etc.
>
> If you use base64 for encoding MIME content (e.g. emails), the base64
> decoding will not transform the content. The email parser must ensure that
> the content is valid, so the parser might have to transform the content
> (possibly replacing some invalid sequences or truncating), and then apply
> Unicode normalization to render the text. These transforms are part of the
> MIME application and are independent of whether you use base64 or any
> another encoding or transport syntax.
>
> 3. If data d is different than d', then the base64 text resulting from
> encoding d is different than the base64 text resulting from encoding d'.
>
> 4. If base64 text t is different than t', then the data resulting from
> decoding t is different than the data resulting from decoding t'.
>
> 5. For every data d there is exactly one base64 encoding t.
>
> 6. Every base64 text t is an encoding of exactly one data d.
>
> 7. For all data d, Base64_Decode[Base64_Encode[d]] = d
>
>


RE: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Costello, Roger L. via Unicode
Hi Folks,

Thank you for your outstanding responses! 

Below is a summary of what I learned. Are there any errors in the summary? Is 
there anything you would add? Please let me know of anything that is not clear. 
  /Roger

1. While base64 encoding is usually applied to binary, it is also sometimes 
applied to text, such as Unicode text.

Note: Since base64 encoding may be applied to both binary and text, in the 
following bullets I use the more generic term "data". For example, "Data d is 
base64-encoded to yield ..."

2. Neither base64 encoding nor decoding should presume any special knowledge of 
the meaning of the data or do anything extra based on that presumption. 

For example, converting Unicode text to and from base64 should not perform any 
sort of Unicode normalization, convert between UTFs, insert or remove BOMs, 
etc. This is like saying that converting a JPEG image to and from base64 should 
not resize or rescale the image, change its color depth, convert it to another 
graphic format, etc.

If you use base64 for encoding MIME content (e.g. emails), the base64 decoding 
will not transform the content. The email parser must ensure that the content 
is valid, so the parser might have to transform the content (possibly replacing 
some invalid sequences or truncating), and then apply Unicode normalization to 
render the text. These transforms are part of the MIME application and are 
independent of whether you use base64 or any another encoding or transport 
syntax.

3. If data d is different than d', then the base64 text resulting from encoding 
d is different than the base64 text resulting from encoding d'.

4. If base64 text t is different than t', then the data resulting from decoding 
t is different than the data resulting from decoding t'.

5. For every data d there is exactly one base64 encoding t.

6. Every base64 text t is an encoding of exactly one data d.

7. For all data d, Base64_Decode[Base64_Encode[d]] = d