Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Steffen Nurpmeso via Unicode
Philippe Verdy via Unicode wrote in :
 |Padding itself does not clearly indicate the length.
 |
 |It's an artefact that **may** be infered only in some other layers \
 |of protocols which specify when and how padding is needed (and how \
 |many padding bytes 
 |are required or accepted), it works only if these upper layer protocols \
 |are using **octets** streams, but it is still not usable for more general 
 |bitstreams (with arbitrary bit lengths).
 |
 |This RFC does not mandate/require these padding bytes and in fact many \
 |upper layer protocols do not ever need it (including UTF-7 for example), \
 |they are 
 |never necessary to infer a length in octets and insufficient for specify\
 |ing a length in bits.
 |
 |As well the usage in MIME (where there's a requirement that lines of \
 |headers or in the content body is limited to 1000 bytes) requires free \
 |splitting of 
 |Base64 (there's no agreed maximum length, some sources insist it should \
 |not be more than 72 bytes, others use 80 bytes, but mail forwarding \
 |may add other 
 |characters at start of lines, forcing them to be shorter (leaving for \
 |example a line of 72 bytes+CRLF and another line of 8 bytes+CRLF): \
 |this means that 
 |padding may not be used where one would expect them, and padding can \
 |event occur in the middle of the encoded stream (not just at end) along \

That was actually a bug in my MUA.  Other MUAs were not capable of
decoding this correctly.
Sorry :-(!!

 |with other 
 |whitespaces or separators (like "> " at start of lines in cited messages).

In fact garbage bytes may be embedded explicitly says MIME.
Most handle that right, and skip (silently, maybe not right),
but some explicit base64 decoders fail miserably when such things
are seen (openssl base64, NetBSD base64 decoder (current)), others
do not (busybox base64, for example).

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Philippe Verdy via Unicode
Padding itself does not clearly indicate the length.
It's an artefact that **may** be infered only in some other layers of
protocols which specify when and how padding is needed (and how many
padding bytes are required or accepted), it works only if these upper layer
protocols are using **octets** streams, but it is still not usable for more
general bitstreams (with arbitrary bit lengths).

This RFC does not mandate/require these padding bytes and in fact many
upper layer protocols do not ever need it (including UTF-7 for example),
they are never necessary to infer a length in octets and insufficient for
specifying a length in bits.

As well the usage in MIME (where there's a requirement that lines of
headers or in the content body is limited to 1000 bytes) requires free
splitting of Base64 (there's no agreed maximum length, some sources insist
it should not be more than 72 bytes, others use 80 bytes, but mail
forwarding may add other characters at start of lines, forcing them to be
shorter (leaving for example a line of 72 bytes+CRLF and another line of 8
bytes+CRLF): this means that padding may not be used where one would expect
them, and padding can event occur in the middle of the encoded stream (not
just at end) along with other whitespaces or separators (like "> " at start
of lines in cited messages).

More generally the padding in MIME offers no benefit at all. The actual
length is infered from the whole content body, and it's just safer to
ignore/discard all padding symbols in decoders (just like they will discard
whitespaces or ">"). If one wants to get a sure indication that the stream
is not truncated and has the expected length, the encoded message must
either embed this length as part of the original binary stream itself, or
can embed secure "digital signatures", "message digests" or "hashes", or
the length can be specified separately in the unencoded MIME body, or as
part of the MIME header if the whole MIME content body is specified as
using a base64 encoding. The same applies to HTTP.

I have rarely seen RFC 4648 used alone outside of another upper layer
protocol. This statement in RFC 4648 section 3.1 is for example completely
wrong for Base16 where paddings are almost always avoided.

Various other Base-N profiles for other upper layer protocols never need
(and sometime even forbid) the presence of any padding symbol, or consider
that paddding can also be made using the bits representing 0 to pad the
original binary stream, or can be made using other ignored/discard
whitespaces or symbols, without assigning any specific role to "=" (as a
length indicator or stream terminator).


Le lun. 15 oct. 2018 à 15:02, Tex  a écrit :

> Philippe, quote the entire section:
>
>
>
> In some circumstances, the use of padding ("=") in base-encoded data
>
>is not required or used.  In the general case, when assumptions about
>
>the size of transported data cannot be made, padding is required to
>
>yield correct decoded data.
>
>
>
>Implementations MUST include appropriate pad characters at the end of
>
>encoded data unless the specification referring to this document
>
>explicitly states otherwise.
>
>
>
> The first para clarifies that padding is required when the length is not
> otherwise known. Only if the length is provided or predefined can the
> padding be dropped.
>
> The second para clarifies it must be included unless the higher level
> protocol states otherwise, in which case it is likely using another
> mechanism to define length.
>
>
>
> It doesn’t seem to me to be as open ended as you implied in your initial
> mails, but well-defined depending on whether base64 is being used as spec’d
> in the RFC, or being explicitly modified to suit an embedding protocol.
>
> And certainly the first sentence in this section isn’t intended to be
> taken without the context of the rest of the section.
>
>
>
> tex
>
>
>
>
>
>
>
> *From:* Philippe Verdy [mailto:verd...@wanadoo.fr]
> *Sent:* Monday, October 15, 2018 4:14 AM
> *To:* Tex Texin
> *Cc:* Adam Borowski; unicode Unicode Discussion
> *Subject:* Re: Base64 encoding applied to different unicode texts always
> yields different base64 texts ... true or false?
>
>
>
> Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st
> sentence, it is explicitly stated :
>
>
>
> In some circumstances, the use of padding ("=") in base-encoded data is not 
> required or used.
>
>
>
> Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :
>
> Philippe,
>
>
>
> Where is the use of whitespace or the idea that 1-byte pieces do not need
> all the equal sign paddings documented?
>
> I read the rfc 3501 you pointed at, I don’t see 

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Peter Saint-Andre via Unicode
On 10/14/18 3:59 PM, Philippe Verdy via Unicode wrote:
> 
> 
> Le dim. 14 oct. 2018 à 21:21, Doug Ewell via Unicode
> mailto:unicode@unicode.org>> a écrit :
> 
> Steffen Nurpmeso wrote:
> 
> > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
> > (MIME) Part One: Format of Internet Message Bodies).
> 
> Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data
> Encodings." RFC 2045 defines a particular implementation of base64,
> specific to transporting Internet mail in a 7-bit environment.
> 
> 
> Wrong, this is "specific" to transporting Internet mail in any 7 bit or
> 8 bit environment (today almost all mail agents are operating in 8 bit),
> and then it is referenced directly by HTTP (and its HTTPS variant).
> 
> So this is no so "specific". MIME is extremely popular, RFC 4648 is
> extremely exotic (and RFC 4648 is wrong when saying that IMAP is very
> specific as it is now a very popular protocol, widely used as well).
> MIME is so frequently used, that almost all people refer to it when they
> look for Base64, or do not explicitly state that another definition
> (found in an exotic RFC) is explicitly used.

RFC 4648 is used in many, many Internet protocols. It's definitely not
"extremely exotic".

Peter



Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Steffen Nurpmeso via Unicode
Doug Ewell via Unicode wrote in <2A67B4F082F74F8AADF34BA11D885554@DougEwell>:
 |Steffen Nurpmeso wrote:
 |> Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
 |> (MIME) Part One: Format of Internet Message Bodies).
 |
 |Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data
 |Encodings." RFC 2045 defines a particular implementation of base64,
 |specific to transporting Internet mail in a 7-bit environment.
 |
 |RFC 4648 discusses many of the "higher-level protocol" topics that some
 |people are focusing on, such as separating the base64-encoded output
 |into lines of length 72 (or other), alternative target code unit sets or
 |"alphabets," and padding characters. It would be helpful for everyone to
 |read this particular RFC before concluding that these topics have not
 |been considered, or that they compromise round-tripping or other
 |characteristics of base64.
 |
 |I had assumed that when Roger asked about "base64 encoding," he was
 |asking about the basic definition of base64.

Sure; i have only followed the discussion superficially, and even
though everybody can read RFCs, i felt the necessity to polemicize
against the false however i look at it "MIME actually splits
a binary object into multiple fragments at random positions".
Solely my fault.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


RE: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Tex via Unicode
Philippe, quote the entire section:

 

In some circumstances, the use of padding ("=") in base-encoded data

   is not required or used.  In the general case, when assumptions about

   the size of transported data cannot be made, padding is required to

   yield correct decoded data.

 

   Implementations MUST include appropriate pad characters at the end of

   encoded data unless the specification referring to this document

   explicitly states otherwise.

 

The first para clarifies that padding is required when the length is not 
otherwise known. Only if the length is provided or predefined can the padding 
be dropped.

The second para clarifies it must be included unless the higher level protocol 
states otherwise, in which case it is likely using another mechanism to define 
length.

 

It doesn’t seem to me to be as open ended as you implied in your initial mails, 
but well-defined depending on whether base64 is being used as spec’d in the 
RFC, or being explicitly modified to suit an embedding protocol.

And certainly the first sentence in this section isn’t intended to be taken 
without the context of the rest of the section.

 

tex

 

 

 

From: Philippe Verdy [mailto:verd...@wanadoo.fr] 
Sent: Monday, October 15, 2018 4:14 AM
To: Tex Texin
Cc: Adam Borowski; unicode Unicode Discussion
Subject: Re: Base64 encoding applied to different unicode texts always yields 
different base64 texts ... true or false?

 

Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st 
sentence, it is explicitly stated :

 

In some circumstances, the use of padding ("=") in base-encoded data is not 
required or used.

 

Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :

Philippe,

 

Where is the use of whitespace or the idea that 1-byte pieces do not need all 
the equal sign paddings documented?

I read the rfc 3501 you pointed at, I don’t see it there.

 

Are these part of any standards? Or are you claiming these are practices 
despite the standards? If so, are these just tolerated by parsers, or are they 
actually generated by encoders?

 

What would be the rationale for supporting unnecessary whitespace? If 
linebreaks are forced at some line length they can presumably be removed at 
that length and not treated as part of the encoding.

Maybe we differ on define where the encoding begins and ends, and where higher 
level protocols prescribe how they are embedded within the protocol.

 

Tex

 

 

 

 

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe Verdy 
via Unicode
Sent: Sunday, October 14, 2018 1:41 AM
To: Adam Borowski
Cc: unicode Unicode Discussion
Subject: Re: Base64 encoding applied to different unicode texts always yields 
different base64 texts ... true or false?

 

Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is enough 
to indicate the end of an octets-span. The extra = after it do not add any 
other octet. and as well you're allowed to insert whitespaces anywhere in the 
encoded stream (this is what ensures that the Base64-encoded octets-stream will 
not be altered if line breaks are forced anywhere (notably within the body of 
emails).

 

So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, LF, 
NEL) in the middle is non-significant and ignorable on decoding (their 
"encoded" bit length is 0 and they don't terminate an octets-span, unlike "=" 
which discards extra bits remaining from the encoded stream before that are not 
on 8-bit boundaries).

 

Also:

- For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol 
before "=" can vary in its 4 lowest bits (which are then ignored/discarded by 
the "=" symbol)

- For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" symbol 
before "=" can vary in its 2 lowest bits (which are then ignored/discarded by 
the "=" symbol)

 

So you can use Base64 by encoding each octet in separate pieces, as one Base64 
symbol followed by an "=" symbol, and even insert any number of whitespaces 
between them: there's a infinite number of valid Base64 encodings for 
representing the same octets-stream payload.

 

Base64 allows encoding any octets streams but not directly any bits-streams : 
it assumes that the effective bits-stream has a binary length multiple of 8. To 
encode a bits-stream with an exact number of bits (not multiple of 8), you need 
to encode an extra payload to indicate the effective number of bits to keep at 
end of the encoded octets-stream (or at start):

- Base64 does not specify how you convert a bitstream of arbitrary length to an 
octets-stream;

- for that purpose, you may need to pad the bits-stream at start or at end with 
1 to 6 bits (so that it the resulting bitstream has a length multiple of 8, 
then encodable with Base64 which takes only octets on input).

- these extra paddin

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Philippe Verdy via Unicode
Note that all these discussion about padding applies to all other base-N
encodings, including base-10.

For example to represent numbers of arbitrary precision: padding does not
require a separate symbol but can use the "0" digit which is part of the
10-symbols alphabet, or encoders can discard them on the left, or on the
right if there's a decimal dot; when the precision is less than a integral
number of decimal digits, the extra bits or fractional bits of information
in the last digit of the encoded sequence does not matter, encoders may
choose to not set them to 0 but may prefer to use rounding which may
conditionally set these bits to 1, depedning on the value of the last
significant bits or fractional bits of maximum precision.

As well the same decoders may want to use extra whitespaces (notably to
limit line lengths at arbitrary lengths, notably for embedding the encoded
sequences in printed documents or documents with a page layout and rendered
with a readable font size suitable for the page width, or for presentation
purpose by grouping symbols).

In summary, padding is not required at all by all Base-N encoders/decoders,
and non significant whitespace is frequently needed.


Le lun. 15 oct. 2018 à 13:57, Philippe Verdy  a écrit :

> If you want an example where padding with "=" is not used at all,
> - look into URL-shortening schemes
> - look into database fields or data input forms and numerous data formats
> where the "=" sign is restricted (just like in URLs and file paths, or in
> identifiers)
> Padding is not used anywhere in the middle of the binary encoding or even
> at end, only the 64 symbols of the encoding alphabet are needed and the
> extra 2 or 4 lowest bits that may be encoded in the last character of the
> encoded sequence are discarded by the decoder (these extra bits are not
> necessarily set to 0 by encoders in the last symbol, even if this is the
> canonical form recommanded in encoders, their value is simply ignored by
> decoders).
> Some Base64 encoders do not necessarily encode binary octets-streams, but
> bits-streams whose length in bits is not necessarily multiple of 8, in
> which case there may be 1 to 7 trailing bits (not just 2 or 4) in the last
> symbol of the encoded sequence.
> Other encoders use streams of binary code units that are larger than 8
> bits, and may want to encode more padding symbols to force the alignment of
> data required in their associated decoders, or will choose to not use any
> padding at all, letting the decoder discard the trailing bits themselves at
> end of the encoded stream.
>
> Le lun. 15 oct. 2018 à 13:24, Philippe Verdy  a
> écrit :
>
>> Also the rationale for supporting "unnecessary" whitespace is found in
>> MIME's version of Base64, also in RFCs describing encoding formats for
>> digital certificates, or for exchanging public keys in encryption
>> algorithms like PGP (notably, but not only, as texts in the body of emails
>> or in documentations and websites).
>>
>> Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :
>>
>>> Philippe,
>>>
>>>
>>>
>>> Where is the use of whitespace or the idea that 1-byte pieces do not
>>> need all the equal sign paddings documented?
>>>
>>> I read the rfc 3501 you pointed at, I don’t see it there.
>>>
>>>
>>>
>>> Are these part of any standards? Or are you claiming these are practices
>>> despite the standards? If so, are these just tolerated by parsers, or are
>>> they actually generated by encoders?
>>>
>>>
>>>
>>> What would be the rationale for supporting unnecessary whitespace? If
>>> linebreaks are forced at some line length they can presumably be removed at
>>> that length and not treated as part of the encoding.
>>>
>>> Maybe we differ on define where the encoding begins and ends, and where
>>> higher level protocols prescribe how they are embedded within the protocol.
>>>
>>>
>>>
>>> Tex
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Philippe
>>> Verdy via Unicode
>>> *Sent:* Sunday, October 14, 2018 1:41 AM
>>> *To:* Adam Borowski
>>> *Cc:* unicode Unicode Discussion
>>> *Subject:* Re: Base64 encoding applied to different unicode texts
>>> always yields different base64 texts ... true or false?
>>>
>>>
>>>
>>> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
>>> enough to indicate the end of an octets-span. The extra = after

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Philippe Verdy via Unicode
If you want an example where padding with "=" is not used at all,
- look into URL-shortening schemes
- look into database fields or data input forms and numerous data formats
where the "=" sign is restricted (just like in URLs and file paths, or in
identifiers)
Padding is not used anywhere in the middle of the binary encoding or even
at end, only the 64 symbols of the encoding alphabet are needed and the
extra 2 or 4 lowest bits that may be encoded in the last character of the
encoded sequence are discarded by the decoder (these extra bits are not
necessarily set to 0 by encoders in the last symbol, even if this is the
canonical form recommanded in encoders, their value is simply ignored by
decoders).
Some Base64 encoders do not necessarily encode binary octets-streams, but
bits-streams whose length in bits is not necessarily multiple of 8, in
which case there may be 1 to 7 trailing bits (not just 2 or 4) in the last
symbol of the encoded sequence.
Other encoders use streams of binary code units that are larger than 8
bits, and may want to encode more padding symbols to force the alignment of
data required in their associated decoders, or will choose to not use any
padding at all, letting the decoder discard the trailing bits themselves at
end of the encoded stream.

Le lun. 15 oct. 2018 à 13:24, Philippe Verdy  a écrit :

> Also the rationale for supporting "unnecessary" whitespace is found in
> MIME's version of Base64, also in RFCs describing encoding formats for
> digital certificates, or for exchanging public keys in encryption
> algorithms like PGP (notably, but not only, as texts in the body of emails
> or in documentations and websites).
>
> Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :
>
>> Philippe,
>>
>>
>>
>> Where is the use of whitespace or the idea that 1-byte pieces do not need
>> all the equal sign paddings documented?
>>
>> I read the rfc 3501 you pointed at, I don’t see it there.
>>
>>
>>
>> Are these part of any standards? Or are you claiming these are practices
>> despite the standards? If so, are these just tolerated by parsers, or are
>> they actually generated by encoders?
>>
>>
>>
>> What would be the rationale for supporting unnecessary whitespace? If
>> linebreaks are forced at some line length they can presumably be removed at
>> that length and not treated as part of the encoding.
>>
>> Maybe we differ on define where the encoding begins and ends, and where
>> higher level protocols prescribe how they are embedded within the protocol.
>>
>>
>>
>> Tex
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Philippe
>> Verdy via Unicode
>> *Sent:* Sunday, October 14, 2018 1:41 AM
>> *To:* Adam Borowski
>> *Cc:* unicode Unicode Discussion
>> *Subject:* Re: Base64 encoding applied to different unicode texts always
>> yields different base64 texts ... true or false?
>>
>>
>>
>> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
>> enough to indicate the end of an octets-span. The extra = after it do not
>> add any other octet. and as well you're allowed to insert whitespaces
>> anywhere in the encoded stream (this is what ensures that the
>> Base64-encoded octets-stream will not be altered if line breaks are forced
>> anywhere (notably within the body of emails).
>>
>>
>>
>> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB,
>> CR, LF, NEL) in the middle is non-significant and ignorable on decoding
>> (their "encoded" bit length is 0 and they don't terminate an octets-span,
>> unlike "=" which discards extra bits remaining from the encoded stream
>> before that are not on 8-bit boundaries).
>>
>>
>>
>> Also:
>>
>> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X"
>> symbol before "=" can vary in its 4 lowest bits (which are then
>> ignored/discarded by the "=" symbol)
>>
>> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
>> symbol before "=" can vary in its 2 lowest bits (which are then
>> ignored/discarded by the "=" symbol)
>>
>>
>>
>> So you can use Base64 by encoding each octet in separate pieces, as one
>> Base64 symbol followed by an "=" symbol, and even insert any number of
>> whitespaces between them: there's a infinite number of valid Base64
>> encodings for representing the same octets-stream payload.
>>
>>
>&g

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Philippe Verdy via Unicode
Also the rationale for supporting "unnecessary" whitespace is found in
MIME's version of Base64, also in RFCs describing encoding formats for
digital certificates, or for exchanging public keys in encryption
algorithms like PGP (notably, but not only, as texts in the body of emails
or in documentations and websites).

Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :

> Philippe,
>
>
>
> Where is the use of whitespace or the idea that 1-byte pieces do not need
> all the equal sign paddings documented?
>
> I read the rfc 3501 you pointed at, I don’t see it there.
>
>
>
> Are these part of any standards? Or are you claiming these are practices
> despite the standards? If so, are these just tolerated by parsers, or are
> they actually generated by encoders?
>
>
>
> What would be the rationale for supporting unnecessary whitespace? If
> linebreaks are forced at some line length they can presumably be removed at
> that length and not treated as part of the encoding.
>
> Maybe we differ on define where the encoding begins and ends, and where
> higher level protocols prescribe how they are embedded within the protocol.
>
>
>
> Tex
>
>
>
>
>
>
>
>
>
> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Philippe
> Verdy via Unicode
> *Sent:* Sunday, October 14, 2018 1:41 AM
> *To:* Adam Borowski
> *Cc:* unicode Unicode Discussion
> *Subject:* Re: Base64 encoding applied to different unicode texts always
> yields different base64 texts ... true or false?
>
>
>
> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
> enough to indicate the end of an octets-span. The extra = after it do not
> add any other octet. and as well you're allowed to insert whitespaces
> anywhere in the encoded stream (this is what ensures that the
> Base64-encoded octets-stream will not be altered if line breaks are forced
> anywhere (notably within the body of emails).
>
>
>
> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR,
> LF, NEL) in the middle is non-significant and ignorable on decoding (their
> "encoded" bit length is 0 and they don't terminate an octets-span, unlike
> "=" which discards extra bits remaining from the encoded stream before that
> are not on 8-bit boundaries).
>
>
>
> Also:
>
> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol
> before "=" can vary in its 4 lowest bits (which are then ignored/discarded
> by the "=" symbol)
>
> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
> symbol before "=" can vary in its 2 lowest bits (which are then
> ignored/discarded by the "=" symbol)
>
>
>
> So you can use Base64 by encoding each octet in separate pieces, as one
> Base64 symbol followed by an "=" symbol, and even insert any number of
> whitespaces between them: there's a infinite number of valid Base64
> encodings for representing the same octets-stream payload.
>
>
>
> Base64 allows encoding any octets streams but not directly any
> bits-streams : it assumes that the effective bits-stream has a binary
> length multiple of 8. To encode a bits-stream with an exact number of bits
> (not multiple of 8), you need to encode an extra payload to indicate the
> effective number of bits to keep at end of the encoded octets-stream (or at
> start):
>
> - Base64 does not specify how you convert a bitstream of arbitrary length
> to an octets-stream;
>
> - for that purpose, you may need to pad the bits-stream at start or at end
> with 1 to 6 bits (so that it the resulting bitstream has a length multiple
> of 8, then encodable with Base64 which takes only octets on input).
>
> - these extra padding bits are not significant for the original bitstream,
> but are significant for the Base64 encoder/decoder, they will be discarded
> by the bitstream decoder built on top of the Base64 decoder, but not by the
> Base64 decoder itself.
>
>
>
> You need to encode somewhere with the bitstream encoder how many padding
> bits (0 to 7) are present at start or end of the octets-stream; this can be
> done:
>
> - as a separate payload (not encoded by Base64), or
>
> - by prepending 3 bits at start of the bits-stream then padded at end with
> 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64
> encoding.
>
> - by appending 3 bits at end of the  bits-stream, just after 1 to 7 random
> bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.
>
> Finally your bits-stream decoder will be able to use this padding count to
> discard these random paddi

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Philippe Verdy via Unicode
Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st
sentence, it is explicitly stated :

In some circumstances, the use of padding ("=") in base-encoded data
is not required or used.


Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :

> Philippe,
>
>
>
> Where is the use of whitespace or the idea that 1-byte pieces do not need
> all the equal sign paddings documented?
>
> I read the rfc 3501 you pointed at, I don’t see it there.
>
>
>
> Are these part of any standards? Or are you claiming these are practices
> despite the standards? If so, are these just tolerated by parsers, or are
> they actually generated by encoders?
>
>
>
> What would be the rationale for supporting unnecessary whitespace? If
> linebreaks are forced at some line length they can presumably be removed at
> that length and not treated as part of the encoding.
>
> Maybe we differ on define where the encoding begins and ends, and where
> higher level protocols prescribe how they are embedded within the protocol.
>
>
>
> Tex
>
>
>
>
>
>
>
>
>
> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Philippe
> Verdy via Unicode
> *Sent:* Sunday, October 14, 2018 1:41 AM
> *To:* Adam Borowski
> *Cc:* unicode Unicode Discussion
> *Subject:* Re: Base64 encoding applied to different unicode texts always
> yields different base64 texts ... true or false?
>
>
>
> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
> enough to indicate the end of an octets-span. The extra = after it do not
> add any other octet. and as well you're allowed to insert whitespaces
> anywhere in the encoded stream (this is what ensures that the
> Base64-encoded octets-stream will not be altered if line breaks are forced
> anywhere (notably within the body of emails).
>
>
>
> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR,
> LF, NEL) in the middle is non-significant and ignorable on decoding (their
> "encoded" bit length is 0 and they don't terminate an octets-span, unlike
> "=" which discards extra bits remaining from the encoded stream before that
> are not on 8-bit boundaries).
>
>
>
> Also:
>
> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol
> before "=" can vary in its 4 lowest bits (which are then ignored/discarded
> by the "=" symbol)
>
> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
> symbol before "=" can vary in its 2 lowest bits (which are then
> ignored/discarded by the "=" symbol)
>
>
>
> So you can use Base64 by encoding each octet in separate pieces, as one
> Base64 symbol followed by an "=" symbol, and even insert any number of
> whitespaces between them: there's a infinite number of valid Base64
> encodings for representing the same octets-stream payload.
>
>
>
> Base64 allows encoding any octets streams but not directly any
> bits-streams : it assumes that the effective bits-stream has a binary
> length multiple of 8. To encode a bits-stream with an exact number of bits
> (not multiple of 8), you need to encode an extra payload to indicate the
> effective number of bits to keep at end of the encoded octets-stream (or at
> start):
>
> - Base64 does not specify how you convert a bitstream of arbitrary length
> to an octets-stream;
>
> - for that purpose, you may need to pad the bits-stream at start or at end
> with 1 to 6 bits (so that it the resulting bitstream has a length multiple
> of 8, then encodable with Base64 which takes only octets on input).
>
> - these extra padding bits are not significant for the original bitstream,
> but are significant for the Base64 encoder/decoder, they will be discarded
> by the bitstream decoder built on top of the Base64 decoder, but not by the
> Base64 decoder itself.
>
>
>
> You need to encode somewhere with the bitstream encoder how many padding
> bits (0 to 7) are present at start or end of the octets-stream; this can be
> done:
>
> - as a separate payload (not encoded by Base64), or
>
> - by prepending 3 bits at start of the bits-stream then padded at end with
> 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64
> encoding.
>
> - by appending 3 bits at end of the  bits-stream, just after 1 to 7 random
> bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.
>
> Finally your bits-stream decoder will be able to use this padding count to
> discard these random padding bits (and possibly realign the stream on
> different byte-boundaries when the effective bitlength bits-stream 

RE: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-14 Thread Tex via Unicode
Philippe,

 

Where is the use of whitespace or the idea that 1-byte pieces do not need all 
the equal sign paddings documented?

I read the rfc 3501 you pointed at, I don’t see it there.

 

Are these part of any standards? Or are you claiming these are practices 
despite the standards? If so, are these just tolerated by parsers, or are they 
actually generated by encoders?

 

What would be the rationale for supporting unnecessary whitespace? If 
linebreaks are forced at some line length they can presumably be removed at 
that length and not treated as part of the encoding.

Maybe we differ on define where the encoding begins and ends, and where higher 
level protocols prescribe how they are embedded within the protocol.

 

Tex

 

 

 

 

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe Verdy 
via Unicode
Sent: Sunday, October 14, 2018 1:41 AM
To: Adam Borowski
Cc: unicode Unicode Discussion
Subject: Re: Base64 encoding applied to different unicode texts always yields 
different base64 texts ... true or false?

 

Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is enough 
to indicate the end of an octets-span. The extra = after it do not add any 
other octet. and as well you're allowed to insert whitespaces anywhere in the 
encoded stream (this is what ensures that the Base64-encoded octets-stream will 
not be altered if line breaks are forced anywhere (notably within the body of 
emails).

 

So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, LF, 
NEL) in the middle is non-significant and ignorable on decoding (their 
"encoded" bit length is 0 and they don't terminate an octets-span, unlike "=" 
which discards extra bits remaining from the encoded stream before that are not 
on 8-bit boundaries).

 

Also:

- For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol 
before "=" can vary in its 4 lowest bits (which are then ignored/discarded by 
the "=" symbol)

- For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" symbol 
before "=" can vary in its 2 lowest bits (which are then ignored/discarded by 
the "=" symbol)

 

So you can use Base64 by encoding each octet in separate pieces, as one Base64 
symbol followed by an "=" symbol, and even insert any number of whitespaces 
between them: there's a infinite number of valid Base64 encodings for 
representing the same octets-stream payload.

 

Base64 allows encoding any octets streams but not directly any bits-streams : 
it assumes that the effective bits-stream has a binary length multiple of 8. To 
encode a bits-stream with an exact number of bits (not multiple of 8), you need 
to encode an extra payload to indicate the effective number of bits to keep at 
end of the encoded octets-stream (or at start):

- Base64 does not specify how you convert a bitstream of arbitrary length to an 
octets-stream;

- for that purpose, you may need to pad the bits-stream at start or at end with 
1 to 6 bits (so that it the resulting bitstream has a length multiple of 8, 
then encodable with Base64 which takes only octets on input).

- these extra padding bits are not significant for the original bitstream, but 
are significant for the Base64 encoder/decoder, they will be discarded by the 
bitstream decoder built on top of the Base64 decoder, but not by the Base64 
decoder itself.

 

You need to encode somewhere with the bitstream encoder how many padding bits 
(0 to 7) are present at start or end of the octets-stream; this can be done:

- as a separate payload (not encoded by Base64), or

- by prepending 3 bits at start of the bits-stream then padded at end with 1 to 
7 random bits to get a bit-length multiple of 8 suitable for Base64 encoding.

- by appending 3 bits at end of the  bits-stream, just after 1 to 7 random bits 
needed to get a bit-length multiple of 8 suitable for Base64 encoding.

Finally your bits-stream decoder will be able to use this padding count to 
discard these random padding bits (and possibly realign the stream on different 
byte-boundaries when the effective bitlength bits-stream payload is not a 
multiple of 8 and padding bits were added)

 

Base64 also does not specify how bits of the original bits-stream payload are 
packed into the octets-stream input suitable for Base64-encoding, notably it 
does not specify their order and endian-ness. The same remark applies as well 
for MIME, HTTP. So lot of network protocols and file formats need to how to 
properly encode which possible option is used to encode bits-streams of 
arbitrary length, or need to specify which default choice to apply if this 
option is not encoded, or which option must be used (with no possible 
variation). And this also adds to the number of distinct encodings that are 
possible but are still equivalent for the same effective bits-stream payload.

 

All th

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-14 Thread Philippe Verdy via Unicode
Le dim. 14 oct. 2018 à 21:21, Doug Ewell via Unicode 
a écrit :

> Steffen Nurpmeso wrote:
>
> > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
> > (MIME) Part One: Format of Internet Message Bodies).
>
> Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data
> Encodings." RFC 2045 defines a particular implementation of base64,
> specific to transporting Internet mail in a 7-bit environment.
>

Wrong, this is "specific" to transporting Internet mail in any 7 bit or 8
bit environment (today almost all mail agents are operating in 8 bit), and
then it is referenced directly by HTTP (and its HTTPS variant).

So this is no so "specific". MIME is extremely popular, RFC 4648 is
extremely exotic (and RFC 4648 is wrong when saying that IMAP is very
specific as it is now a very popular protocol, widely used as well). MIME
is so frequently used, that almost all people refer to it when they look
for Base64, or do not explicitly state that another definition (found in an
exotic RFC) is explicitly used.


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-14 Thread Philippe Verdy via Unicode
It's also interesting to look at https://tools.ietf.org/html/rfc3501
- which defines (for IMAP v4) another "BASE64" encoding,
- and also defines a "Modified UTF-7" encoding using it, deviating from
Unicode's definition of UTF-7,
- and adding other requirements (which forbids alternate encodings
permitted in UTF-7 and all other Base64 variants, including those used in
MIME/RFC 2045 or SMTP, used in strong relations with IMAP !).

And nothing in RFC 4648 is clear about the fact that it only covers the
encoding of "octets streams" and not "bits streams". It also does not
discuss the adaptation for "Base64" for transport and storage (needed for
MIME, IMAP, but also in HTTP, and in several file/data formats including
XML, or digital signatures).

That RFC 4648 is only superficial, and does not cover everything (even
Unicode has its own definition for UTF-7 and also allows variations).

As we are on this Unicode list, the definition used by Unicode (more in
line with MIME), does not follow at all those in RFC 4648.
Most uses of Base64 encodings are based on the original MIME definition,
and all of them perform new adaptations. (Even the definition of "Base16"
in RFC4648 contradicts most other definitions).


Le dim. 14 oct. 2018 à 21:21, Doug Ewell via Unicode 
a écrit :

> Steffen Nurpmeso wrote:
>
> > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
> > (MIME) Part One: Format of Internet Message Bodies).
>
> Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data
> Encodings." RFC 2045 defines a particular implementation of base64,
> specific to transporting Internet mail in a 7-bit environment.
>
> RFC 4648 discusses many of the "higher-level protocol" topics that some
> people are focusing on, such as separating the base64-encoded output
> into lines of length 72 (or other), alternative target code unit sets or
> "alphabets," and padding characters. It would be helpful for everyone to
> read this particular RFC before concluding that these topics have not
> been considered, or that they compromise round-tripping or other
> characteristics of base64.
>
> I had assumed that when Roger asked about "base64 encoding," he was
> asking about the basic definition of base64.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-14 Thread Doug Ewell via Unicode

Steffen Nurpmeso wrote:


Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
(MIME) Part One: Format of Internet Message Bodies).


Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data 
Encodings." RFC 2045 defines a particular implementation of base64, 
specific to transporting Internet mail in a 7-bit environment.


RFC 4648 discusses many of the "higher-level protocol" topics that some 
people are focusing on, such as separating the base64-encoded output 
into lines of length 72 (or other), alternative target code unit sets or 
"alphabets," and padding characters. It would be helpful for everyone to 
read this particular RFC before concluding that these topics have not 
been considered, or that they compromise round-tripping or other 
characteristics of base64.


I had assumed that when Roger asked about "base64 encoding," he was 
asking about the basic definition of base64.


--
Doug Ewell | Thornton, CO, US | ewellic.org



Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-14 Thread Philippe Verdy via Unicode
Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
enough to indicate the end of an octets-span. The extra = after it do not
add any other octet. and as well you're allowed to insert whitespaces
anywhere in the encoded stream (this is what ensures that the
Base64-encoded octets-stream will not be altered if line breaks are forced
anywhere (notably within the body of emails).

So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR,
LF, NEL) in the middle is non-significant and ignorable on decoding (their
"encoded" bit length is 0 and they don't terminate an octets-span, unlike
"=" which discards extra bits remaining from the encoded stream before that
are not on 8-bit boundaries).

Also:
- For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol
before "=" can vary in its 4 lowest bits (which are then ignored/discarded
by the "=" symbol)
- For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" symbol
before "=" can vary in its 2 lowest bits (which are then ignored/discarded
by the "=" symbol)

So you can use Base64 by encoding each octet in separate pieces, as one
Base64 symbol followed by an "=" symbol, and even insert any number of
whitespaces between them: there's a infinite number of valid Base64
encodings for representing the same octets-stream payload.

Base64 allows encoding any octets streams but not directly any bits-streams
: it assumes that the effective bits-stream has a binary length multiple of
8. To encode a bits-stream with an exact number of bits (not multiple of
8), you need to encode an extra payload to indicate the effective number of
bits to keep at end of the encoded octets-stream (or at start):
- Base64 does not specify how you convert a bitstream of arbitrary length
to an octets-stream;
- for that purpose, you may need to pad the bits-stream at start or at end
with 1 to 6 bits (so that it the resulting bitstream has a length multiple
of 8, then encodable with Base64 which takes only octets on input).
- these extra padding bits are not significant for the original bitstream,
but are significant for the Base64 encoder/decoder, they will be discarded
by the bitstream decoder built on top of the Base64 decoder, but not by the
Base64 decoder itself.

You need to encode somewhere with the bitstream encoder how many padding
bits (0 to 7) are present at start or end of the octets-stream; this can be
done:
- as a separate payload (not encoded by Base64), or
- by prepending 3 bits at start of the bits-stream then padded at end with
1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64
encoding.
- by appending 3 bits at end of the  bits-stream, just after 1 to 7 random
bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.
Finally your bits-stream decoder will be able to use this padding count to
discard these random padding bits (and possibly realign the stream on
different byte-boundaries when the effective bitlength bits-stream payload
is not a multiple of 8 and padding bits were added)

Base64 also does not specify how bits of the original bits-stream payload
are packed into the octets-stream input suitable for Base64-encoding,
notably it does not specify their order and endian-ness. The same remark
applies as well for MIME, HTTP. So lot of network protocols and file
formats need to how to properly encode which possible option is used to
encode bits-streams of arbitrary length, or need to specify which default
choice to apply if this option is not encoded, or which option must be used
(with no possible variation). And this also adds to the number of distinct
encodings that are possible but are still equivalent for the same effective
bits-stream payload.

All these allowed variations are from the encoder perspective. For
interoperability, the decoder has to be flexible and to support various
options to be compatible with different implementations of the encoder,
notably when the encoder was run on a different system. And this is the
case for the MIME transport by mail, or for HTTP and FTP transports, or
file/media storage formats even if the file is stored on the same system,
because it may actually be a copy stored locally but coming from another
system where the file was actually encoded).

Now if we come back to the encoding of plain-text payloads, Unicode just
specifies the allowed range (from 0 to 0x10) for scalar values of code
points (it actually does not mandate an exact bit-length because the range
does not fully fit exactly to 21 bits and an encoder can still pack
multiple code points together into more compact code units.

However Unicode provides and standardizes several encodings (UTF-8/16/32)
which use code units whose size is directly suitable as input for an
octets-stream, so that they are directly encodable with Base64, without
having to specify an extra layer for the bits-stream encoder/decoder.

But many other encodings are still possible (and can be 

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Adam Borowski via Unicode
On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote:
> Le sam. 13 oct. 2018 à 18:58, Steffen Nurpmeso via Unicode <
> unicode@unicode.org> a écrit :
> > The only variance is described as:
> >
> >   Care must be taken to use the proper octets for line breaks if base64
> >   encoding is applied directly to text material that has not been
> >   converted to canonical form.  In particular, text line breaks must be
> >   converted into CRLF sequences prior to base64 encoding.  The
> >   important thing to note is that this may be done directly by the
> >   encoder rather than in a prior canonicalization step in some
> >   implementations.
> >
> > This is MIME, it specifies (in the same RFC):
> 
> I've not spoken aboutr the encoding of new lines **in the actual encoded
> text**:
> -  if their existing text-encoding ever gets converted to Base64 as if the
> whole text was an opaque binary object, their initial text-encoding will be
> preserved (so yes it will preserve the way these embedded newlines are
> encoded as CR, LF, CR+LF, NL...)
> 
> I spoke about newlines used in the transport syntax to split the initial
> binary object (which may actually contain text but it does not matter).
> MIME defines this operation and even requires splitting the binary object
> in fragments with maximum binary size so that these binary fragments can be
> converted with Base64 into lines with maximum length. In the MIME Base64
> representation you can insert newlines anywhere between fragments encoded
> separately.

There's another kind of fragmentation that can make the encoding differ (but
still decode to the same payload):

The data stream gets split into 3-byte internal, 4-byte external packets.
Any packet may contain less than those 3 bytes, in which cases it is padded
with = characters:
3 bytes 
2 bytes XXX=
1 byte  XX==

Usually, such smaller packets happen only at the end of a message, but to
support encoding a stream piecewise, they are allowed at any point.

For example:
"meow" is bWVvdw==
"me""ow"   is bWU=b3c=
yet both carry the same payload.

> Base64 is used exactly to support this flexibility in transport (or
> storage) without altering any bit of the initial content once it is
> decoded.

Right, any such variations are in packaging only.


ᛗᛖᛟᚹ
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 10 people enter a bar: 1 who understands binary,
⢿⡄⠘⠷⠚⠋⠀ 1 who doesn't, D who prefer to write it as hex,
⠈⠳⣄ and 1 who narrowly avoided an off-by-one error.


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Philippe Verdy via Unicode
Le sam. 13 oct. 2018 à 18:58, Steffen Nurpmeso via Unicode <
unicode@unicode.org> a écrit :

> Philippe Verdy via Unicode wrote in  w9+jearw4ghyk...@mail.gmail.com>:
>  |You forget that Base64 (as used in MIME) does not follow these rules \
>  |as it allows multiple different encodings for the same source binary. \
>  |MIME actually
>  |splits a binary object into multiple fragments at random positions, \
>  |and then encodes these fragments separately. Also MIME uses an extension
> \
>  |of Base64
>  |where it allows some variations in the encoding alphabet (so even the \
>  |same fragment of the same length may have two disting encodings).
>  |
>  |Base64 in MIME is different from standard Base64 (which never splits \
>  |the binary object before encoding it, and uses a strict alphabet of \
>  |64 ASCII
>  |characters, allowing no variation). So MIME requires special handling: \
>  |the assumpton that a binary message is encoded the same is wrong, but \
>  |MIME still
>  |requires that this non unique Base64 encoding will be decoded back \
>  |to the same initial (unsplitted) binary object (independantly of its \
>  |size and
>  |independantly of the splitting boundaries used in the transport, which \
>  |may change during the transport).
>
> Base64 is defined in RFC 2045 (Multipurpose Internet Mail
> Extensions (MIME) Part One: Format of Internet Message Bodies).
> It is a content-transfer-encoding and encodes any data
> transparently into a 7 bit clean ASCII _and_ EBCDIC compatible
> (the authors commemorate that) text.
> When decoding it reverts this representation into its original form.
> Ok, there is the CRLF newline problem, as below.
> What do you mean by "splitting"?
>
> ...
> The only variance is described as:
>
>   Care must be taken to use the proper octets for line breaks if base64
>   encoding is applied directly to text material that has not been
>   converted to canonical form.  In particular, text line breaks must be
>   converted into CRLF sequences prior to base64 encoding.  The
>   important thing to note is that this may be done directly by the
>   encoder rather than in a prior canonicalization step in some
>   implementations.
>
> This is MIME, it specifies (in the same RFC):


I've not spoken aboutr the encoding of new lines **in the actual encoded
text**:
-  if their existing text-encoding ever gets converted to Base64 as if the
whole text was an opaque binary object, their initial text-encoding will be
preserved (so yes it will preserve the way these embedded newlines are
encoded as CR, LF, CR+LF, NL...)

I spoke about newlines used in the transport syntax to split the initial
binary object (which may actually contain text but it does not matter).
MIME defines this operation and even requires splitting the binary object
in fragments with maximum binary size so that these binary fragments can be
converted with Base64 into lines with maximum length. In the MIME Base64
representation you can insert newlines anywhere between fragments encoded
separately.

The maximum size of fragment is not fixed (it is usually about 60 binary
octets, that are converted to lines of 80 ASCII characters, followed by a
newline (CR+LF is strongly suggested for MIME, but it is admitted to use
other newline sequences). Email forwarding agents frequently needed these
line lengths to process the mail properly (not just the MIME headers but as
well the content body, where they want at least some whitespace or newline
in the middle where they can freely rearrange the line lines by compressing
whitespaces or splitting lines to shorter length as necessary to their
processing; this is much less frequent today because most mail agents are
8-bit clean and allow arbitrary line lengths... except in MIME headers).

In MIME headers the situation is different, there's really a maximum
line-length there, and if a header is too long, it has to be split on
multiple lines (using continuation sequences, i.e. a newline (CR+LF is
standard here) followed by at least one space (this
insertion/change/removal of whitespaces is permitted everywhere in the MIME
header after the header type, but even before the colon that follows the
header type). So a MIME header value whose included text gets encoded with
Base64 will be split using "=?" sequences starting the indication that the
fragment is Base64 encoded (instead of being QuotedPrintable-encoded) and
then a separator and the encapsulated Base-64 encoding of a fragment, and a
single header may have multiple Base64-encoded fragments in the same header
value, and there's large freedom about where to split the value to isolate
fragments with convenient size that satisfies the MIME requirements. These
multiple fragemetns may then occur on the same line (separated by
whitespace) or on multiple line (separated by continuation sequences).

In that case, the same initial text can have multiple valid representation
in a MIME envelope format using Base64: it is not Base64 itself that 

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Steffen Nurpmeso via Unicode
Philippe Verdy via Unicode wrote in :
 |You forget that Base64 (as used in MIME) does not follow these rules \
 |as it allows multiple different encodings for the same source binary. \
 |MIME actually 
 |splits a binary object into multiple fragments at random positions, \
 |and then encodes these fragments separately. Also MIME uses an extension \
 |of Base64 
 |where it allows some variations in the encoding alphabet (so even the \
 |same fragment of the same length may have two disting encodings).
 |
 |Base64 in MIME is different from standard Base64 (which never splits \
 |the binary object before encoding it, and uses a strict alphabet of \
 |64 ASCII 
 |characters, allowing no variation). So MIME requires special handling: \
 |the assumpton that a binary message is encoded the same is wrong, but \
 |MIME still 
 |requires that this non unique Base64 encoding will be decoded back \
 |to the same initial (unsplitted) binary object (independantly of its \
 |size and 
 |independantly of the splitting boundaries used in the transport, which \
 |may change during the transport).

Base64 is defined in RFC 2045 (Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message Bodies).
It is a content-transfer-encoding and encodes any data
transparently into a 7 bit clean ASCII _and_ EBCDIC compatible
(the authors commemorate that) text.
When decoding it reverts this representation into its original form.
Ok, there is the CRLF newline problem, as below.
What do you mean by "splitting"?

...
The only variance is described as:

  Care must be taken to use the proper octets for line breaks if base64
  encoding is applied directly to text material that has not been
  converted to canonical form.  In particular, text line breaks must be
  converted into CRLF sequences prior to base64 encoding.  The
  important thing to note is that this may be done directly by the
  encoder rather than in a prior canonicalization step in some
  implementations.

This is MIME, it specifies (in the same RFC):

  2.10.  Lines

   "Lines" are defined as sequences of octets separated by a CRLF
   sequences.  This is consistent with both RFC 821 and RFC 822.
   "Lines" only refers to a unit of data in a message, which may or may
   not correspond to something that is actually displayed by a user
   agent.

and furthermore

  6.5.  Translating Encodings

   The quoted-printable and base64 encodings are designed so that
   conversion between them is possible.  The only issue that arises in
   such a conversion is the handling of hard line breaks in quoted-
   printable encoding output. When converting from quoted-printable to
   base64 a hard line break in the quoted-printable form represents a
   CRLF sequence in the canonical form of the data. It must therefore be
   converted to a corresponding encoded CRLF in the base64 form of the
   data.  Similarly, a CRLF sequence in the canonical form of the data
   obtained after base64 decoding must be converted to a quoted-
   printable hard line break, but ONLY when converting text data.

So we go over

  6.6.  Canonical Encoding Model

   There was some confusion, in the previous versions of this RFC,
   regarding the model for when email data was to be converted to
   canonical form and encoded, and in particular how this process would
   affect the treatment of CRLFs, given that the representation of
   newlines varies greatly from system to system, and the relationship
   between content-transfer-encodings and character sets.  A canonical
   model for encoding is presented in RFC 2049 for this reason.

to RFC 2049 where we find

 For example, in the case of text/plain data, the text
  must be converted to a supported character set and
  lines must be delimited with CRLF delimiters in
  accordance with RFC 822.  Note that the restriction on
  line lengths implied by RFC 822 is eliminated if the
  next step employs either quoted-printable or base64
  encoding.

and, later

   Conversion from entity form to local form is accomplished by
   reversing these steps. Note that reversal of these steps may produce
   differing results since there is no guarantee that the original and
   final local forms are the same.

and, later

   NOTE: Some confusion has been caused by systems that represent
   messages in a format which uses local newline conventions which
   differ from the RFC822 CRLF convention.  It is important to note that
   these formats are not canonical RFC822/MIME.  These formats are
   instead *encodings* of RFC822, where CRLF sequences in the canonical
   representation of the message are encoded as the local newline
   convention.  Note that formats which encode CRLF sequences as, for
   example, LF are not capable of representing MIME messages containing
   binary data which contains LF octets not part of CRLF line separation
   sequences.

Whoever understands this emojibake.
My MUA still gnaws at 

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Philippe Verdy via Unicode
In summary, two disating implementations are allowed to return different
values t and t' of Base64_Encode(d) from the same message d, but both
Base64_Decode(t') and  Base64_Decode(t) will be equal and will MUST return
d exactly.

There's an allowed choice of implementation for Base64_Encode() but
Base64_Decode() must then be updated to be permissive/flexible and ensure
that in all cases,
Base64_Decode[Base64_Encode[d]] = d, for every value of d.

The reverse is not true because of this flexibility (needed for various
transport protocols that have different requirements, notably on the
allowed set of characters, and on their maximum line lengths):
Base64_Encode[Base64_Decode[t]] = t may be false.


Le sam. 13 oct. 2018 à 16:45, Philippe Verdy  a écrit :

> You forget that Base64 (as used in MIME) does not follow these rules as it
> allows multiple different encodings for the same source binary. MIME
> actually splits a binary object into multiple fragments at random
> positions, and then encodes these fragments separately. Also MIME uses an
> extension of Base64 where it allows some variations in the encoding
> alphabet (so even the same fragment of the same length may have two disting
> encodings).
>
> Base64 in MIME is different from standard Base64 (which never splits the
> binary object before encoding it, and uses a strict alphabet of 64 ASCII
> characters, allowing no variation). So MIME requires special handling: the
> assumpton that a binary message is encoded the same is wrong, but MIME
> still requires that this non unique Base64 encoding will be decoded back to
> the same initial (unsplitted) binary object (independantly of its size and
> independantly of the splitting boundaries used in the transport, which may
> change during the transport).
>
> This also applies to the Base64 encoding used in HTTP transport syntax,
> and notably in the HTTP/1.1 streaming feature where fragment sizes are also
> variable.
>
>
> Le sam. 13 oct. 2018 à 16:27, Costello, Roger L. via Unicode <
> unicode@unicode.org> a écrit :
>
>> Hi Folks,
>>
>> Thank you for your outstanding responses!
>>
>> Below is a summary of what I learned. Are there any errors in the
>> summary? Is there anything you would add? Please let me know of anything
>> that is not clear.   /Roger
>>
>> 1. While base64 encoding is usually applied to binary, it is also
>> sometimes applied to text, such as Unicode text.
>>
>> Note: Since base64 encoding may be applied to both binary and text, in
>> the following bullets I use the more generic term "data". For example,
>> "Data d is base64-encoded to yield ..."
>>
>> 2. Neither base64 encoding nor decoding should presume any special
>> knowledge of the meaning of the data or do anything extra based on that
>> presumption.
>>
>> For example, converting Unicode text to and from base64 should not
>> perform any sort of Unicode normalization, convert between UTFs, insert or
>> remove BOMs, etc. This is like saying that converting a JPEG image to and
>> from base64 should not resize or rescale the image, change its color depth,
>> convert it to another graphic format, etc.
>>
>> If you use base64 for encoding MIME content (e.g. emails), the base64
>> decoding will not transform the content. The email parser must ensure that
>> the content is valid, so the parser might have to transform the content
>> (possibly replacing some invalid sequences or truncating), and then apply
>> Unicode normalization to render the text. These transforms are part of the
>> MIME application and are independent of whether you use base64 or any
>> another encoding or transport syntax.
>>
>> 3. If data d is different than d', then the base64 text resulting from
>> encoding d is different than the base64 text resulting from encoding d'.
>>
>> 4. If base64 text t is different than t', then the data resulting from
>> decoding t is different than the data resulting from decoding t'.
>>
>> 5. For every data d there is exactly one base64 encoding t.
>>
>> 6. Every base64 text t is an encoding of exactly one data d.
>>
>> 7. For all data d, Base64_Decode[Base64_Encode[d]] = d
>>
>>


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Philippe Verdy via Unicode
You forget that Base64 (as used in MIME) does not follow these rules as it
allows multiple different encodings for the same source binary. MIME
actually splits a binary object into multiple fragments at random
positions, and then encodes these fragments separately. Also MIME uses an
extension of Base64 where it allows some variations in the encoding
alphabet (so even the same fragment of the same length may have two disting
encodings).

Base64 in MIME is different from standard Base64 (which never splits the
binary object before encoding it, and uses a strict alphabet of 64 ASCII
characters, allowing no variation). So MIME requires special handling: the
assumpton that a binary message is encoded the same is wrong, but MIME
still requires that this non unique Base64 encoding will be decoded back to
the same initial (unsplitted) binary object (independantly of its size and
independantly of the splitting boundaries used in the transport, which may
change during the transport).

This also applies to the Base64 encoding used in HTTP transport syntax, and
notably in the HTTP/1.1 streaming feature where fragment sizes are also
variable.


Le sam. 13 oct. 2018 à 16:27, Costello, Roger L. via Unicode <
unicode@unicode.org> a écrit :

> Hi Folks,
>
> Thank you for your outstanding responses!
>
> Below is a summary of what I learned. Are there any errors in the summary?
> Is there anything you would add? Please let me know of anything that is not
> clear.   /Roger
>
> 1. While base64 encoding is usually applied to binary, it is also
> sometimes applied to text, such as Unicode text.
>
> Note: Since base64 encoding may be applied to both binary and text, in the
> following bullets I use the more generic term "data". For example, "Data d
> is base64-encoded to yield ..."
>
> 2. Neither base64 encoding nor decoding should presume any special
> knowledge of the meaning of the data or do anything extra based on that
> presumption.
>
> For example, converting Unicode text to and from base64 should not perform
> any sort of Unicode normalization, convert between UTFs, insert or remove
> BOMs, etc. This is like saying that converting a JPEG image to and from
> base64 should not resize or rescale the image, change its color depth,
> convert it to another graphic format, etc.
>
> If you use base64 for encoding MIME content (e.g. emails), the base64
> decoding will not transform the content. The email parser must ensure that
> the content is valid, so the parser might have to transform the content
> (possibly replacing some invalid sequences or truncating), and then apply
> Unicode normalization to render the text. These transforms are part of the
> MIME application and are independent of whether you use base64 or any
> another encoding or transport syntax.
>
> 3. If data d is different than d', then the base64 text resulting from
> encoding d is different than the base64 text resulting from encoding d'.
>
> 4. If base64 text t is different than t', then the data resulting from
> decoding t is different than the data resulting from decoding t'.
>
> 5. For every data d there is exactly one base64 encoding t.
>
> 6. Every base64 text t is an encoding of exactly one data d.
>
> 7. For all data d, Base64_Decode[Base64_Encode[d]] = d
>
>


RE: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Costello, Roger L. via Unicode
Hi Folks,

Thank you for your outstanding responses! 

Below is a summary of what I learned. Are there any errors in the summary? Is 
there anything you would add? Please let me know of anything that is not clear. 
  /Roger

1. While base64 encoding is usually applied to binary, it is also sometimes 
applied to text, such as Unicode text.

Note: Since base64 encoding may be applied to both binary and text, in the 
following bullets I use the more generic term "data". For example, "Data d is 
base64-encoded to yield ..."

2. Neither base64 encoding nor decoding should presume any special knowledge of 
the meaning of the data or do anything extra based on that presumption. 

For example, converting Unicode text to and from base64 should not perform any 
sort of Unicode normalization, convert between UTFs, insert or remove BOMs, 
etc. This is like saying that converting a JPEG image to and from base64 should 
not resize or rescale the image, change its color depth, convert it to another 
graphic format, etc.

If you use base64 for encoding MIME content (e.g. emails), the base64 decoding 
will not transform the content. The email parser must ensure that the content 
is valid, so the parser might have to transform the content (possibly replacing 
some invalid sequences or truncating), and then apply Unicode normalization to 
render the text. These transforms are part of the MIME application and are 
independent of whether you use base64 or any another encoding or transport 
syntax.

3. If data d is different than d', then the base64 text resulting from encoding 
d is different than the base64 text resulting from encoding d'.

4. If base64 text t is different than t', then the data resulting from decoding 
t is different than the data resulting from decoding t'.

5. For every data d there is exactly one base64 encoding t.

6. Every base64 text t is an encoding of exactly one data d.

7. For all data d, Base64_Decode[Base64_Encode[d]] = d



Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-12 Thread Philippe Verdy via Unicode
I also think the reverse is also true !

Decoding a Base64 entity does not warranty it will return valid text in any
known encoding. So Unicode normalization of the output cannot apply.

Even if it represents text, nothing indicates that the result will be
encoded with some Unicode encoding form (unless this is tagged separately,
like in MIME).

If you use Base64 for decoding MIME contents (e.g. for emails), the Base-64
decoding itself will not transform the encoding, but then the email parser
will have to ensure that the text encoding is valid, at which time it will
have to transform it (possibly replace some invalid sequences or truncate
it), and then only it may apply normalization to help render that text. But
these transforms are part of the MIME application and independant of whever
you used Base-64 or any another binary encoding or transport syntax.

In other words: "If m is not equal to m', then t will not equal t'" is
reversible, but nothing indicates that m or m' Base64-decoded are texts,
they are just opaque binary objects which are still equal in value like
their t or t' Base64-encodings.

Note: some Base64 envelope formats (like MIME) allow multiple
representations t and t' from the same message m, by adding paddings or
transport syntaxes like line-splitting (with varaible length). Base64 alone
does not allow that variation (it normally uses a static alphabet), but
there are variants that accept decoding extended alphabets as binary
equivalent. So you may have two MIME-encoded texts that have different
encodings (with Base64 or Quopted-Printable, with variable line lengths)
but that represent the same source binary object, and decoding these
different encoded messages will yeld the same binary object: this does not
depend on Base64 but on the permissivity/flexibility of decoders for these
envelope formats (using **extensions** of Base64 specific to the envelope
format).


Le ven. 12 oct. 2018 à 18:27, Doug Ewell via Unicode 
a écrit :

> J Decker wrote:
>
> >> How about the opposite direction: If m is base64 encoded to yield t
> >> and then t is base64 decoded to yield n, will it always be the case
> >> that m equals n?
> >
> > False.
> > Canonical translation may occur which the different base64 may be the
> > same sort of string...
>
> Base64 is a binary-to-text encoding. Neither encoding nor decoding
> should presume any special knowledge of the meaning of the binary data,
> or do anything extra based on that presumption.
>
> Converting Unicode text to and from base64 should not perform any sort
> of Unicode normalization, convert between UTFs, insert or remove BOMs,
> etc. This is like saying that converting a JPEG image to and from base64
> should not resize or rescale the image, change its color depth, convert
> it to another graphic format, etc.
>
> So I'd say "true" to Roger's question.
>
> I touched on this a little bit in UTN #14, from the standpoint of trying
> to improve compression by normalizing the Unicode text first.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>


RE: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-12 Thread Tex via Unicode
I agree with Doug. Base64 maps each byte of the source string to unique bytes 
in the destination string. Decoding is also a unique mapping.

If the encoded string is “translated” in some way by additional processes, 
canonical or otherwise, then all bets are off.

 

If you disagree, please offer an example or additional details of how 2 base64 
strings might be equivalent.

 

Tex

 

 

 

 

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of J Decker via 
Unicode
Sent: Friday, October 12, 2018 9:29 AM
To: d...@ewellic.org
Cc: Unicode Discussion
Subject: Re: Base64 encoding applied to different unicode texts always yields 
different base64 texts ... true or false?

 

 

On Fri, Oct 12, 2018 at 9:23 AM Doug Ewell via Unicode  
wrote:

J Decker wrote:

>> How about the opposite direction: If m is base64 encoded to yield t
>> and then t is base64 decoded to yield n, will it always be the case
>> that m equals n?
>
> False.
> Canonical translation may occur which the different base64 may be the
> same sort of string...

Base64 is a binary-to-text encoding. Neither encoding nor decoding
should presume any special knowledge of the meaning of the binary data,
or do anything extra based on that presumption.

Converting Unicode text to and from base64 should not perform any sort
of Unicode normalization, convert between UTFs, insert or remove BOMs,
etc. This is like saying that converting a JPEG image to and from base64
should not resize or rescale the image, change its color depth, convert
it to another graphic format, etc.

So I'd say "true" to Roger's question.

On the first side (X to base64) definitely true.

 

But there is potential that text resulting from some decoded buffer is 
translated, resulting in a 'congruent' string that's not exactly the same... 
and the base64 will be different.

 

Comparing some base64 string with some other base64 string shows a binary 
difference, but may be still the 'same' string. 

 


I touched on this a little bit in UTN #14, from the standpoint of trying
to improve compression by normalizing the Unicode text first.

--
Doug Ewell | Thornton, CO, US | ewellic.org



Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-12 Thread J Decker via Unicode
On Fri, Oct 12, 2018 at 9:23 AM Doug Ewell via Unicode 
wrote:

> J Decker wrote:
>
> >> How about the opposite direction: If m is base64 encoded to yield t
> >> and then t is base64 decoded to yield n, will it always be the case
> >> that m equals n?
> >
> > False.
> > Canonical translation may occur which the different base64 may be the
> > same sort of string...
>
> Base64 is a binary-to-text encoding. Neither encoding nor decoding
> should presume any special knowledge of the meaning of the binary data,
> or do anything extra based on that presumption.
>
> Converting Unicode text to and from base64 should not perform any sort
> of Unicode normalization, convert between UTFs, insert or remove BOMs,
> etc. This is like saying that converting a JPEG image to and from base64
> should not resize or rescale the image, change its color depth, convert
> it to another graphic format, etc.
>
> So I'd say "true" to Roger's question.
>
On the first side (X to base64) definitely true.

But there is potential that text resulting from some decoded buffer is
translated, resulting in a 'congruent' string that's not exactly the
same... and the base64 will be different.

Comparing some base64 string with some other base64 string shows a binary
difference, but may be still the 'same' string.


>
> I touched on this a little bit in UTN #14, from the standpoint of trying
> to improve compression by normalizing the Unicode text first.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-12 Thread Doug Ewell via Unicode
J Decker wrote:

>> How about the opposite direction: If m is base64 encoded to yield t
>> and then t is base64 decoded to yield n, will it always be the case
>> that m equals n?
>
> False.
> Canonical translation may occur which the different base64 may be the
> same sort of string...

Base64 is a binary-to-text encoding. Neither encoding nor decoding
should presume any special knowledge of the meaning of the binary data,
or do anything extra based on that presumption.

Converting Unicode text to and from base64 should not perform any sort
of Unicode normalization, convert between UTFs, insert or remove BOMs,
etc. This is like saying that converting a JPEG image to and from base64
should not resize or rescale the image, change its color depth, convert
it to another graphic format, etc.

So I'd say "true" to Roger's question.

I touched on this a little bit in UTN #14, from the standpoint of trying
to improve compression by normalizing the Unicode text first.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org



Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-12 Thread J Decker via Unicode
On Fri, Oct 12, 2018 at 3:57 AM Costello, Roger L. via Unicode <
unicode@unicode.org> wrote:

> Hi Unicode Experts,
>
> Suppose base64 encoding is applied to m to yield base64 text t.
>
> Next, suppose base64 encoding is applied to m' to yield base64 text t'.
>
> If m is not equal to m', then t will not equal t'.
>
> In other words, given different inputs, base64 encoding always yields
> different base64 texts.
>
> True or false?
>
true.  base64 to and from is always the same thing.

>
> How about the opposite direction: If m is base64 encoded to yield t and
> then t is base64 decoded to yield n, will it always be the case that m
> equals n?
>
False.
Canonical translation may occur which the different base64 may be the same
sort of string...

https://en.wikipedia.org/wiki/Unicode_equivalence
https://en.wikipedia.org/wiki/Canonical_form


> /Roger
>
>