RE: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-14 Thread Tex via Unicode
Philippe,

 

Where is the use of whitespace or the idea that 1-byte pieces do not need all 
the equal sign paddings documented?

I read the rfc 3501 you pointed at, I don’t see it there.

 

Are these part of any standards? Or are you claiming these are practices 
despite the standards? If so, are these just tolerated by parsers, or are they 
actually generated by encoders?

 

What would be the rationale for supporting unnecessary whitespace? If 
linebreaks are forced at some line length they can presumably be removed at 
that length and not treated as part of the encoding.

Maybe we differ on define where the encoding begins and ends, and where higher 
level protocols prescribe how they are embedded within the protocol.

 

Tex

 

 

 

 

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe Verdy 
via Unicode
Sent: Sunday, October 14, 2018 1:41 AM
To: Adam Borowski
Cc: unicode Unicode Discussion
Subject: Re: Base64 encoding applied to different unicode texts always yields 
different base64 texts ... true or false?

 

Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is enough 
to indicate the end of an octets-span. The extra = after it do not add any 
other octet. and as well you're allowed to insert whitespaces anywhere in the 
encoded stream (this is what ensures that the Base64-encoded octets-stream will 
not be altered if line breaks are forced anywhere (notably within the body of 
emails).

 

So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, LF, 
NEL) in the middle is non-significant and ignorable on decoding (their 
"encoded" bit length is 0 and they don't terminate an octets-span, unlike "=" 
which discards extra bits remaining from the encoded stream before that are not 
on 8-bit boundaries).

 

Also:

- For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol 
before "=" can vary in its 4 lowest bits (which are then ignored/discarded by 
the "=" symbol)

- For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" symbol 
before "=" can vary in its 2 lowest bits (which are then ignored/discarded by 
the "=" symbol)

 

So you can use Base64 by encoding each octet in separate pieces, as one Base64 
symbol followed by an "=" symbol, and even insert any number of whitespaces 
between them: there's a infinite number of valid Base64 encodings for 
representing the same octets-stream payload.

 

Base64 allows encoding any octets streams but not directly any bits-streams : 
it assumes that the effective bits-stream has a binary length multiple of 8. To 
encode a bits-stream with an exact number of bits (not multiple of 8), you need 
to encode an extra payload to indicate the effective number of bits to keep at 
end of the encoded octets-stream (or at start):

- Base64 does not specify how you convert a bitstream of arbitrary length to an 
octets-stream;

- for that purpose, you may need to pad the bits-stream at start or at end with 
1 to 6 bits (so that it the resulting bitstream has a length multiple of 8, 
then encodable with Base64 which takes only octets on input).

- these extra padding bits are not significant for the original bitstream, but 
are significant for the Base64 encoder/decoder, they will be discarded by the 
bitstream decoder built on top of the Base64 decoder, but not by the Base64 
decoder itself.

 

You need to encode somewhere with the bitstream encoder how many padding bits 
(0 to 7) are present at start or end of the octets-stream; this can be done:

- as a separate payload (not encoded by Base64), or

- by prepending 3 bits at start of the bits-stream then padded at end with 1 to 
7 random bits to get a bit-length multiple of 8 suitable for Base64 encoding.

- by appending 3 bits at end of the  bits-stream, just after 1 to 7 random bits 
needed to get a bit-length multiple of 8 suitable for Base64 encoding.

Finally your bits-stream decoder will be able to use this padding count to 
discard these random padding bits (and possibly realign the stream on different 
byte-boundaries when the effective bitlength bits-stream payload is not a 
multiple of 8 and padding bits were added)

 

Base64 also does not specify how bits of the original bits-stream payload are 
packed into the octets-stream input suitable for Base64-encoding, notably it 
does not specify their order and endian-ness. The same remark applies as well 
for MIME, HTTP. So lot of network protocols and file formats need to how to 
properly encode which possible option is used to encode bits-streams of 
arbitrary length, or need to specify which default choice to apply if this 
option is not encoded, or which option must be used (with no possible 
variation). And this also adds to the number of distinct encodings that are 
possible but are still equivalent for the same effective bits-stream payload.

 

All these allowed variations are from the encoder perspective. For 
interoperability, the decoder has to be flexible and 

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-14 Thread Philippe Verdy via Unicode
Le dim. 14 oct. 2018 à 21:21, Doug Ewell via Unicode 
a écrit :

> Steffen Nurpmeso wrote:
>
> > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
> > (MIME) Part One: Format of Internet Message Bodies).
>
> Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data
> Encodings." RFC 2045 defines a particular implementation of base64,
> specific to transporting Internet mail in a 7-bit environment.
>

Wrong, this is "specific" to transporting Internet mail in any 7 bit or 8
bit environment (today almost all mail agents are operating in 8 bit), and
then it is referenced directly by HTTP (and its HTTPS variant).

So this is no so "specific". MIME is extremely popular, RFC 4648 is
extremely exotic (and RFC 4648 is wrong when saying that IMAP is very
specific as it is now a very popular protocol, widely used as well). MIME
is so frequently used, that almost all people refer to it when they look
for Base64, or do not explicitly state that another definition (found in an
exotic RFC) is explicitly used.


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-14 Thread Philippe Verdy via Unicode
It's also interesting to look at https://tools.ietf.org/html/rfc3501
- which defines (for IMAP v4) another "BASE64" encoding,
- and also defines a "Modified UTF-7" encoding using it, deviating from
Unicode's definition of UTF-7,
- and adding other requirements (which forbids alternate encodings
permitted in UTF-7 and all other Base64 variants, including those used in
MIME/RFC 2045 or SMTP, used in strong relations with IMAP !).

And nothing in RFC 4648 is clear about the fact that it only covers the
encoding of "octets streams" and not "bits streams". It also does not
discuss the adaptation for "Base64" for transport and storage (needed for
MIME, IMAP, but also in HTTP, and in several file/data formats including
XML, or digital signatures).

That RFC 4648 is only superficial, and does not cover everything (even
Unicode has its own definition for UTF-7 and also allows variations).

As we are on this Unicode list, the definition used by Unicode (more in
line with MIME), does not follow at all those in RFC 4648.
Most uses of Base64 encodings are based on the original MIME definition,
and all of them perform new adaptations. (Even the definition of "Base16"
in RFC4648 contradicts most other definitions).


Le dim. 14 oct. 2018 à 21:21, Doug Ewell via Unicode 
a écrit :

> Steffen Nurpmeso wrote:
>
> > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
> > (MIME) Part One: Format of Internet Message Bodies).
>
> Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data
> Encodings." RFC 2045 defines a particular implementation of base64,
> specific to transporting Internet mail in a 7-bit environment.
>
> RFC 4648 discusses many of the "higher-level protocol" topics that some
> people are focusing on, such as separating the base64-encoded output
> into lines of length 72 (or other), alternative target code unit sets or
> "alphabets," and padding characters. It would be helpful for everyone to
> read this particular RFC before concluding that these topics have not
> been considered, or that they compromise round-tripping or other
> characteristics of base64.
>
> I had assumed that when Roger asked about "base64 encoding," he was
> asking about the basic definition of base64.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-14 Thread Doug Ewell via Unicode

Steffen Nurpmeso wrote:


Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
(MIME) Part One: Format of Internet Message Bodies).


Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data 
Encodings." RFC 2045 defines a particular implementation of base64, 
specific to transporting Internet mail in a 7-bit environment.


RFC 4648 discusses many of the "higher-level protocol" topics that some 
people are focusing on, such as separating the base64-encoded output 
into lines of length 72 (or other), alternative target code unit sets or 
"alphabets," and padding characters. It would be helpful for everyone to 
read this particular RFC before concluding that these topics have not 
been considered, or that they compromise round-tripping or other 
characteristics of base64.


I had assumed that when Roger asked about "base64 encoding," he was 
asking about the basic definition of base64.


--
Doug Ewell | Thornton, CO, US | ewellic.org



Re: Fallback for Sinhala Consonant Clusters

2018-10-14 Thread Harshula via Unicode
Hi Richard,

1) From a pronunciation perspective, your first and third examples will
be similar. Your second example will be pronounced very differently. I
did some quick testing on Linux and reproduced the behaviour that you
observed.

2) Going back more than a decade, the state tables used by some
layout/shaping engines used the same 'virama' rules for North Indian
scripts and Sinhala. This resulted in undesirable *implicit* conjuncts
being created for Sinhala consonant clusters. That then resulted in
undesirable positioning of dependent vowels. e.g.
https://bugzilla.gnome.org/show_bug.cgi?id=161981

3) However, what you have observed is an issue with *explicit* conjunct
creation. After the segmentation is completed, the layout/shaping engine
needs to first check if there is a corresponding lookup for the explicit
conjunct, if not, then it needs to remove the ZWJ and redo the
segmentation and lookup(s). Perhaps that is not happening in Harfbuzz.

4) I've been out of the loop for many years, so I have CC'd Ruvan &
Harsha who may already be aware of what you have observed.

cya,
#

On 14/10/18 11:02 am, Richard Wordingham via Unicode wrote:
> Are there fallback rules for Sinhala consonant clusters?  There are
> fallback rules for Devanagari, but I'm not sure if they read across.
> 
> The problem I am seeing is that the Pali syllable 'ndhe' න්‍ධෙ  NAYANNA, U+0DCA AL-LAKUNA, 200D ZWJ, U+0DB0 MAHAPRAANA DAYANNA, U+0DD9
> KOMBUVA> is being rendered identically to a hypothetical Sinhalese
> 'nēdha' නේධ ,  which in NFD is
> , when I use a font that lacks the
> conjunct.  (Most fonts lack the conjunct.)  The Devanagari rules and my
> preference would lead to a fallback rendering as න්ධෙ  (Sinhalese
> 'ndhe'), which is encoded as  MAHAPRAANA DAYANNA, U+0DD9 KOMBUVA>.  Is the rendering I am getting
> technically wrong, or is it merely undesirable?
> 
> The ambiguity arises in part because, like the Brahmi script, the
> Sinhala script uses its virama character as a vowel length indicator.
> 
> Missing touching consonants are being rendered almost as though there
> were no ZWJ, but the combination of consonant and al-lakuna is being
> rendered badly.
> 
> Richard.
> 


Re: Fallback for Sinhala Consonant Clusters

2018-10-14 Thread Richard Wordingham via Unicode
On Sun, 14 Oct 2018 17:15:26 +0900
"Martin J. Dürst via Unicode"  wrote:

> Hello Richard,
> 
> On 2018/10/14 09:02, Richard Wordingham via Unicode wrote:
> > Are there fallback rules for Sinhala consonant clusters?  There are
> > fallback rules for Devanagari, but I'm not sure if they read across.
> > 
> > The problem I am seeing is that the Pali syllable 'ndhe' න්‍ධෙ
> >  > DAYANNA, U+0DD9  
> > KOMBUVA>  
> 
> Let's label this as (1)
> 
> > is being rendered identically to a hypothetical Sinhalese
> > 'nēdha' නේධ ,  
> 
> It (2) doesn't look identically to (1) here (Thunderbird on Win 8.1).
> 
> Your mail is written as if you are speaking about a general
> phenomenon, but I guess there are differences depending on the font
> and rendering stack.

The critical one is whether the font has the conjunct.  The default
Sinhala font on supported Windows, Iskoola Pota, has the conjunct. For
an example that should illustrate my points with that font (at least,
as on Windows 7) and the HarfBuzz renderer (as I believe in
Thunderbird), we have

1') Pali thve ථ්‍වෙ 

It's a very rare syllable - it only occurs in sandhi, and I have only
a single example.  Iskoola Pota has neither the conjunct nor the
touching form; I would actually expect it to be the touching form
that exists.

2') Misleading look-alike thēva ථේව 

3') Preferred fallback appearance thve ථ්වෙ  .

My question is, 'What should a rendering stack that claims to support
the Sinhala script display when it lacks the conjunct in the font
being used?'

Now what does get displayed does depend on the rendering stack.
HarfBuzz (e.g. Firefox, Google Chrome, LibreOffice, and most Linux) and
Notepad on Windows 7 move the vowel to the left and display al-lakuna,
the display I object to. iPhone and Notepad on Windows 10 display
the vowel in the middle and display al-lakuna (possibly ligated), which
is the solution I prefer.

> Hope this helps.

Well, it has prompted me to find a 'me-too' argument for improving the
rendering.  I wanted a standards-based argument.

>> Missing touching consonants are being rendered almost as though
>> there were no ZWJ, but the combination of consonant and al-lakuna
>> is being rendered badly.

This looks like a common font problem.  Iskoola Pota does not suffer
from it.

Richard.



Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-14 Thread Philippe Verdy via Unicode
Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
enough to indicate the end of an octets-span. The extra = after it do not
add any other octet. and as well you're allowed to insert whitespaces
anywhere in the encoded stream (this is what ensures that the
Base64-encoded octets-stream will not be altered if line breaks are forced
anywhere (notably within the body of emails).

So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR,
LF, NEL) in the middle is non-significant and ignorable on decoding (their
"encoded" bit length is 0 and they don't terminate an octets-span, unlike
"=" which discards extra bits remaining from the encoded stream before that
are not on 8-bit boundaries).

Also:
- For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol
before "=" can vary in its 4 lowest bits (which are then ignored/discarded
by the "=" symbol)
- For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" symbol
before "=" can vary in its 2 lowest bits (which are then ignored/discarded
by the "=" symbol)

So you can use Base64 by encoding each octet in separate pieces, as one
Base64 symbol followed by an "=" symbol, and even insert any number of
whitespaces between them: there's a infinite number of valid Base64
encodings for representing the same octets-stream payload.

Base64 allows encoding any octets streams but not directly any bits-streams
: it assumes that the effective bits-stream has a binary length multiple of
8. To encode a bits-stream with an exact number of bits (not multiple of
8), you need to encode an extra payload to indicate the effective number of
bits to keep at end of the encoded octets-stream (or at start):
- Base64 does not specify how you convert a bitstream of arbitrary length
to an octets-stream;
- for that purpose, you may need to pad the bits-stream at start or at end
with 1 to 6 bits (so that it the resulting bitstream has a length multiple
of 8, then encodable with Base64 which takes only octets on input).
- these extra padding bits are not significant for the original bitstream,
but are significant for the Base64 encoder/decoder, they will be discarded
by the bitstream decoder built on top of the Base64 decoder, but not by the
Base64 decoder itself.

You need to encode somewhere with the bitstream encoder how many padding
bits (0 to 7) are present at start or end of the octets-stream; this can be
done:
- as a separate payload (not encoded by Base64), or
- by prepending 3 bits at start of the bits-stream then padded at end with
1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64
encoding.
- by appending 3 bits at end of the  bits-stream, just after 1 to 7 random
bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.
Finally your bits-stream decoder will be able to use this padding count to
discard these random padding bits (and possibly realign the stream on
different byte-boundaries when the effective bitlength bits-stream payload
is not a multiple of 8 and padding bits were added)

Base64 also does not specify how bits of the original bits-stream payload
are packed into the octets-stream input suitable for Base64-encoding,
notably it does not specify their order and endian-ness. The same remark
applies as well for MIME, HTTP. So lot of network protocols and file
formats need to how to properly encode which possible option is used to
encode bits-streams of arbitrary length, or need to specify which default
choice to apply if this option is not encoded, or which option must be used
(with no possible variation). And this also adds to the number of distinct
encodings that are possible but are still equivalent for the same effective
bits-stream payload.

All these allowed variations are from the encoder perspective. For
interoperability, the decoder has to be flexible and to support various
options to be compatible with different implementations of the encoder,
notably when the encoder was run on a different system. And this is the
case for the MIME transport by mail, or for HTTP and FTP transports, or
file/media storage formats even if the file is stored on the same system,
because it may actually be a copy stored locally but coming from another
system where the file was actually encoded).

Now if we come back to the encoding of plain-text payloads, Unicode just
specifies the allowed range (from 0 to 0x10) for scalar values of code
points (it actually does not mandate an exact bit-length because the range
does not fully fit exactly to 21 bits and an encoder can still pack
multiple code points together into more compact code units.

However Unicode provides and standardizes several encodings (UTF-8/16/32)
which use code units whose size is directly suitable as input for an
octets-stream, so that they are directly encodable with Base64, without
having to specify an extra layer for the bits-stream encoder/decoder.

But many other encodings are still possible (and can be 

Re: Fallback for Sinhala Consonant Clusters

2018-10-14 Thread Martin J. Dürst via Unicode

Hello Richard,

On 2018/10/14 09:02, Richard Wordingham via Unicode wrote:

Are there fallback rules for Sinhala consonant clusters?  There are
fallback rules for Devanagari, but I'm not sure if they read across.

The problem I am seeing is that the Pali syllable 'ndhe' න්‍ධෙ 


Let's label this as (1)


is being rendered identically to a hypothetical Sinhalese
'nēdha' නේධ ,


It (2) doesn't look identically to (1) here (Thunderbird on Win 8.1).

Your mail is written as if you are speaking about a general phenomenon, 
but I guess there are differences depending on the font and rendering stack.



which in NFD is
, when I use a font that lacks the
conjunct.  (Most fonts lack the conjunct.)  The Devanagari rules and my
preference would lead to a fallback rendering as න්ධෙ  (Sinhalese
'ndhe'),


Here, this (3) looks like it has the same three components as (2), but 
the first two are exchanged, so that the piece that looks like @ is now 
in the middle (it was at the left in (1) and (2)).


Hope this helps.  Regards,Martin.


which is encoded as .  Is the rendering I am getting
technically wrong, or is it merely undesirable?

The ambiguity arises in part because, like the Brahmi script, the
Sinhala script uses its virama character as a vowel length indicator.

Missing touching consonants are being rendered almost as though there
were no ZWJ, but the combination of consonant and al-lakuna is being
rendered badly.

Richard.

.



--
Prof. Dr.sc. Martin J. Dürst
Department of Intelligent Information Technology
College of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan