Re: Fallback for Sinhala Consonant Clusters

2018-10-15 Thread Richard Wordingham via Unicode
On Tue, 16 Oct 2018 11:59:54 +1100
Harshula via Unicode  wrote:

> Hi Richard,
> 
> On 16/10/18 6:57 am, Richard Wordingham via Unicode wrote:
> > On Tue, 16 Oct 2018 02:47:36 +1100
> > Harshula via Unicode  wrote:
> >   
> >> Note, touching letters are formed by , so they
> >> should not be displayed as a fallback for 
> >> conjuncts.  
> > 
> > I don't follow that.  While the conjuncts with r-, -r and -y are
> > very different to pairs of touching letters, the conjuncts for tth,
> > nd, ndr, ndh, kv and tv would be very similar to the hypothetical
> > corresponding touching letters and quite different to the fallbacks
> > with visible al-lakuna.  
> 
> If you haven't already, it's best you read SLS 1134:2011:
> http://www.language.lk/en/download/standards/
> 
> or the older SLS 1134:2004:
> http://unicode.org/wg2/docs/n2737.pdf

The latter actually says, in Section 5.8, that  may be
used for either!  I suspect that that is a printing error.

The Sri Lankan standard simply assumes that the rendering system can
accommodate what is requested in the backing store.  It says nothing
about fallbacks.  So, if the user specifies the the syllable ddho
written with a conjunct and encoded as ද්‍ධො but the conjunct is
missing from the fonts' repertoires, why is it right to display it with
al-lakuna as though it were ද්ධො but wrong to display it with the
touching letters encoded as ද‍්ධො?   There are three different
correct ways of writing 'ddho', but many systems only support one of
them (and some weirdly use a fourth method). 

Richard.




Re: Fallback for Sinhala Consonant Clusters

2018-10-15 Thread Harshula via Unicode
Hi Richard,

On 16/10/18 6:57 am, Richard Wordingham via Unicode wrote:
> On Tue, 16 Oct 2018 02:47:36 +1100
> Harshula via Unicode  wrote:
> 
>> Note, touching letters are formed by , so they should
>> not be displayed as a fallback for  conjuncts.
> 
> I don't follow that.  While the conjuncts with r-, -r and -y are very
> different to pairs of touching letters, the conjuncts for tth, nd, ndr,
> ndh, kv and tv would be very similar to the hypothetical corresponding
> touching letters and quite different to the fallbacks with visible
> al-lakuna.

If you haven't already, it's best you read SLS 1134:2011:
http://www.language.lk/en/download/standards/

or the older SLS 1134:2004:
http://unicode.org/wg2/docs/n2737.pdf

cya,
#


Re: Fallback for Sinhala Consonant Clusters

2018-10-15 Thread Richard Wordingham via Unicode
On Tue, 16 Oct 2018 02:47:36 +1100
Harshula via Unicode  wrote:

> Note, touching letters are formed by , so they should
> not be displayed as a fallback for  conjuncts.

I don't follow that.  While the conjuncts with r-, -r and -y are very
different to pairs of touching letters, the conjuncts for tth, nd, ndr,
ndh, kv and tv would be very similar to the hypothetical corresponding
touching letters and quite different to the fallbacks with visible
al-lakuna.

Richard.


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Steffen Nurpmeso via Unicode
Philippe Verdy via Unicode wrote in :
 |Padding itself does not clearly indicate the length.
 |
 |It's an artefact that **may** be infered only in some other layers \
 |of protocols which specify when and how padding is needed (and how \
 |many padding bytes 
 |are required or accepted), it works only if these upper layer protocols \
 |are using **octets** streams, but it is still not usable for more general 
 |bitstreams (with arbitrary bit lengths).
 |
 |This RFC does not mandate/require these padding bytes and in fact many \
 |upper layer protocols do not ever need it (including UTF-7 for example), \
 |they are 
 |never necessary to infer a length in octets and insufficient for specify\
 |ing a length in bits.
 |
 |As well the usage in MIME (where there's a requirement that lines of \
 |headers or in the content body is limited to 1000 bytes) requires free \
 |splitting of 
 |Base64 (there's no agreed maximum length, some sources insist it should \
 |not be more than 72 bytes, others use 80 bytes, but mail forwarding \
 |may add other 
 |characters at start of lines, forcing them to be shorter (leaving for \
 |example a line of 72 bytes+CRLF and another line of 8 bytes+CRLF): \
 |this means that 
 |padding may not be used where one would expect them, and padding can \
 |event occur in the middle of the encoded stream (not just at end) along \

That was actually a bug in my MUA.  Other MUAs were not capable of
decoding this correctly.
Sorry :-(!!

 |with other 
 |whitespaces or separators (like "> " at start of lines in cited messages).

In fact garbage bytes may be embedded explicitly says MIME.
Most handle that right, and skip (silently, maybe not right),
but some explicit base64 decoders fail miserably when such things
are seen (openssl base64, NetBSD base64 decoder (current)), others
do not (busybox base64, for example).

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Philippe Verdy via Unicode
Padding itself does not clearly indicate the length.
It's an artefact that **may** be infered only in some other layers of
protocols which specify when and how padding is needed (and how many
padding bytes are required or accepted), it works only if these upper layer
protocols are using **octets** streams, but it is still not usable for more
general bitstreams (with arbitrary bit lengths).

This RFC does not mandate/require these padding bytes and in fact many
upper layer protocols do not ever need it (including UTF-7 for example),
they are never necessary to infer a length in octets and insufficient for
specifying a length in bits.

As well the usage in MIME (where there's a requirement that lines of
headers or in the content body is limited to 1000 bytes) requires free
splitting of Base64 (there's no agreed maximum length, some sources insist
it should not be more than 72 bytes, others use 80 bytes, but mail
forwarding may add other characters at start of lines, forcing them to be
shorter (leaving for example a line of 72 bytes+CRLF and another line of 8
bytes+CRLF): this means that padding may not be used where one would expect
them, and padding can event occur in the middle of the encoded stream (not
just at end) along with other whitespaces or separators (like "> " at start
of lines in cited messages).

More generally the padding in MIME offers no benefit at all. The actual
length is infered from the whole content body, and it's just safer to
ignore/discard all padding symbols in decoders (just like they will discard
whitespaces or ">"). If one wants to get a sure indication that the stream
is not truncated and has the expected length, the encoded message must
either embed this length as part of the original binary stream itself, or
can embed secure "digital signatures", "message digests" or "hashes", or
the length can be specified separately in the unencoded MIME body, or as
part of the MIME header if the whole MIME content body is specified as
using a base64 encoding. The same applies to HTTP.

I have rarely seen RFC 4648 used alone outside of another upper layer
protocol. This statement in RFC 4648 section 3.1 is for example completely
wrong for Base16 where paddings are almost always avoided.

Various other Base-N profiles for other upper layer protocols never need
(and sometime even forbid) the presence of any padding symbol, or consider
that paddding can also be made using the bits representing 0 to pad the
original binary stream, or can be made using other ignored/discard
whitespaces or symbols, without assigning any specific role to "=" (as a
length indicator or stream terminator).


Le lun. 15 oct. 2018 à 15:02, Tex  a écrit :

> Philippe, quote the entire section:
>
>
>
> In some circumstances, the use of padding ("=") in base-encoded data
>
>is not required or used.  In the general case, when assumptions about
>
>the size of transported data cannot be made, padding is required to
>
>yield correct decoded data.
>
>
>
>Implementations MUST include appropriate pad characters at the end of
>
>encoded data unless the specification referring to this document
>
>explicitly states otherwise.
>
>
>
> The first para clarifies that padding is required when the length is not
> otherwise known. Only if the length is provided or predefined can the
> padding be dropped.
>
> The second para clarifies it must be included unless the higher level
> protocol states otherwise, in which case it is likely using another
> mechanism to define length.
>
>
>
> It doesn’t seem to me to be as open ended as you implied in your initial
> mails, but well-defined depending on whether base64 is being used as spec’d
> in the RFC, or being explicitly modified to suit an embedding protocol.
>
> And certainly the first sentence in this section isn’t intended to be
> taken without the context of the rest of the section.
>
>
>
> tex
>
>
>
>
>
>
>
> *From:* Philippe Verdy [mailto:verd...@wanadoo.fr]
> *Sent:* Monday, October 15, 2018 4:14 AM
> *To:* Tex Texin
> *Cc:* Adam Borowski; unicode Unicode Discussion
> *Subject:* Re: Base64 encoding applied to different unicode texts always
> yields different base64 texts ... true or false?
>
>
>
> Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st
> sentence, it is explicitly stated :
>
>
>
> In some circumstances, the use of padding ("=") in base-encoded data is not 
> required or used.
>
>
>
> Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :
>
> Philippe,
>
>
>
> Where is the use of whitespace or the idea that 1-byte pieces do not need
> all the equal sign paddings documented?
>
> I read the rfc 3501 you pointed at, I don’t see it there.
>
>
>
> Are these part of any standards? Or are you claiming these are practices
> despite the standards? If so, are these just tolerated by parsers, or are
> they actually generated by encoders?
>
>
>
> What would be the rationale for supporting unnecessary whitespace? If
> linebreaks are forced at some line 

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Peter Saint-Andre via Unicode
On 10/14/18 3:59 PM, Philippe Verdy via Unicode wrote:
> 
> 
> Le dim. 14 oct. 2018 à 21:21, Doug Ewell via Unicode
> mailto:unicode@unicode.org>> a écrit :
> 
> Steffen Nurpmeso wrote:
> 
> > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
> > (MIME) Part One: Format of Internet Message Bodies).
> 
> Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data
> Encodings." RFC 2045 defines a particular implementation of base64,
> specific to transporting Internet mail in a 7-bit environment.
> 
> 
> Wrong, this is "specific" to transporting Internet mail in any 7 bit or
> 8 bit environment (today almost all mail agents are operating in 8 bit),
> and then it is referenced directly by HTTP (and its HTTPS variant).
> 
> So this is no so "specific". MIME is extremely popular, RFC 4648 is
> extremely exotic (and RFC 4648 is wrong when saying that IMAP is very
> specific as it is now a very popular protocol, widely used as well).
> MIME is so frequently used, that almost all people refer to it when they
> look for Base64, or do not explicitly state that another definition
> (found in an exotic RFC) is explicitly used.

RFC 4648 is used in many, many Internet protocols. It's definitely not
"extremely exotic".

Peter



Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Steffen Nurpmeso via Unicode
Doug Ewell via Unicode wrote in <2A67B4F082F74F8AADF34BA11D885554@DougEwell>:
 |Steffen Nurpmeso wrote:
 |> Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
 |> (MIME) Part One: Format of Internet Message Bodies).
 |
 |Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data
 |Encodings." RFC 2045 defines a particular implementation of base64,
 |specific to transporting Internet mail in a 7-bit environment.
 |
 |RFC 4648 discusses many of the "higher-level protocol" topics that some
 |people are focusing on, such as separating the base64-encoded output
 |into lines of length 72 (or other), alternative target code unit sets or
 |"alphabets," and padding characters. It would be helpful for everyone to
 |read this particular RFC before concluding that these topics have not
 |been considered, or that they compromise round-tripping or other
 |characteristics of base64.
 |
 |I had assumed that when Roger asked about "base64 encoding," he was
 |asking about the basic definition of base64.

Sure; i have only followed the discussion superficially, and even
though everybody can read RFCs, i felt the necessity to polemicize
against the false however i look at it "MIME actually splits
a binary object into multiple fragments at random positions".
Solely my fault.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


Re: Fallback for Sinhala Consonant Clusters

2018-10-15 Thread Harshula via Unicode
Hi Richard,

On 15/10/18 6:53 pm, Richard Wordingham via Unicode wrote:
> On Mon, 15 Oct 2018 01:55:24 +1100
> Harshula via Unicode  wrote:
> 
>> 3) However, what you have observed is an issue with *explicit*
>> conjunct creation. After the segmentation is completed, the
>> layout/shaping engine needs to first check if there is a
>> corresponding lookup for the explicit conjunct, if not, then it needs
>> to remove the ZWJ and redo the segmentation and lookup(s). Perhaps
>> that is not happening in Harfbuzz.
> 
> This indeed seems to be the problem with HarfBuzz and with Windows 7
> Uniscribe.  Curiously, they almost adopt this behaviour when touching
> letters are not available.  (The ZWJ seems not to be completely removed
> - in HarfBuzz at least it can result in the al-lakuna not interacting
> properly with the base character.)
> 
> But where is this usually useful behaviour specified?
> 
> 1.  There may be nothing but time and money to stop fallbacks being
> built into the font.  For example, what prohibits the rendering of a
> conjunct falling back to touching letters or a missing glyph symbol?

I had not considered the missing glyph symbol. Perhaps that is the most
accurate solution when a font is missing a glyph during an *explicit*
conjunct lookup.

Note, touching letters are formed by , so they should
not be displayed as a fallback for  conjuncts.

cya,
#


RE: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Tex via Unicode
Philippe, quote the entire section:

 

In some circumstances, the use of padding ("=") in base-encoded data

   is not required or used.  In the general case, when assumptions about

   the size of transported data cannot be made, padding is required to

   yield correct decoded data.

 

   Implementations MUST include appropriate pad characters at the end of

   encoded data unless the specification referring to this document

   explicitly states otherwise.

 

The first para clarifies that padding is required when the length is not 
otherwise known. Only if the length is provided or predefined can the padding 
be dropped.

The second para clarifies it must be included unless the higher level protocol 
states otherwise, in which case it is likely using another mechanism to define 
length.

 

It doesn’t seem to me to be as open ended as you implied in your initial mails, 
but well-defined depending on whether base64 is being used as spec’d in the 
RFC, or being explicitly modified to suit an embedding protocol.

And certainly the first sentence in this section isn’t intended to be taken 
without the context of the rest of the section.

 

tex

 

 

 

From: Philippe Verdy [mailto:verd...@wanadoo.fr] 
Sent: Monday, October 15, 2018 4:14 AM
To: Tex Texin
Cc: Adam Borowski; unicode Unicode Discussion
Subject: Re: Base64 encoding applied to different unicode texts always yields 
different base64 texts ... true or false?

 

Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st 
sentence, it is explicitly stated :

 

In some circumstances, the use of padding ("=") in base-encoded data is not 
required or used.

 

Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :

Philippe,

 

Where is the use of whitespace or the idea that 1-byte pieces do not need all 
the equal sign paddings documented?

I read the rfc 3501 you pointed at, I don’t see it there.

 

Are these part of any standards? Or are you claiming these are practices 
despite the standards? If so, are these just tolerated by parsers, or are they 
actually generated by encoders?

 

What would be the rationale for supporting unnecessary whitespace? If 
linebreaks are forced at some line length they can presumably be removed at 
that length and not treated as part of the encoding.

Maybe we differ on define where the encoding begins and ends, and where higher 
level protocols prescribe how they are embedded within the protocol.

 

Tex

 

 

 

 

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe Verdy 
via Unicode
Sent: Sunday, October 14, 2018 1:41 AM
To: Adam Borowski
Cc: unicode Unicode Discussion
Subject: Re: Base64 encoding applied to different unicode texts always yields 
different base64 texts ... true or false?

 

Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is enough 
to indicate the end of an octets-span. The extra = after it do not add any 
other octet. and as well you're allowed to insert whitespaces anywhere in the 
encoded stream (this is what ensures that the Base64-encoded octets-stream will 
not be altered if line breaks are forced anywhere (notably within the body of 
emails).

 

So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, LF, 
NEL) in the middle is non-significant and ignorable on decoding (their 
"encoded" bit length is 0 and they don't terminate an octets-span, unlike "=" 
which discards extra bits remaining from the encoded stream before that are not 
on 8-bit boundaries).

 

Also:

- For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol 
before "=" can vary in its 4 lowest bits (which are then ignored/discarded by 
the "=" symbol)

- For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" symbol 
before "=" can vary in its 2 lowest bits (which are then ignored/discarded by 
the "=" symbol)

 

So you can use Base64 by encoding each octet in separate pieces, as one Base64 
symbol followed by an "=" symbol, and even insert any number of whitespaces 
between them: there's a infinite number of valid Base64 encodings for 
representing the same octets-stream payload.

 

Base64 allows encoding any octets streams but not directly any bits-streams : 
it assumes that the effective bits-stream has a binary length multiple of 8. To 
encode a bits-stream with an exact number of bits (not multiple of 8), you need 
to encode an extra payload to indicate the effective number of bits to keep at 
end of the encoded octets-stream (or at start):

- Base64 does not specify how you convert a bitstream of arbitrary length to an 
octets-stream;

- for that purpose, you may need to pad the bits-stream at start or at end with 
1 to 6 bits (so that it the resulting bitstream has a length multiple of 8, 
then encodable with Base64 which takes only octets on input).

- these extra padding bits are not significant for the original bitstream, but 
are significant for the Base64 encoder/decoder, they will be discarded by the 

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Philippe Verdy via Unicode
Note that all these discussion about padding applies to all other base-N
encodings, including base-10.

For example to represent numbers of arbitrary precision: padding does not
require a separate symbol but can use the "0" digit which is part of the
10-symbols alphabet, or encoders can discard them on the left, or on the
right if there's a decimal dot; when the precision is less than a integral
number of decimal digits, the extra bits or fractional bits of information
in the last digit of the encoded sequence does not matter, encoders may
choose to not set them to 0 but may prefer to use rounding which may
conditionally set these bits to 1, depedning on the value of the last
significant bits or fractional bits of maximum precision.

As well the same decoders may want to use extra whitespaces (notably to
limit line lengths at arbitrary lengths, notably for embedding the encoded
sequences in printed documents or documents with a page layout and rendered
with a readable font size suitable for the page width, or for presentation
purpose by grouping symbols).

In summary, padding is not required at all by all Base-N encoders/decoders,
and non significant whitespace is frequently needed.


Le lun. 15 oct. 2018 à 13:57, Philippe Verdy  a écrit :

> If you want an example where padding with "=" is not used at all,
> - look into URL-shortening schemes
> - look into database fields or data input forms and numerous data formats
> where the "=" sign is restricted (just like in URLs and file paths, or in
> identifiers)
> Padding is not used anywhere in the middle of the binary encoding or even
> at end, only the 64 symbols of the encoding alphabet are needed and the
> extra 2 or 4 lowest bits that may be encoded in the last character of the
> encoded sequence are discarded by the decoder (these extra bits are not
> necessarily set to 0 by encoders in the last symbol, even if this is the
> canonical form recommanded in encoders, their value is simply ignored by
> decoders).
> Some Base64 encoders do not necessarily encode binary octets-streams, but
> bits-streams whose length in bits is not necessarily multiple of 8, in
> which case there may be 1 to 7 trailing bits (not just 2 or 4) in the last
> symbol of the encoded sequence.
> Other encoders use streams of binary code units that are larger than 8
> bits, and may want to encode more padding symbols to force the alignment of
> data required in their associated decoders, or will choose to not use any
> padding at all, letting the decoder discard the trailing bits themselves at
> end of the encoded stream.
>
> Le lun. 15 oct. 2018 à 13:24, Philippe Verdy  a
> écrit :
>
>> Also the rationale for supporting "unnecessary" whitespace is found in
>> MIME's version of Base64, also in RFCs describing encoding formats for
>> digital certificates, or for exchanging public keys in encryption
>> algorithms like PGP (notably, but not only, as texts in the body of emails
>> or in documentations and websites).
>>
>> Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :
>>
>>> Philippe,
>>>
>>>
>>>
>>> Where is the use of whitespace or the idea that 1-byte pieces do not
>>> need all the equal sign paddings documented?
>>>
>>> I read the rfc 3501 you pointed at, I don’t see it there.
>>>
>>>
>>>
>>> Are these part of any standards? Or are you claiming these are practices
>>> despite the standards? If so, are these just tolerated by parsers, or are
>>> they actually generated by encoders?
>>>
>>>
>>>
>>> What would be the rationale for supporting unnecessary whitespace? If
>>> linebreaks are forced at some line length they can presumably be removed at
>>> that length and not treated as part of the encoding.
>>>
>>> Maybe we differ on define where the encoding begins and ends, and where
>>> higher level protocols prescribe how they are embedded within the protocol.
>>>
>>>
>>>
>>> Tex
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Philippe
>>> Verdy via Unicode
>>> *Sent:* Sunday, October 14, 2018 1:41 AM
>>> *To:* Adam Borowski
>>> *Cc:* unicode Unicode Discussion
>>> *Subject:* Re: Base64 encoding applied to different unicode texts
>>> always yields different base64 texts ... true or false?
>>>
>>>
>>>
>>> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
>>> enough to indicate the end of an octets-span. The extra = after it do not
>>> add any other octet. and as well you're allowed to insert whitespaces
>>> anywhere in the encoded stream (this is what ensures that the
>>> Base64-encoded octets-stream will not be altered if line breaks are forced
>>> anywhere (notably within the body of emails).
>>>
>>>
>>>
>>> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB,
>>> CR, LF, NEL) in the middle is non-significant and ignorable on decoding
>>> (their "encoded" bit length is 0 and they don't terminate an octets-span,
>>> unlike "=" which discards extra bits remaining from the encoded stream
>>> 

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Philippe Verdy via Unicode
If you want an example where padding with "=" is not used at all,
- look into URL-shortening schemes
- look into database fields or data input forms and numerous data formats
where the "=" sign is restricted (just like in URLs and file paths, or in
identifiers)
Padding is not used anywhere in the middle of the binary encoding or even
at end, only the 64 symbols of the encoding alphabet are needed and the
extra 2 or 4 lowest bits that may be encoded in the last character of the
encoded sequence are discarded by the decoder (these extra bits are not
necessarily set to 0 by encoders in the last symbol, even if this is the
canonical form recommanded in encoders, their value is simply ignored by
decoders).
Some Base64 encoders do not necessarily encode binary octets-streams, but
bits-streams whose length in bits is not necessarily multiple of 8, in
which case there may be 1 to 7 trailing bits (not just 2 or 4) in the last
symbol of the encoded sequence.
Other encoders use streams of binary code units that are larger than 8
bits, and may want to encode more padding symbols to force the alignment of
data required in their associated decoders, or will choose to not use any
padding at all, letting the decoder discard the trailing bits themselves at
end of the encoded stream.

Le lun. 15 oct. 2018 à 13:24, Philippe Verdy  a écrit :

> Also the rationale for supporting "unnecessary" whitespace is found in
> MIME's version of Base64, also in RFCs describing encoding formats for
> digital certificates, or for exchanging public keys in encryption
> algorithms like PGP (notably, but not only, as texts in the body of emails
> or in documentations and websites).
>
> Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :
>
>> Philippe,
>>
>>
>>
>> Where is the use of whitespace or the idea that 1-byte pieces do not need
>> all the equal sign paddings documented?
>>
>> I read the rfc 3501 you pointed at, I don’t see it there.
>>
>>
>>
>> Are these part of any standards? Or are you claiming these are practices
>> despite the standards? If so, are these just tolerated by parsers, or are
>> they actually generated by encoders?
>>
>>
>>
>> What would be the rationale for supporting unnecessary whitespace? If
>> linebreaks are forced at some line length they can presumably be removed at
>> that length and not treated as part of the encoding.
>>
>> Maybe we differ on define where the encoding begins and ends, and where
>> higher level protocols prescribe how they are embedded within the protocol.
>>
>>
>>
>> Tex
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Philippe
>> Verdy via Unicode
>> *Sent:* Sunday, October 14, 2018 1:41 AM
>> *To:* Adam Borowski
>> *Cc:* unicode Unicode Discussion
>> *Subject:* Re: Base64 encoding applied to different unicode texts always
>> yields different base64 texts ... true or false?
>>
>>
>>
>> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
>> enough to indicate the end of an octets-span. The extra = after it do not
>> add any other octet. and as well you're allowed to insert whitespaces
>> anywhere in the encoded stream (this is what ensures that the
>> Base64-encoded octets-stream will not be altered if line breaks are forced
>> anywhere (notably within the body of emails).
>>
>>
>>
>> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB,
>> CR, LF, NEL) in the middle is non-significant and ignorable on decoding
>> (their "encoded" bit length is 0 and they don't terminate an octets-span,
>> unlike "=" which discards extra bits remaining from the encoded stream
>> before that are not on 8-bit boundaries).
>>
>>
>>
>> Also:
>>
>> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X"
>> symbol before "=" can vary in its 4 lowest bits (which are then
>> ignored/discarded by the "=" symbol)
>>
>> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
>> symbol before "=" can vary in its 2 lowest bits (which are then
>> ignored/discarded by the "=" symbol)
>>
>>
>>
>> So you can use Base64 by encoding each octet in separate pieces, as one
>> Base64 symbol followed by an "=" symbol, and even insert any number of
>> whitespaces between them: there's a infinite number of valid Base64
>> encodings for representing the same octets-stream payload.
>>
>>
>>
>> Base64 allows encoding any octets streams but not directly any
>> bits-streams : it assumes that the effective bits-stream has a binary
>> length multiple of 8. To encode a bits-stream with an exact number of bits
>> (not multiple of 8), you need to encode an extra payload to indicate the
>> effective number of bits to keep at end of the encoded octets-stream (or at
>> start):
>>
>> - Base64 does not specify how you convert a bitstream of arbitrary length
>> to an octets-stream;
>>
>> - for that purpose, you may need to pad the bits-stream at start or at
>> end with 1 to 6 bits (so that it the resulting bitstream has a length
>> multiple of 8, 

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Philippe Verdy via Unicode
Also the rationale for supporting "unnecessary" whitespace is found in
MIME's version of Base64, also in RFCs describing encoding formats for
digital certificates, or for exchanging public keys in encryption
algorithms like PGP (notably, but not only, as texts in the body of emails
or in documentations and websites).

Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :

> Philippe,
>
>
>
> Where is the use of whitespace or the idea that 1-byte pieces do not need
> all the equal sign paddings documented?
>
> I read the rfc 3501 you pointed at, I don’t see it there.
>
>
>
> Are these part of any standards? Or are you claiming these are practices
> despite the standards? If so, are these just tolerated by parsers, or are
> they actually generated by encoders?
>
>
>
> What would be the rationale for supporting unnecessary whitespace? If
> linebreaks are forced at some line length they can presumably be removed at
> that length and not treated as part of the encoding.
>
> Maybe we differ on define where the encoding begins and ends, and where
> higher level protocols prescribe how they are embedded within the protocol.
>
>
>
> Tex
>
>
>
>
>
>
>
>
>
> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Philippe
> Verdy via Unicode
> *Sent:* Sunday, October 14, 2018 1:41 AM
> *To:* Adam Borowski
> *Cc:* unicode Unicode Discussion
> *Subject:* Re: Base64 encoding applied to different unicode texts always
> yields different base64 texts ... true or false?
>
>
>
> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
> enough to indicate the end of an octets-span. The extra = after it do not
> add any other octet. and as well you're allowed to insert whitespaces
> anywhere in the encoded stream (this is what ensures that the
> Base64-encoded octets-stream will not be altered if line breaks are forced
> anywhere (notably within the body of emails).
>
>
>
> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR,
> LF, NEL) in the middle is non-significant and ignorable on decoding (their
> "encoded" bit length is 0 and they don't terminate an octets-span, unlike
> "=" which discards extra bits remaining from the encoded stream before that
> are not on 8-bit boundaries).
>
>
>
> Also:
>
> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol
> before "=" can vary in its 4 lowest bits (which are then ignored/discarded
> by the "=" symbol)
>
> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
> symbol before "=" can vary in its 2 lowest bits (which are then
> ignored/discarded by the "=" symbol)
>
>
>
> So you can use Base64 by encoding each octet in separate pieces, as one
> Base64 symbol followed by an "=" symbol, and even insert any number of
> whitespaces between them: there's a infinite number of valid Base64
> encodings for representing the same octets-stream payload.
>
>
>
> Base64 allows encoding any octets streams but not directly any
> bits-streams : it assumes that the effective bits-stream has a binary
> length multiple of 8. To encode a bits-stream with an exact number of bits
> (not multiple of 8), you need to encode an extra payload to indicate the
> effective number of bits to keep at end of the encoded octets-stream (or at
> start):
>
> - Base64 does not specify how you convert a bitstream of arbitrary length
> to an octets-stream;
>
> - for that purpose, you may need to pad the bits-stream at start or at end
> with 1 to 6 bits (so that it the resulting bitstream has a length multiple
> of 8, then encodable with Base64 which takes only octets on input).
>
> - these extra padding bits are not significant for the original bitstream,
> but are significant for the Base64 encoder/decoder, they will be discarded
> by the bitstream decoder built on top of the Base64 decoder, but not by the
> Base64 decoder itself.
>
>
>
> You need to encode somewhere with the bitstream encoder how many padding
> bits (0 to 7) are present at start or end of the octets-stream; this can be
> done:
>
> - as a separate payload (not encoded by Base64), or
>
> - by prepending 3 bits at start of the bits-stream then padded at end with
> 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64
> encoding.
>
> - by appending 3 bits at end of the  bits-stream, just after 1 to 7 random
> bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.
>
> Finally your bits-stream decoder will be able to use this padding count to
> discard these random padding bits (and possibly realign the stream on
> different byte-boundaries when the effective bitlength bits-stream payload
> is not a multiple of 8 and padding bits were added)
>
>
>
> Base64 also does not specify how bits of the original bits-stream payload
> are packed into the octets-stream input suitable for Base64-encoding,
> notably it does not specify their order and endian-ness. The same remark
> applies as well for MIME, HTTP. So lot of network protocols and file
> 

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Philippe Verdy via Unicode
Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st
sentence, it is explicitly stated :

In some circumstances, the use of padding ("=") in base-encoded data
is not required or used.


Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :

> Philippe,
>
>
>
> Where is the use of whitespace or the idea that 1-byte pieces do not need
> all the equal sign paddings documented?
>
> I read the rfc 3501 you pointed at, I don’t see it there.
>
>
>
> Are these part of any standards? Or are you claiming these are practices
> despite the standards? If so, are these just tolerated by parsers, or are
> they actually generated by encoders?
>
>
>
> What would be the rationale for supporting unnecessary whitespace? If
> linebreaks are forced at some line length they can presumably be removed at
> that length and not treated as part of the encoding.
>
> Maybe we differ on define where the encoding begins and ends, and where
> higher level protocols prescribe how they are embedded within the protocol.
>
>
>
> Tex
>
>
>
>
>
>
>
>
>
> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Philippe
> Verdy via Unicode
> *Sent:* Sunday, October 14, 2018 1:41 AM
> *To:* Adam Borowski
> *Cc:* unicode Unicode Discussion
> *Subject:* Re: Base64 encoding applied to different unicode texts always
> yields different base64 texts ... true or false?
>
>
>
> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
> enough to indicate the end of an octets-span. The extra = after it do not
> add any other octet. and as well you're allowed to insert whitespaces
> anywhere in the encoded stream (this is what ensures that the
> Base64-encoded octets-stream will not be altered if line breaks are forced
> anywhere (notably within the body of emails).
>
>
>
> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR,
> LF, NEL) in the middle is non-significant and ignorable on decoding (their
> "encoded" bit length is 0 and they don't terminate an octets-span, unlike
> "=" which discards extra bits remaining from the encoded stream before that
> are not on 8-bit boundaries).
>
>
>
> Also:
>
> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol
> before "=" can vary in its 4 lowest bits (which are then ignored/discarded
> by the "=" symbol)
>
> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
> symbol before "=" can vary in its 2 lowest bits (which are then
> ignored/discarded by the "=" symbol)
>
>
>
> So you can use Base64 by encoding each octet in separate pieces, as one
> Base64 symbol followed by an "=" symbol, and even insert any number of
> whitespaces between them: there's a infinite number of valid Base64
> encodings for representing the same octets-stream payload.
>
>
>
> Base64 allows encoding any octets streams but not directly any
> bits-streams : it assumes that the effective bits-stream has a binary
> length multiple of 8. To encode a bits-stream with an exact number of bits
> (not multiple of 8), you need to encode an extra payload to indicate the
> effective number of bits to keep at end of the encoded octets-stream (or at
> start):
>
> - Base64 does not specify how you convert a bitstream of arbitrary length
> to an octets-stream;
>
> - for that purpose, you may need to pad the bits-stream at start or at end
> with 1 to 6 bits (so that it the resulting bitstream has a length multiple
> of 8, then encodable with Base64 which takes only octets on input).
>
> - these extra padding bits are not significant for the original bitstream,
> but are significant for the Base64 encoder/decoder, they will be discarded
> by the bitstream decoder built on top of the Base64 decoder, but not by the
> Base64 decoder itself.
>
>
>
> You need to encode somewhere with the bitstream encoder how many padding
> bits (0 to 7) are present at start or end of the octets-stream; this can be
> done:
>
> - as a separate payload (not encoded by Base64), or
>
> - by prepending 3 bits at start of the bits-stream then padded at end with
> 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64
> encoding.
>
> - by appending 3 bits at end of the  bits-stream, just after 1 to 7 random
> bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.
>
> Finally your bits-stream decoder will be able to use this padding count to
> discard these random padding bits (and possibly realign the stream on
> different byte-boundaries when the effective bitlength bits-stream payload
> is not a multiple of 8 and padding bits were added)
>
>
>
> Base64 also does not specify how bits of the original bits-stream payload
> are packed into the octets-stream input suitable for Base64-encoding,
> notably it does not specify their order and endian-ness. The same remark
> applies as well for MIME, HTTP. So lot of network protocols and file
> formats need to how to properly encode which possible option is used to
> encode bits-streams of arbitrary length, 

Re: Fallback for Sinhala Consonant Clusters

2018-10-15 Thread Richard Wordingham via Unicode
On Mon, 15 Oct 2018 01:55:24 +1100
Harshula via Unicode  wrote:

> 3) However, what you have observed is an issue with *explicit*
> conjunct creation. After the segmentation is completed, the
> layout/shaping engine needs to first check if there is a
> corresponding lookup for the explicit conjunct, if not, then it needs
> to remove the ZWJ and redo the segmentation and lookup(s). Perhaps
> that is not happening in Harfbuzz.

This indeed seems to be the problem with HarfBuzz and with Windows 7
Uniscribe.  Curiously, they almost adopt this behaviour when touching
letters are not available.  (The ZWJ seems not to be completely removed
- in HarfBuzz at least it can result in the al-lakuna not interacting
properly with the base character.)

But where is this usually useful behaviour specified?

1.  There may be nothing but time and money to stop fallbacks being
built into the font.  For example, what prohibits the rendering of a
conjunct falling back to touching letters or a missing glyph symbol?

2. One could argue that the current behaviour falls back to a  display; Pali in Thai script does use sequences
of . The problem is
that al-lakuna also acts as a vowel modifier.

3. What stops one arguing that a conjunct is an abstract
character and that to render it with a sequence using a visible
al-lakuna would violate its identity? 

Richard.