RE: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?
Philippe, Where is the use of whitespace or the idea that 1-byte pieces do not need all the equal sign paddings documented? I read the rfc 3501 you pointed at, I don’t see it there. Are these part of any standards? Or are you claiming these are practices despite the standards? If so, are these just tolerated by parsers, or are they actually generated by encoders? What would be the rationale for supporting unnecessary whitespace? If linebreaks are forced at some line length they can presumably be removed at that length and not treated as part of the encoding. Maybe we differ on define where the encoding begins and ends, and where higher level protocols prescribe how they are embedded within the protocol. Tex From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe Verdy via Unicode Sent: Sunday, October 14, 2018 1:41 AM To: Adam Borowski Cc: unicode Unicode Discussion Subject: Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is enough to indicate the end of an octets-span. The extra = after it do not add any other octet. and as well you're allowed to insert whitespaces anywhere in the encoded stream (this is what ensures that the Base64-encoded octets-stream will not be altered if line breaks are forced anywhere (notably within the body of emails). So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, LF, NEL) in the middle is non-significant and ignorable on decoding (their "encoded" bit length is 0 and they don't terminate an octets-span, unlike "=" which discards extra bits remaining from the encoded stream before that are not on 8-bit boundaries). Also: - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol before "=" can vary in its 4 lowest bits (which are then ignored/discarded by the "=" symbol) - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" symbol before "=" can vary in its 2 lowest bits (which are then ignored/discarded by the "=" symbol) So you can use Base64 by encoding each octet in separate pieces, as one Base64 symbol followed by an "=" symbol, and even insert any number of whitespaces between them: there's a infinite number of valid Base64 encodings for representing the same octets-stream payload. Base64 allows encoding any octets streams but not directly any bits-streams : it assumes that the effective bits-stream has a binary length multiple of 8. To encode a bits-stream with an exact number of bits (not multiple of 8), you need to encode an extra payload to indicate the effective number of bits to keep at end of the encoded octets-stream (or at start): - Base64 does not specify how you convert a bitstream of arbitrary length to an octets-stream; - for that purpose, you may need to pad the bits-stream at start or at end with 1 to 6 bits (so that it the resulting bitstream has a length multiple of 8, then encodable with Base64 which takes only octets on input). - these extra padding bits are not significant for the original bitstream, but are significant for the Base64 encoder/decoder, they will be discarded by the bitstream decoder built on top of the Base64 decoder, but not by the Base64 decoder itself. You need to encode somewhere with the bitstream encoder how many padding bits (0 to 7) are present at start or end of the octets-stream; this can be done: - as a separate payload (not encoded by Base64), or - by prepending 3 bits at start of the bits-stream then padded at end with 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64 encoding. - by appending 3 bits at end of the bits-stream, just after 1 to 7 random bits needed to get a bit-length multiple of 8 suitable for Base64 encoding. Finally your bits-stream decoder will be able to use this padding count to discard these random padding bits (and possibly realign the stream on different byte-boundaries when the effective bitlength bits-stream payload is not a multiple of 8 and padding bits were added) Base64 also does not specify how bits of the original bits-stream payload are packed into the octets-stream input suitable for Base64-encoding, notably it does not specify their order and endian-ness. The same remark applies as well for MIME, HTTP. So lot of network protocols and file formats need to how to properly encode which possible option is used to encode bits-streams of arbitrary length, or need to specify which default choice to apply if this option is not encoded, or which option must be used (with no possible variation). And this also adds to the number of distinct encodings that are possible but are still equivalent for the same effective bits-stream payload. All these allowed variations are from the encoder perspective. For interoperability, the decoder has to be flexible and
Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?
Le dim. 14 oct. 2018 à 21:21, Doug Ewell via Unicode a écrit : > Steffen Nurpmeso wrote: > > > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions > > (MIME) Part One: Format of Internet Message Bodies). > > Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data > Encodings." RFC 2045 defines a particular implementation of base64, > specific to transporting Internet mail in a 7-bit environment. > Wrong, this is "specific" to transporting Internet mail in any 7 bit or 8 bit environment (today almost all mail agents are operating in 8 bit), and then it is referenced directly by HTTP (and its HTTPS variant). So this is no so "specific". MIME is extremely popular, RFC 4648 is extremely exotic (and RFC 4648 is wrong when saying that IMAP is very specific as it is now a very popular protocol, widely used as well). MIME is so frequently used, that almost all people refer to it when they look for Base64, or do not explicitly state that another definition (found in an exotic RFC) is explicitly used.
Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?
It's also interesting to look at https://tools.ietf.org/html/rfc3501 - which defines (for IMAP v4) another "BASE64" encoding, - and also defines a "Modified UTF-7" encoding using it, deviating from Unicode's definition of UTF-7, - and adding other requirements (which forbids alternate encodings permitted in UTF-7 and all other Base64 variants, including those used in MIME/RFC 2045 or SMTP, used in strong relations with IMAP !). And nothing in RFC 4648 is clear about the fact that it only covers the encoding of "octets streams" and not "bits streams". It also does not discuss the adaptation for "Base64" for transport and storage (needed for MIME, IMAP, but also in HTTP, and in several file/data formats including XML, or digital signatures). That RFC 4648 is only superficial, and does not cover everything (even Unicode has its own definition for UTF-7 and also allows variations). As we are on this Unicode list, the definition used by Unicode (more in line with MIME), does not follow at all those in RFC 4648. Most uses of Base64 encodings are based on the original MIME definition, and all of them perform new adaptations. (Even the definition of "Base16" in RFC4648 contradicts most other definitions). Le dim. 14 oct. 2018 à 21:21, Doug Ewell via Unicode a écrit : > Steffen Nurpmeso wrote: > > > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions > > (MIME) Part One: Format of Internet Message Bodies). > > Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data > Encodings." RFC 2045 defines a particular implementation of base64, > specific to transporting Internet mail in a 7-bit environment. > > RFC 4648 discusses many of the "higher-level protocol" topics that some > people are focusing on, such as separating the base64-encoded output > into lines of length 72 (or other), alternative target code unit sets or > "alphabets," and padding characters. It would be helpful for everyone to > read this particular RFC before concluding that these topics have not > been considered, or that they compromise round-tripping or other > characteristics of base64. > > I had assumed that when Roger asked about "base64 encoding," he was > asking about the basic definition of base64. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > >
Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?
Steffen Nurpmeso wrote: Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies). Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data Encodings." RFC 2045 defines a particular implementation of base64, specific to transporting Internet mail in a 7-bit environment. RFC 4648 discusses many of the "higher-level protocol" topics that some people are focusing on, such as separating the base64-encoded output into lines of length 72 (or other), alternative target code unit sets or "alphabets," and padding characters. It would be helpful for everyone to read this particular RFC before concluding that these topics have not been considered, or that they compromise round-tripping or other characteristics of base64. I had assumed that when Roger asked about "base64 encoding," he was asking about the basic definition of base64. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: Fallback for Sinhala Consonant Clusters
Hi Richard, 1) From a pronunciation perspective, your first and third examples will be similar. Your second example will be pronounced very differently. I did some quick testing on Linux and reproduced the behaviour that you observed. 2) Going back more than a decade, the state tables used by some layout/shaping engines used the same 'virama' rules for North Indian scripts and Sinhala. This resulted in undesirable *implicit* conjuncts being created for Sinhala consonant clusters. That then resulted in undesirable positioning of dependent vowels. e.g. https://bugzilla.gnome.org/show_bug.cgi?id=161981 3) However, what you have observed is an issue with *explicit* conjunct creation. After the segmentation is completed, the layout/shaping engine needs to first check if there is a corresponding lookup for the explicit conjunct, if not, then it needs to remove the ZWJ and redo the segmentation and lookup(s). Perhaps that is not happening in Harfbuzz. 4) I've been out of the loop for many years, so I have CC'd Ruvan & Harsha who may already be aware of what you have observed. cya, # On 14/10/18 11:02 am, Richard Wordingham via Unicode wrote: > Are there fallback rules for Sinhala consonant clusters? There are > fallback rules for Devanagari, but I'm not sure if they read across. > > The problem I am seeing is that the Pali syllable 'ndhe' න්ධෙ NAYANNA, U+0DCA AL-LAKUNA, 200D ZWJ, U+0DB0 MAHAPRAANA DAYANNA, U+0DD9 > KOMBUVA> is being rendered identically to a hypothetical Sinhalese > 'nēdha' නේධ , which in NFD is > , when I use a font that lacks the > conjunct. (Most fonts lack the conjunct.) The Devanagari rules and my > preference would lead to a fallback rendering as න්ධෙ (Sinhalese > 'ndhe'), which is encoded as MAHAPRAANA DAYANNA, U+0DD9 KOMBUVA>. Is the rendering I am getting > technically wrong, or is it merely undesirable? > > The ambiguity arises in part because, like the Brahmi script, the > Sinhala script uses its virama character as a vowel length indicator. > > Missing touching consonants are being rendered almost as though there > were no ZWJ, but the combination of consonant and al-lakuna is being > rendered badly. > > Richard. >
Re: Fallback for Sinhala Consonant Clusters
On Sun, 14 Oct 2018 17:15:26 +0900 "Martin J. Dürst via Unicode" wrote: > Hello Richard, > > On 2018/10/14 09:02, Richard Wordingham via Unicode wrote: > > Are there fallback rules for Sinhala consonant clusters? There are > > fallback rules for Devanagari, but I'm not sure if they read across. > > > > The problem I am seeing is that the Pali syllable 'ndhe' න්ධෙ > > > DAYANNA, U+0DD9 > > KOMBUVA> > > Let's label this as (1) > > > is being rendered identically to a hypothetical Sinhalese > > 'nēdha' නේධ , > > It (2) doesn't look identically to (1) here (Thunderbird on Win 8.1). > > Your mail is written as if you are speaking about a general > phenomenon, but I guess there are differences depending on the font > and rendering stack. The critical one is whether the font has the conjunct. The default Sinhala font on supported Windows, Iskoola Pota, has the conjunct. For an example that should illustrate my points with that font (at least, as on Windows 7) and the HarfBuzz renderer (as I believe in Thunderbird), we have 1') Pali thve ථ්වෙ It's a very rare syllable - it only occurs in sandhi, and I have only a single example. Iskoola Pota has neither the conjunct nor the touching form; I would actually expect it to be the touching form that exists. 2') Misleading look-alike thēva ථේව 3') Preferred fallback appearance thve ථ්වෙ . My question is, 'What should a rendering stack that claims to support the Sinhala script display when it lacks the conjunct in the font being used?' Now what does get displayed does depend on the rendering stack. HarfBuzz (e.g. Firefox, Google Chrome, LibreOffice, and most Linux) and Notepad on Windows 7 move the vowel to the left and display al-lakuna, the display I object to. iPhone and Notepad on Windows 10 display the vowel in the middle and display al-lakuna (possibly ligated), which is the solution I prefer. > Hope this helps. Well, it has prompted me to find a 'me-too' argument for improving the rendering. I wanted a standards-based argument. >> Missing touching consonants are being rendered almost as though >> there were no ZWJ, but the combination of consonant and al-lakuna >> is being rendered badly. This looks like a common font problem. Iskoola Pota does not suffer from it. Richard.
Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?
Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is enough to indicate the end of an octets-span. The extra = after it do not add any other octet. and as well you're allowed to insert whitespaces anywhere in the encoded stream (this is what ensures that the Base64-encoded octets-stream will not be altered if line breaks are forced anywhere (notably within the body of emails). So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, LF, NEL) in the middle is non-significant and ignorable on decoding (their "encoded" bit length is 0 and they don't terminate an octets-span, unlike "=" which discards extra bits remaining from the encoded stream before that are not on 8-bit boundaries). Also: - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol before "=" can vary in its 4 lowest bits (which are then ignored/discarded by the "=" symbol) - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" symbol before "=" can vary in its 2 lowest bits (which are then ignored/discarded by the "=" symbol) So you can use Base64 by encoding each octet in separate pieces, as one Base64 symbol followed by an "=" symbol, and even insert any number of whitespaces between them: there's a infinite number of valid Base64 encodings for representing the same octets-stream payload. Base64 allows encoding any octets streams but not directly any bits-streams : it assumes that the effective bits-stream has a binary length multiple of 8. To encode a bits-stream with an exact number of bits (not multiple of 8), you need to encode an extra payload to indicate the effective number of bits to keep at end of the encoded octets-stream (or at start): - Base64 does not specify how you convert a bitstream of arbitrary length to an octets-stream; - for that purpose, you may need to pad the bits-stream at start or at end with 1 to 6 bits (so that it the resulting bitstream has a length multiple of 8, then encodable with Base64 which takes only octets on input). - these extra padding bits are not significant for the original bitstream, but are significant for the Base64 encoder/decoder, they will be discarded by the bitstream decoder built on top of the Base64 decoder, but not by the Base64 decoder itself. You need to encode somewhere with the bitstream encoder how many padding bits (0 to 7) are present at start or end of the octets-stream; this can be done: - as a separate payload (not encoded by Base64), or - by prepending 3 bits at start of the bits-stream then padded at end with 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64 encoding. - by appending 3 bits at end of the bits-stream, just after 1 to 7 random bits needed to get a bit-length multiple of 8 suitable for Base64 encoding. Finally your bits-stream decoder will be able to use this padding count to discard these random padding bits (and possibly realign the stream on different byte-boundaries when the effective bitlength bits-stream payload is not a multiple of 8 and padding bits were added) Base64 also does not specify how bits of the original bits-stream payload are packed into the octets-stream input suitable for Base64-encoding, notably it does not specify their order and endian-ness. The same remark applies as well for MIME, HTTP. So lot of network protocols and file formats need to how to properly encode which possible option is used to encode bits-streams of arbitrary length, or need to specify which default choice to apply if this option is not encoded, or which option must be used (with no possible variation). And this also adds to the number of distinct encodings that are possible but are still equivalent for the same effective bits-stream payload. All these allowed variations are from the encoder perspective. For interoperability, the decoder has to be flexible and to support various options to be compatible with different implementations of the encoder, notably when the encoder was run on a different system. And this is the case for the MIME transport by mail, or for HTTP and FTP transports, or file/media storage formats even if the file is stored on the same system, because it may actually be a copy stored locally but coming from another system where the file was actually encoded). Now if we come back to the encoding of plain-text payloads, Unicode just specifies the allowed range (from 0 to 0x10) for scalar values of code points (it actually does not mandate an exact bit-length because the range does not fully fit exactly to 21 bits and an encoder can still pack multiple code points together into more compact code units. However Unicode provides and standardizes several encodings (UTF-8/16/32) which use code units whose size is directly suitable as input for an octets-stream, so that they are directly encodable with Base64, without having to specify an extra layer for the bits-stream encoder/decoder. But many other encodings are still possible (and can be
Re: Fallback for Sinhala Consonant Clusters
Hello Richard, On 2018/10/14 09:02, Richard Wordingham via Unicode wrote: Are there fallback rules for Sinhala consonant clusters? There are fallback rules for Devanagari, but I'm not sure if they read across. The problem I am seeing is that the Pali syllable 'ndhe' න්ධෙ Let's label this as (1) is being rendered identically to a hypothetical Sinhalese 'nēdha' නේධ , It (2) doesn't look identically to (1) here (Thunderbird on Win 8.1). Your mail is written as if you are speaking about a general phenomenon, but I guess there are differences depending on the font and rendering stack. which in NFD is , when I use a font that lacks the conjunct. (Most fonts lack the conjunct.) The Devanagari rules and my preference would lead to a fallback rendering as න්ධෙ (Sinhalese 'ndhe'), Here, this (3) looks like it has the same three components as (2), but the first two are exchanged, so that the piece that looks like @ is now in the middle (it was at the left in (1) and (2)). Hope this helps. Regards,Martin. which is encoded as . Is the rendering I am getting technically wrong, or is it merely undesirable? The ambiguity arises in part because, like the Brahmi script, the Sinhala script uses its virama character as a vowel length indicator. Missing touching consonants are being rendered almost as though there were no ZWJ, but the combination of consonant and al-lakuna is being rendered badly. Richard. . -- Prof. Dr.sc. Martin J. Dürst Department of Intelligent Information Technology College of Science and Engineering Aoyama Gakuin University Fuchinobe 5-1-10, Chuo-ku, Sagamihara 252-5258 Japan