Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st sentence, it is explicitly stated :
In some circumstances, the use of padding ("=") in base-encoded data is not required or used. Le lun. 15 oct. 2018 à 03:56, Tex <texte...@xencraft.com> a écrit : > Philippe, > > > > Where is the use of whitespace or the idea that 1-byte pieces do not need > all the equal sign paddings documented? > > I read the rfc 3501 you pointed at, I don’t see it there. > > > > Are these part of any standards? Or are you claiming these are practices > despite the standards? If so, are these just tolerated by parsers, or are > they actually generated by encoders? > > > > What would be the rationale for supporting unnecessary whitespace? If > linebreaks are forced at some line length they can presumably be removed at > that length and not treated as part of the encoding. > > Maybe we differ on define where the encoding begins and ends, and where > higher level protocols prescribe how they are embedded within the protocol. > > > > Tex > > > > > > > > > > *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Philippe > Verdy via Unicode > *Sent:* Sunday, October 14, 2018 1:41 AM > *To:* Adam Borowski > *Cc:* unicode Unicode Discussion > *Subject:* Re: Base64 encoding applied to different unicode texts always > yields different base64 texts ... true or false? > > > > Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is > enough to indicate the end of an octets-span. The extra = after it do not > add any other octet. and as well you're allowed to insert whitespaces > anywhere in the encoded stream (this is what ensures that the > Base64-encoded octets-stream will not be altered if line breaks are forced > anywhere (notably within the body of emails). > > > > So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, > LF, NEL) in the middle is non-significant and ignorable on decoding (their > "encoded" bit length is 0 and they don't terminate an octets-span, unlike > "=" which discards extra bits remaining from the encoded stream before that > are not on 8-bit boundaries). > > > > Also: > > - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol > before "=" can vary in its 4 lowest bits (which are then ignored/discarded > by the "=" symbol) > > - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" > symbol before "=" can vary in its 2 lowest bits (which are then > ignored/discarded by the "=" symbol) > > > > So you can use Base64 by encoding each octet in separate pieces, as one > Base64 symbol followed by an "=" symbol, and even insert any number of > whitespaces between them: there's a infinite number of valid Base64 > encodings for representing the same octets-stream payload. > > > > Base64 allows encoding any octets streams but not directly any > bits-streams : it assumes that the effective bits-stream has a binary > length multiple of 8. To encode a bits-stream with an exact number of bits > (not multiple of 8), you need to encode an extra payload to indicate the > effective number of bits to keep at end of the encoded octets-stream (or at > start): > > - Base64 does not specify how you convert a bitstream of arbitrary length > to an octets-stream; > > - for that purpose, you may need to pad the bits-stream at start or at end > with 1 to 6 bits (so that it the resulting bitstream has a length multiple > of 8, then encodable with Base64 which takes only octets on input). > > - these extra padding bits are not significant for the original bitstream, > but are significant for the Base64 encoder/decoder, they will be discarded > by the bitstream decoder built on top of the Base64 decoder, but not by the > Base64 decoder itself. > > > > You need to encode somewhere with the bitstream encoder how many padding > bits (0 to 7) are present at start or end of the octets-stream; this can be > done: > > - as a separate payload (not encoded by Base64), or > > - by prepending 3 bits at start of the bits-stream then padded at end with > 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64 > encoding. > > - by appending 3 bits at end of the bits-stream, just after 1 to 7 random > bits needed to get a bit-length multiple of 8 suitable for Base64 encoding. > > Finally your bits-stream decoder will be able to use this padding count to > discard these random padding bits (and possibly realign the stream on > different byte-boundaries when the effective bitlength bits-stream payload > is not a multiple of 8 and padding bits were added) > > > > Base64 also does not specify how bits of the original bits-stream payload > are packed into the octets-stream input suitable for Base64-encoding, > notably it does not specify their order and endian-ness. The same remark > applies as well for MIME, HTTP. So lot of network protocols and file > formats need to how to properly encode which possible option is used to > encode bits-streams of arbitrary length, or need to specify which default > choice to apply if this option is not encoded, or which option must be used > (with no possible variation). And this also adds to the number of distinct > encodings that are possible but are still equivalent for the same effective > bits-stream payload. > > > > All these allowed variations are from the encoder perspective. For > interoperability, the decoder has to be flexible and to support various > options to be compatible with different implementations of the encoder, > notably when the encoder was run on a different system. And this is the > case for the MIME transport by mail, or for HTTP and FTP transports, or > file/media storage formats even if the file is stored on the same system, > because it may actually be a copy stored locally but coming from another > system where the file was actually encoded). > > > > Now if we come back to the encoding of plain-text payloads, Unicode just > specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code > points (it actually does not mandate an exact bit-length because the range > does not fully fit exactly to 21 bits and an encoder can still pack > multiple code points together into more compact code units. > > > > However Unicode provides and standardizes several encodings (UTF-8/16/32) > which use code units whose size is directly suitable as input for an > octets-stream, so that they are directly encodable with Base64, without > having to specify an extra layer for the bits-stream encoder/decoder. > > > > But many other encodings are still possible (and can be conforming to > Unicode, provided they preserve each Unicode scalar value, or at least the > code point identity because an encoder/decoder is not required to support > non-character code points such as surrogates or U+FFFE), where Base64 may > be used for internally generated octets-streams. > > > > > > Le dim. 14 oct. 2018 à 03:47, Adam Borowski via Unicode < > unicode@unicode.org> a écrit : > > On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote: > > Le sam. 13 oct. 2018 à 18:58, Steffen Nurpmeso via Unicode < > > unicode@unicode.org> a écrit : > > > The only variance is described as: > > > > > > Care must be taken to use the proper octets for line breaks if base64 > > > encoding is applied directly to text material that has not been > > > converted to canonical form. In particular, text line breaks must be > > > converted into CRLF sequences prior to base64 encoding. The > > > important thing to note is that this may be done directly by the > > > encoder rather than in a prior canonicalization step in some > > > implementations. > > > > > > This is MIME, it specifies (in the same RFC): > > > > I've not spoken aboutr the encoding of new lines **in the actual encoded > > text**: > > - if their existing text-encoding ever gets converted to Base64 as if > the > > whole text was an opaque binary object, their initial text-encoding will > be > > preserved (so yes it will preserve the way these embedded newlines are > > encoded as CR, LF, CR+LF, NL...) > > > > I spoke about newlines used in the transport syntax to split the initial > > binary object (which may actually contain text but it does not matter). > > MIME defines this operation and even requires splitting the binary object > > in fragments with maximum binary size so that these binary fragments can > be > > converted with Base64 into lines with maximum length. In the MIME Base64 > > representation you can insert newlines anywhere between fragments encoded > > separately. > > There's another kind of fragmentation that can make the encoding differ > (but > still decode to the same payload): > > The data stream gets split into 3-byte internal, 4-byte external packets. > Any packet may contain less than those 3 bytes, in which cases it is padded > with = characters: > 3 bytes XXXX > 2 bytes XXX= > 1 byte XX== > > Usually, such smaller packets happen only at the end of a message, but to > support encoding a stream piecewise, they are allowed at any point. > > For example: > "meow" is bWVvdw== > "me""ow" is bWU=b3c= > yet both carry the same payload. > > > Base64 is used exactly to support this flexibility in transport (or > > storage) without altering any bit of the initial content once it is > > decoded. > > Right, any such variations are in packaging only. > > > ᛗᛖᛟᚹ > -- > ⢀⣴⠾⠻⢶⣦⠀ > ⣾⠁⢰⠒⠀⣿⡁ 10 people enter a bar: 1 who understands binary, > ⢿⡄⠘⠷⠚⠋⠀ 1 who doesn't, D who prefer to write it as hex, > ⠈⠳⣄⠀⠀⠀⠀ and 1 who narrowly avoided an off-by-one error. > >