Re: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask
wjgo_10...@btinternet.com via Unicode wrote in <141cecf1.23e.1702ea529c1.webtop@btinternet.com>: |Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good |reason why I ask | |There is a German song, Lorelei, and I searched to find an English |translation. Regarding Rhine and this thing of yours, there is also the German joke from the middle of the 1950s, i think, with "Tünnes und Schäl". Tünnes und Schäl stehen auf der Rheinbrücke. Da fällt Tünnes die Brille in den Fluß und er sagt "Da schau, jetzt ist mir die Brille in die Mosel gefallen", worauf Schäl sagt, "Mensch, Tünnes, dat is doch de Ring!", und Tünnes antwortet "Da kannste mal sehen wie schlecht ich ohne Brille sehen kann!" Tuennes und Schael stand on the Rhine bridge. Then Tuennes glasses fall into the river, and he says "Look, now i lost my glasses to the Moselle", whereupon Schael says "Crumbs!, Tuennes, that is the Rhine!", and Tuennes responds "There you can say how bad i can see without glasses!" P.S.: i cannot speak "Kösch" aka Cologne dialect. P.P.S.: i think i got you wrong. --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: Access to the Unicode technical site (was: Re: Unicode's got a new logo?)
Hello Mr. Ken Whistler. Ken Whistler wrote in <3d1676bb-f3c1-8a3e-fdc5-1c0bdd74a...@sonic.net>: |On 7/18/2019 11:50 AM, Steffen Nurpmeso via Unicode wrote: |> I also decided to enter /L2 directly from now on. | |For folks wishing to access the UTC document register, Unicode |Consortium standards, and so forth, all of those links will be |permanently stable. They are not impacted by the rollout of the new home |page and its related content. | |If you need access to the more technical information from the UTC, |CLDR-TC, ICU-TC, etc., feel free to bookmark such pages as: | |https://www.unicode.org/L2/ | |for the UTC document register. | |https://www.unicode.org/charts/ | |for the Unicode code charts index, | |https://www.unicode.org/versions/latest/ | |for the latest version of the Unicode Standard, and so forth. All such |technical links are stable on the site, and will continue to be stable. Are these things still linked from the top homepage yet? Thank you very much for the information. (My gut feeling is that it is tremendous that very highly qualified people care for such vanities.) |For general access to the technical content on the Unicode website, see: | |https://www.unicode.org/main.html | |which provides easy link access to all the technical content areas and |to the ongoing technical committee work. I hopefully will come to truly Unicode the things i do!! (By then programming will hopefully be true fun again. I hope..) A nice weekend i wish, from soon sunny again Germany! --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: Unicode's got a new logo?
Yifán Wáng via Unicode wrote in : |I cannot help but notice the new home.unicode.org site embraces a new |logo, blue base color with a humanist type, rather than the |traditional one, red and geometric. Does anybody know if it means that |Unicode wants to renew its logo or that they serve for different |purposes? Which should I cite as the official logo? I think I've read |the description and the blog post but couldn't find an explanation. I also decided to enter /L2 directly from now on. I am happy that you give me the opportunity to finally send a mail regarding this topic. (Excuses to the designers from Adobe.) --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?
Philippe Verdy via Unicode wrote in : |Padding itself does not clearly indicate the length. | |It's an artefact that **may** be infered only in some other layers \ |of protocols which specify when and how padding is needed (and how \ |many padding bytes |are required or accepted), it works only if these upper layer protocols \ |are using **octets** streams, but it is still not usable for more general |bitstreams (with arbitrary bit lengths). | |This RFC does not mandate/require these padding bytes and in fact many \ |upper layer protocols do not ever need it (including UTF-7 for example), \ |they are |never necessary to infer a length in octets and insufficient for specify\ |ing a length in bits. | |As well the usage in MIME (where there's a requirement that lines of \ |headers or in the content body is limited to 1000 bytes) requires free \ |splitting of |Base64 (there's no agreed maximum length, some sources insist it should \ |not be more than 72 bytes, others use 80 bytes, but mail forwarding \ |may add other |characters at start of lines, forcing them to be shorter (leaving for \ |example a line of 72 bytes+CRLF and another line of 8 bytes+CRLF): \ |this means that |padding may not be used where one would expect them, and padding can \ |event occur in the middle of the encoded stream (not just at end) along \ That was actually a bug in my MUA. Other MUAs were not capable of decoding this correctly. Sorry :-(!! |with other |whitespaces or separators (like "> " at start of lines in cited messages). In fact garbage bytes may be embedded explicitly says MIME. Most handle that right, and skip (silently, maybe not right), but some explicit base64 decoders fail miserably when such things are seen (openssl base64, NetBSD base64 decoder (current)), others do not (busybox base64, for example). --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?
Doug Ewell via Unicode wrote in <2A67B4F082F74F8AADF34BA11D885554@DougEwell>: |Steffen Nurpmeso wrote: |> Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions |> (MIME) Part One: Format of Internet Message Bodies). | |Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data |Encodings." RFC 2045 defines a particular implementation of base64, |specific to transporting Internet mail in a 7-bit environment. | |RFC 4648 discusses many of the "higher-level protocol" topics that some |people are focusing on, such as separating the base64-encoded output |into lines of length 72 (or other), alternative target code unit sets or |"alphabets," and padding characters. It would be helpful for everyone to |read this particular RFC before concluding that these topics have not |been considered, or that they compromise round-tripping or other |characteristics of base64. | |I had assumed that when Roger asked about "base64 encoding," he was |asking about the basic definition of base64. Sure; i have only followed the discussion superficially, and even though everybody can read RFCs, i felt the necessity to polemicize against the false however i look at it "MIME actually splits a binary object into multiple fragments at random positions". Solely my fault. --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?
Philippe Verdy via Unicode wrote in : |You forget that Base64 (as used in MIME) does not follow these rules \ |as it allows multiple different encodings for the same source binary. \ |MIME actually |splits a binary object into multiple fragments at random positions, \ |and then encodes these fragments separately. Also MIME uses an extension \ |of Base64 |where it allows some variations in the encoding alphabet (so even the \ |same fragment of the same length may have two disting encodings). | |Base64 in MIME is different from standard Base64 (which never splits \ |the binary object before encoding it, and uses a strict alphabet of \ |64 ASCII |characters, allowing no variation). So MIME requires special handling: \ |the assumpton that a binary message is encoded the same is wrong, but \ |MIME still |requires that this non unique Base64 encoding will be decoded back \ |to the same initial (unsplitted) binary object (independantly of its \ |size and |independantly of the splitting boundaries used in the transport, which \ |may change during the transport). Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies). It is a content-transfer-encoding and encodes any data transparently into a 7 bit clean ASCII _and_ EBCDIC compatible (the authors commemorate that) text. When decoding it reverts this representation into its original form. Ok, there is the CRLF newline problem, as below. What do you mean by "splitting"? ... The only variance is described as: Care must be taken to use the proper octets for line breaks if base64 encoding is applied directly to text material that has not been converted to canonical form. In particular, text line breaks must be converted into CRLF sequences prior to base64 encoding. The important thing to note is that this may be done directly by the encoder rather than in a prior canonicalization step in some implementations. This is MIME, it specifies (in the same RFC): 2.10. Lines "Lines" are defined as sequences of octets separated by a CRLF sequences. This is consistent with both RFC 821 and RFC 822. "Lines" only refers to a unit of data in a message, which may or may not correspond to something that is actually displayed by a user agent. and furthermore 6.5. Translating Encodings The quoted-printable and base64 encodings are designed so that conversion between them is possible. The only issue that arises in such a conversion is the handling of hard line breaks in quoted- printable encoding output. When converting from quoted-printable to base64 a hard line break in the quoted-printable form represents a CRLF sequence in the canonical form of the data. It must therefore be converted to a corresponding encoded CRLF in the base64 form of the data. Similarly, a CRLF sequence in the canonical form of the data obtained after base64 decoding must be converted to a quoted- printable hard line break, but ONLY when converting text data. So we go over 6.6. Canonical Encoding Model There was some confusion, in the previous versions of this RFC, regarding the model for when email data was to be converted to canonical form and encoded, and in particular how this process would affect the treatment of CRLFs, given that the representation of newlines varies greatly from system to system, and the relationship between content-transfer-encodings and character sets. A canonical model for encoding is presented in RFC 2049 for this reason. to RFC 2049 where we find For example, in the case of text/plain data, the text must be converted to a supported character set and lines must be delimited with CRLF delimiters in accordance with RFC 822. Note that the restriction on line lengths implied by RFC 822 is eliminated if the next step employs either quoted-printable or base64 encoding. and, later Conversion from entity form to local form is accomplished by reversing these steps. Note that reversal of these steps may produce differing results since there is no guarantee that the original and final local forms are the same. and, later NOTE: Some confusion has been caused by systems that represent messages in a format which uses local newline conventions which differ from the RFC822 CRLF convention. It is important to note that these formats are not canonical RFC822/MIME. These formats are instead *encodings* of RFC822, where CRLF sequences in the canonical representation of the message are encoded as the local newline convention. Note that formats which encode CRLF sequences as, for example, LF are not capable of representing MIME messages containing binary data which contains LF octets not part of CRLF line separation sequences. Whoever understands this emojibake. My MUA still gnaws at
Re: Tales from the Archives
Terrible! Ken Whistler wrote in <12e6ad91-89e4-ec87-85ad-8fc4ab3f6...@att.net>: |Steffen, | |Are you looking for the Unicode list email archives? | |https://www.unicode.org/mail-arch/ | |Those contain list content going back all the way to 1994. Dear Ken Whistler, no, and yes, having an archive is very good, though your statement from 1997-07-16 ("Plan 9 (a Unix OS) uses UTF-8") i cannot agree with (it feels very different from Unix). It was just that i have read on one of the mailing-lists i am subscribed to a cite of a Unicode statement that i have never read of anything on the Unicode mailing-list. It is very awkward, but i _again_ cannot find what attracted my attention, even with the help of a search machine. I think "faith alone will reveal the true name of shuruq" (1997-07-18). --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: Tales from the Archives
James Kass via Unicode wrote in : ... |Eighteen years pass, display issues have mostly gone away, nearly |everything works "out-of-the-box", and list traffic has dropped |dramatically. Today's questions are usually either highly technical |or emoji-related. | |Recent threads related to emoji included some questions and issues |which remain unanswered in spite of the fact that there are list |members who know the answers. | |This gives the impression that the Unicode public list has become |passé. That's almost as sad as looking down the archive posts, seeing |the names of the posters, and remembering colleagues who no longer |post. | |So I'm wondering what changed, but I don't expect an answer. I have the impression that many things which have been posted here some years ago are now only available via some Forums or other browser based services. What is posted here seems to be mostly a duplicate of the blog only. (And the website has its pitfalls too, for example [1] is linked from [2], but does not exist.) [1] http://www.unicode.org/resources/readinglist.html [2] http://www.unicode.org/publications/ --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
"Costello, Roger L. via Unicode"wrote: |Suppose an application splits a UTF-8 multi-octet sequence. The application \ |then sends the split sequence to a client. The client must restore \ |the original sequence. | |Question: is it possible to split a UTF-8 multi-octet sequence in such \ |a way that the client cannot unambiguously restore the original sequence? | |Here is the source of my question: | |The iCalendar specification [RFC 5545] says that long lines must be folded: | | Long content lines SHOULD be split | into a multiple line representations | using a line "folding" technique. | That is, a long line can be split between | any two characters by inserting a CRLF | immediately followed by a single linear | white-space character (i.e., SPACE or HTAB). | |The RFC says that, when parsing a content line, folded lines must first \ |be unfolded using this technique: | | Unfolding is accomplished by removing | the CRLF and the linear white-space | character that immediately follows. | |The RFC acknowledges that simple implementations might generate improperly \ |folded lines: | | Note: It is possible for very simple | implementations to generate improperly | folded lines in the middle of a UTF-8 | multi-octet sequence. For this reason, | implementations need to unfold lines | in such a way to properly restore the | original sequence. That is not what the RFC says. It says that simple implementations simply split lines when the limit is reached, which might be in the middle of an UTF-8 sequence. The RFC is thus improved compared to other RFCs in the email standard section, which do not give any hints on how to do that. Even RFC 2231, which avoids many of the ambiguities and problems of RFC 2047 (for a different purpose, but still), does not say it so exactly for the reversing character set conversion (which i for one perform _once_ after joining together the chunks, but is not a written word and, thus, ...). --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)