ASCII and Unicode [was Re: Managing Google Groups headaches]
On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote: Evidently (and completely inadvertently) this exchange has just illustrated one of the inadmissable assumptions: unicode as a medium is universal in the same way that ASCII used to be Ironically, your post was not Unicode. Seriously. I am 100% serious. Your post was sent using a legacy encoding, Windows-1252, also known as CP-1252, which is most certainly *not* Unicode. Whatever software you used to send the message correctly flagged it with a charset header: Content-Type: text/plain; charset=windows-1252 Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle encodings correctly (or at all!), it screws up the encoding then sends a reply with no charset line at all. This is one bug that cannot be blamed on Google Groups -- or on Unicode. I wrote a number of ellipsis characters ie codepoint 2026 as in: Actually you didn't. You wrote a number of ellipsis characters, hex byte \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to code point U+2026 in Unicode, but the two are as distinct as ASCII and EBCDIC. Somewhere between my sending and your quoting those ellipses became the replacement character FFFD Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about encodings and character sets. It doesn't just assume things are ASCII, but makes a half-hearted attempt to be charset-aware, but badly. I can only imagine that it was written back in the Dark Ages where there were a lot of different charsets in use but no conventions for specifying which charset was in use. Or perhaps the author was smoking crack while coding. Leaving aside whose fault this is (very likely buggy google groups), this mojibaking cannot happen if the assumption All text is ASCII were to uniformly hold. This is incorrect. People forget that ASCII has evolved since the first version of the standard in 1963. There have actually been five versions of the ASCII standard, plus one unpublished version. (And that's not including the things which are frequently called ASCII but aren't.) ASCII-1963 didn't even include lowercase letters. It is also missing some graphic characters like braces, and included at least two characters no longer used, the up-arrow and left-arrow. The control characters were also significantly different from today. ASCII-1965 was unpublished and unused. I don't know the details of what it changed. ASCII-1967 is a lot closer to the ASCII in use today. It made considerable changes to the control characters, moving, adding, removing, or renaming at least half a dozen control characters. It officially added lowercase letters, braces, and some others. It replaced the up-arrow character with the caret and the left-arrow with the underscore. It was ambiguous, allowing variations and substitutions, e.g.: - character 33 was permitted to be either the exclamation mark ! or the logical OR symbol | - consequently character 124 (vertical bar) was always displayed as a broken bar ¦, which explains why even today many keyboards show it that way - character 35 was permitted to be either the number sign # or the pound sign £ - character 94 could be either a caret ^ or a logical NOT ¬ Even the humble comma could be pressed into service as a cedilla. ASCII-1968 didn't change any characters, but allowed the use of LF on its own. Previously, you had to use either LF/CR or CR/LF as newline. ASCII-1977 removed the ambiguities from the 1967 standard. The most recent version is ASCII-1986 (also known as ANSI X3.4-1986). Unfortunately I haven't been able to find out what changes were made -- I presume they were minor, and didn't affect the character set. So as you can see, even with actual ASCII, you can have mojibake. It's just not normally called that. But if you are given an arbitrary ASCII file of unknown age, containing code 94, how can you be sure it was intended as a caret rather than a logical NOT symbol? You can't. Then there are at least 30 official variations of ASCII, strictly speaking part of ISO-646. These 7-bit codes were commonly called ASCII by their users, despite the differences, e.g. replacing the dollar sign $ with the international currency sign ¤, or replacing the left brace { with the letter s with caron š. One consequence of this is that the MIME type for ASCII text is called US ASCII, despite the redundancy, because many people expect ASCII alone to mean whatever national variation they are used to. But it gets worse: there are proprietary variations on ASCII which are commonly called ASCII but aren't, including dozens of 8-bit so-called extended ASCII character sets, which is where the problems *really* pile up. Invariably back in the 1980s and early 1990s people used to call these ASCII no matter that they used 8-bits and contained anything up to 256 characters. Just because somebody calls something ASCII,
Re: ASCII and Unicode [was Re: Managing Google Groups headaches]
On Friday 06 December 2013 14:30:06 Steven D'Aprano did opine: On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote: Evidently (and completely inadvertently) this exchange has just illustrated one of the inadmissable assumptions: unicode as a medium is universal in the same way that ASCII used to be Ironically, your post was not Unicode. Seriously. I am 100% serious. Your post was sent using a legacy encoding, Windows-1252, also known as CP-1252, which is most certainly *not* Unicode. Whatever software you used to send the message correctly flagged it with a charset header: Content-Type: text/plain; charset=windows-1252 Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle encodings correctly (or at all!), it screws up the encoding then sends a reply with no charset line at all. This is one bug that cannot be blamed on Google Groups -- or on Unicode. I wrote a number of ellipsis characters ie codepoint 2026 as in: Actually you didn't. You wrote a number of ellipsis characters, hex byte \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to code point U+2026 in Unicode, but the two are as distinct as ASCII and EBCDIC. Somewhere between my sending and your quoting those ellipses became the replacement character FFFD Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about encodings and character sets. It doesn't just assume things are ASCII, but makes a half-hearted attempt to be charset-aware, but badly. I can only imagine that it was written back in the Dark Ages where there were a lot of different charsets in use but no conventions for specifying which charset was in use. Or perhaps the author was smoking crack while coding. Leaving aside whose fault this is (very likely buggy google groups), this mojibaking cannot happen if the assumption All text is ASCII were to uniformly hold. This is incorrect. People forget that ASCII has evolved since the first version of the standard in 1963. There have actually been five versions of the ASCII standard, plus one unpublished version. (And that's not including the things which are frequently called ASCII but aren't.) ASCII-1963 didn't even include lowercase letters. It is also missing some graphic characters like braces, and included at least two characters no longer used, the up-arrow and left-arrow. The control characters were also significantly different from today. ASCII-1965 was unpublished and unused. I don't know the details of what it changed. ASCII-1967 is a lot closer to the ASCII in use today. It made considerable changes to the control characters, moving, adding, removing, or renaming at least half a dozen control characters. It officially added lowercase letters, braces, and some others. It replaced the up-arrow character with the caret and the left-arrow with the underscore. It was ambiguous, allowing variations and substitutions, e.g.: - character 33 was permitted to be either the exclamation mark ! or the logical OR symbol | - consequently character 124 (vertical bar) was always displayed as a broken bar آ¦, which explains why even today many keyboards show it that way - character 35 was permitted to be either the number sign # or the pound sign آ£ - character 94 could be either a caret ^ or a logical NOT آ¬ Even the humble comma could be pressed into service as a cedilla. ASCII-1968 didn't change any characters, but allowed the use of LF on its own. Previously, you had to use either LF/CR or CR/LF as newline. ASCII-1977 removed the ambiguities from the 1967 standard. The most recent version is ASCII-1986 (also known as ANSI X3.4-1986). Unfortunately I haven't been able to find out what changes were made -- I presume they were minor, and didn't affect the character set. So as you can see, even with actual ASCII, you can have mojibake. It's just not normally called that. But if you are given an arbitrary ASCII file of unknown age, containing code 94, how can you be sure it was intended as a caret rather than a logical NOT symbol? You can't. Then there are at least 30 official variations of ASCII, strictly speaking part of ISO-646. These 7-bit codes were commonly called ASCII by their users, despite the differences, e.g. replacing the dollar sign $ with the international currency sign آ¤, or replacing the left brace { with the letter s with caron إ،. One consequence of this is that the MIME type for ASCII text is called US ASCII, despite the redundancy, because many people expect ASCII alone to mean whatever national variation they are used to. But it gets worse: there are proprietary variations on ASCII which are commonly called ASCII but aren't, including dozens of 8-bit so-called extended ASCII character sets, which is where the problems *really* pile up. Invariably back in the 1980s and early 1990s people used to call these
Re: ASCII and Unicode [was Re: Managing Google Groups headaches]
Steven D'Aprano steve+comp.lang.python at pearwood.info writes: Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about encodings and character sets. It doesn't just assume things are ASCII, but makes a half-hearted attempt to be charset-aware, but badly. I can only imagine that it was written back in the Dark Ages Indeed. The basic codebase probably goes back 20 years. I'm posting this from gmane, just so people don't think I'm a total luddite. When transmitting ASCII characters, the networking protocol could include various start and stop bits and parity codes. A single 7-bit ASCII character might be anything up to 12 bits in length on the wire. Not to mention that some really old hardware used 1.5 stop bits! -- https://mail.python.org/mailman/listinfo/python-list
Re: ASCII and Unicode [was Re: Managing Google Groups headaches]
On Sat, Dec 7, 2013 at 6:00 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: - character 33 was permitted to be either the exclamation mark ! or the logical OR symbol | - consequently character 124 (vertical bar) was always displayed as a broken bar ¦, which explains why even today many keyboards show it that way - character 35 was permitted to be either the number sign # or the pound sign £ - character 94 could be either a caret ^ or a logical NOT ¬ Yeah, good fun stuff. I first met several of these ambiguities in the OS/2 REXX documentation, which detailed the language's operators by specifying their byte values as well as their characters - for instance, this quote from the docs (yeah, I still have it all here): Note: Depending upon your Personal System keyboard and the code page you are using, you may not have the solid vertical bar to select. For this reason, REXX also recognizes the use of the split vertical bar as a logical OR symbol. Some keyboards may have both characters. If so, they are not interchangeable; only the character that is equal to the ASCII value of 124 works as the logical OR. This type of mismatch can also cause the character on your screen to be different from the character on your keyboard. (The front material on the docs says (C) Copyright IBM Corp. 1987, 1994. All Rights Reserved.) It says ASCII value where on this list we would be more likely to call it byte value, and I'd prefer to say represented by rather than equal to, but nonetheless, this is still clearly distinguishing characters and bytes. The language spec is on characters, but ultimately the interpreter is going to be looking at bytes, so when there's a problem, it's byte 124 that's the one defined as logical OR. Oh, and note the copyright date. The byte/char distinction isn't new. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: ASCII and Unicode [was Re: Managing Google Groups headaches]
On Saturday, December 7, 2013 12:30:18 AM UTC+5:30, Steven D'Aprano wrote: On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote: Evidently (and completely inadvertently) this exchange has just illustrated one of the inadmissable assumptions: unicode as a medium is universal in the same way that ASCII used to be Ironically, your post was not Unicode. Seriously. I am 100% serious. Your post was sent using a legacy encoding, Windows-1252, also known as CP-1252, which is most certainly *not* Unicode. Whatever software you used to send the message correctly flagged it with a charset header: Content-Type: text/plain; charset=windows-1252 Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle encodings correctly (or at all!), it screws up the encoding then sends a reply with no charset line at all. This is one bug that cannot be blamed on Google Groups -- or on Unicode. I wrote a number of ellipsis characters ie codepoint 2026 as in: Actually you didn't. You wrote a number of ellipsis characters, hex byte \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to code point U+2026 in Unicode, but the two are as distinct as ASCII and EBCDIC. Somewhere between my sending and your quoting those ellipses became the replacement character FFFD Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about encodings and character sets. It doesn't just assume things are ASCII, but makes a half-hearted attempt to be charset-aware, but badly. I can only imagine that it was written back in the Dark Ages where there were a lot of different charsets in use but no conventions for specifying which charset was in use. Or perhaps the author was smoking crack while coding. Leaving aside whose fault this is (very likely buggy google groups), this mojibaking cannot happen if the assumption All text is ASCII were to uniformly hold. This is incorrect. People forget that ASCII has evolved since the first version of the standard in 1963. There have actually been five versions of the ASCII standard, plus one unpublished version. (And that's not including the things which are frequently called ASCII but aren't.) ASCII-1963 didn't even include lowercase letters. It is also missing some graphic characters like braces, and included at least two characters no longer used, the up-arrow and left-arrow. The control characters were also significantly different from today. ASCII-1965 was unpublished and unused. I don't know the details of what it changed. ASCII-1967 is a lot closer to the ASCII in use today. It made considerable changes to the control characters, moving, adding, removing, or renaming at least half a dozen control characters. It officially added lowercase letters, braces, and some others. It replaced the up-arrow character with the caret and the left-arrow with the underscore. It was ambiguous, allowing variations and substitutions, e.g.: - character 33 was permitted to be either the exclamation mark ! or the logical OR symbol | - consequently character 124 (vertical bar) was always displayed as a broken bar ¦, which explains why even today many keyboards show it that way - character 35 was permitted to be either the number sign # or the pound sign £ - character 94 could be either a caret ^ or a logical NOT ¬ Even the humble comma could be pressed into service as a cedilla. ASCII-1968 didn't change any characters, but allowed the use of LF on its own. Previously, you had to use either LF/CR or CR/LF as newline. ASCII-1977 removed the ambiguities from the 1967 standard. The most recent version is ASCII-1986 (also known as ANSI X3.4-1986). Unfortunately I haven't been able to find out what changes were made -- I presume they were minor, and didn't affect the character set. So as you can see, even with actual ASCII, you can have mojibake. It's just not normally called that. But if you are given an arbitrary ASCII file of unknown age, containing code 94, how can you be sure it was intended as a caret rather than a logical NOT symbol? You can't. Then there are at least 30 official variations of ASCII, strictly speaking part of ISO-646. These 7-bit codes were commonly called ASCII by their users, despite the differences, e.g. replacing the dollar sign $ with the international currency sign ¤, or replacing the left brace { with the letter s with caron š. One consequence of this is that the MIME type for ASCII text is called US ASCII, despite the redundancy, because many people expect ASCII alone to mean whatever national variation they are used to. But it gets worse: there are proprietary variations on ASCII which are commonly called ASCII but aren't, including dozens of 8-bit so-called extended ASCII character sets, which is where the problems *really* pile up. Invariably back in the 1980s and early 1990s people used
Re: ASCII and Unicode [was Re: Managing Google Groups headaches]
On Sat, Dec 7, 2013 at 1:33 PM, rusi rustompm...@gmail.com wrote: That seems to suggest that something is not right with the python mailing list config. No?? If in doubt, blame someone else, eh? I'd first check what your browser's actually sending. Firebug will help there. See if your form fill-out is encoded as UTF-8 or CP-1252. That's the first step. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: ASCII and Unicode [was Re: Managing Google Groups headaches]
On 07/12/2013 02:41, Chris Angelico wrote: On Sat, Dec 7, 2013 at 1:33 PM, rusi rustompm...@gmail.com wrote: That seems to suggest that something is not right with the python mailing list config. No?? If in doubt, blame someone else, eh? I'd first check what your browser's actually sending. Firebug will help there. See if your form fill-out is encoded as UTF-8 or CP-1252. That's the first step. Looking back through the thread, it looks like: Roy posted a reply in us-ascii. rusi replied in windows-1252, adding the '…'. Roy replied in us-ascii, but with 'Š' in place of '…'. rusi replied in utf-8, with '�' in place of '…' -- https://mail.python.org/mailman/listinfo/python-list
Re: ASCII and Unicode [was Re: Managing Google Groups headaches]
On Saturday, December 7, 2013 8:11:45 AM UTC+5:30, Chris Angelico wrote: On Sat, Dec 7, 2013 at 1:33 PM, rusi wrote: That seems to suggest that something is not right with the python mailing list config. No?? If in doubt, blame someone else, eh? I'd first check what your browser's actually sending. Firebug will help there. See if your form fill-out is encoded as UTF-8 or CP-1252. That's the first step. If you give me some tip where to look, I'll do that. But I dont see what this has to do with forms. Everything in the python archive (not just my posts) show as Win 1252 [I checked about 6] Every other page that I checked (most nothing to do with python list, GG etc) show UTF-8. [I checked about 5] None of these checkings had forms to be filled. -- https://mail.python.org/mailman/listinfo/python-list
Re: ASCII and Unicode [was Re: Managing Google Groups headaches]
On Sat, Dec 7, 2013 at 2:16 PM, rusi rustompm...@gmail.com wrote: On Saturday, December 7, 2013 8:11:45 AM UTC+5:30, Chris Angelico wrote: On Sat, Dec 7, 2013 at 1:33 PM, rusi wrote: That seems to suggest that something is not right with the python mailing list config. No?? If in doubt, blame someone else, eh? I'd first check what your browser's actually sending. Firebug will help there. See if your form fill-out is encoded as UTF-8 or CP-1252. That's the first step. If you give me some tip where to look, I'll do that. But I dont see what this has to do with forms. Page encodings specify what comes from the server to your browser. Your post went the other way. Tracing the data going back to the server would tell you how it's encoded. ChrisA -- https://mail.python.org/mailman/listinfo/python-list