ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread Steven D'Aprano
On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:

 Evidently (and completely inadvertently) this exchange has just
 illustrated one of the inadmissable assumptions:
 
 unicode as a medium is universal in the same way that ASCII used to be

Ironically, your post was not Unicode.

Seriously. I am 100% serious.

Your post was sent using a legacy encoding, Windows-1252, also known as 
CP-1252, which is most certainly *not* Unicode. Whatever software you 
used to send the message correctly flagged it with a charset header:

Content-Type: text/plain; charset=windows-1252

Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle 
encodings correctly (or at all!), it screws up the encoding then sends a 
reply with no charset line at all. This is one bug that cannot be blamed 
on Google Groups -- or on Unicode.


 I wrote a number of ellipsis characters ie codepoint 2026 as in:

Actually you didn't. You wrote a number of ellipsis characters, hex byte 
\x85 (decimal 133), in the CP1252 charset. That happens to be mapped to 
code point U+2026 in Unicode, but the two are as distinct as ASCII and 
EBCDIC.


 Somewhere between my sending and your quoting those ellipses became the
 replacement character FFFD

Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about 
encodings and character sets. It doesn't just assume things are ASCII, 
but makes a half-hearted attempt to be charset-aware, but badly. I can 
only imagine that it was written back in the Dark Ages where there were a 
lot of different charsets in use but no conventions for specifying which 
charset was in use. Or perhaps the author was smoking crack while coding.


 Leaving aside whose fault this is (very likely buggy google groups),
 this mojibaking cannot happen if the assumption All text is ASCII were
 to uniformly hold.

This is incorrect. People forget that ASCII has evolved since the first 
version of the standard in 1963. There have actually been five versions 
of the ASCII standard, plus one unpublished version. (And that's not 
including the things which are frequently called ASCII but aren't.)

ASCII-1963 didn't even include lowercase letters. It is also missing some 
graphic characters like braces, and included at least two characters no 
longer used, the up-arrow and left-arrow. The control characters were 
also significantly different from today.

ASCII-1965 was unpublished and unused. I don't know the details of what 
it changed.

ASCII-1967 is a lot closer to the ASCII in use today. It made 
considerable changes to the control characters, moving, adding, removing, 
or renaming at least half a dozen control characters. It officially added 
lowercase letters, braces, and some others. It replaced the up-arrow 
character with the caret and the left-arrow with the underscore. It was 
ambiguous, allowing variations and substitutions, e.g.:

- character 33 was permitted to be either the exclamation 
  mark ! or the logical OR symbol |

- consequently character 124 (vertical bar) was always 
  displayed as a broken bar ¦, which explains why even today
  many keyboards show it that way

- character 35 was permitted to be either the number sign # or 
  the pound sign £

- character 94 could be either a caret ^ or a logical NOT ¬

Even the humble comma could be pressed into service as a cedilla.

ASCII-1968 didn't change any characters, but allowed the use of LF on its 
own. Previously, you had to use either LF/CR or CR/LF as newline.

ASCII-1977 removed the ambiguities from the 1967 standard.

The most recent version is ASCII-1986 (also known as ANSI X3.4-1986). 
Unfortunately I haven't been able to find out what changes were made -- I 
presume they were minor, and didn't affect the character set.

So as you can see, even with actual ASCII, you can have mojibake. It's 
just not normally called that. But if you are given an arbitrary ASCII 
file of unknown age, containing code 94, how can you be sure it was 
intended as a caret rather than a logical NOT symbol? You can't.

Then there are at least 30 official variations of ASCII, strictly 
speaking part of ISO-646. These 7-bit codes were commonly called ASCII 
by their users, despite the differences, e.g. replacing the dollar sign $ 
with the international currency sign ¤, or replacing the left brace 
{ with the letter s with caron š.

One consequence of this is that the MIME type for ASCII text is called 
US ASCII, despite the redundancy, because many people expect ASCII 
alone to mean whatever national variation they are used to.

But it gets worse: there are proprietary variations on ASCII which are 
commonly called ASCII but aren't, including dozens of 8-bit so-called 
extended ASCII character sets, which is where the problems *really* 
pile up. Invariably back in the 1980s and early 1990s people used to call 
these ASCII no matter that they used 8-bits and contained anything up 
to 256 characters.

Just because somebody calls something ASCII, 

Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread Gene Heskett
On Friday 06 December 2013 14:30:06 Steven D'Aprano did opine:

 On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:
  Evidently (and completely inadvertently) this exchange has just
  illustrated one of the inadmissable assumptions:
  
  unicode as a medium is universal in the same way that ASCII used to
  be
 
 Ironically, your post was not Unicode.
 
 Seriously. I am 100% serious.
 
 Your post was sent using a legacy encoding, Windows-1252, also known as
 CP-1252, which is most certainly *not* Unicode. Whatever software you
 used to send the message correctly flagged it with a charset header:
 
 Content-Type: text/plain; charset=windows-1252
 
 Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle
 encodings correctly (or at all!), it screws up the encoding then sends a
 reply with no charset line at all. This is one bug that cannot be blamed
 on Google Groups -- or on Unicode.
 
  I wrote a number of ellipsis characters ie codepoint 2026 as in:
 Actually you didn't. You wrote a number of ellipsis characters, hex byte
 \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to
 code point U+2026 in Unicode, but the two are as distinct as ASCII and
 EBCDIC.
 
  Somewhere between my sending and your quoting those ellipses became
  the replacement character FFFD
 
 Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about
 encodings and character sets. It doesn't just assume things are ASCII,
 but makes a half-hearted attempt to be charset-aware, but badly. I can
 only imagine that it was written back in the Dark Ages where there were
 a lot of different charsets in use but no conventions for specifying
 which charset was in use. Or perhaps the author was smoking crack while
 coding.
 
  Leaving aside whose fault this is (very likely buggy google groups),
  this mojibaking cannot happen if the assumption All text is ASCII
  were to uniformly hold.
 
 This is incorrect. People forget that ASCII has evolved since the first
 version of the standard in 1963. There have actually been five versions
 of the ASCII standard, plus one unpublished version. (And that's not
 including the things which are frequently called ASCII but aren't.)
 
 ASCII-1963 didn't even include lowercase letters. It is also missing
 some graphic characters like braces, and included at least two
 characters no longer used, the up-arrow and left-arrow. The control
 characters were also significantly different from today.
 
 ASCII-1965 was unpublished and unused. I don't know the details of what
 it changed.
 
 ASCII-1967 is a lot closer to the ASCII in use today. It made
 considerable changes to the control characters, moving, adding,
 removing, or renaming at least half a dozen control characters. It
 officially added lowercase letters, braces, and some others. It
 replaced the up-arrow character with the caret and the left-arrow with
 the underscore. It was ambiguous, allowing variations and
 substitutions, e.g.:
 
 - character 33 was permitted to be either the exclamation
   mark ! or the logical OR symbol |
 
 - consequently character 124 (vertical bar) was always
   displayed as a broken bar آ¦, which explains why even today
   many keyboards show it that way
 
 - character 35 was permitted to be either the number sign # or
   the pound sign آ£
 
 - character 94 could be either a caret ^ or a logical NOT آ¬
 
 Even the humble comma could be pressed into service as a cedilla.
 
 ASCII-1968 didn't change any characters, but allowed the use of LF on
 its own. Previously, you had to use either LF/CR or CR/LF as newline.
 
 ASCII-1977 removed the ambiguities from the 1967 standard.
 
 The most recent version is ASCII-1986 (also known as ANSI X3.4-1986).
 Unfortunately I haven't been able to find out what changes were made --
 I presume they were minor, and didn't affect the character set.
 
 So as you can see, even with actual ASCII, you can have mojibake. It's
 just not normally called that. But if you are given an arbitrary ASCII
 file of unknown age, containing code 94, how can you be sure it was
 intended as a caret rather than a logical NOT symbol? You can't.
 
 Then there are at least 30 official variations of ASCII, strictly
 speaking part of ISO-646. These 7-bit codes were commonly called ASCII
 by their users, despite the differences, e.g. replacing the dollar sign
 $ with the international currency sign آ¤, or replacing the left brace
 { with the letter s with caron إ،.
 
 One consequence of this is that the MIME type for ASCII text is called
 US ASCII, despite the redundancy, because many people expect ASCII
 alone to mean whatever national variation they are used to.
 
 But it gets worse: there are proprietary variations on ASCII which are
 commonly called ASCII but aren't, including dozens of 8-bit so-called
 extended ASCII character sets, which is where the problems *really*
 pile up. Invariably back in the 1980s and early 1990s people used to
 call these 

Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread Roy Smith
Steven D'Aprano steve+comp.lang.python at pearwood.info writes:

 Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about 
 encodings and character sets. It doesn't just assume things are ASCII, 
 but makes a half-hearted attempt to be charset-aware, but badly. I can 
 only imagine that it was written back in the Dark Ages

Indeed.  The basic codebase probably goes back 20 years.  I'm posting this
from gmane, just so people don't think I'm a total luddite.

 When transmitting ASCII characters, the networking protocol could include 
 various start and stop bits and parity codes. A single 7-bit ASCII 
 character might be anything up to 12 bits in length on the wire.

Not to mention that some really old hardware used 1.5 stop bits!


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread Chris Angelico
On Sat, Dec 7, 2013 at 6:00 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 - character 33 was permitted to be either the exclamation
   mark ! or the logical OR symbol |

 - consequently character 124 (vertical bar) was always
   displayed as a broken bar ¦, which explains why even today
   many keyboards show it that way

 - character 35 was permitted to be either the number sign # or
   the pound sign £

 - character 94 could be either a caret ^ or a logical NOT ¬

Yeah, good fun stuff. I first met several of these ambiguities in the
OS/2 REXX documentation, which detailed the language's operators by
specifying their byte values as well as their characters - for
instance, this quote from the docs (yeah, I still have it all here):


Note:   Depending upon your Personal System keyboard and the code page
you are using, you may not have the solid vertical bar to select. For
this reason, REXX also recognizes the use of the split vertical bar as
a logical OR symbol. Some keyboards may have both characters. If so,
they are not interchangeable; only the character that is equal to the
ASCII value of 124 works as the logical OR. This type of mismatch can
also cause the character on your screen to be different from the
character on your keyboard.

(The front material on the docs says (C) Copyright IBM Corp. 1987,
1994. All Rights Reserved.)

It says ASCII value where on this list we would be more likely to
call it byte value, and I'd prefer to say represented by rather
than equal to, but nonetheless, this is still clearly distinguishing
characters and bytes. The language spec is on characters, but
ultimately the interpreter is going to be looking at bytes, so when
there's a problem, it's byte 124 that's the one defined as logical OR.
Oh, and note the copyright date. The byte/char distinction isn't new.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread rusi
On Saturday, December 7, 2013 12:30:18 AM UTC+5:30, Steven D'Aprano wrote:
 On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:

  Evidently (and completely inadvertently) this exchange has just
  illustrated one of the inadmissable assumptions:
  unicode as a medium is universal in the same way that ASCII used to be

 Ironically, your post was not Unicode.

 Seriously. I am 100% serious.

 Your post was sent using a legacy encoding, Windows-1252, also known as 
 CP-1252, which is most certainly *not* Unicode. Whatever software you 
 used to send the message correctly flagged it with a charset header:

 Content-Type: text/plain; charset=windows-1252

 Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle 
 encodings correctly (or at all!), it screws up the encoding then sends a 
 reply with no charset line at all. This is one bug that cannot be blamed 
 on Google Groups -- or on Unicode.

  I wrote a number of ellipsis characters ie codepoint 2026 as in:

 Actually you didn't. You wrote a number of ellipsis characters, hex byte 
 \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to 
 code point U+2026 in Unicode, but the two are as distinct as ASCII and 
 EBCDIC.

  Somewhere between my sending and your quoting those ellipses became the
  replacement character FFFD

 Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about 
 encodings and character sets. It doesn't just assume things are ASCII, 
 but makes a half-hearted attempt to be charset-aware, but badly. I can 
 only imagine that it was written back in the Dark Ages where there were a 
 lot of different charsets in use but no conventions for specifying which 
 charset was in use. Or perhaps the author was smoking crack while coding.

  Leaving aside whose fault this is (very likely buggy google groups),
  this mojibaking cannot happen if the assumption All text is ASCII were
  to uniformly hold.

 This is incorrect. People forget that ASCII has evolved since the first 
 version of the standard in 1963. There have actually been five versions 
 of the ASCII standard, plus one unpublished version. (And that's not 
 including the things which are frequently called ASCII but aren't.)

 ASCII-1963 didn't even include lowercase letters. It is also missing some 
 graphic characters like braces, and included at least two characters no 
 longer used, the up-arrow and left-arrow. The control characters were 
 also significantly different from today.

 ASCII-1965 was unpublished and unused. I don't know the details of what 
 it changed.

 ASCII-1967 is a lot closer to the ASCII in use today. It made 
 considerable changes to the control characters, moving, adding, removing, 
 or renaming at least half a dozen control characters. It officially added 
 lowercase letters, braces, and some others. It replaced the up-arrow 
 character with the caret and the left-arrow with the underscore. It was 
 ambiguous, allowing variations and substitutions, e.g.:

 - character 33 was permitted to be either the exclamation 
   mark ! or the logical OR symbol |

 - consequently character 124 (vertical bar) was always 
   displayed as a broken bar ¦, which explains why even today
   many keyboards show it that way

 - character 35 was permitted to be either the number sign # or 
   the pound sign £

 - character 94 could be either a caret ^ or a logical NOT ¬

 Even the humble comma could be pressed into service as a cedilla.

 ASCII-1968 didn't change any characters, but allowed the use of LF on its 
 own. Previously, you had to use either LF/CR or CR/LF as newline.

 ASCII-1977 removed the ambiguities from the 1967 standard.

 The most recent version is ASCII-1986 (also known as ANSI X3.4-1986). 
 Unfortunately I haven't been able to find out what changes were made -- I 
 presume they were minor, and didn't affect the character set.

 So as you can see, even with actual ASCII, you can have mojibake. It's 
 just not normally called that. But if you are given an arbitrary ASCII 
 file of unknown age, containing code 94, how can you be sure it was 
 intended as a caret rather than a logical NOT symbol? You can't.

 Then there are at least 30 official variations of ASCII, strictly 
 speaking part of ISO-646. These 7-bit codes were commonly called ASCII 
 by their users, despite the differences, e.g. replacing the dollar sign $ 
 with the international currency sign ¤, or replacing the left brace 
 { with the letter s with caron š.

 One consequence of this is that the MIME type for ASCII text is called 
 US ASCII, despite the redundancy, because many people expect ASCII 
 alone to mean whatever national variation they are used to.

 But it gets worse: there are proprietary variations on ASCII which are 
 commonly called ASCII but aren't, including dozens of 8-bit so-called 
 extended ASCII character sets, which is where the problems *really* 
 pile up. Invariably back in the 1980s and early 1990s people used 

Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread Chris Angelico
On Sat, Dec 7, 2013 at 1:33 PM, rusi rustompm...@gmail.com wrote:
 That seems to suggest that something is not right with the python
 mailing list config. No??

If in doubt, blame someone else, eh?

I'd first check what your browser's actually sending. Firebug will
help there. See if your form fill-out is encoded as UTF-8 or CP-1252.
That's the first step.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread MRAB

On 07/12/2013 02:41, Chris Angelico wrote:

On Sat, Dec 7, 2013 at 1:33 PM, rusi rustompm...@gmail.com wrote:

That seems to suggest that something is not right with the python
mailing list config. No??


If in doubt, blame someone else, eh?

I'd first check what your browser's actually sending. Firebug will
help there. See if your form fill-out is encoded as UTF-8 or CP-1252.
That's the first step.


Looking back through the thread, it looks like:

Roy posted a reply in us-ascii.

rusi replied in windows-1252, adding the '…'.

Roy replied in us-ascii, but with 'Š' in place of '…'.

rusi replied in utf-8, with '�' in place of '…'

--
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread rusi
On Saturday, December 7, 2013 8:11:45 AM UTC+5:30, Chris Angelico wrote:
 On Sat, Dec 7, 2013 at 1:33 PM, rusi  wrote:
  That seems to suggest that something is not right with the python
  mailing list config. No??

 If in doubt, blame someone else, eh?

 I'd first check what your browser's actually sending. Firebug will
 help there. See if your form fill-out is encoded as UTF-8 or CP-1252.
 That's the first step.

If you give me some tip where to look, I'll do that.
But I dont see what this has to do with forms.

Everything in the python archive (not just my posts) show as Win 1252
[I checked about 6]

Every other page that I checked (most nothing to do with python list,
GG etc) show UTF-8. [I checked about 5]

None of these checkings had forms to be filled.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

2013-12-06 Thread Chris Angelico
On Sat, Dec 7, 2013 at 2:16 PM, rusi rustompm...@gmail.com wrote:
 On Saturday, December 7, 2013 8:11:45 AM UTC+5:30, Chris Angelico wrote:
 On Sat, Dec 7, 2013 at 1:33 PM, rusi  wrote:
  That seems to suggest that something is not right with the python
  mailing list config. No??

 If in doubt, blame someone else, eh?

 I'd first check what your browser's actually sending. Firebug will
 help there. See if your form fill-out is encoded as UTF-8 or CP-1252.
 That's the first step.

 If you give me some tip where to look, I'll do that.
 But I dont see what this has to do with forms.


Page encodings specify what comes from the server to your browser.
Your post went the other way. Tracing the data going back to the
server would tell you how it's encoded.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list