On 07/08/11 17:57, Benjamin R. Haskell wrote:
On Sat, 6 Aug 2011, Groups munged Tony Mechelynck's mail into:
:set list lcs=eol:ś,tab:\|_,nbsp:~,conceal:*
And he followed up:
...and for some reason that f???ing bl??dy st??id googlegroups
interface changed my Pilcrow mark to an s-acute. Well, the exact
character used there is irrelevant in this case but still, I don't
like it. The copy in my "Sent" folder is in 8bit ISO-8859-1 with the
correct Pilcrow mark; after the [me (SMTP) relay.skynet.be (ESMTP)
googlegroups.com (SMTP) gmail.com (POP3) me] round-trip it comes back
in quoted-printable UTF-8 as =C5=9B (equal Charlie Pantafayf equal
Noveniner Bravo) which means U+015B SMALL LATIN LETTER S WITH ACUTE
instead of the 0xB6 (U+00B6 PILCROW MARK) which I had sent. Ah, why
couldn't Google simply understand that Latin1 0xB6 means UTF-8 U+00B6?
You don't need iconv to know that. Ah, Google pisses me off. >:-(
In both this thread and the last time I discussed this¹, it appears that
the only charset that survives roundtripping to Groups when using
codepoints outside of ASCII is UTF-8.
Also as before, though, it's recipient-dependent. ZyX's response² to the
initial, munged mail seems to have it correctly quoted as:
:set list lcs=eol:¶,tab:\|_,nbsp:~,conceal:*
In the Groups web interface, all of the broken characters are replaced
(for me, using a default charset of UTF-8 everywhere) by the three
characters:
�
That means that, in the old thread { å, æ, ø, «, » } and in the new
thread { ¶ } were all replaced by �.
In this message of yours (which I received in quoted-printable UTF-8)
all these characters arrived (AFAICT) correct: a-ball, ae-ligature,
o-bar, open-French-quote, close-French-quote, Pilcrow-mark, and, at the
end, i-diaeresis, Spanish-inverted-question-mark, one-half.
ZyX appears to have received the old thread correctly, too. His response
there³ has them correctly quoted, but Ben Fritz's response⁴ indicates
that the erroneously converted characters were simply absent.
All that said, it's unclear how 0xB6 was misinterpreted as 0xC5,0x9B...
But, alas. Unless you have good reason to stick to explicit Latin-1,
you're probably better off using UTF-8. In the current HTML specs⁵, for
example, even stating that something is ISO-8859-1 is now
*intentionally* treated as CP1252 (Microsoft's version of Latin-1). So,
the number of places in which using ISO-8859-1 instead of UTF-8 will
bite you is only going to increase.
The only difference between ISO-8859-1 and Windows-1252 is that in the
former, 0x80 to 0x9F are non-printing control characters (which I don't
use), while in the latter most of them are printable characters (for
which I use UTF-8 if I need them: in fact, my mailer is set to fall back
to UTF-8 if the message contains characters not supported by the charset
in which I would otherwise send it). In ISO-8859-15 (another common
replacement for Latin1) 0x80 to 0x9F are the same nonprinting controls,
but some of 0xA0 to 0xBF are /different/ printing characters, to wit,
the Euro sign €, the French oe and OE digraphs œ Œ, the uppercase
Y-diaeresis Ÿ, and the upper- and lowercase z-caron Ž ž.
One advantage of Latin1 over UTF-8 is that it uses one byte rather than
two for every codepoint in the range [U+0080-U+00FF]. That may or may
not be much of an advantage depending on the proportion of non-ASCII
characters in a "Western-text" message. IOW it would be "least"
advantageous for English text.
I'll send this reply in UTF-8, just to see if it makes a difference. I
also checked my character-encoding preferences, and changed the
"encoding to use when replying" from ISO-8859-1 to "whatever the sender
used" (subject, in both cases, to UTF-8 fallback if the message text
doesn't fit). If it isn't good enough I'll change it again.
As for HTML specs, last time I checked they didn't apply to email, and
it's email which gives me problems; with HTML I usually have no problem,
except when the page is badly set up, let's say a page sent in some
bizarre charset with no charset mentioned in an HTML Content-Type header
and also not in any <meta http-equiv="Content-Type"> element.
Oh, and about your reference 5, I thought the normative authority for
HTML was the W3C, in whose Standards I don't find what your whatwg page
displays, and sometimes even the opposite, see for instance items C030
and C076 under "Character Model for the World Wide Web (latest
revision)" which I reached from "HTML for User Agents": namely,
http://www.w3.org/TR/charmod/#C030 and http://www.w3.org/TR/charmod/#C076
Best regards,
Tony.
--
hundred-and-one symptoms of being an internet addict:
153. You find yourself staring at your "inbox" waiting for new e-mail
to arrive.
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php