On Tue, 9 Aug 2011, Tony Mechelynck wrote:
On 07/08/11 17:57, Benjamin R. Haskell wrote:
That means that, in the old thread { å, æ, ø, «, » } and in the new
thread { ¶ } were all replaced by �.
In this message of yours (which I received in quoted-printable UTF-8)
all these characters arrived (AFAICT) correct: a-ball, ae-ligature,
o-bar, open-French-quote, close-French-quote, Pilcrow-mark, and, at
the end, i-diaeresis, Spanish-inverted-question-mark, one-half.
Yep. As input.
All that said, it's unclear how 0xB6 was misinterpreted as
0xC5,0x9B... But, alas. Unless you have good reason to stick to
explicit Latin-1, you're probably better off using UTF-8. In the
current HTML specs⁵, for example, even stating that something is
ISO-8859-1 is now *intentionally* treated as CP1252 (Microsoft's
version of Latin-1). So, the number of places in which using
ISO-8859-1 instead of UTF-8 will bite you is only going to increase.
The only difference between ISO-8859-1 and Windows-1252 is that in the
former, 0x80 to 0x9F are non-printing control characters (which I
don't use), while in the latter most of them are printable characters
(for which I use UTF-8 if I need them: in fact, my mailer is set to
fall back to UTF-8 if the message contains characters not supported by
the charset in which I would otherwise send it). In ISO-8859-15
(another common replacement for Latin1) 0x80 to 0x9F are the same
nonprinting controls, but some of 0xA0 to 0xBF are /different/
printing characters, to wit, the Euro sign €, the French oe and OE
digraphs œ Œ, the uppercase Y-diaeresis Ÿ, and the upper- and
lowercase z-caron Ž ž.
Right, of course. I was thinking -15 when writing -1.
One advantage of Latin1 over UTF-8 is that it uses one byte rather
than two for every codepoint in the range [U+0080-U+00FF]. That may or
may not be much of an advantage depending on the proportion of
non-ASCII characters in a "Western-text" message. IOW it would be
"least" advantageous for English text.
So, pros: possibly, maybe saves a couple of bytes.
Cons: is more likely to be misinterpreted.
I'll send this reply in UTF-8, just to see if it makes a difference. I
also checked my character-encoding preferences, and changed the
"encoding to use when replying" from ISO-8859-1 to "whatever the
sender used" (subject, in both cases, to UTF-8 fallback if the message
text doesn't fit). If it isn't good enough I'll change it again.
Seems properly encoded.
As for HTML specs, last time I checked they didn't apply to email,
My point wasn't about HTML or email, it was about the outmoded nature of
ISO-8859-n ∀n ∈ { x | x ≥ 1 & x ≤ 15 }. ( for all n belonging to the
set { x, where x >= 1 and x <= 15 } if your font's missing any of those
chars)
UTF-8, since it can encode anything in any of those charsets, but has
fewer interoperability problems, is virtually always preferable (at this
point).
and it's email which gives me problems; with HTML I usually have no
problem, except when the page is badly set up, let's say a page sent
in some bizarre charset with no charset mentioned in an HTML
Content-Type header and also not in any <meta
http-equiv="Content-Type"> element.
Part of the reason you usually have no problem is that browsers have a
long "tradition" of having to be better at guessing the proper encoding
in the face of bad data (hence HTML is the first major spec [AFAIK] to
break from accepting what's provided as charset).
Oh, and about your reference 5, I thought the normative authority for HTML
was the W3C, in whose Standards I don't find what your whatwg page displays,
and sometimes even the opposite, see for instance items C030 and C076 under
"Character Model for the World Wide Web (latest revision)" which I reached
from "HTML for User Agents": namely, http://www.w3.org/TR/charmod/#C030 and
http://www.w3.org/TR/charmod/#C076
Yes, sorry. WHATWG = Web Hypertext Application Technology Working
Group. The current editor, Ian Hickson, is also the current editor of
the HTML5 spec¹, so I mistook it for official.
The official spec and my original link point out² that the character
override is a "willful violation"³ of the specs that you pointed to.
Which also points to the fact that you're only going to have more
problems in the future should you stick with ISO-8859-n.
--
Best,
Ben
¹: HTML5 spec
current: http://www.w3.org/TR/html5/parsing.html
latest draft: http://dev.w3.org/html5/spec/Overview.html
²: § 8.2.2.1 (last ¶, just above the link below)
current: http://www.w3.org/TR/html5/parsing.html#character-encodings-0
latest draft: http://dev.w3.org/html5/spec/Overview.html#character-encodings-0
³: § 1.5.2 "Compliance with other specifications"
http://www.w3.org/TR/html5/introduction.html#compliance-with-other-specifications
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php