On Tue, 9 Aug 2011, Tony Mechelynck wrote:

On 07/08/11 17:57, Benjamin R. Haskell wrote:
That means that, in the old thread { å, æ, ø, «, » } and in the new
thread { ¶ } were all replaced by �.

In this message of yours (which I received in quoted-printable UTF-8) all these characters arrived (AFAICT) correct: a-ball, ae-ligature, o-bar, open-French-quote, close-French-quote, Pilcrow-mark, and, at the end, i-diaeresis, Spanish-inverted-question-mark, one-half.

Yep.  As input.


All that said, it's unclear how 0xB6 was misinterpreted as 0xC5,0x9B... But, alas. Unless you have good reason to stick to explicit Latin-1, you're probably better off using UTF-8. In the current HTML specs⁵, for example, even stating that something is ISO-8859-1 is now *intentionally* treated as CP1252 (Microsoft's version of Latin-1). So, the number of places in which using ISO-8859-1 instead of UTF-8 will bite you is only going to increase.

The only difference between ISO-8859-1 and Windows-1252 is that in the former, 0x80 to 0x9F are non-printing control characters (which I don't use), while in the latter most of them are printable characters (for which I use UTF-8 if I need them: in fact, my mailer is set to fall back to UTF-8 if the message contains characters not supported by the charset in which I would otherwise send it). In ISO-8859-15 (another common replacement for Latin1) 0x80 to 0x9F are the same nonprinting controls, but some of 0xA0 to 0xBF are /different/ printing characters, to wit, the Euro sign €, the French oe and OE digraphs œ Œ, the uppercase Y-diaeresis Ÿ, and the upper- and lowercase z-caron Ž ž.

Right, of course.  I was thinking -15 when writing -1.


One advantage of Latin1 over UTF-8 is that it uses one byte rather than two for every codepoint in the range [U+0080-U+00FF]. That may or may not be much of an advantage depending on the proportion of non-ASCII characters in a "Western-text" message. IOW it would be "least" advantageous for English text.

So, pros: possibly, maybe saves a couple of bytes.
Cons: is more likely to be misinterpreted.


I'll send this reply in UTF-8, just to see if it makes a difference. I also checked my character-encoding preferences, and changed the "encoding to use when replying" from ISO-8859-1 to "whatever the sender used" (subject, in both cases, to UTF-8 fallback if the message text doesn't fit). If it isn't good enough I'll change it again.

Seems properly encoded.


As for HTML specs, last time I checked they didn't apply to email,

My point wasn't about HTML or email, it was about the outmoded nature of ISO-8859-n ∀n ∈ { x | x ≥ 1 & x ≤ 15 }. ( for all n belonging to the set { x, where x >= 1 and x <= 15 } if your font's missing any of those chars)

UTF-8, since it can encode anything in any of those charsets, but has fewer interoperability problems, is virtually always preferable (at this point).


and it's email which gives me problems; with HTML I usually have no problem, except when the page is badly set up, let's say a page sent in some bizarre charset with no charset mentioned in an HTML Content-Type header and also not in any <meta http-equiv="Content-Type"> element.

Part of the reason you usually have no problem is that browsers have a long "tradition" of having to be better at guessing the proper encoding in the face of bad data (hence HTML is the first major spec [AFAIK] to break from accepting what's provided as charset).


Oh, and about your reference 5, I thought the normative authority for HTML was the W3C, in whose Standards I don't find what your whatwg page displays, and sometimes even the opposite, see for instance items C030 and C076 under "Character Model for the World Wide Web (latest revision)" which I reached from "HTML for User Agents": namely, http://www.w3.org/TR/charmod/#C030 and http://www.w3.org/TR/charmod/#C076

Yes, sorry. WHATWG = Web Hypertext Application Technology Working Group. The current editor, Ian Hickson, is also the current editor of the HTML5 spec¹, so I mistook it for official.

The official spec and my original link point out² that the character override is a "willful violation"³ of the specs that you pointed to. Which also points to the fact that you're only going to have more problems in the future should you stick with ISO-8859-n.

--
Best,
Ben

¹: HTML5 spec
current: http://www.w3.org/TR/html5/parsing.html
latest draft: http://dev.w3.org/html5/spec/Overview.html

²: § 8.2.2.1 (last ¶, just above the link below)
current: http://www.w3.org/TR/html5/parsing.html#character-encodings-0
latest draft: http://dev.w3.org/html5/spec/Overview.html#character-encodings-0

³: § 1.5.2 "Compliance with other specifications"
http://www.w3.org/TR/html5/introduction.html#compliance-with-other-specifications

--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Raspunde prin e-mail lui