On 09/08/11 03:38, Benjamin R. Haskell wrote:
On Tue, 9 Aug 2011, Tony Mechelynck wrote:
[...]
The only difference between ISO-8859-1 and Windows-1252 is that in the
former, 0x80 to 0x9F are non-printing control characters (which I
don't use), while in the latter most of them are printable characters
(for which I use UTF-8 if I need them: in fact, my mailer is set to
fall back to UTF-8 if the message contains characters not supported by
the charset in which I would otherwise send it). In ISO-8859-15
(another common replacement for Latin1) 0x80 to 0x9F are the same
nonprinting controls, but some of 0xA0 to 0xBF are /different/
printing characters, to wit, the Euro sign €, the French oe and OE
digraphs œ Œ, the uppercase Y-diaeresis Ÿ, and the upper- and
lowercase z-caron Ž ž.

Right, of course. I was thinking -15 when writing -1.

Ah. I only rarely use Latin9 (ISO-8859-15) anyway. When the document you mentioned (and the HTML5 specs you mentioned here in a footnote) say to use Windows-1252 as a willful violation of previous specs when Latin1 (ISO-8859-1) is requested, there is no big risk of failure since these nonprinting controls are practically never used in Latin1.



One advantage of Latin1 over UTF-8 is that it uses one byte rather
than two for every codepoint in the range [U+0080-U+00FF]. That may or
may not be much of an advantage depending on the proportion of
non-ASCII characters in a "Western-text" message. IOW it would be
"least" advantageous for English text.

So, pros: possibly, maybe saves a couple of bytes.
Cons: is more likely to be misinterpreted.

For English the balance is probably in favour of UTF-8, but languages like French, Spanish, German, Danish, etc. use comparatively much more "accented" characters above 0x7F.



I'll send this reply in UTF-8, just to see if it makes a difference. I
also checked my character-encoding preferences, and changed the
"encoding to use when replying" from ISO-8859-1 to "whatever the
sender used" (subject, in both cases, to UTF-8 fallback if the message
text doesn't fit). If it isn't good enough I'll change it again.

Seems properly encoded.


As for HTML specs, last time I checked they didn't apply to email,

My point wasn't about HTML or email, it was about the outmoded nature of
ISO-8859-n ∀n ∈ { x | x ≥ 1 & x ≤ 15 }. ( for all n belonging to the set
{ x, where x >= 1 and x <= 15 } if your font's missing any of those chars)

UTF-8, since it can encode anything in any of those charsets, but has
fewer interoperability problems, is virtually always preferable (at this
point).


and it's email which gives me problems; with HTML I usually have no
problem, except when the page is badly set up, let's say a page sent
in some bizarre charset with no charset mentioned in an HTML
Content-Type header and also not in any <meta
http-equiv="Content-Type"> element.

Part of the reason you usually have no problem is that browsers have a
long "tradition" of having to be better at guessing the proper encoding
in the face of bad data (hence HTML is the first major spec [AFAIK] to
break from accepting what's provided as charset).


Oh, and about your reference 5, I thought the normative authority for
HTML was the W3C, in whose Standards I don't find what your whatwg
page displays, and sometimes even the opposite, see for instance items
C030 and C076 under "Character Model for the World Wide Web (latest
revision)" which I reached from "HTML for User Agents": namely,
http://www.w3.org/TR/charmod/#C030 and http://www.w3.org/TR/charmod/#C076

Yes, sorry. WHATWG = Web Hypertext Application Technology Working Group.
The current editor, Ian Hickson, is also the current editor of the HTML5
spec¹, so I mistook it for official.

The official spec and my original link point out² that the character
override is a "willful violation"³ of the specs that you pointed to.
Which also points to the fact that you're only going to have more
problems in the future should you stick with ISO-8859-n.


Thanks for the links in your footnotes (but since they are after the dash-dash-space my mailer removes them when replying); but they all apply only to HTML5 don't they? When I publish web pages, I use UTF-8 but also HTML 4.01.

For email, I would expect that the Content-Type header be respected (and that any translation along the way be done in such way as not to corrupt the data as interpreted at every step according to the Content-Type header accompanying it); and anyway, when a Pilcrow mark (in Latin1) comes back in UTF-8 as an s-acute (which doesn't exist in _either_ ISO-8859-1 _or_ Windows-1252), I wonder how that s-acute could have been injected. I couldn't even imagine any consensus of vendors of email agents and servers which would approve such a "wilful violation" of the standards.

Hm, it seems that ISO-8859-2 (a Central- or East-European Latin encoding AFAICT) has s-acute where ISO-8859-1 -15 and Windows-1252 all have a Pilcrow mark. Still doesn't explain why or how it got in.


Best regards,
Tony.
--
hundred-and-one symptoms of being an internet addict:
154. You fondle your mouse.

--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Raspunde prin e-mail lui