Re: [BUG] Passing special characters to &listchars and &fillchars causes screen corruption

Tony Mechelynck Tue, 09 Aug 2011 14:07:51 -0700

On 09/08/11 03:38, Benjamin R. Haskell wrote:

On Tue, 9 Aug 2011, Tony Mechelynck wrote:

[...]

The only difference between ISO-8859-1 and Windows-1252 is that in the
former, 0x80 to 0x9F are non-printing control characters (which I
don't use), while in the latter most of them are printable characters
(for which I use UTF-8 if I need them: in fact, my mailer is set to
fall back to UTF-8 if the message contains characters not supported by
the charset in which I would otherwise send it). In ISO-8859-15
(another common replacement for Latin1) 0x80 to 0x9F are the same
nonprinting controls, but some of 0xA0 to 0xBF are /different/
printing characters, to wit, the Euro sign €, the French oe and OE
digraphs œ Œ, the uppercase Y-diaeresis Ÿ, and the upper- and
lowercase z-caron Ž ž.


Right, of course. I was thinking -15 when writing -1.

Ah. I only rarely use Latin9 (ISO-8859-15) anyway. When the document youmentioned (and the HTML5 specs you mentioned here in a footnote) say touse Windows-1252 as a willful violation of previous specs when Latin1(ISO-8859-1) is requested, there is no big risk of failure since thesenonprinting controls are practically never used in Latin1.

One advantage of Latin1 over UTF-8 is that it uses one byte rather
than two for every codepoint in the range [U+0080-U+00FF]. That may or
may not be much of an advantage depending on the proportion of
non-ASCII characters in a "Western-text" message. IOW it would be
"least" advantageous for English text.


So, pros: possibly, maybe saves a couple of bytes.
Cons: is more likely to be misinterpreted.

For English the balance is probably in favour of UTF-8, but languageslike French, Spanish, German, Danish, etc. use comparatively much more"accented" characters above 0x7F.

I'll send this reply in UTF-8, just to see if it makes a difference. I
also checked my character-encoding preferences, and changed the
"encoding to use when replying" from ISO-8859-1 to "whatever the
sender used" (subject, in both cases, to UTF-8 fallback if the message
text doesn't fit). If it isn't good enough I'll change it again.


Seems properly encoded.

As for HTML specs, last time I checked they didn't apply to email,


My point wasn't about HTML or email, it was about the outmoded nature of
ISO-8859-n ∀n ∈ { x | x ≥ 1 & x ≤ 15 }. ( for all n belonging to the set
{ x, where x >= 1 and x <= 15 } if your font's missing any of those chars)

UTF-8, since it can encode anything in any of those charsets, but has
fewer interoperability problems, is virtually always preferable (at this
point).

and it's email which gives me problems; with HTML I usually have no
problem, except when the page is badly set up, let's say a page sent
in some bizarre charset with no charset mentioned in an HTML
Content-Type header and also not in any <meta
http-equiv="Content-Type"> element.


Part of the reason you usually have no problem is that browsers have a
long "tradition" of having to be better at guessing the proper encoding
in the face of bad data (hence HTML is the first major spec [AFAIK] to
break from accepting what's provided as charset).

Oh, and about your reference 5, I thought the normative authority for
HTML was the W3C, in whose Standards I don't find what your whatwg
page displays, and sometimes even the opposite, see for instance items
C030 and C076 under "Character Model for the World Wide Web (latest
revision)" which I reached from "HTML for User Agents": namely,
http://www.w3.org/TR/charmod/#C030 and http://www.w3.org/TR/charmod/#C076


Yes, sorry. WHATWG = Web Hypertext Application Technology Working Group.
The current editor, Ian Hickson, is also the current editor of the HTML5
spec¹, so I mistook it for official.

The official spec and my original link point out² that the character
override is a "willful violation"³ of the specs that you pointed to.
Which also points to the fact that you're only going to have more
problems in the future should you stick with ISO-8859-n.

Thanks for the links in your footnotes (but since they are after thedash-dash-space my mailer removes them when replying); but they allapply only to HTML5 don't they? When I publish web pages, I use UTF-8but also HTML 4.01.

For email, I would expect that the Content-Type header be respected (andthat any translation along the way be done in such way as not to corruptthe data as interpreted at every step according to the Content-Typeheader accompanying it); and anyway, when a Pilcrow mark (in Latin1)comes back in UTF-8 as an s-acute (which doesn't exist in _either_ISO-8859-1 _or_ Windows-1252), I wonder how that s-acute could have beeninjected. I couldn't even imagine any consensus of vendors of emailagents and servers which would approve such a "wilful violation" of thestandards.

Hm, it seems that ISO-8859-2 (a Central- or East-European Latin encodingAFAICT) has s-acute where ISO-8859-1 -15 and Windows-1252 all have aPilcrow mark. Still doesn't explain why or how it got in.



Best regards,
Tony.
--
hundred-and-one symptoms of being an internet addict:
154. You fondle your mouse.

--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Re: [BUG] Passing special characters to &listchars and &fillchars causes screen corruption

Raspunde prin e-mail lui