(2011/12/06 17:39), Jukka K. Korpela wrote:
> 2011-12-06 6:54, Leif Halvard Silli wrote:
> 
>> Yeah, it would be a pity if it had already become an widespread
>> cargo-cult to - all at once - use HTML5 doctype without using UTF-8
>> *and* without using some encoding declaration *and* thus effectively
>> relying on the default locale encoding ... Who does have a data corpus?

I found it: http://rink77.web.fc2.com/html/metatagu.html
It uses HTML5 doctype and not declare encoding and its encoding is Shift_JIS,
the default encoding of Japanese locale.

> Since <!doctype html> is the simplest way to put browsers to "standards 
> mode", this would punish authors who have realized that their page works 
> better in "standards mode" but are unaware of a completely different and 
> fairly complex problem. (Basic character encoding issues are of course not 
> that complex to you and me or most people around here; but most authors are 
> more or less confused with them, and I don't think we should add to the 
> confusion.)

I don't think there is a page works better in "standards mode" than *current* 
loose mode.

> There's a little point in changing the specs to say something very different 
> from what previous HTML specs have said and from actual browser behavior. If 
> the purpose is to make things more exactly defined (a fixed encoding vs. 
> implementation-defined), then I think such exactness is a luxury we cannot 
> afford. Things would be all different if we were designing a document format 
> from scratch, with no existing implementations and no existing usage. If the 
> purpose is UTF-8 evangelism, then it would be just the kind of evangelism 
> that produces angry people, not converts.

Agreed, if we design new spec, there's no reason to choose other than UTF-8.
But HTML has long history and many content.
We already have HTML*5* pages which doesn't have encoding declaration.

> If there's something that should be added to or modified in the algorithm for 
> determining character encoding, the I'd say it's error processing. I mean 
> user agent behavior when it detects, after running the algorithm, when 
> processing the document data, that there is a mismatch between them. That is, 
> that the data contains octets or octet sequences that are not allowed in the 
> encoding or that denote noncharacters. Such errors are naturally detected 
> when the user agent processes the octets; the question is what the browser 
> should do then.

Current implementations replaces such an invalid octet with a replacement 
character.
Or some implementations scans almost the page and uses an encoding
with which all octets in the page are valid.

> When data that is actually in ISO-8859-1 or some similar encoding has been 
> mislabeled as UTF-8        encoded, then, if the data contains octets outside 
> the ASCII, character-level errors are likely to occur. Many ISO-8859-1 octets 
> are just not possible in UTF-8 data. The converse error may also cause 
> character-level errors. And these are not uncommon situations - they seem 
> occur increasingly often, partly due to cargo cult "use of UTF-8" (when it 
> means declaring UTF-8 but not actually using it, or vice versa), partly due 
> increased use of UTF-8 combined with ISO-8859-1 encoded data creeping in from 
> somewhere into UTF-8 encoded data.

In such case, the page should be failed to show on the author's environment.

> From the user's point of view, the character-level errors currently result is 
> some gibberish (e.g., some odd box appearing instead of a character, in one 
> place) or in total mess (e.g. a large number non-ASCII characters displayed 
> all wrong). In either case, I think an error should be signalled to the user, 
> together with
> a) automatically trying another encoding, such as the locale default encoding 
> instead of UTF-8 or UTF-8 instead of anything else
> b) suggesting to the user that he should try to view the page using some 
> other encoding, possibly with a menu of encodings offered as part of the 
> error explanation
> c) a combination of the above.

This premises that a user know the correct encoding.
But European people really know the correct encoding of ISO-8859-* pages?
I, Japanese, imagine that it is hard that distingusih ISO-8859-1 page and 
ISO-8859-2 page.

> Although there are good reasons why browsers usually don't give error 
> messages, this would be a special case. It's about the primary interpretation 
> of the data in the document and about a situation where some data has no 
> interpretation in the assumed encoding - but usually has an interpretation in 
> some other encoding.

Some browsers alerts scripting issues.
Why they cannot alerts an encoding issue?

> The current "Character encoding overrides" rules are questionable because 
> they often mask out data errors that would have helped to detect problems 
> that can be solved constructively. For example, if data labeled as ISO-8859-1 
> contains an octet in the 80...9F range, then it may well be the case that the 
> data is actually windows-1252 encoded and the "override" helps everyone. But 
> it may also be the case that the data is in a different encoding and that the 
> "override" therefore results in gibberish shown to the user, with no hint of 
> the cause of the problem.

I think such case doesn't exist.
On character encoding overrides a superset overrides a standard set.
So I can't imagine the case.

> It would therefore be better to signal a problem to the user, display the 
> page using the windows-1252 encoding but with some instruction or hint on 
> changing the encoding. And a browser should in this process really analyze 
> whether the data can be windows-1252 encoded data that contains only 
> characters permitted in HTML.

Such verification should be done by developer tools, not production browsers
which is widely used by real users.

-- 
NARUSE, Yui  <nar...@airemix.jp>

Reply via email to