(2011/12/06 17:39), Jukka K. Korpela wrote: > 2011-12-06 6:54, Leif Halvard Silli wrote: > >> Yeah, it would be a pity if it had already become an widespread >> cargo-cult to - all at once - use HTML5 doctype without using UTF-8 >> *and* without using some encoding declaration *and* thus effectively >> relying on the default locale encoding ... Who does have a data corpus?
I found it: http://rink77.web.fc2.com/html/metatagu.html It uses HTML5 doctype and not declare encoding and its encoding is Shift_JIS, the default encoding of Japanese locale. > Since <!doctype html> is the simplest way to put browsers to "standards > mode", this would punish authors who have realized that their page works > better in "standards mode" but are unaware of a completely different and > fairly complex problem. (Basic character encoding issues are of course not > that complex to you and me or most people around here; but most authors are > more or less confused with them, and I don't think we should add to the > confusion.) I don't think there is a page works better in "standards mode" than *current* loose mode. > There's a little point in changing the specs to say something very different > from what previous HTML specs have said and from actual browser behavior. If > the purpose is to make things more exactly defined (a fixed encoding vs. > implementation-defined), then I think such exactness is a luxury we cannot > afford. Things would be all different if we were designing a document format > from scratch, with no existing implementations and no existing usage. If the > purpose is UTF-8 evangelism, then it would be just the kind of evangelism > that produces angry people, not converts. Agreed, if we design new spec, there's no reason to choose other than UTF-8. But HTML has long history and many content. We already have HTML*5* pages which doesn't have encoding declaration. > If there's something that should be added to or modified in the algorithm for > determining character encoding, the I'd say it's error processing. I mean > user agent behavior when it detects, after running the algorithm, when > processing the document data, that there is a mismatch between them. That is, > that the data contains octets or octet sequences that are not allowed in the > encoding or that denote noncharacters. Such errors are naturally detected > when the user agent processes the octets; the question is what the browser > should do then. Current implementations replaces such an invalid octet with a replacement character. Or some implementations scans almost the page and uses an encoding with which all octets in the page are valid. > When data that is actually in ISO-8859-1 or some similar encoding has been > mislabeled as UTF-8 encoded, then, if the data contains octets outside > the ASCII, character-level errors are likely to occur. Many ISO-8859-1 octets > are just not possible in UTF-8 data. The converse error may also cause > character-level errors. And these are not uncommon situations - they seem > occur increasingly often, partly due to cargo cult "use of UTF-8" (when it > means declaring UTF-8 but not actually using it, or vice versa), partly due > increased use of UTF-8 combined with ISO-8859-1 encoded data creeping in from > somewhere into UTF-8 encoded data. In such case, the page should be failed to show on the author's environment. > From the user's point of view, the character-level errors currently result is > some gibberish (e.g., some odd box appearing instead of a character, in one > place) or in total mess (e.g. a large number non-ASCII characters displayed > all wrong). In either case, I think an error should be signalled to the user, > together with > a) automatically trying another encoding, such as the locale default encoding > instead of UTF-8 or UTF-8 instead of anything else > b) suggesting to the user that he should try to view the page using some > other encoding, possibly with a menu of encodings offered as part of the > error explanation > c) a combination of the above. This premises that a user know the correct encoding. But European people really know the correct encoding of ISO-8859-* pages? I, Japanese, imagine that it is hard that distingusih ISO-8859-1 page and ISO-8859-2 page. > Although there are good reasons why browsers usually don't give error > messages, this would be a special case. It's about the primary interpretation > of the data in the document and about a situation where some data has no > interpretation in the assumed encoding - but usually has an interpretation in > some other encoding. Some browsers alerts scripting issues. Why they cannot alerts an encoding issue? > The current "Character encoding overrides" rules are questionable because > they often mask out data errors that would have helped to detect problems > that can be solved constructively. For example, if data labeled as ISO-8859-1 > contains an octet in the 80...9F range, then it may well be the case that the > data is actually windows-1252 encoded and the "override" helps everyone. But > it may also be the case that the data is in a different encoding and that the > "override" therefore results in gibberish shown to the user, with no hint of > the cause of the problem. I think such case doesn't exist. On character encoding overrides a superset overrides a standard set. So I can't imagine the case. > It would therefore be better to signal a problem to the user, display the > page using the windows-1252 encoding but with some instruction or hint on > changing the encoding. And a browser should in this process really analyze > whether the data can be windows-1252 encoded data that contains only > characters permitted in HTML. Such verification should be done by developer tools, not production browsers which is widely used by real users. -- NARUSE, Yui <nar...@airemix.jp>