Re: [whatwg] Default encoding to UTF-8?

Jukka K. Korpela Tue, 06 Dec 2011 00:40:06 -0800

2011-12-06 6:54, Leif Halvard Silli wrote:

Yeah, it would be a pity if it had already become an widespread
cargo-cult to - all at once - use HTML5 doctype without using UTF-8
*and* without using some encoding declaration *and* thus effectively
relying on the default locale encoding ... Who does have a data corpus?

I think we wound need to ask search engine developers about that, butwhat is this proposed change to defaults supposed to achieve. It wouldbreak any old page that does not specify the encoding, as soon as thethe doctype is changed to <!doctype html> or this doctype is added to apage that lacked a doctype.

Since <!doctype html> is the simplest way to put browsers to "standardsmode", this would punish authors who have realized that their page worksbetter in "standards mode" but are unaware of a completely different andfairly complex problem. (Basic character encoding issues are of coursenot that complex to you and me or most people around here; but mostauthors are more or less confused with them, and I don't think we shouldadd to the confusion.)

There's a little point in changing the specs to say something verydifferent from what previous HTML specs have said and from actualbrowser behavior. If the purpose is to make things more exactly defined(a fixed encoding vs. implementation-defined), then I think suchexactness is a luxury we cannot afford. Things would be all different ifwe were designing a document format from scratch, with no existingimplementations and no existing usage. If the purpose is UTF-8evangelism, then it would be just the kind of evangelism that producesangry people, not converts.

If there's something that should be added to or modified in thealgorithm for determining character encoding, the I'd say it's errorprocessing. I mean user agent behavior when it detects, after runningthe algorithm, when processing the document data, that there is amismatch between them. That is, that the data contains octets or octetsequences that are not allowed in the encoding or that denotenoncharacters. Such errors are naturally detected when the user agentprocesses the octets; the question is what the browser should do then.

When data that is actually in ISO-8859-1 or some similar encoding hasbeen mislabeled as UTF-8 encoded, then, if the data contains octetsoutside the ASCII, character-level errors are likely to occur. ManyISO-8859-1 octets are just not possible in UTF-8 data. The converseerror may also cause character-level errors. And these are not uncommonsituations - they seem occur increasingly often, partly due to cargocult "use of UTF-8" (when it means declaring UTF-8 but not actuallyusing it, or vice versa), partly due increased use of UTF-8 combinedwith ISO-8859-1 encoded data creeping in from somewhere into UTF-8encoded data.

From the user's point of view, the character-level errors currentlyresult is some gibberish (e.g., some odd box appearing instead of acharacter, in one place) or in total mess (e.g. a large number non-ASCIIcharacters displayed all wrong). In either case, I think an error shouldbe signalled to the user, together witha) automatically trying another encoding, such as the locale defaultencoding instead of UTF-8 or UTF-8 instead of anything elseb) suggesting to the user that he should try to view the page using someother encoding, possibly with a menu of encodings offered as part of theerror explanation

c) a combination of the above.

Although there are good reasons why browsers usually don't give errormessages, this would be a special case. It's about the primaryinterpretation of the data in the document and about a situation wheresome data has no interpretation in the assumed encoding - but usuallyhas an interpretation in some other encoding.

The current "Character encoding overrides" rules are questionablebecause they often mask out data errors that would have helped to detectproblems that can be solved constructively. For example, if data labeledas ISO-8859-1 contains an octet in the 80...9F range, then it may wellbe the case that the data is actually windows-1252 encoded and the"override" helps everyone. But it may also be the case that the data isin a different encoding and that the "override" therefore results ingibberish shown to the user, with no hint of the cause of the problem.It would therefore be better to signal a problem to the user, displaythe page using the windows-1252 encoding but with some instruction orhint on changing the encoding. And a browser should in this processreally analyze whether the data can be windows-1252 encoded data thatcontains only characters permitted in HTML.


Yucca

Re: [whatwg] Default encoding to UTF-8?

Reply via email to