Re: [whatwg] Character-encoding-related threads
On Fri, 19 Oct 2012, Jukka K. Korpela wrote: Are there any situations that this doesn't handle where it would be legitimate to omit a title element? Perhaps the simplest case is an HTML document that is only meant to be displayed inside an inline frame and containing, say, just a numeric table. It is not meant to be found and indexed by search engines, it is not supposed to be rendered as a standalone document with a browser top bar (or equivalent) showing its title, etc. The initial intent of such a document may be to only display it in a frame, but since it's independently addressable, nothing stops a search engine from referencing it, a user from bookmarking it, etc. So I don't think that's an example of where omitting title is a good idea. The current wording looks OK to me, and it to me, it says that a title is not needed when the document is not to be used out of context: The title element represents the document's title or name. Authors should use titles that identify their documents even when they are used out of context, for example in a user's history or bookmarks, or in search results. http://www.whatwg.org/specs/web-apps/current-work/#the-title-element That isn't what that says. Please make sure never to read between the lines when reading a specification. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Character-encoding-related threads
2012-10-19 19:33, Ian Hickson wrote: On Fri, 19 Oct 2012, Jukka K. Korpela wrote: Are there any situations that this doesn't handle where it would be legitimate to omit a title element? Perhaps the simplest case is an HTML document that is only meant to be displayed inside an inline frame and containing, say, just a numeric table. It is not meant to be found and indexed by search engines, it is not supposed to be rendered as a standalone document with a browser top bar (or equivalent) showing its title, etc. The initial intent of such a document may be to only display it in a frame, but since it's independently addressable, nothing stops a search engine from referencing it, a user from bookmarking it, etc. So I don't think that's an example of where omitting title is a good idea. Anyone who bookmarks a document that was not meant to be bookmarked should accept the consequences. But it seems that it is pointless to present any situations where it would be legitimate to omit a title element, since you are prepared to refuting any possible example by presenting how things could be different from the scenario given. The title element represents the document's title or name. Yet you seem to deny, a priori, the possibility that a document does not need a title or a name. Yucca
Re: [whatwg] Character-encoding-related threads
On Fri, 19 Oct 2012, Jukka K. Korpela wrote: 2012-10-19 19:33, Ian Hickson wrote: On Fri, 19 Oct 2012, Jukka K. Korpela wrote: Are there any situations that this doesn't handle where it would be legitimate to omit a title element? Perhaps the simplest case is an HTML document that is only meant to be displayed inside an inline frame and containing, say, just a numeric table. It is not meant to be found and indexed by search engines, it is not supposed to be rendered as a standalone document with a browser top bar (or equivalent) showing its title, etc. The initial intent of such a document may be to only display it in a frame, but since it's independently addressable, nothing stops a search engine from referencing it, a user from bookmarking it, etc. So I don't think that's an example of where omitting title is a good idea. Anyone who bookmarks a document that was not meant to be bookmarked should accept the consequences. That doesn't seem like a very user-friendly approach. But it seems that it is pointless to present any situations where it would be legitimate to omit a title element, since you are prepared to refuting any possible example by presenting how things could be different from the scenario given. There are definitely cases where it's ok to not have the title. For example, a srcdoc= document doesn't need a title, since it's not independently addressable. An e-mail has a Subject line so if its body is HTML, it doesn't need a title. Both these examples are in the spec. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Character-encoding-related threads
Jukka K. Korpela jkorp...@cs.tut.fi schrieb am Fri, 19 Oct 2012 20:49:16 +0300: Anyone who bookmarks a document that was not meant to be bookmarked should accept the consequences. What makes the web – and collaboration between entities in general – tremendously useful is that information can be re-used in novel ways the original authors never thought of. A document that is “not meant to be bookmarked” cannot be markedly different from one that is meant to under these circumstances. Yet you seem to deny, a priori, the possibility that a document does not need a title or a name. Care to elaborate? -- Nils Dagsson Moskopp // erlehmann http://dieweltistgarnichtso.net
Re: [whatwg] Character-encoding-related threads
On Fri, 13 Jul 2012, Jukka K. Korpela wrote: 2012-06-29 23:42, Ian Hickson wrote: Currently you need a DOCTYPE, a character encoding declaration, a title, and some content. I'd love to be in a position where the empty string would be a valid document, personally. Is content really necessary? The validator.nu service accepts the following: !DOCTYPE htmltitle/title It's a SHOULD-level requirement; search the spec for the word palpable. But the title element isn't really needed, and unless I'm mistaken, the current rules allow its omission under some conditions - which cannot be tested algorithmically, so conformance checkers should issue a warning at most about missing title. It might be better to declare title optional but strongly recommend its use on web or intranet pages (it might be rather irrelevant in other uses of HTML). That's basically what the spec says -- if there's a higher-level protocol that gives a title, then it's not required. It's only required if there's no way to get a title. Are there any situations that this doesn't handle where it would be legitimate to omit a title element? -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Character-encoding-related threads
2012-10-19 2:09, Ian Hickson wrote: On Fri, 13 Jul 2012, Jukka K. Korpela wrote: [...] It might be better to declare title optional but strongly recommend its use on web or intranet pages (it might be rather irrelevant in other uses of HTML). That's basically what the spec says -- if there's a higher-level protocol that gives a title, then it's not required. It's only required if there's no way to get a title. My point is that the title may be irrelevant, rather than specified using a higher-level protocol. Are there any situations that this doesn't handle where it would be legitimate to omit a title element? Perhaps the simplest case is an HTML document that is only meant to be displayed inside an inline frame and containing, say, just a numeric table. It is not meant to be found and indexed by search engines, it is not supposed to be rendered as a standalone document with a browser top bar (or equivalent) showing its title, etc. The current wording looks OK to me, and it to me, it says that a title is not needed when the document is not to be used out of context: The title element represents the document's title or name. Authors should use titles that identify their documents even when they are used out of context, for example in a user's history or bookmarks, or in search results. http://www.whatwg.org/specs/web-apps/current-work/#the-title-element Authors may still wish to use a title element in a document that is to be just shown in an inline frame, but it is comment-like then. I don't think it's something that should be required (even in a should clause). Yucca
Re: [whatwg] Character-encoding-related threads
2012-06-29 23:42, Ian Hickson wrote: I consider all boilerplate to be a significant burden. I think there's a huge win to making it trivial to create a Web page. Anything we require makes it less trivial. It's a win, but I'm not sure of the huge. When learning HTML, it's an important aspect, and also when typing HTML by hand, but then it's mostly a convenience - and it helps to avoid annoying problems caused e.g. by making a single typo in a DOCTYPE declaration. So !DOCTYPE html is really an improvement Currently you need a DOCTYPE, a character encoding declaration, a title, and some content. I'd love to be in a position where the empty string would be a valid document, personally. Is content really necessary? The validator.nu service accepts the following: !DOCTYPE htmltitle/title I don't think we can get rid of DOCTYPE anytime soon, as browser vendors are stuck with DOCTYPE sniffing. But the title element isn't really needed, and unless I'm mistaken, the current rules allow its omission under some conditions - which cannot be tested algorithmically, so conformance checkers should issue a warning at most about missing title. It might be better to declare title optional but strongly recommend its use on web or intranet pages (it might be rather irrelevant in other uses of HTML). Yucca
Re: [whatwg] Character-encoding-related threads
On Thu, Dec 1, 2011 at 1:28 AM, Faruk Ates faruka...@me.com wrote: We like to think that “every web developer is surely building things in UTF-8 nowadays” but this is far from true. I still frequently break websites and webapps simply by entering my name (Faruk Ateş). Firefox 12 whines to the error console when submitting a form using an encoding that cannot represent all Unicode. Hopefully, after Firefox 12 has been released, this will help Web authors to actually test their sites with the error console open locate forms that can corrupt user input. On Wed, 7 Dec 2011, Henri Sivonen wrote: I believe I was implementing exactly what the spec said at the time I implemented that behavior of Validator.nu. I'm particularly convinced that I was following the spec, because I think it's not the optimal behavior. I think pages that don't declare their encoding should always be non-conforming even if they only contain ASCII bytes, because that way templates created by English-oriented (or lorem ipsum -oriented) authors would be caught as non-conforming before non-ASCII text gets filled into them later. Hixie disagreed. I think it puts an undue burden on authors who are just writing small files with only ASCII. 7-bit clean ASCII is still the second-most used encoding on the Web (after UTF-8), so I don't think it's a small thing. http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html I still think that allowing ASCII-only pages to omit the encoding declaration is the wrong call. I agree with Simon's point about the doctype and reliance on quirks. Firefox Nightly (14 if all goes well) whines to the error console when the encoding hasn't been declared and about a bunch of other encoding declaration-related bad conditions. It also warns about ASCII-only pages, because I didn't want to burn cycles detecting whether a page is ASCII-only and because I think it's the wrong call not to whine about ASCII-only templates that might getting non-ASCII content later. However, I suppressed the message about the lack of an encoding declaration for different-origin frames, because it is so common for ad iframes that contain only images or flash objects to lack an encoding declaration that not suppressing the message would have made the error console too noisy. It's cheaper to detect whether the message is about to be emitted for a different-origin frame than to detect whether it's about to be emitted for an ASCII-only page. Besides, authors generally are powerless to fix the technical flaws of different-origin embeds. On Mon, 19 Dec 2011, Henri Sivonen wrote: Hmm. The HTML spec isn't too clear about when alias resolution happens, to I (incorrectly, I now think) mapped only UTF-16, UTF-16BE and UTF-16LE (ASCII-case-insensitive) to UTF-8 in meta without considering aliases at that point. Hixie, was alias resolution supposed to happen first? In Firefox, alias resolution happen after, so meta charset=iso-10646-ucs-2 is ignored per the non-ASCII superset rule. Assuming you mean for cases where the spec says things like If encoding is a UTF-16 encoding, then change the value of encoding to UTF-8, then any alias of UTF-16, UTF-16LE, and UTF-16BE (there aren't any registered currently, but Unicode might need to be one) would be considered a match. ... Currently, iso-10646-ucs-2 is neither an alias for UTF-16 nor an encoding that is overridden in any way. It's its own encoding. That's not reality in Gecko. I hope the above is clear. Let me know if you think the spec is vague on the matter. Evidently, it's too vague, because I read the spec and implemented something different from what you meant. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [whatwg] Character-encoding-related threads
On Mon, 13 Feb 2012, Simon Pieters wrote: On Sat, 11 Feb 2012 00:44:22 +0100, Ian Hickson i...@hixie.ch wrote: On Wed, 7 Dec 2011, Henri Sivonen wrote: I believe I was implementing exactly what the spec said at the time I implemented that behavior of Validator.nu. I'm particularly convinced that I was following the spec, because I think it's not the optimal behavior. I think pages that don't declare their encoding should always be non-conforming even if they only contain ASCII bytes, because that way templates created by English-oriented (or lorem ipsum -oriented) authors would be caught as non-conforming before non-ASCII text gets filled into them later. Hixie disagreed. I think it puts an undue burden on authors who are just writing small files with only ASCII. 7-bit clean ASCII is still the second-most used encoding on the Web (after UTF-8), so I don't think it's a small thing. http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html I think this is like saying that requiring !DOCTYPE HTML is an undue burden on authors... It is. You may recall we tried really hard to make it shorter. At the end of the day, however, !DOCTYPE HTML is the best we could do. ...on authors who are just writing small files that don't use CSS or happen to not be affected by any quirk. If you have data showing that this would be as many documents as the ASCII-only documents, then it would be worth considering. In practice though I think it would be a very small group of pages, far fewer than the double-digit percentages using ASCII. In practice, authors who don't declare their encoding can silence the validator by using entities for their non-ASCII characters, but they will still get bitten by encoding problems as soon as they want to submit forms or resolve URLs with %-escaped stuff in the query component, and so forth, so it seems to me authors would be better off if we said that the encoding cruft is required cruft just like the doctype cruft. Hm, that's an interesting point. Can we make a list of features that rely on the character encoding and have the spec require an encoding if any of those are used? If the list is long or includes anything that it's unreasonable to expect will not be used in most Web pages, then we should remove this particular hole in the conformance criteria. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Character-encoding-related threads
On Mon, 13 Feb 2012 18:22:13 +0100, Ian Hickson i...@hixie.ch wrote: Hm, that's an interesting point. Can we make a list of features that rely on the character encoding and have the spec require an encoding if any of those are used? If the list is long or includes anything that it's unreasonable to expect will not be used in most Web pages, then we should remove this particular hole in the conformance criteria. The list starts with a and the moment you do not use UTF-8 (or UTF-16, but you really shouldn't) you can run into problems. I wonder how controversial it is to just require UTF-8 and not accept anything else. -- Anne van Kesteren http://annevankesteren.nl/
[whatwg] Character-encoding-related threads
Anne van Kesteren, Mon Feb 13 12:02:53 PST 2012: On Mon, 13 Feb 2012 20:46:57 +0100, Anne van Kesteren wrote: The list starts with a and the moment you do not use UTF-8 (or UTF-16, but you really shouldn't) you can run into problems. I wonder how controversial it is to just require UTF-8 and not accept anything else. Hear, hear! I guess one could argue that a is already captured by the requirements around URL validation. That would leave form and potentially some script-related features. It still seems sensible to me to flag everything that is not labeled as UTF-8, Indeed. Such a step would make it a must for HTML5-compliant authoring tools to default to UTF-8. It would also positively affect validators - they would have to give mild advices about how to, the simplest way, use UTF-8. (E.g. if page is US-ASCII or US-ASCII with entities, then - a simple move: Just at a encoding declaration.) It is likely to have many, many positive side effects. but if we want something intermediate we could start by flagging non-UTF-8 pages that use form and maybe obsolete form accept-charset or obsolete any other value than utf-8 (I filed a bug on that feature already to at least restrict it to a single value). The full way - all pages regardless of form - seems the simplest and best. -- Leif H Silli
Re: [whatwg] Character-encoding-related threads
On Mon, 13 Feb 2012 18:22:13 +0100, Ian Hickson i...@hixie.ch wrote: I think this is like saying that requiring !DOCTYPE HTML is an undue burden on authors... It is. You may recall we tried really hard to make it shorter. At the end of the day, however, !DOCTYPE HTML is the best we could do. It is a burden, but it's not significantly difficult or anything. In practice, authors who don't declare their encoding can silence the validator by using entities for their non-ASCII characters, but they will still get bitten by encoding problems as soon as they want to submit forms or resolve URLs with %-escaped stuff in the query component, and so forth, so it seems to me authors would be better off if we said that the encoding cruft is required cruft just like the doctype cruft. Hm, that's an interesting point. Can we make a list of features that rely on the character encoding and have the spec require an encoding if any of those are used? If the list is long or includes anything that it's unreasonable to expect will not be used in most Web pages, then we should remove this particular hole in the conformance criteria. The list may well be longer, I haven't checked, but I don't think that matters. The resolving URL problem is a bad problem because it means links will stop working for users that have a different default encoding, so those users leave and go to a competitor site. The form problem is a bad problem because it means that the database will be filled with content using various different encodings with no knowledge of what is what, so when the author realizes this and fixes it by declaring the encoding, it's already too late, the data is broken and is very hard to repair. Letting authors get themselves in a situation where they have broken data even though it could have been easily prevented seems more like an undue burden to me. Note that both of these features can be hidden in scripts where validators currently don't even look, so I think it's not a good idea to make the requirement conditional on these features. -- Simon Pieters Opera Software
Re: [whatwg] Character-encoding-related threads
On Sat, 11 Feb 2012 00:44:22 +0100, Ian Hickson i...@hixie.ch wrote: On Wed, 7 Dec 2011, Henri Sivonen wrote: I believe I was implementing exactly what the spec said at the time I implemented that behavior of Validator.nu. I'm particularly convinced that I was following the spec, because I think it's not the optimal behavior. I think pages that don't declare their encoding should always be non-conforming even if they only contain ASCII bytes, because that way templates created by English-oriented (or lorem ipsum -oriented) authors would be caught as non-conforming before non-ASCII text gets filled into them later. Hixie disagreed. I think it puts an undue burden on authors who are just writing small files with only ASCII. 7-bit clean ASCII is still the second-most used encoding on the Web (after UTF-8), so I don't think it's a small thing. http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html I think this is like saying that requiring !DOCTYPE HTML is an undue burden on authors who are just writing small files that don't use CSS or happen to not be affected by any quirk. In practice, authors who don't declare their encoding can silence the validator by using entities for their non-ASCII characters, but they will still get bitten by encoding problems as soon as they want to submit forms or resolve URLs with %-escaped stuff in the query component, and so forth, so it seems to me authors would be better off if we said that the encoding cruft is required cruft just like the doctype cruft. -- Simon Pieters Opera Software
[whatwg] Character-encoding-related threads
On Mon, 6 Jun 2011, Boris Zbarsky wrote: You can detect other effects by seeing what unescape() does in the resulting document, iirc. Doesn't seem like it: http://junkyard.damowmow.com/499 http://junkyard.damowmow.com/500 In both cases, unescape() is assuming Win1252, even though in one case the encoding is claimed as UTF-8. As well as URIs including %-encoded bytes and so forth. In both cases here, I see URLs getting interpreted as UTF-8, not based on the encoding of the containing page: http://junkyard.damowmow.com/501 http://junkyard.damowmow.com/502 Also you can detect what charset is used for stylesheets included by the document that don't declare their own charset. My head hurt too much from setting up the previous two tests to actually test this. There are probably other places that use the document encoding. Worth testing some of this stuff I'm happy to consider specific tests. Currently however, it seems like Firefox is the only one with any kind of magic involved in determining the encoding of javascript: URLs at all, and that magic doesn't seem to have as many side effects as one would expect, so I've left it as is. On Wed, 30 Nov 2011, Faruk Ates wrote: My understanding is that all browsers default to Western Latin (ISO-8859-1) encoding by default (for Western-world downloads/OSes) due to legacy content on the web. But how relevant is that still today? Has any browser done any recent research into the need for this? I'm wondering if it might not be good to start encouraging defaulting to UTF-8, and only fallback to Western Latin if it is detected that the content is very old / served by old infrastructure or servers, etc. And of course if the content is served with an explicit encoding of Western Latin. That is in fact exactly what the spec requires. The way that we detect that the content is very old / served by old infrastructure is that it lacks a character encoding declaration... :-) On Wed, 30 Nov 2011, L. David Baron wrote: I would, however, like to see movement towards defaulting to UTF-8: the current situation makes the Web less world-wide because pages that work for one user don't work for another. I'm just not quite sure how to get from here to there, though, since such changes are likely to make users experience broken content. One of the ways I have personally been pushing UTF-8 in the specs is by making new formats only support UTF-8. On Thu, 1 Dec 2011, Sergiusz Wolicki wrote: I have read section 4.2.5.5 of the WHATWG HTML spec and I think it is sufficient. It requires that any non-US-ASCII document has an explicit character encoding declaration. It also recommends UTF-8 for all new documents and for authoring tools' default encoding. Therefore, any document conforming to HTML5 should not pose any problem in this area. The default encoding issue is therefore for old stuff. But I have seen a lot of pages, in browsers and in mail, that were tagged with one encoding and encoded in another. Hence, documents without a charset declaration are only one of the reasons of garbage we see. Therefore, I see no point in trying to fix anything in browsers by changing the ancient defaults (risking compatibility issues). Energy should go into filing bugs against misbehaving authoring tools and into adding proper recommendations and education in HTML guidelines and tutorials. Indeed. On Fri, 2 Dec 2011, Henri Sivonen wrote: On Thu, Dec 1, 2011 at 8:29 PM, Brett Zamir bret...@yahoo.com wrote: How about a Compatibility Mode for the older non-UTF-8 character set approach, specific to page? That compatibility mode already exists: It's the default mode--just like the quirks mode is the default for pages that don't have a doctype. You opt out of the quirks mode by saying !DOCTYPE html. You opt out of the encoding compatibility mode by saying meta charset=utf-8. Quite. On Mon, 5 Dec 2011, Darin Adler wrote: On Dec 5, 2011, at 4:10 PM, Kornel Lesiński wrote: Could !DOCTYPE html be an opt-in to default UTF-8 encoding? It would be nice to minimize number of declarations a page needs to include. I like that idea. Maybe it's not too late. Just configure your server to send back UTF-8 character encoding declarations by default, and you don't need to think about it. On Wed, 7 Dec 2011, Henri Sivonen wrote: If you want to minimize the declarations, you can put the UTF-8 BOM followed by !DOCTYPE html at the start of the file. That is indeed another terse solution. On Mon, 5 Dec 2011, Sergiusz Wolicki wrote: As far as I understand, HTML5 defines US-ASCII to be the default and requires that any other encoding is explicitly declared. I do like this approach. It's important not to confuse the default for authors (which is indeed ASCII) and the default for browsers (which is a complicated answer, but which defines the processing for bytes