Re: [whatwg] Character-encoding-related threads

2012-10-19 Thread Ian Hickson
On Fri, 19 Oct 2012, Jukka K. Korpela wrote:
  
  Are there any situations that this doesn't handle where it would be 
  legitimate to omit a title element?
 
 Perhaps the simplest case is an HTML document that is only meant to be 
 displayed inside an inline frame and containing, say, just a numeric 
 table. It is not meant to be found and indexed by search engines, it is 
 not supposed to be rendered as a standalone document with a browser top 
 bar (or equivalent) showing its title, etc.

The initial intent of such a document may be to only display it in a 
frame, but since it's independently addressable, nothing stops a search 
engine from referencing it, a user from bookmarking it, etc. So I don't 
think that's an example of where omitting title is a good idea.


 The current wording looks OK to me, and it to me, it says that a title 
 is not needed when the document is not to be used out of context:
 
 The title element represents the document's title or name. Authors 
 should use titles that identify their documents even when they are used 
 out of context, for example in a user's history or bookmarks, or in 
 search results. 
 http://www.whatwg.org/specs/web-apps/current-work/#the-title-element

That isn't what that says. Please make sure never to read between the 
lines when reading a specification.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Character-encoding-related threads

2012-10-19 Thread Jukka K. Korpela

2012-10-19 19:33, Ian Hickson wrote:


On Fri, 19 Oct 2012, Jukka K. Korpela wrote:


Are there any situations that this doesn't handle where it would be
legitimate to omit a title element?


Perhaps the simplest case is an HTML document that is only meant to be
displayed inside an inline frame and containing, say, just a numeric
table. It is not meant to be found and indexed by search engines, it is
not supposed to be rendered as a standalone document with a browser top
bar (or equivalent) showing its title, etc.


The initial intent of such a document may be to only display it in a
frame, but since it's independently addressable, nothing stops a search
engine from referencing it, a user from bookmarking it, etc. So I don't
think that's an example of where omitting title is a good idea.


Anyone who bookmarks a document that was not meant to be bookmarked 
should accept the consequences.


But it seems that it is pointless to present any situations where it 
would be legitimate to omit a title element, since you are prepared to 
refuting any possible example by presenting how things could be 
different from the scenario given.



The title element represents the document's title or name.


Yet you seem to deny, a priori, the possibility that a document does not 
need a title or a name.


Yucca




Re: [whatwg] Character-encoding-related threads

2012-10-19 Thread Ian Hickson
On Fri, 19 Oct 2012, Jukka K. Korpela wrote:
 2012-10-19 19:33, Ian Hickson wrote:
  On Fri, 19 Oct 2012, Jukka K. Korpela wrote:

Are there any situations that this doesn't handle where it would 
be legitimate to omit a title element?
   
   Perhaps the simplest case is an HTML document that is only meant to 
   be displayed inside an inline frame and containing, say, just a 
   numeric table. It is not meant to be found and indexed by search 
   engines, it is not supposed to be rendered as a standalone document 
   with a browser top bar (or equivalent) showing its title, etc.
  
  The initial intent of such a document may be to only display it in a 
  frame, but since it's independently addressable, nothing stops a 
  search engine from referencing it, a user from bookmarking it, etc. So 
  I don't think that's an example of where omitting title is a good 
  idea.
 
 Anyone who bookmarks a document that was not meant to be bookmarked 
 should accept the consequences.

That doesn't seem like a very user-friendly approach.


 But it seems that it is pointless to present any situations where it 
 would be legitimate to omit a title element, since you are prepared to 
 refuting any possible example by presenting how things could be 
 different from the scenario given.

There are definitely cases where it's ok to not have the title. For 
example, a srcdoc= document doesn't need a title, since it's not 
independently addressable. An e-mail has a Subject line so if its body is 
HTML, it doesn't need a title. Both these examples are in the spec.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Character-encoding-related threads

2012-10-19 Thread Nils Dagsson Moskopp
Jukka K. Korpela jkorp...@cs.tut.fi schrieb am Fri, 19 Oct 2012
20:49:16 +0300:

 Anyone who bookmarks a document that was not meant to be bookmarked 
 should accept the consequences.

What makes the web – and collaboration between entities in general –
tremendously useful is that information can be re-used in novel ways
the original authors never thought of. A document that is “not meant to
be bookmarked” cannot be markedly different from one that is meant to
under these circumstances.

 Yet you seem to deny, a priori, the possibility that a document does
 not need a title or a name.

Care to elaborate?

-- 
Nils Dagsson Moskopp // erlehmann
http://dieweltistgarnichtso.net


Re: [whatwg] Character-encoding-related threads

2012-10-18 Thread Ian Hickson
On Fri, 13 Jul 2012, Jukka K. Korpela wrote:
 2012-06-29 23:42, Ian Hickson wrote:
 
  Currently you need a DOCTYPE, a character encoding declaration, a 
  title, and some content. I'd love to be in a position where the empty 
  string would be a valid document, personally.
 
 Is content really necessary? The validator.nu service accepts the 
 following:
 
 !DOCTYPE htmltitle/title

It's a SHOULD-level requirement; search the spec for the word palpable.



 But the title element isn't really needed, and unless I'm mistaken, 
 the current rules allow its omission under some conditions - which 
 cannot be tested algorithmically, so conformance checkers should issue a 
 warning at most about missing title.
 
 It might be better to declare title optional but strongly recommend 
 its use on web or intranet pages (it might be rather irrelevant in other 
 uses of HTML).

That's basically what the spec says -- if there's a higher-level protocol 
that gives a title, then it's not required. It's only required if 
there's no way to get a title.

Are there any situations that this doesn't handle where it would be 
legitimate to omit a title element?

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Character-encoding-related threads

2012-10-18 Thread Jukka K. Korpela

2012-10-19 2:09, Ian Hickson wrote:

 On Fri, 13 Jul 2012, Jukka K. Korpela wrote:
[...]
 It might be better to declare title optional but strongly recommend
 its use on web or intranet pages (it might be rather irrelevant in other
 uses of HTML).

 That's basically what the spec says -- if there's a higher-level protocol
 that gives a title, then it's not required. It's only required if
 there's no way to get a title.

My point is that the title may be irrelevant, rather than specified 
using a higher-level protocol.


 Are there any situations that this doesn't handle where it would be
 legitimate to omit a title element?

Perhaps the simplest case is an HTML document that is only meant to be 
displayed inside an inline frame and containing, say, just a numeric 
table. It is not meant to be found and indexed by search engines, it is 
not supposed to be rendered as a standalone document with a browser top 
bar (or equivalent) showing its title, etc.


The current wording looks OK to me, and it to me, it says that a title 
is not needed when the document is not to be used out of context:


The title element represents the document's title or name. Authors 
should use titles that identify their documents even when they are used 
out of context, for example in a user's history or bookmarks, or in 
search results.

http://www.whatwg.org/specs/web-apps/current-work/#the-title-element

Authors may still wish to use a title element in a document that is to 
be just shown in an inline frame, but it is comment-like then. I don't 
think it's something that should be required (even in a should clause).


Yucca



Re: [whatwg] Character-encoding-related threads

2012-07-13 Thread Jukka K. Korpela

2012-06-29 23:42, Ian Hickson wrote:


I consider all boilerplate to be a significant burden. I think there's a
huge win to making it trivial to create a Web page. Anything we require
makes it less trivial.


It's a win, but I'm not sure of the huge. When learning HTML, it's an 
important aspect, and also when typing HTML by hand, but then it's 
mostly a convenience - and it helps to avoid annoying problems caused 
e.g. by making a single typo in a DOCTYPE declaration. So !DOCTYPE 
html is really an improvement



Currently you need a DOCTYPE, a character encoding declaration, a title,
and some content. I'd love to be in a position where the empty string
would be a valid document, personally.


Is content really necessary? The validator.nu service accepts the following:

!DOCTYPE htmltitle/title

I don't think we can get rid of DOCTYPE anytime soon, as browser vendors 
are stuck with DOCTYPE sniffing.


But the title element isn't really needed, and unless I'm mistaken, 
the current rules allow its omission under some conditions - which 
cannot be tested algorithmically, so conformance checkers should issue a 
warning at most about missing title.


It might be better to declare title optional but strongly recommend 
its use on web or intranet pages (it might be rather irrelevant in other 
uses of HTML).


Yucca



Re: [whatwg] Character-encoding-related threads

2012-03-30 Thread Henri Sivonen
On Thu, Dec 1, 2011 at 1:28 AM, Faruk Ates faruka...@me.com wrote:
 We like to think that “every web developer is surely building things in UTF-8 
 nowadays” but this is far from true. I still frequently break websites and 
 webapps simply by entering my name (Faruk Ateş).

Firefox 12 whines to the error console when submitting a form using an
encoding that cannot represent all Unicode. Hopefully, after Firefox
12 has been released, this will help Web authors to actually test
their sites with the error console open locate forms that can corrupt
user input.

 On Wed, 7 Dec 2011, Henri Sivonen wrote:

 I believe I was implementing exactly what the spec said at the time I
 implemented that behavior of Validator.nu. I'm particularly convinced
 that I was following the spec, because I think it's not the optimal
 behavior. I think pages that don't declare their encoding should always
 be non-conforming even if they only contain ASCII bytes, because that
 way templates created by English-oriented (or lorem ipsum -oriented)
 authors would be caught as non-conforming before non-ASCII text gets
 filled into them later. Hixie disagreed.

 I think it puts an undue burden on authors who are just writing small
 files with only ASCII. 7-bit clean ASCII is still the second-most used
 encoding on the Web (after UTF-8), so I don't think it's a small thing.

 http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html

I still think that allowing ASCII-only pages to omit the encoding
declaration is the wrong call. I agree with Simon's point about the
doctype and reliance on quirks.

Firefox Nightly (14 if all goes well) whines to the error console when
the encoding hasn't been declared and about a bunch of other encoding
declaration-related bad conditions. It also warns about ASCII-only
pages, because I didn't want to burn cycles detecting whether a page
is ASCII-only and because I think it's the wrong call not to whine
about ASCII-only templates that might getting non-ASCII content later.
However, I suppressed the message about the lack of an encoding
declaration for different-origin frames, because it is so common for
ad iframes that contain only images or flash objects to lack an
encoding declaration that not suppressing the message would have made
the error console too noisy. It's cheaper to detect whether the
message is about to be emitted for a different-origin frame than to
detect whether it's about to be emitted for an ASCII-only page.
Besides, authors generally are powerless to fix the technical flaws of
different-origin embeds.

 On Mon, 19 Dec 2011, Henri Sivonen wrote:

 Hmm. The HTML spec isn't too clear about when alias resolution happens,
 to I (incorrectly, I now think) mapped only UTF-16, UTF-16BE and
 UTF-16LE (ASCII-case-insensitive) to UTF-8 in meta without considering
 aliases at that point. Hixie, was alias resolution supposed to happen
 first? In Firefox, alias resolution happen after, so meta
 charset=iso-10646-ucs-2 is ignored per the non-ASCII superset rule.

 Assuming you mean for cases where the spec says things like If encoding
 is a UTF-16 encoding, then change the value of encoding to UTF-8, then
 any alias of UTF-16, UTF-16LE, and UTF-16BE (there aren't any registered
 currently, but Unicode might need to be one) would be considered a
 match.
...
 Currently, iso-10646-ucs-2 is neither an alias for UTF-16 nor an
 encoding that is overridden in any way. It's its own encoding.

That's not reality in Gecko.

 I hope the above is clear. Let me know if you think the spec is vague on
 the matter.

Evidently, it's too vague, because I read the spec and implemented
something different from what you meant.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/


Re: [whatwg] Character-encoding-related threads

2012-02-13 Thread Ian Hickson
On Mon, 13 Feb 2012, Simon Pieters wrote:
 On Sat, 11 Feb 2012 00:44:22 +0100, Ian Hickson i...@hixie.ch wrote:
  On Wed, 7 Dec 2011, Henri Sivonen wrote:
   
   I believe I was implementing exactly what the spec said at the time 
   I implemented that behavior of Validator.nu. I'm particularly 
   convinced that I was following the spec, because I think it's not 
   the optimal behavior. I think pages that don't declare their 
   encoding should always be non-conforming even if they only contain 
   ASCII bytes, because that way templates created by English-oriented 
   (or lorem ipsum -oriented) authors would be caught as non-conforming 
   before non-ASCII text gets filled into them later. Hixie disagreed.
  
  I think it puts an undue burden on authors who are just writing small 
  files with only ASCII. 7-bit clean ASCII is still the second-most used 
  encoding on the Web (after UTF-8), so I don't think it's a small 
  thing.
  
  http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html
 
 I think this is like saying that requiring !DOCTYPE HTML is an undue 
 burden on authors...

It is. You may recall we tried really hard to make it shorter. At the end 
of the day, however, !DOCTYPE HTML is the best we could do.


 ...on authors who are just writing small files that don't use CSS or 
 happen to not be affected by any quirk.

If you have data showing that this would be as many documents as the 
ASCII-only documents, then it would be worth considering. In practice 
though I think it would be a very small group of pages, far fewer than 
the double-digit percentages using ASCII.


 In practice, authors who don't declare their encoding can silence the 
 validator by using entities for their non-ASCII characters, but they 
 will still get bitten by encoding problems as soon as they want to 
 submit forms or resolve URLs with %-escaped stuff in the query 
 component, and so forth, so it seems to me authors would be better off 
 if we said that the encoding cruft is required cruft just like the 
 doctype cruft.

Hm, that's an interesting point. Can we make a list of features that rely 
on the character encoding and have the spec require an encoding if any of 
those are used?

If the list is long or includes anything that it's unreasonable to expect 
will not be used in most Web pages, then we should remove this particular 
hole in the conformance criteria.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Character-encoding-related threads

2012-02-13 Thread Anne van Kesteren

On Mon, 13 Feb 2012 18:22:13 +0100, Ian Hickson i...@hixie.ch wrote:

Hm, that's an interesting point. Can we make a list of features that rely
on the character encoding and have the spec require an encoding if any of
those are used?

If the list is long or includes anything that it's unreasonable to expect
will not be used in most Web pages, then we should remove this particular
hole in the conformance criteria.


The list starts with a and the moment you do not use UTF-8 (or UTF-16,  
but you really shouldn't) you can run into problems. I wonder how  
controversial it is to just require UTF-8 and not accept anything else.



--
Anne van Kesteren
http://annevankesteren.nl/


[whatwg] Character-encoding-related threads

2012-02-13 Thread Leif Halvard Silli
Anne van Kesteren, Mon Feb 13 12:02:53 PST 2012:
 On Mon, 13 Feb 2012 20:46:57 +0100, Anne van Kesteren wrote:

 The list starts with a and the moment you do not use UTF-8 (or UTF-16,  
 but you really shouldn't) you can run into problems. I wonder how  
 controversial it is to just require UTF-8 and not accept anything else.

Hear, hear!

 I guess one could argue that a is already captured by the requirements  
 around URL validation. That would leave form and potentially some  
 script-related features. It still seems sensible to me to flag everything  
 that is not labeled as UTF-8,

Indeed. Such a step would make it a must for HTML5-compliant authoring 
tools to default to UTF-8. It would also positively affect validators - 
they would have to give mild advices about how to, the simplest way, 
use UTF-8. (E.g. if page is US-ASCII or US-ASCII with entities, then - 
a simple move: Just at a encoding declaration.) It is likely to have 
many, many positive side effects.

 but if we want something intermediate we  
 could start by flagging non-UTF-8 pages that use form and maybe obsolete  
 form accept-charset or obsolete any other value than utf-8 (I filed a  
 bug on that feature already to at least restrict it to a single value).

The full way - all pages regardless of form - seems the simplest and 
best.
-- 
Leif H Silli


Re: [whatwg] Character-encoding-related threads

2012-02-13 Thread Simon Pieters

On Mon, 13 Feb 2012 18:22:13 +0100, Ian Hickson i...@hixie.ch wrote:


I think this is like saying that requiring !DOCTYPE HTML is an undue
burden on authors...


It is. You may recall we tried really hard to make it shorter. At the end
of the day, however, !DOCTYPE HTML is the best we could do.


It is a burden, but it's not significantly difficult or anything.


In practice, authors who don't declare their encoding can silence the
validator by using entities for their non-ASCII characters, but they
will still get bitten by encoding problems as soon as they want to
submit forms or resolve URLs with %-escaped stuff in the query
component, and so forth, so it seems to me authors would be better off
if we said that the encoding cruft is required cruft just like the
doctype cruft.


Hm, that's an interesting point. Can we make a list of features that rely
on the character encoding and have the spec require an encoding if any of
those are used?

If the list is long or includes anything that it's unreasonable to expect
will not be used in most Web pages, then we should remove this particular
hole in the conformance criteria.


The list may well be longer, I haven't checked, but I don't think that  
matters. The resolving URL problem is a bad problem because it means links  
will stop working for users that have a different default encoding, so  
those users leave and go to a competitor site. The form problem is a bad  
problem because it means that the database will be filled with content  
using various different encodings with no knowledge of what is what, so  
when the author realizes this and fixes it by declaring the encoding,  
it's already too late, the data is broken and is very hard to repair.


Letting authors get themselves in a situation where they have broken data  
even though it could have been easily prevented seems more like an undue  
burden to me.


Note that both of these features can be hidden in scripts where validators  
currently don't even look, so I think it's not a good idea to make the  
requirement conditional on these features.


--
Simon Pieters
Opera Software


Re: [whatwg] Character-encoding-related threads

2012-02-12 Thread Simon Pieters

On Sat, 11 Feb 2012 00:44:22 +0100, Ian Hickson i...@hixie.ch wrote:


On Wed, 7 Dec 2011, Henri Sivonen wrote:


I believe I was implementing exactly what the spec said at the time I
implemented that behavior of Validator.nu. I'm particularly convinced
that I was following the spec, because I think it's not the optimal
behavior. I think pages that don't declare their encoding should always
be non-conforming even if they only contain ASCII bytes, because that
way templates created by English-oriented (or lorem ipsum -oriented)
authors would be caught as non-conforming before non-ASCII text gets
filled into them later. Hixie disagreed.


I think it puts an undue burden on authors who are just writing small
files with only ASCII. 7-bit clean ASCII is still the second-most used
encoding on the Web (after UTF-8), so I don't think it's a small thing.

http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html


I think this is like saying that requiring !DOCTYPE HTML is an undue  
burden on authors who are just writing small files that don't use CSS or  
happen to not be affected by any quirk.


In practice, authors who don't declare their encoding can silence the  
validator by using entities for their non-ASCII characters, but they will  
still get bitten by encoding problems as soon as they want to submit forms  
or resolve URLs with %-escaped stuff in the query component, and so forth,  
so it seems to me authors would be better off if we said that the encoding  
cruft is required cruft just like the doctype cruft.


--
Simon Pieters
Opera Software


[whatwg] Character-encoding-related threads

2012-02-10 Thread Ian Hickson
On Mon, 6 Jun 2011, Boris Zbarsky wrote:
 
 You can detect other effects by seeing what unescape() does in the 
 resulting document, iirc.

Doesn't seem like it:

   http://junkyard.damowmow.com/499
   http://junkyard.damowmow.com/500

In both cases, unescape() is assuming Win1252, even though in one case 
the encoding is claimed as UTF-8.


 As well as URIs including %-encoded bytes and so forth.

In both cases here, I see URLs getting interpreted as UTF-8, not based on 
the encoding of the containing page:

   http://junkyard.damowmow.com/501
   http://junkyard.damowmow.com/502


 Also you can detect what charset is used for stylesheets included by the 
 document that don't declare their own charset.

My head hurt too much from setting up the previous two tests to actually 
test this.


 There are probably other places that use the document encoding.  Worth 
 testing some of this stuff

I'm happy to consider specific tests. Currently however, it seems like 
Firefox is the only one with any kind of magic involved in determining the 
encoding of javascript: URLs at all, and that magic doesn't seem to have 
as many side effects as one would expect, so I've left it as is.


On Wed, 30 Nov 2011, Faruk Ates wrote:

 My understanding is that all browsers default to Western Latin 
 (ISO-8859-1) encoding by default (for Western-world downloads/OSes) due 
 to legacy content on the web. But how relevant is that still today? Has 
 any browser done any recent research into the need for this?
 
 I'm wondering if it might not be good to start encouraging defaulting to 
 UTF-8, and only fallback to Western Latin if it is detected that the 
 content is very old / served by old infrastructure or servers, etc. And 
 of course if the content is served with an explicit encoding of Western 
 Latin.

That is in fact exactly what the spec requires. The way that we detect 
that the content is very old / served by old infrastructure is that it 
lacks a character encoding declaration... :-)


On Wed, 30 Nov 2011, L. David Baron wrote:
 
 I would, however, like to see movement towards defaulting to UTF-8: the 
 current situation makes the Web less world-wide because pages that work 
 for one user don't work for another.
 
 I'm just not quite sure how to get from here to there, though, since 
 such changes are likely to make users experience broken content.

One of the ways I have personally been pushing UTF-8 in the specs is by 
making new formats only support UTF-8.


On Thu, 1 Dec 2011, Sergiusz Wolicki wrote:

 I have read section 4.2.5.5 of the WHATWG HTML spec and I think it is 
 sufficient.  It requires that any non-US-ASCII document has an explicit 
 character encoding declaration. It also recommends UTF-8 for all new 
 documents and for authoring tools' default encoding.  Therefore, any 
 document conforming to HTML5 should not pose any problem in this area.
 
 The default encoding issue is therefore for old stuff.  But I have seen 
 a lot of pages, in browsers and in mail, that were tagged with one 
 encoding and encoded in another.  Hence, documents without a charset 
 declaration are only one of the reasons of garbage we see. Therefore, I 
 see no point in trying to fix anything in browsers by changing the 
 ancient defaults (risking compatibility issues). Energy should go into 
 filing bugs against misbehaving authoring tools and into adding proper 
 recommendations and education in HTML guidelines and tutorials.

Indeed.


On Fri, 2 Dec 2011, Henri Sivonen wrote:
 On Thu, Dec 1, 2011 at 8:29 PM, Brett Zamir bret...@yahoo.com wrote:
  How about a Compatibility Mode for the older non-UTF-8 character set 
  approach, specific to page?
 
 That compatibility mode already exists: It's the default mode--just like 
 the quirks mode is the default for pages that don't have a doctype. You 
 opt out of the quirks mode by saying !DOCTYPE html. You opt out of the 
 encoding compatibility mode by saying meta charset=utf-8.

Quite.


On Mon, 5 Dec 2011, Darin Adler wrote:
 On Dec 5, 2011, at 4:10 PM, Kornel Lesiński wrote:
  
  Could !DOCTYPE html be an opt-in to default UTF-8 encoding?
  
  It would be nice to minimize number of declarations a page needs to 
  include.
 
 I like that idea. Maybe it's not too late.

Just configure your server to send back UTF-8 character encoding 
declarations by default, and you don't need to think about it.


On Wed, 7 Dec 2011, Henri Sivonen wrote:
 
 If you want to minimize the declarations, you can put the UTF-8 BOM 
 followed by !DOCTYPE html at the start of the file.

That is indeed another terse solution.


On Mon, 5 Dec 2011, Sergiusz Wolicki wrote:
 
 As far as I understand, HTML5 defines US-ASCII to be the default and 
 requires that any other encoding is explicitly declared. I do like this 
 approach.

It's important not to confuse the default for authors (which is indeed 
ASCII) and the default for browsers (which is a complicated answer, but 
which defines the processing for bytes