Re: [XHR2] Avoiding charset dependencies on user settings

2011-09-30 Thread Henri Sivonen
On Thu, Sep 29, 2011 at 11:27 PM, Jonas Sicking jo...@sicking.cc wrote:
 Finally, XHR allows the programmer using XHR to override the MIME
 type, including the charset parameter, so if the person adding new XHR
 code can't change the encoding declarations on legacy data, (s)he can
 override the UTF-8 last resort from JS (and a given repository of
 legacy data pretty often has a self-consistent encoding that the XHR
 programmer can discover ahead of time). I think requiring the person
 adding XHR code to write that line is much better than adding more
 locale and/or user setting-dependent behavior to the Web platform.

 This is certainly a good point, and is likely generally the easiest
 solution for someone rolling out a AJAX version of a new website
 rather than requiring webserver configuration changes. However it
 still doesn't solve the case where a website uses different encodings
 for different documents as described above.

If we want to *really* address that problem, I think the right way to
address it in XHR would be to add a way to XHR to override the HTML
last resort encoding so that authors who are dealing with a content
repository migrated partially to UTF-8 can set the last resort to the
legacy encoding they know they have instead of ending up overriding
the whole HTTP Content-Type for the UTF-8 content. (I'm assuming here
that if someone is migrating a site from a legacy encoding to UTF-8,
the UTF-8 parts declare that they are UTF-8. Authors who migrate to
UTF-8 but are *still* after realizing that legacy encodings suck UTF-8
rocks too clueless to *declare* that they use UTF-8 don't deserve any
further help from browsers, IMO.)

 I'm particularly keen to hear how this will affect locales which do
 not use ascii by default. Most of the contents I personally consume is
 written in english or swedish. Most of which is generally legible even
 if decoded using the wrong encoding. I'm under the impression that
 that is not the case for for example Chinese or Hindi documents. I
 think it would be sad if we went with any particular solution here
 without consulting people from those locales.

The old way of putting Hindi content on the Web relied on
intentionally misencoded downloadable fonts. From the browser's point
of view, such deep legacy text is Windows-1252. Hindi content that
works without misencoded fonts is UTF-8. So I think Hindi isn't
relevant to this thread.

Users in CJK and Cyrillic locales are the ones most hurt by authors
not declaring their encodings (well, actually, readers of CJK and
Cyrillic languages whose browsers are configured for other locales are
hurt *even* more), so I think it would be completely backwards for
browsers to complicate new features in order to enable authors in the
CJK and Cyrillic locales deploy *new* features and *still* not declare
encodings. Instead, I think we should design new features to make
authors everywhere get their act together and declare their encodings.
(Note that this position is much less extreme than the more
enlightened position e.g. HTML5 App Cache manifests take: Requiring
everyone to use UTF-8 for a new feature so that declarations aren't
needed.)

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/



Re: [XHR2] Avoiding charset dependencies on user settings

2011-09-29 Thread Henri Sivonen
On Thu, Sep 29, 2011 at 3:30 AM, Jonas Sicking jo...@sicking.cc wrote:
 Do we have any guesses or data as to what percentage of existing pages
 would parse correctly with the above suggestion?

I don't have guesses or data, because I think the question is irrelevant.

When XHR is used for retrieving responseXML for legacy text/html, I'm
not expecting legacy data that doesn't have encoding declations to be
UTF-8 encoded. I want to use UTF-8 for consistency with legacy
responseText and for well-defined behavior. (In the HTML parsing
algorithm at least, we value well-defined behavior over guessing the
author's intent correctly.) When people add responseXML usage for
text/html, I expect them to add encoding declaration (if they are
missing) when they add XHR code that uses responseXML for text/html.

We assume for security purposes that an origin is under the control of
one authority--i.e. that authority can change stuff within the origin.
I'm suggesting that when XHR is used to retrieve text/html data from
the same origin, if the text/html data doesn't already have its
encoding declared, the person exercising the origin's authority to add
XHR should take care of exercising the origin's authority to modify
the text/html resources to add encoding declarations.

XHR can't be used for retrieving different-origin legacy data without
the other origin opting in using CORS. I posit that it's less onerous
for the other origin to declare its encoding than to add CORS support.
Since the other origin needs to participate anyway, I think it's
reasonable to require declaring the encoding to be part of the
participation.

Finally, XHR allows the programmer using XHR to override the MIME
type, including the charset parameter, so if the person adding new XHR
code can't change the encoding declarations on legacy data, (s)he can
override the UTF-8 last resort from JS (and a given repository of
legacy data pretty often has a self-consistent encoding that the XHR
programmer can discover ahead of time). I think requiring the person
adding XHR code to write that line is much better than adding more
locale and/or user setting-dependent behavior to the Web platform.

 What outcome do you suggest and why? It seems you aren't suggesting
 doing stuff that involves a parser restart? Are you just arguing
 against UTF-8 as the last resort?

 I'm suggesting that we do the same thing for XHR loading as we do for
 iframe loading. With exception of not ever restarting the parser.
 The goals are:

 * Parse as much of the HTML on the web as we can.
 * Don't ever restart a network operation as that significantly
 complicates the progress reporting as well as can have bad side
 effects since XHR allows arbitrary headers and HTTP methods.

So you suggest scanning the first 1024 bytes heuristically and suggest
varying the last resort encoding.

Would you decode responseText using the same encoding that's used for
responseXML? If yes, that would mean changing the way responseText
decodes in Gecko when there's no declaration.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/



Re: [XHR2] Avoiding charset dependencies on user settings

2011-09-29 Thread Jonas Sicking
On Thu, Sep 29, 2011 at 12:03 AM, Henri Sivonen hsivo...@iki.fi wrote:
 On Thu, Sep 29, 2011 at 3:30 AM, Jonas Sicking jo...@sicking.cc wrote:
 Do we have any guesses or data as to what percentage of existing pages
 would parse correctly with the above suggestion?

 I don't have guesses or data, because I think the question is irrelevant.

 When XHR is used for retrieving responseXML for legacy text/html, I'm
 not expecting legacy data that doesn't have encoding declations to be
 UTF-8 encoded. I want to use UTF-8 for consistency with legacy
 responseText and for well-defined behavior. (In the HTML parsing
 algorithm at least, we value well-defined behavior over guessing the
 author's intent correctly.) When people add responseXML usage for
 text/html, I expect them to add encoding declaration (if they are
 missing) when they add XHR code that uses responseXML for text/html.

 We assume for security purposes that an origin is under the control of
 one authority--i.e. that authority can change stuff within the origin.
 I'm suggesting that when XHR is used to retrieve text/html data from
 the same origin, if the text/html data doesn't already have its
 encoding declared, the person exercising the origin's authority to add
 XHR should take care of exercising the origin's authority to modify
 the text/html resources to add encoding declarations.

 XHR can't be used for retrieving different-origin legacy data without
 the other origin opting in using CORS. I posit that it's less onerous
 for the other origin to declare its encoding than to add CORS support.
 Since the other origin needs to participate anyway, I think it's
 reasonable to require declaring the encoding to be part of the
 participation.

While I agree that it's generally theoretically possible for a site
administrator to change anything about the site, in reality it's many
times pretty hard to do. We hear time and again how simply adding
headers to resources in a directory is a complex task, for example in
situations where a website is hosted by a third party.

Adding a charset-indicating header is probably generally easier to do
as it can be done by simply reconfiguring the server. However, I'm not
sure that it's safe to do so in all instances. Adding a
charset-indicating header requires knowing what the charset is for all
documents. If you have a large body of document served without a
charset-indicating header today, you take advantage of the automatic
detection in browsers. If you add a charset-indicating header, that
will stop happening and so you risk breaking all documents which
aren't using that encoding.

So consider for example a website which has traditionally been GB2312
for years, but have recently started transitioning to UTF8. If such a
website were to add a header which indicates that all documents are
encoded in GB2312, then all of a sudden all UTF8 documents break.

To do this properly, the website would have to analyze all documents
and either keep a separate database which indicates which documents
have which encoding, or automatically rewrite the documents such that
they all have in-document metas which indicate the correct charset.
The former seems technically very hard to do, the latter seems very
risky since it requires parsing HTML and rewriting HTML.

 Finally, XHR allows the programmer using XHR to override the MIME
 type, including the charset parameter, so if the person adding new XHR
 code can't change the encoding declarations on legacy data, (s)he can
 override the UTF-8 last resort from JS (and a given repository of
 legacy data pretty often has a self-consistent encoding that the XHR
 programmer can discover ahead of time). I think requiring the person
 adding XHR code to write that line is much better than adding more
 locale and/or user setting-dependent behavior to the Web platform.

This is certainly a good point, and is likely generally the easiest
solution for someone rolling out a AJAX version of a new website
rather than requiring webserver configuration changes. However it
still doesn't solve the case where a website uses different encodings
for different documents as described above.

 What outcome do you suggest and why? It seems you aren't suggesting
 doing stuff that involves a parser restart? Are you just arguing
 against UTF-8 as the last resort?

 I'm suggesting that we do the same thing for XHR loading as we do for
 iframe loading. With exception of not ever restarting the parser.
 The goals are:

 * Parse as much of the HTML on the web as we can.
 * Don't ever restart a network operation as that significantly
 complicates the progress reporting as well as can have bad side
 effects since XHR allows arbitrary headers and HTTP methods.

 So you suggest scanning the first 1024 bytes heuristically and suggest
 varying the last resort encoding.

 Would you decode responseText using the same encoding that's used for
 responseXML? If yes, that would mean changing the way responseText
 decodes in Gecko 

Re: [XHR2] Avoiding charset dependencies on user settings

2011-09-28 Thread Anne van Kesteren

On Wed, 28 Sep 2011 03:16:46 +0200, Jonas Sicking jo...@sicking.cc wrote:

So it sounds like your argument is that we should do meta prescan
because we can do it without breaking any new ground. Not because it's
better or was inherently safer before webkit tried it out.


It does seem better to decode resources in the manner they are encoded.



I'd much rather first debate what behavior we want and if we can try
if that is safe.

And we always have the option of only doing HTML parsing when
.responseType is set to document. That is unlikely to break a lot of
content. And it saves users resources as it uses less memory.


I think it should have the same behavior as XML. No reason to make it  
harder for HTML.



--
Anne van Kesteren
http://annevankesteren.nl/



Re: [XHR2] Avoiding charset dependencies on user settings

2011-09-28 Thread Henri Sivonen
On Wed, Sep 28, 2011 at 4:16 AM, Jonas Sicking jo...@sicking.cc wrote:
 So it sounds like your argument is that we should do meta prescan
 because we can do it without breaking any new ground. Not because it's
 better or was inherently safer before webkit tried it out.

The outcome I am suggesting is that character encoding determination
for text/html in XHR should be:
 1) HTTP charset
 2) BOM
 3) meta prescan
 4) UTF-8

My rationale is:
 * Restarting the parser sucks. Full heuristic detection and
non-prescan meta require restarting.
 * Supporting HTTP charset, BOM and meta prescan means supporting
all the cases where the author is declaring the encoding in a
conforming way.
 * Supporting meta prescan even for responseText is safe to the
extent content is not already broken in WebKit.
 * Not doing even heuristic detection on the first 1024 bytes allows
us to avoid one of the unpredictability and
non-interoperability-inducing legacy flaws that encumber HTML when
loading it into a browsing context.
 * Using a clamped last resort encoding instead of a user setting or
locale-dependent encoding allows us to avoid one of the
unpredictability and non-interoperability-inducing legacy flaws that
encumber HTML when loading it into a browsing context.
 * Using UTF-8 as opposed to Windows-1252 or a user setting or
locale-dependent encoding as the last resort encoding allows the same
encoding to be used in the responseXML and responseText cases without
breaking existing responseText usage that expects UTF-8 (UTF-8 is the
responseText default in Gecko).

What outcome do you suggest and why? It seems you aren't suggesting
doing stuff that involves a parser restart? Are you just arguing
against UTF-8 as the last resort?

 And in any case, it's easy to figure out where the
 data was loaded from after the fact, so debugging doesn't seem any
 harder.

If that counts as not harder, I concede this point.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/



Re: [XHR2] Avoiding charset dependencies on user settings

2011-09-28 Thread Jonas Sicking
On Tue, Sep 27, 2011 at 11:10 PM, Anne van Kesteren ann...@opera.com wrote:
 On Wed, 28 Sep 2011 03:16:46 +0200, Jonas Sicking jo...@sicking.cc wrote:

 So it sounds like your argument is that we should do meta prescan
 because we can do it without breaking any new ground. Not because it's
 better or was inherently safer before webkit tried it out.

 It does seem better to decode resources in the manner they are encoded.

I'm not sure I understand what you're saying here. If you're simply
saying that ideally we should always decode using the correct decoder,
then I agree.

 I'd much rather first debate what behavior we want and if we can try
 if that is safe.

 And we always have the option of only doing HTML parsing when
 .responseType is set to document. That is unlikely to break a lot of
 content. And it saves users resources as it uses less memory.

 I think it should have the same behavior as XML. No reason to make it harder
 for HTML.

same as XML is a matter of definition though. We're doing all of the
following for XML:

* Using the same charset selection for XHR loading as for iframe loading.
* If we don't find any explicit charset in the http headers on in the
document body, we use UTF8
* If we don't find any explicit charset in the http header, we look
for a XML PI which contains a charset

It so happens that in XML all three of these are equivalent. For HTML
that is not the case. So which are you suggesting we do (I'm assuming
not the last one :) )?

/ Jonas



Re: [XHR2] Avoiding charset dependencies on user settings

2011-09-28 Thread Jonas Sicking
On Wed, Sep 28, 2011 at 2:54 AM, Henri Sivonen hsivo...@iki.fi wrote:
 On Wed, Sep 28, 2011 at 4:16 AM, Jonas Sicking jo...@sicking.cc wrote:
 So it sounds like your argument is that we should do meta prescan
 because we can do it without breaking any new ground. Not because it's
 better or was inherently safer before webkit tried it out.

 The outcome I am suggesting is that character encoding determination
 for text/html in XHR should be:
  1) HTTP charset
  2) BOM
  3) meta prescan
  4) UTF-8

 My rationale is:
  * Restarting the parser sucks. Full heuristic detection and
 non-prescan meta require restarting.
  * Supporting HTTP charset, BOM and meta prescan means supporting
 all the cases where the author is declaring the encoding in a
 conforming way.
  * Supporting meta prescan even for responseText is safe to the
 extent content is not already broken in WebKit.
  * Not doing even heuristic detection on the first 1024 bytes allows
 us to avoid one of the unpredictability and
 non-interoperability-inducing legacy flaws that encumber HTML when
 loading it into a browsing context.
  * Using a clamped last resort encoding instead of a user setting or
 locale-dependent encoding allows us to avoid one of the
 unpredictability and non-interoperability-inducing legacy flaws that
 encumber HTML when loading it into a browsing context.
  * Using UTF-8 as opposed to Windows-1252 or a user setting or
 locale-dependent encoding as the last resort encoding allows the same
 encoding to be used in the responseXML and responseText cases without
 breaking existing responseText usage that expects UTF-8 (UTF-8 is the
 responseText default in Gecko).

Do we have any guesses or data as to what percentage of existing pages
would parse correctly with the above suggestion? If we only have
guesses, what are those guesses based on?

My concern is leaving large chunks of the web decoded incorrectly with
the above algorithm. My perception was that a very large number of
pages don't declare a charset in the 1-3 locations proposed above, and
yet aren't encoded in UTF8.

This article is over a year old at this point, but we still had less
than 50% of the web in UTF8 at that point.

http://googland.blogspot.com/2010/01/g-unicode-nearing-50-of-web.html

 What outcome do you suggest and why? It seems you aren't suggesting
 doing stuff that involves a parser restart? Are you just arguing
 against UTF-8 as the last resort?

I'm suggesting that we do the same thing for XHR loading as we do for
iframe loading. With exception of not ever restarting the parser.
The goals are:

* Parse as much of the HTML on the web as we can.
* Don't ever restart a network operation as that significantly
complicates the progress reporting as well as can have bad side
effects since XHR allows arbitrary headers and HTTP methods.

/ Jonas



Re: [XHR2] Avoiding charset dependencies on user settings

2011-09-27 Thread Jonas Sicking
On Mon, Sep 26, 2011 at 7:50 AM, Henri Sivonen hsivo...@iki.fi wrote:
 On Mon, Sep 26, 2011 at 12:46 PM, Jonas Sicking jo...@sicking.cc wrote:
 On Fri, Sep 23, 2011 at 1:26 AM, Henri Sivonen hsivo...@iki.fi wrote:
 On Thu, Sep 22, 2011 at 9:54 PM, Jonas Sicking jo...@sicking.cc wrote:
 I agree that there are no legacy requirements on XHR here, however I
 don't think that that is the only thing that we should look at. We
 should also look at what makes the feature the most useful. A extreme
 counter-example would be that we could let XHR refuse to parse any
 HTML page that didn't pass a validator. While this wouldn't break any
 existing content, it would make HTML-in-XHR significantly less useful.

 Applying all the legacy text/html craziness to XHR could break current
 use of XHR to retrieve responseText of text/html resources (assuming
 that we want responseText for text/html work like responseText for XML
 in the sense that the same character encoding is used for responseText
 and responseXML).

 This doesn't seem to only be a problem when using crazy parts of
 text/html charset detection. Simply looking for meta charset in the
 first 1024 characters will change behavior and could cause page
 breakage.

 Or am I missing something?

 Yes: WebKit already performs the meta prescan for text/html when
 retrieving responseText via XHR even though it doesn't support full
 HTML parsing in XHR (so responseXML is still null).
 http://hsivonen.iki.fi/test/moz/xhr/charset-xhr.html

 Thus, apps broken by the meta prescan would already be broken in
 WebKit (unless, of course, they browser sniff in a very strange way).

 And apps that wouldn't be OK with using UTF-8 as the fallback encoding
 when there's no HTTP-level charset, no BOM and no meta in the first
 1024 bytes would already by broken in Gecko.

So it sounds like your argument is that we should do meta prescan
because we can do it without breaking any new ground. Not because it's
better or was inherently safer before webkit tried it out.

I'd much rather first debate what behavior we want and if we can try
if that is safe.

And we always have the option of only doing HTML parsing when
.responseType is set to document. That is unlikely to break a lot of
content. And it saves users resources as it uses less memory.

 Applying all the legacy text/html craziness to XHR would make data
 loading in programs fail in subtle and hard-to-debug ways depending on
 the browser localization and user settings. At least when loading into
 a browsing context, there's visual feedback of character misdecoding
 and the feedback can be attributed back to a given file. If
 setting-dependent misdecoding happens in the XHR data loading
 machinery of an app, it's much harder to figure out what part of the
 system the problem should be attributed to.

 Could you provide more detail here. How are you imagining this data
 being used such that it's not being displayed to the user.

 I.e. can you describe an application that would break in a non-visual
 way and where it would be harder to detect where the data originated
 from compared to for example iframe usage.

 If a piece of text came from XHR and got injected into a visible DOM,
 it's not immediately obvious, which HTTP response it came from.

But what type of web app would that be? Consider for example a webmail
client. While it might originally show emails in a collapsed state in
a mail-thread view, the data is likely still going to be shown
eventually when the user expands the individual messages. Also, if the
user doesn't expand to see the data, does it really matter that it was
wrongly decoded. And in any case, it's easy to figure out where the
data was loaded from after the fact, so debugging doesn't seem any
harder.

So can you provide a counter example of an app where this wouldn't be the case?

/ Jonas



Re: [XHR2] Avoiding charset dependencies on user settings

2011-09-26 Thread Jonas Sicking
On Fri, Sep 23, 2011 at 1:26 AM, Henri Sivonen hsivo...@iki.fi wrote:
 On Thu, Sep 22, 2011 at 9:54 PM, Jonas Sicking jo...@sicking.cc wrote:
 I agree that there are no legacy requirements on XHR here, however I
 don't think that that is the only thing that we should look at. We
 should also look at what makes the feature the most useful. A extreme
 counter-example would be that we could let XHR refuse to parse any
 HTML page that didn't pass a validator. While this wouldn't break any
 existing content, it would make HTML-in-XHR significantly less useful.

 Applying all the legacy text/html craziness to XHR could break current
 use of XHR to retrieve responseText of text/html resources (assuming
 that we want responseText for text/html work like responseText for XML
 in the sense that the same character encoding is used for responseText
 and responseXML).

This doesn't seem to only be a problem when using crazy parts of
text/html charset detection. Simply looking for meta charset in the
first 1024 characters will change behavior and could cause page
breakage.

Or am I missing something?

In fact, it seems to me to be a more likely scenario that we now would
get the correct charset for many XHR-loads and thus fix more pages
than it breaks.

 Applying all the legacy text/html craziness to XHR would make data
 loading in programs fail in subtle and hard-to-debug ways depending on
 the browser localization and user settings. At least when loading into
 a browsing context, there's visual feedback of character misdecoding
 and the feedback can be attributed back to a given file. If
 setting-dependent misdecoding happens in the XHR data loading
 machinery of an app, it's much harder to figure out what part of the
 system the problem should be attributed to.

Could you provide more detail here. How are you imagining this data
being used such that it's not being displayed to the user.

I.e. can you describe an application that would break in a non-visual
way and where it would be harder to detect where the data originated
from compared to for example iframe usage.

/ Jonas



Re: [XHR2] Avoiding charset dependencies on user settings

2011-09-26 Thread Jonas Sicking
On Fri, Sep 23, 2011 at 4:46 AM, Henri Sivonen hsivo...@iki.fi wrote:
 On Fri, Sep 23, 2011 at 11:26 AM, Henri Sivonen hsivo...@iki.fi wrote:
 Applying all the legacy text/html craziness

 Furthermore, applying full legacy text/html craziness involves parser
 restarts for GET requests. With a browsing context, that means
 renavigation, but I really don't want to support parser restarts in
 XHR.

Yeah, I don't see that there's a sane way to replicate this part of
HTML parsing.

/ Jonas



Re: [XHR2] Avoiding charset dependencies on user settings

2011-09-26 Thread Henri Sivonen
On Mon, Sep 26, 2011 at 12:46 PM, Jonas Sicking jo...@sicking.cc wrote:
 On Fri, Sep 23, 2011 at 1:26 AM, Henri Sivonen hsivo...@iki.fi wrote:
 On Thu, Sep 22, 2011 at 9:54 PM, Jonas Sicking jo...@sicking.cc wrote:
 I agree that there are no legacy requirements on XHR here, however I
 don't think that that is the only thing that we should look at. We
 should also look at what makes the feature the most useful. A extreme
 counter-example would be that we could let XHR refuse to parse any
 HTML page that didn't pass a validator. While this wouldn't break any
 existing content, it would make HTML-in-XHR significantly less useful.

 Applying all the legacy text/html craziness to XHR could break current
 use of XHR to retrieve responseText of text/html resources (assuming
 that we want responseText for text/html work like responseText for XML
 in the sense that the same character encoding is used for responseText
 and responseXML).

 This doesn't seem to only be a problem when using crazy parts of
 text/html charset detection. Simply looking for meta charset in the
 first 1024 characters will change behavior and could cause page
 breakage.

 Or am I missing something?

Yes: WebKit already performs the meta prescan for text/html when
retrieving responseText via XHR even though it doesn't support full
HTML parsing in XHR (so responseXML is still null).
http://hsivonen.iki.fi/test/moz/xhr/charset-xhr.html

Thus, apps broken by the meta prescan would already be broken in
WebKit (unless, of course, they browser sniff in a very strange way).

And apps that wouldn't be OK with using UTF-8 as the fallback encoding
when there's no HTTP-level charset, no BOM and no meta in the first
1024 bytes would already by broken in Gecko.

 Applying all the legacy text/html craziness to XHR would make data
 loading in programs fail in subtle and hard-to-debug ways depending on
 the browser localization and user settings. At least when loading into
 a browsing context, there's visual feedback of character misdecoding
 and the feedback can be attributed back to a given file. If
 setting-dependent misdecoding happens in the XHR data loading
 machinery of an app, it's much harder to figure out what part of the
 system the problem should be attributed to.

 Could you provide more detail here. How are you imagining this data
 being used such that it's not being displayed to the user.

 I.e. can you describe an application that would break in a non-visual
 way and where it would be harder to detect where the data originated
 from compared to for example iframe usage.

If a piece of text came from XHR and got injected into a visible DOM,
it's not immediately obvious, which HTTP response it came from.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/



Re: [XHR2] Avoiding charset dependencies on user settings

2011-09-23 Thread Henri Sivonen
On Thu, Sep 22, 2011 at 9:54 PM, Jonas Sicking jo...@sicking.cc wrote:
 I agree that there are no legacy requirements on XHR here, however I
 don't think that that is the only thing that we should look at. We
 should also look at what makes the feature the most useful. A extreme
 counter-example would be that we could let XHR refuse to parse any
 HTML page that didn't pass a validator. While this wouldn't break any
 existing content, it would make HTML-in-XHR significantly less useful.

Applying all the legacy text/html craziness to XHR could break current
use of XHR to retrieve responseText of text/html resources (assuming
that we want responseText for text/html work like responseText for XML
in the sense that the same character encoding is used for responseText
and responseXML).

Applying all the legacy text/html craziness to XHR would make data
loading in programs fail in subtle and hard-to-debug ways depending on
the browser localization and user settings. At least when loading into
a browsing context, there's visual feedback of character misdecoding
and the feedback can be attributed back to a given file. If
setting-dependent misdecoding happens in the XHR data loading
machinery of an app, it's much harder to figure out what part of the
system the problem should be attributed to.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/



Re: [XHR2] Avoiding charset dependencies on user settings

2011-09-23 Thread Henri Sivonen
On Fri, Sep 23, 2011 at 11:26 AM, Henri Sivonen hsivo...@iki.fi wrote:
 Applying all the legacy text/html craziness

Furthermore, applying full legacy text/html craziness involves parser
restarts for GET requests. With a browsing context, that means
renavigation, but I really don't want to support parser restarts in
XHR.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/



Re: [XHR2] Avoiding charset dependencies on user settings

2011-09-23 Thread Boris Zbarsky

On 9/23/11 4:26 AM, Henri Sivonen wrote:

Applying all the legacy text/html craziness to XHR could break current
use of XHR to retrieve responseText of text/html resources (assuming
that we want responseText for text/html work like responseText for XML
in the sense that the same character encoding is used for responseText
and responseXML).


I think this is a pretty strong argument in favor of not doing the 
text/html craziness.


-Boris



[XHR2] Avoiding charset dependencies on user settings

2011-09-22 Thread Henri Sivonen
http://dev.w3.org/2006/webapi/XMLHttpRequest-2/#document-response-entity-body
says:
If final MIME type is text/html let document be Document object that
represents the response entity body parsed following the rules set
forth in the HTML specification for an HTML parser with scripting
disabled. [HTML]

Since there's presumably no legacy content using XHR to read
responseXML for text/html (and expecting HTML parsing) and since (in
Gecko at least) responseText for non-XML tries HTTP charset and falls
back on UTF-8, it seems it doesn't make sense to implement full-blown
legacy charset craziness for text/html in XHR.

Specifically, it seems that it makes sense to skip heuristic detection
and to use UTF-8 (as opposed to Windows-1252 or a locale-dependent
value) as the fallback encoding if there's neither meta nor HTTP
charset, since UTF-8 is the pre-existing fallback for responseText and
responseText is already used with text/html.

As it stands, the XHR2 spec defers to a part of HTML that has
legacy-oriented optional features. It seems that it makes sense to
clamp down those options for XHR.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/



Re: [XHR2] Avoiding charset dependencies on user settings

2011-09-22 Thread Jonas Sicking
On Thu, Sep 22, 2011 at 6:33 AM, Henri Sivonen hsivo...@iki.fi wrote:
 http://dev.w3.org/2006/webapi/XMLHttpRequest-2/#document-response-entity-body
 says:
 If final MIME type is text/html let document be Document object that
 represents the response entity body parsed following the rules set
 forth in the HTML specification for an HTML parser with scripting
 disabled. [HTML]

 Since there's presumably no legacy content using XHR to read
 responseXML for text/html (and expecting HTML parsing) and since (in
 Gecko at least) responseText for non-XML tries HTTP charset and falls
 back on UTF-8, it seems it doesn't make sense to implement full-blown
 legacy charset craziness for text/html in XHR.

 Specifically, it seems that it makes sense to skip heuristic detection
 and to use UTF-8 (as opposed to Windows-1252 or a locale-dependent
 value) as the fallback encoding if there's neither meta nor HTTP
 charset, since UTF-8 is the pre-existing fallback for responseText and
 responseText is already used with text/html.

 As it stands, the XHR2 spec defers to a part of HTML that has
 legacy-oriented optional features. It seems that it makes sense to
 clamp down those options for XHR.

I agree that there are no legacy requirements on XHR here, however I
don't think that that is the only thing that we should look at. We
should also look at what makes the feature the most useful. A extreme
counter-example would be that we could let XHR refuse to parse any
HTML page that didn't pass a validator. While this wouldn't break any
existing content, it would make HTML-in-XHR significantly less useful.

It makes sense to me that XHR can load any HTML resource that you
could load through navigation.

The one argument I could see for refusing diverge from the normal HTML
loading algorithm is if it breaks few enough pages that it doesn't
severely limit the usefulness of HTML-in-XHR (in any locale), while
still adding enough pressure on sites to start using explicit charsets
that we accomplish real change.

Unfortunately I don't know how to measure those things though.

/ Jonas