Re: [XHR2] Avoiding charset dependencies on user settings
On Thu, Sep 29, 2011 at 11:27 PM, Jonas Sicking jo...@sicking.cc wrote: Finally, XHR allows the programmer using XHR to override the MIME type, including the charset parameter, so if the person adding new XHR code can't change the encoding declarations on legacy data, (s)he can override the UTF-8 last resort from JS (and a given repository of legacy data pretty often has a self-consistent encoding that the XHR programmer can discover ahead of time). I think requiring the person adding XHR code to write that line is much better than adding more locale and/or user setting-dependent behavior to the Web platform. This is certainly a good point, and is likely generally the easiest solution for someone rolling out a AJAX version of a new website rather than requiring webserver configuration changes. However it still doesn't solve the case where a website uses different encodings for different documents as described above. If we want to *really* address that problem, I think the right way to address it in XHR would be to add a way to XHR to override the HTML last resort encoding so that authors who are dealing with a content repository migrated partially to UTF-8 can set the last resort to the legacy encoding they know they have instead of ending up overriding the whole HTTP Content-Type for the UTF-8 content. (I'm assuming here that if someone is migrating a site from a legacy encoding to UTF-8, the UTF-8 parts declare that they are UTF-8. Authors who migrate to UTF-8 but are *still* after realizing that legacy encodings suck UTF-8 rocks too clueless to *declare* that they use UTF-8 don't deserve any further help from browsers, IMO.) I'm particularly keen to hear how this will affect locales which do not use ascii by default. Most of the contents I personally consume is written in english or swedish. Most of which is generally legible even if decoded using the wrong encoding. I'm under the impression that that is not the case for for example Chinese or Hindi documents. I think it would be sad if we went with any particular solution here without consulting people from those locales. The old way of putting Hindi content on the Web relied on intentionally misencoded downloadable fonts. From the browser's point of view, such deep legacy text is Windows-1252. Hindi content that works without misencoded fonts is UTF-8. So I think Hindi isn't relevant to this thread. Users in CJK and Cyrillic locales are the ones most hurt by authors not declaring their encodings (well, actually, readers of CJK and Cyrillic languages whose browsers are configured for other locales are hurt *even* more), so I think it would be completely backwards for browsers to complicate new features in order to enable authors in the CJK and Cyrillic locales deploy *new* features and *still* not declare encodings. Instead, I think we should design new features to make authors everywhere get their act together and declare their encodings. (Note that this position is much less extreme than the more enlightened position e.g. HTML5 App Cache manifests take: Requiring everyone to use UTF-8 for a new feature so that declarations aren't needed.) -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [XHR2] Avoiding charset dependencies on user settings
On Thu, Sep 29, 2011 at 3:30 AM, Jonas Sicking jo...@sicking.cc wrote: Do we have any guesses or data as to what percentage of existing pages would parse correctly with the above suggestion? I don't have guesses or data, because I think the question is irrelevant. When XHR is used for retrieving responseXML for legacy text/html, I'm not expecting legacy data that doesn't have encoding declations to be UTF-8 encoded. I want to use UTF-8 for consistency with legacy responseText and for well-defined behavior. (In the HTML parsing algorithm at least, we value well-defined behavior over guessing the author's intent correctly.) When people add responseXML usage for text/html, I expect them to add encoding declaration (if they are missing) when they add XHR code that uses responseXML for text/html. We assume for security purposes that an origin is under the control of one authority--i.e. that authority can change stuff within the origin. I'm suggesting that when XHR is used to retrieve text/html data from the same origin, if the text/html data doesn't already have its encoding declared, the person exercising the origin's authority to add XHR should take care of exercising the origin's authority to modify the text/html resources to add encoding declarations. XHR can't be used for retrieving different-origin legacy data without the other origin opting in using CORS. I posit that it's less onerous for the other origin to declare its encoding than to add CORS support. Since the other origin needs to participate anyway, I think it's reasonable to require declaring the encoding to be part of the participation. Finally, XHR allows the programmer using XHR to override the MIME type, including the charset parameter, so if the person adding new XHR code can't change the encoding declarations on legacy data, (s)he can override the UTF-8 last resort from JS (and a given repository of legacy data pretty often has a self-consistent encoding that the XHR programmer can discover ahead of time). I think requiring the person adding XHR code to write that line is much better than adding more locale and/or user setting-dependent behavior to the Web platform. What outcome do you suggest and why? It seems you aren't suggesting doing stuff that involves a parser restart? Are you just arguing against UTF-8 as the last resort? I'm suggesting that we do the same thing for XHR loading as we do for iframe loading. With exception of not ever restarting the parser. The goals are: * Parse as much of the HTML on the web as we can. * Don't ever restart a network operation as that significantly complicates the progress reporting as well as can have bad side effects since XHR allows arbitrary headers and HTTP methods. So you suggest scanning the first 1024 bytes heuristically and suggest varying the last resort encoding. Would you decode responseText using the same encoding that's used for responseXML? If yes, that would mean changing the way responseText decodes in Gecko when there's no declaration. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [XHR2] Avoiding charset dependencies on user settings
On Thu, Sep 29, 2011 at 12:03 AM, Henri Sivonen hsivo...@iki.fi wrote: On Thu, Sep 29, 2011 at 3:30 AM, Jonas Sicking jo...@sicking.cc wrote: Do we have any guesses or data as to what percentage of existing pages would parse correctly with the above suggestion? I don't have guesses or data, because I think the question is irrelevant. When XHR is used for retrieving responseXML for legacy text/html, I'm not expecting legacy data that doesn't have encoding declations to be UTF-8 encoded. I want to use UTF-8 for consistency with legacy responseText and for well-defined behavior. (In the HTML parsing algorithm at least, we value well-defined behavior over guessing the author's intent correctly.) When people add responseXML usage for text/html, I expect them to add encoding declaration (if they are missing) when they add XHR code that uses responseXML for text/html. We assume for security purposes that an origin is under the control of one authority--i.e. that authority can change stuff within the origin. I'm suggesting that when XHR is used to retrieve text/html data from the same origin, if the text/html data doesn't already have its encoding declared, the person exercising the origin's authority to add XHR should take care of exercising the origin's authority to modify the text/html resources to add encoding declarations. XHR can't be used for retrieving different-origin legacy data without the other origin opting in using CORS. I posit that it's less onerous for the other origin to declare its encoding than to add CORS support. Since the other origin needs to participate anyway, I think it's reasonable to require declaring the encoding to be part of the participation. While I agree that it's generally theoretically possible for a site administrator to change anything about the site, in reality it's many times pretty hard to do. We hear time and again how simply adding headers to resources in a directory is a complex task, for example in situations where a website is hosted by a third party. Adding a charset-indicating header is probably generally easier to do as it can be done by simply reconfiguring the server. However, I'm not sure that it's safe to do so in all instances. Adding a charset-indicating header requires knowing what the charset is for all documents. If you have a large body of document served without a charset-indicating header today, you take advantage of the automatic detection in browsers. If you add a charset-indicating header, that will stop happening and so you risk breaking all documents which aren't using that encoding. So consider for example a website which has traditionally been GB2312 for years, but have recently started transitioning to UTF8. If such a website were to add a header which indicates that all documents are encoded in GB2312, then all of a sudden all UTF8 documents break. To do this properly, the website would have to analyze all documents and either keep a separate database which indicates which documents have which encoding, or automatically rewrite the documents such that they all have in-document metas which indicate the correct charset. The former seems technically very hard to do, the latter seems very risky since it requires parsing HTML and rewriting HTML. Finally, XHR allows the programmer using XHR to override the MIME type, including the charset parameter, so if the person adding new XHR code can't change the encoding declarations on legacy data, (s)he can override the UTF-8 last resort from JS (and a given repository of legacy data pretty often has a self-consistent encoding that the XHR programmer can discover ahead of time). I think requiring the person adding XHR code to write that line is much better than adding more locale and/or user setting-dependent behavior to the Web platform. This is certainly a good point, and is likely generally the easiest solution for someone rolling out a AJAX version of a new website rather than requiring webserver configuration changes. However it still doesn't solve the case where a website uses different encodings for different documents as described above. What outcome do you suggest and why? It seems you aren't suggesting doing stuff that involves a parser restart? Are you just arguing against UTF-8 as the last resort? I'm suggesting that we do the same thing for XHR loading as we do for iframe loading. With exception of not ever restarting the parser. The goals are: * Parse as much of the HTML on the web as we can. * Don't ever restart a network operation as that significantly complicates the progress reporting as well as can have bad side effects since XHR allows arbitrary headers and HTTP methods. So you suggest scanning the first 1024 bytes heuristically and suggest varying the last resort encoding. Would you decode responseText using the same encoding that's used for responseXML? If yes, that would mean changing the way responseText decodes in Gecko
Re: [XHR2] Avoiding charset dependencies on user settings
On Wed, 28 Sep 2011 03:16:46 +0200, Jonas Sicking jo...@sicking.cc wrote: So it sounds like your argument is that we should do meta prescan because we can do it without breaking any new ground. Not because it's better or was inherently safer before webkit tried it out. It does seem better to decode resources in the manner they are encoded. I'd much rather first debate what behavior we want and if we can try if that is safe. And we always have the option of only doing HTML parsing when .responseType is set to document. That is unlikely to break a lot of content. And it saves users resources as it uses less memory. I think it should have the same behavior as XML. No reason to make it harder for HTML. -- Anne van Kesteren http://annevankesteren.nl/
Re: [XHR2] Avoiding charset dependencies on user settings
On Wed, Sep 28, 2011 at 4:16 AM, Jonas Sicking jo...@sicking.cc wrote: So it sounds like your argument is that we should do meta prescan because we can do it without breaking any new ground. Not because it's better or was inherently safer before webkit tried it out. The outcome I am suggesting is that character encoding determination for text/html in XHR should be: 1) HTTP charset 2) BOM 3) meta prescan 4) UTF-8 My rationale is: * Restarting the parser sucks. Full heuristic detection and non-prescan meta require restarting. * Supporting HTTP charset, BOM and meta prescan means supporting all the cases where the author is declaring the encoding in a conforming way. * Supporting meta prescan even for responseText is safe to the extent content is not already broken in WebKit. * Not doing even heuristic detection on the first 1024 bytes allows us to avoid one of the unpredictability and non-interoperability-inducing legacy flaws that encumber HTML when loading it into a browsing context. * Using a clamped last resort encoding instead of a user setting or locale-dependent encoding allows us to avoid one of the unpredictability and non-interoperability-inducing legacy flaws that encumber HTML when loading it into a browsing context. * Using UTF-8 as opposed to Windows-1252 or a user setting or locale-dependent encoding as the last resort encoding allows the same encoding to be used in the responseXML and responseText cases without breaking existing responseText usage that expects UTF-8 (UTF-8 is the responseText default in Gecko). What outcome do you suggest and why? It seems you aren't suggesting doing stuff that involves a parser restart? Are you just arguing against UTF-8 as the last resort? And in any case, it's easy to figure out where the data was loaded from after the fact, so debugging doesn't seem any harder. If that counts as not harder, I concede this point. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [XHR2] Avoiding charset dependencies on user settings
On Tue, Sep 27, 2011 at 11:10 PM, Anne van Kesteren ann...@opera.com wrote: On Wed, 28 Sep 2011 03:16:46 +0200, Jonas Sicking jo...@sicking.cc wrote: So it sounds like your argument is that we should do meta prescan because we can do it without breaking any new ground. Not because it's better or was inherently safer before webkit tried it out. It does seem better to decode resources in the manner they are encoded. I'm not sure I understand what you're saying here. If you're simply saying that ideally we should always decode using the correct decoder, then I agree. I'd much rather first debate what behavior we want and if we can try if that is safe. And we always have the option of only doing HTML parsing when .responseType is set to document. That is unlikely to break a lot of content. And it saves users resources as it uses less memory. I think it should have the same behavior as XML. No reason to make it harder for HTML. same as XML is a matter of definition though. We're doing all of the following for XML: * Using the same charset selection for XHR loading as for iframe loading. * If we don't find any explicit charset in the http headers on in the document body, we use UTF8 * If we don't find any explicit charset in the http header, we look for a XML PI which contains a charset It so happens that in XML all three of these are equivalent. For HTML that is not the case. So which are you suggesting we do (I'm assuming not the last one :) )? / Jonas
Re: [XHR2] Avoiding charset dependencies on user settings
On Wed, Sep 28, 2011 at 2:54 AM, Henri Sivonen hsivo...@iki.fi wrote: On Wed, Sep 28, 2011 at 4:16 AM, Jonas Sicking jo...@sicking.cc wrote: So it sounds like your argument is that we should do meta prescan because we can do it without breaking any new ground. Not because it's better or was inherently safer before webkit tried it out. The outcome I am suggesting is that character encoding determination for text/html in XHR should be: 1) HTTP charset 2) BOM 3) meta prescan 4) UTF-8 My rationale is: * Restarting the parser sucks. Full heuristic detection and non-prescan meta require restarting. * Supporting HTTP charset, BOM and meta prescan means supporting all the cases where the author is declaring the encoding in a conforming way. * Supporting meta prescan even for responseText is safe to the extent content is not already broken in WebKit. * Not doing even heuristic detection on the first 1024 bytes allows us to avoid one of the unpredictability and non-interoperability-inducing legacy flaws that encumber HTML when loading it into a browsing context. * Using a clamped last resort encoding instead of a user setting or locale-dependent encoding allows us to avoid one of the unpredictability and non-interoperability-inducing legacy flaws that encumber HTML when loading it into a browsing context. * Using UTF-8 as opposed to Windows-1252 or a user setting or locale-dependent encoding as the last resort encoding allows the same encoding to be used in the responseXML and responseText cases without breaking existing responseText usage that expects UTF-8 (UTF-8 is the responseText default in Gecko). Do we have any guesses or data as to what percentage of existing pages would parse correctly with the above suggestion? If we only have guesses, what are those guesses based on? My concern is leaving large chunks of the web decoded incorrectly with the above algorithm. My perception was that a very large number of pages don't declare a charset in the 1-3 locations proposed above, and yet aren't encoded in UTF8. This article is over a year old at this point, but we still had less than 50% of the web in UTF8 at that point. http://googland.blogspot.com/2010/01/g-unicode-nearing-50-of-web.html What outcome do you suggest and why? It seems you aren't suggesting doing stuff that involves a parser restart? Are you just arguing against UTF-8 as the last resort? I'm suggesting that we do the same thing for XHR loading as we do for iframe loading. With exception of not ever restarting the parser. The goals are: * Parse as much of the HTML on the web as we can. * Don't ever restart a network operation as that significantly complicates the progress reporting as well as can have bad side effects since XHR allows arbitrary headers and HTTP methods. / Jonas
Re: [XHR2] Avoiding charset dependencies on user settings
On Mon, Sep 26, 2011 at 7:50 AM, Henri Sivonen hsivo...@iki.fi wrote: On Mon, Sep 26, 2011 at 12:46 PM, Jonas Sicking jo...@sicking.cc wrote: On Fri, Sep 23, 2011 at 1:26 AM, Henri Sivonen hsivo...@iki.fi wrote: On Thu, Sep 22, 2011 at 9:54 PM, Jonas Sicking jo...@sicking.cc wrote: I agree that there are no legacy requirements on XHR here, however I don't think that that is the only thing that we should look at. We should also look at what makes the feature the most useful. A extreme counter-example would be that we could let XHR refuse to parse any HTML page that didn't pass a validator. While this wouldn't break any existing content, it would make HTML-in-XHR significantly less useful. Applying all the legacy text/html craziness to XHR could break current use of XHR to retrieve responseText of text/html resources (assuming that we want responseText for text/html work like responseText for XML in the sense that the same character encoding is used for responseText and responseXML). This doesn't seem to only be a problem when using crazy parts of text/html charset detection. Simply looking for meta charset in the first 1024 characters will change behavior and could cause page breakage. Or am I missing something? Yes: WebKit already performs the meta prescan for text/html when retrieving responseText via XHR even though it doesn't support full HTML parsing in XHR (so responseXML is still null). http://hsivonen.iki.fi/test/moz/xhr/charset-xhr.html Thus, apps broken by the meta prescan would already be broken in WebKit (unless, of course, they browser sniff in a very strange way). And apps that wouldn't be OK with using UTF-8 as the fallback encoding when there's no HTTP-level charset, no BOM and no meta in the first 1024 bytes would already by broken in Gecko. So it sounds like your argument is that we should do meta prescan because we can do it without breaking any new ground. Not because it's better or was inherently safer before webkit tried it out. I'd much rather first debate what behavior we want and if we can try if that is safe. And we always have the option of only doing HTML parsing when .responseType is set to document. That is unlikely to break a lot of content. And it saves users resources as it uses less memory. Applying all the legacy text/html craziness to XHR would make data loading in programs fail in subtle and hard-to-debug ways depending on the browser localization and user settings. At least when loading into a browsing context, there's visual feedback of character misdecoding and the feedback can be attributed back to a given file. If setting-dependent misdecoding happens in the XHR data loading machinery of an app, it's much harder to figure out what part of the system the problem should be attributed to. Could you provide more detail here. How are you imagining this data being used such that it's not being displayed to the user. I.e. can you describe an application that would break in a non-visual way and where it would be harder to detect where the data originated from compared to for example iframe usage. If a piece of text came from XHR and got injected into a visible DOM, it's not immediately obvious, which HTTP response it came from. But what type of web app would that be? Consider for example a webmail client. While it might originally show emails in a collapsed state in a mail-thread view, the data is likely still going to be shown eventually when the user expands the individual messages. Also, if the user doesn't expand to see the data, does it really matter that it was wrongly decoded. And in any case, it's easy to figure out where the data was loaded from after the fact, so debugging doesn't seem any harder. So can you provide a counter example of an app where this wouldn't be the case? / Jonas
Re: [XHR2] Avoiding charset dependencies on user settings
On Fri, Sep 23, 2011 at 1:26 AM, Henri Sivonen hsivo...@iki.fi wrote: On Thu, Sep 22, 2011 at 9:54 PM, Jonas Sicking jo...@sicking.cc wrote: I agree that there are no legacy requirements on XHR here, however I don't think that that is the only thing that we should look at. We should also look at what makes the feature the most useful. A extreme counter-example would be that we could let XHR refuse to parse any HTML page that didn't pass a validator. While this wouldn't break any existing content, it would make HTML-in-XHR significantly less useful. Applying all the legacy text/html craziness to XHR could break current use of XHR to retrieve responseText of text/html resources (assuming that we want responseText for text/html work like responseText for XML in the sense that the same character encoding is used for responseText and responseXML). This doesn't seem to only be a problem when using crazy parts of text/html charset detection. Simply looking for meta charset in the first 1024 characters will change behavior and could cause page breakage. Or am I missing something? In fact, it seems to me to be a more likely scenario that we now would get the correct charset for many XHR-loads and thus fix more pages than it breaks. Applying all the legacy text/html craziness to XHR would make data loading in programs fail in subtle and hard-to-debug ways depending on the browser localization and user settings. At least when loading into a browsing context, there's visual feedback of character misdecoding and the feedback can be attributed back to a given file. If setting-dependent misdecoding happens in the XHR data loading machinery of an app, it's much harder to figure out what part of the system the problem should be attributed to. Could you provide more detail here. How are you imagining this data being used such that it's not being displayed to the user. I.e. can you describe an application that would break in a non-visual way and where it would be harder to detect where the data originated from compared to for example iframe usage. / Jonas
Re: [XHR2] Avoiding charset dependencies on user settings
On Fri, Sep 23, 2011 at 4:46 AM, Henri Sivonen hsivo...@iki.fi wrote: On Fri, Sep 23, 2011 at 11:26 AM, Henri Sivonen hsivo...@iki.fi wrote: Applying all the legacy text/html craziness Furthermore, applying full legacy text/html craziness involves parser restarts for GET requests. With a browsing context, that means renavigation, but I really don't want to support parser restarts in XHR. Yeah, I don't see that there's a sane way to replicate this part of HTML parsing. / Jonas
Re: [XHR2] Avoiding charset dependencies on user settings
On Mon, Sep 26, 2011 at 12:46 PM, Jonas Sicking jo...@sicking.cc wrote: On Fri, Sep 23, 2011 at 1:26 AM, Henri Sivonen hsivo...@iki.fi wrote: On Thu, Sep 22, 2011 at 9:54 PM, Jonas Sicking jo...@sicking.cc wrote: I agree that there are no legacy requirements on XHR here, however I don't think that that is the only thing that we should look at. We should also look at what makes the feature the most useful. A extreme counter-example would be that we could let XHR refuse to parse any HTML page that didn't pass a validator. While this wouldn't break any existing content, it would make HTML-in-XHR significantly less useful. Applying all the legacy text/html craziness to XHR could break current use of XHR to retrieve responseText of text/html resources (assuming that we want responseText for text/html work like responseText for XML in the sense that the same character encoding is used for responseText and responseXML). This doesn't seem to only be a problem when using crazy parts of text/html charset detection. Simply looking for meta charset in the first 1024 characters will change behavior and could cause page breakage. Or am I missing something? Yes: WebKit already performs the meta prescan for text/html when retrieving responseText via XHR even though it doesn't support full HTML parsing in XHR (so responseXML is still null). http://hsivonen.iki.fi/test/moz/xhr/charset-xhr.html Thus, apps broken by the meta prescan would already be broken in WebKit (unless, of course, they browser sniff in a very strange way). And apps that wouldn't be OK with using UTF-8 as the fallback encoding when there's no HTTP-level charset, no BOM and no meta in the first 1024 bytes would already by broken in Gecko. Applying all the legacy text/html craziness to XHR would make data loading in programs fail in subtle and hard-to-debug ways depending on the browser localization and user settings. At least when loading into a browsing context, there's visual feedback of character misdecoding and the feedback can be attributed back to a given file. If setting-dependent misdecoding happens in the XHR data loading machinery of an app, it's much harder to figure out what part of the system the problem should be attributed to. Could you provide more detail here. How are you imagining this data being used such that it's not being displayed to the user. I.e. can you describe an application that would break in a non-visual way and where it would be harder to detect where the data originated from compared to for example iframe usage. If a piece of text came from XHR and got injected into a visible DOM, it's not immediately obvious, which HTTP response it came from. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [XHR2] Avoiding charset dependencies on user settings
On Thu, Sep 22, 2011 at 9:54 PM, Jonas Sicking jo...@sicking.cc wrote: I agree that there are no legacy requirements on XHR here, however I don't think that that is the only thing that we should look at. We should also look at what makes the feature the most useful. A extreme counter-example would be that we could let XHR refuse to parse any HTML page that didn't pass a validator. While this wouldn't break any existing content, it would make HTML-in-XHR significantly less useful. Applying all the legacy text/html craziness to XHR could break current use of XHR to retrieve responseText of text/html resources (assuming that we want responseText for text/html work like responseText for XML in the sense that the same character encoding is used for responseText and responseXML). Applying all the legacy text/html craziness to XHR would make data loading in programs fail in subtle and hard-to-debug ways depending on the browser localization and user settings. At least when loading into a browsing context, there's visual feedback of character misdecoding and the feedback can be attributed back to a given file. If setting-dependent misdecoding happens in the XHR data loading machinery of an app, it's much harder to figure out what part of the system the problem should be attributed to. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [XHR2] Avoiding charset dependencies on user settings
On Fri, Sep 23, 2011 at 11:26 AM, Henri Sivonen hsivo...@iki.fi wrote: Applying all the legacy text/html craziness Furthermore, applying full legacy text/html craziness involves parser restarts for GET requests. With a browsing context, that means renavigation, but I really don't want to support parser restarts in XHR. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [XHR2] Avoiding charset dependencies on user settings
On 9/23/11 4:26 AM, Henri Sivonen wrote: Applying all the legacy text/html craziness to XHR could break current use of XHR to retrieve responseText of text/html resources (assuming that we want responseText for text/html work like responseText for XML in the sense that the same character encoding is used for responseText and responseXML). I think this is a pretty strong argument in favor of not doing the text/html craziness. -Boris
[XHR2] Avoiding charset dependencies on user settings
http://dev.w3.org/2006/webapi/XMLHttpRequest-2/#document-response-entity-body says: If final MIME type is text/html let document be Document object that represents the response entity body parsed following the rules set forth in the HTML specification for an HTML parser with scripting disabled. [HTML] Since there's presumably no legacy content using XHR to read responseXML for text/html (and expecting HTML parsing) and since (in Gecko at least) responseText for non-XML tries HTTP charset and falls back on UTF-8, it seems it doesn't make sense to implement full-blown legacy charset craziness for text/html in XHR. Specifically, it seems that it makes sense to skip heuristic detection and to use UTF-8 (as opposed to Windows-1252 or a locale-dependent value) as the fallback encoding if there's neither meta nor HTTP charset, since UTF-8 is the pre-existing fallback for responseText and responseText is already used with text/html. As it stands, the XHR2 spec defers to a part of HTML that has legacy-oriented optional features. It seems that it makes sense to clamp down those options for XHR. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [XHR2] Avoiding charset dependencies on user settings
On Thu, Sep 22, 2011 at 6:33 AM, Henri Sivonen hsivo...@iki.fi wrote: http://dev.w3.org/2006/webapi/XMLHttpRequest-2/#document-response-entity-body says: If final MIME type is text/html let document be Document object that represents the response entity body parsed following the rules set forth in the HTML specification for an HTML parser with scripting disabled. [HTML] Since there's presumably no legacy content using XHR to read responseXML for text/html (and expecting HTML parsing) and since (in Gecko at least) responseText for non-XML tries HTTP charset and falls back on UTF-8, it seems it doesn't make sense to implement full-blown legacy charset craziness for text/html in XHR. Specifically, it seems that it makes sense to skip heuristic detection and to use UTF-8 (as opposed to Windows-1252 or a locale-dependent value) as the fallback encoding if there's neither meta nor HTTP charset, since UTF-8 is the pre-existing fallback for responseText and responseText is already used with text/html. As it stands, the XHR2 spec defers to a part of HTML that has legacy-oriented optional features. It seems that it makes sense to clamp down those options for XHR. I agree that there are no legacy requirements on XHR here, however I don't think that that is the only thing that we should look at. We should also look at what makes the feature the most useful. A extreme counter-example would be that we could let XHR refuse to parse any HTML page that didn't pass a validator. While this wouldn't break any existing content, it would make HTML-in-XHR significantly less useful. It makes sense to me that XHR can load any HTML resource that you could load through navigation. The one argument I could see for refusing diverge from the normal HTML loading algorithm is if it breaks few enough pages that it doesn't severely limit the usefulness of HTML-in-XHR (in any locale), while still adding enough pressure on sites to start using explicit charsets that we accomplish real change. Unfortunately I don't know how to measure those things though. / Jonas