Re: XHR LC comment: header encoding
On Fri, 05 Feb 2010 23:42:09 +0100, Boris Zbarsky bzbar...@mit.edu wrote: On 1/31/10 7:38 AM, Anne van Kesteren wrote: Specifically search for inflate and deflate throughout the drafts: To deflate a DOMString into an byte sequence means to remove from each code point in the DOMString the higher-order byte and let the resulting byte (all the lower-order bytes) be the byte sequence. How about: To deflate a DOMString into a byte sequence means to create a sequence of bytes such the n-th byte of the sequence is equal to the low-order byte of the n-th code point in the original DOMString. To inflate an byte sequence into a DOMString means to create a code point for each byte of which the higher-order byte is 0x00 and the lower-order byte is the byte. The resulting code point sequence is the DOMString. To inflate a byte sequence into a DOMString means to create a DOMString such that the n-th codepoint has 0x00 as the high-order byte and the n-th byte of the byte sequence as the low-order byte. Other than that looks ok, though I still worry about changing behavior here... Thanks, fixed. Hopefully it all works out, and if not we will have to change the specification again. -- Anne van Kesteren http://annevankesteren.nl/
Re: XHR LC comment: header encoding
On 1/31/10 7:38 AM, Anne van Kesteren wrote: Specifically search for inflate and deflate throughout the drafts: To deflate a DOMString into an byte sequence means to remove from each code point in the DOMString the higher-order byte and let the resulting byte (all the lower-order bytes) be the byte sequence. How about: To deflate a DOMString into a byte sequence means to create a sequence of bytes such the n-th byte of the sequence is equal to the low-order byte of the n-th code point in the original DOMString. To inflate an byte sequence into a DOMString means to create a code point for each byte of which the higher-order byte is 0x00 and the lower-order byte is the byte. The resulting code point sequence is the DOMString. To inflate a byte sequence into a DOMString means to create a DOMString such that the n-th codepoint has 0x00 as the high-order byte and the n-th byte of the byte sequence as the low-order byte. Other than that looks ok, though I still worry about changing behavior here... -Boris
Re: XHR LC comment: header encoding
Anne van Kesteren wrote: On Tue, 05 Jan 2010 13:49:55 +0100, Boris Zbarsky bzbar...@mit.edu wrote: Apart from the obvious worry of switching away from a behavior that the vast majority of UAs currently implement, with the ensuing potential for website breakage, sounds fine... I know... Though Opera not having received bug reports so far on this issue gives me some hope, since we have received lots of other bug reports on far more minor details starting very early on. The editor drafts of XHR1 and XHR2 now include the change. This also moved things away from being defined in Unicode to a combination of bytes and ASCII. Please let me know if you (i.e. anyone reading this thread) have any editorial suggestions on my changes or if I missed something while making the edits. Specifically search for inflate and deflate throughout the drafts: http://dev.w3.org/2006/webapi/XMLHttpRequest/ http://dev.w3.org/2006/webapi/XMLHttpRequest-2/ ... I've got a question. You know have several parts where you say something like: If any code point in method is higher than U+00FF LATIN SMALL LETTER Y WITH DIAERESIS or after deflating method it does not match the Method token production raise a SYNTAX_ERR exception and terminate these steps. a) the part about U+00FF seems to be redundant with the requirement for deflate not to loose information, and b) as Method token (actually token in HTTP/1.1) does not allow non-ASCII characters anyway, it appears to be much simpler to just require conformance to that ABNF. So this is probably correct, but appears to be way too verbose to me... Best regards, Julian
Re: XHR LC comment: header encoding
On Tue, 05 Jan 2010 13:49:55 +0100, Boris Zbarsky bzbar...@mit.edu wrote: Apart from the obvious worry of switching away from a behavior that the vast majority of UAs currently implement, with the ensuing potential for website breakage, sounds fine... I know... Though Opera not having received bug reports so far on this issue gives me some hope, since we have received lots of other bug reports on far more minor details starting very early on. The editor drafts of XHR1 and XHR2 now include the change. This also moved things away from being defined in Unicode to a combination of bytes and ASCII. Please let me know if you (i.e. anyone reading this thread) have any editorial suggestions on my changes or if I missed something while making the edits. Specifically search for inflate and deflate throughout the drafts: http://dev.w3.org/2006/webapi/XMLHttpRequest/ http://dev.w3.org/2006/webapi/XMLHttpRequest-2/ Or review the diff of xhr-source: http://dev.w3.org/cvsweb/2006/webapi/XMLHttpRequest-2/xhr-source.diff?r1=1.6r2=1.7f=h Kind regards, -- Anne van Kesteren http://annevankesteren.nl/
Re: XHR LC comment: header encoding
On Tue, 05 Jan 2010 08:39:26 +0100, Anne van Kesteren ann...@opera.com wrote: On Tue, 05 Jan 2010 08:29:53 +0100, Jonas Sicking jo...@sicking.cc wrote: At the very least, throwing if the upper byte is non-zero seems like the right thing to do to prevent silent data loss. That works for me. More concretely, this means that combined with the rules coming from HTTP a SYNTAX_ERR exception would be raised for the value argument if one of the characters has a code point larger than U+00FF, if the code point is U+007F, or if the code point is smaller than U+0020 but is not U+0009. If this is all ok the lower bytes are collected as the new header value. Does this sound acceptable to implementors? -- Anne van Kesteren http://annevankesteren.nl/
Re: XHR LC comment: header encoding
Anne van Kesteren wrote: On Tue, 05 Jan 2010 08:29:53 +0100, Jonas Sicking jo...@sicking.cc wrote: Wouldn't it then be better to throw for any non ASCII characters? That way we don't restrict ourself for when (if?) IETF defines an encoding for http headers. The defined encoding is ISO-8859-1 (unfortunately). Well, that's debatable, as RFC 2616 wasn't sufficiently precise. What's a fact is that some HTTP APIs treat them as ISO-8859-1 (servlet API, for instance). HTTPbis currently has: Historically, HTTP has allowed field content with text in the ISO-8859-1 [ISO-8859-1] character encoding and supported other character sets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII character encoding [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII characters. Recipients SHOULD treat other (obs-text) octets in field content as opaque data. -- http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p1-messaging-08.html#rfc.section.3.2 At the very least, throwing if the upper byte is non-zero seems like the right thing to do to prevent silent data loss. That works for me. Sounds good to me as well. Best regards, Julian
Re: XHR LC comment: header encoding
On 1/5/10 5:03 AM, Anne van Kesteren wrote: More concretely, this means that combined with the rules coming from HTTP a SYNTAX_ERR exception would be raised for the value argument if one of the characters has a code point larger than U+00FF, if the code point is U+007F, or if the code point is smaller than U+0020 but is not U+0009. If this is all ok the lower bytes are collected as the new header value. Does this sound acceptable to implementors? Apart from the obvious worry of switching away from a behavior that the vast majority of UAs currently implement, with the ensuing potential for website breakage, sounds fine... -Boris
Re: XHR LC comment: header encoding
On Mon, 07 Dec 2009 16:42:31 +0100, Julian Reschke julian.resc...@gmx.de wrote: I think XHR needs to elaborate on how non-ASCII characters in request headers are put on the wire, and how non-ASCII characters in response headers are transformed back to Javascript characters. Hmm yeah. I somehow assumed this was easy because everything was restricted to the ASCII range. It appears octets higher than 7E can occur as well per HTTP. For request headers, I would assume that the character encoding is ISO-8859-1, and if a character can't be encoded using ISO-8859-1, some kind of error handling occurs (ignore the character/ignore the header/throw?). From my limited testing it seems Firefox, Chrome, and Internet Explorer use UTF-8 octets. E.g. \xFF in ECMAScript gets transmitted as C3 BF (in octets). Opera sends \xFF as FF. For response headers, I'd expect that the octet sequence is decoded using ISO-8859-1; so no specific error handling would be needed (although the result may be funny when the intended encoding was Firefox, Opera, and Internet Explorer indeed do this. Chrome decodes as UTF-8 as far as I can tell. I'd love some implementor feedback on the manner. -- Anne van Kesteren http://annevankesteren.nl/
Re: XHR LC comment: header encoding
Anne van Kesteren wrote: On Mon, 07 Dec 2009 16:42:31 +0100, Julian Reschke julian.resc...@gmx.de wrote: I think XHR needs to elaborate on how non-ASCII characters in request headers are put on the wire, and how non-ASCII characters in response headers are transformed back to Javascript characters. Hmm yeah. I somehow assumed this was easy because everything was restricted to the ASCII range. It appears octets higher than 7E can occur as well per HTTP. For request headers, I would assume that the character encoding is ISO-8859-1, and if a character can't be encoded using ISO-8859-1, some kind of error handling occurs (ignore the character/ignore the header/throw?). From my limited testing it seems Firefox, Chrome, and Internet Explorer use UTF-8 octets. E.g. \xFF in ECMAScript gets transmitted as C3 BF (in octets). Opera sends \xFF as FF. For response headers, I'd expect that the octet sequence is decoded using ISO-8859-1; so no specific error handling would be needed (although the result may be funny when the intended encoding was Firefox, Opera, and Internet Explorer indeed do this. Chrome decodes as UTF-8 as far as I can tell. I'd love some implementor feedback on the manner. ... Thanks for doing the testing. The discrepancy between setting and getting worries me a lot :-). From HTTP's point of view, the header field value really is opaque. So you can put there anything, as long as it fits into the header field ABNF. Of course that only helps if senders and receivers agree on the encoding. In my experience, server frameworks (servlet API, for instance) assume ISO-8859-1 here (but that probably should be tested). For XHR 1 I think the resolution should be to leave this implementation-specific, and advise users not to rely on anything non-ASCII. Best regards, Julian
Re: XHR LC comment: header encoding
On 1/4/10 11:17 AM, Julian Reschke wrote: For request headers, I would assume that the character encoding is ISO-8859-1, and if a character can't be encoded using ISO-8859-1, some kind of error handling occurs (ignore the character/ignore the header/throw?). From my limited testing it seems Firefox, Chrome, and Internet Explorer use UTF-8 octets. E.g. \xFF in ECMAScript gets transmitted as C3 BF (in octets). Opera sends \xFF as FF. That's what Gecko does, correct. For response headers, I'd expect that the octet sequence is decoded using ISO-8859-1; so no specific error handling would be needed (although the result may be funny when the intended encoding was Firefox, Opera, and Internet Explorer indeed do this. Chrome decodes as UTF-8 as far as I can tell. More precisely, what Gecko does here is to take the raw byte string and byte-inflate it (by setting the high byte of each 16-bit code unit to 0 and the low byte to the corresponding byte of the given byte string) before returning it to JS. This happens to more or less match decoding as ISO-8859-1, but not quite. Thanks for doing the testing. The discrepancy between setting and getting worries me a lot :-). In Gecko's case it seems to be an accident, at least historically. The getter and setter used to both do byte ops only (so byte inflation in the getter, and dropping the high byte in the setter) until the fix for https://bugzilla.mozilla.org/show_bug.cgi?id=232493. The review comments at https://bugzilla.mozilla.org/show_bug.cgi?id=232493#c4 point out the UTF-8-vs-byte-inflation inconsistency here, but didn;t seem to get addressed... From HTTP's point of view, the header field value really is opaque. So you can put there anything, as long as it fits into the header field ABNF. True; what does that mean for converting header values to 16-bit code units in practice? Seems like byte-inflation might be the only reasonable thing to do... Of course that only helps if senders and receivers agree on the encoding. True, but encoding here needs to mean more than just encoding of Unicode, since one can just stick random byte arrays, within the ABNF restrictions, in the header, right? -Boris
Re: XHR LC comment: header encoding
Boris Zbarsky wrote: ... Mozilla trunk already does byte _inflation_ when converting from header bytes into a JavaScript string. I assume you meant to convert JavaScript strings into header bytes via dropping the high byte of each 16-bit code unit. However that fails the preserve as much information as possible test... In particular, as soon as any Unicode character outside the U+-U+00FF range is used, byte-dropping loses information. ... But what's the alternative? Decide the encoding in each case? The encoding not being predictable seems to be worse than anything else... BR, Julian
Re: XHR LC comment: header encoding
On 1/4/10 3:15 PM, Julian Reschke wrote: But what's the alternative? Decide the encoding in each case? The encoding not being predictable seems to be worse than anything else... Well, one non-destructive alternative is to encode JS strings as bytes by converting each 16-bit code unit into a byte pair (in LE or BE order, as desired). This has the obvious drawback of stuffing null bytes into the header, as well as not round-tripping with the byte-inflation. But that's the only non-destructive alternative (well, that and variants like base64-encoding to get around the null byte thing) I see, given that JS strings are actually arrays of arbitrary 16-bit integers. In particular, conversion to UTF-8 is in fact destructive, as is any other conversion that treats the string as Unicode of some sort. If we don't have a requirement to preserve any possible JS string via this API, then we probably have more flexibility.. -Boris
Re: XHR LC comment: header encoding
On Mon, 04 Jan 2010 21:57:34 +0100, Boris Zbarsky bzbar...@mit.edu wrote: If we don't have a requirement to preserve any possible JS string via this API, then we probably have more flexibility.. I don't think we have that requirement. I tested Opera a bit further and it seems to simply remove the first byte of a 16-bit code unit on setting. So e.g. U+FFFD becomes FD and U+033A becomes 3A. (This seems to match what you call byte-inflation.) I personally quite like this. It is very predictable and allows you to submit any valid HTTP header. If Gecko can switch back to this behavior as well other browsers are probably willing to follow. Unless there are strong objections I will define this behavior in the specification. I.e. byte-inflation for both setting and getting headers. -- Anne van Kesteren http://annevankesteren.nl/
Re: XHR LC comment: header encoding
On Mon, Jan 4, 2010 at 9:51 PM, Anne van Kesteren ann...@opera.com wrote: On Mon, 04 Jan 2010 21:57:34 +0100, Boris Zbarsky bzbar...@mit.edu wrote: If we don't have a requirement to preserve any possible JS string via this API, then we probably have more flexibility.. I don't think we have that requirement. I tested Opera a bit further and it seems to simply remove the first byte of a 16-bit code unit on setting. So e.g. U+FFFD becomes FD and U+033A becomes 3A. (This seems to match what you call byte-inflation.) I personally quite like this. It is very predictable and allows you to submit any valid HTTP header. If Gecko can switch back to this behavior as well other browsers are probably willing to follow. Unless there are strong objections I will define this behavior in the specification. I.e. byte-inflation for both setting and getting headers. Wouldn't it then be better to throw for any non ASCII characters? That way we don't restrict ourself for when (if?) IETF defines an encoding for http headers. At the very least, throwing if the upper byte is non-zero seems like the right thing to do to prevent silent data loss. / Jonas
Re: XHR LC comment: header encoding
On Tue, 05 Jan 2010 08:29:53 +0100, Jonas Sicking jo...@sicking.cc wrote: Wouldn't it then be better to throw for any non ASCII characters? That way we don't restrict ourself for when (if?) IETF defines an encoding for http headers. The defined encoding is ISO-8859-1 (unfortunately). At the very least, throwing if the upper byte is non-zero seems like the right thing to do to prevent silent data loss. That works for me. -- Anne van Kesteren http://annevankesteren.nl/
XHR LC comment: header encoding
Hi, I think XHR needs to elaborate on how non-ASCII characters in request headers are put on the wire, and how non-ASCII characters in response headers are transformed back to Javascript characters. For request headers, I would assume that the character encoding is ISO-8859-1, and if a character can't be encoded using ISO-8859-1, some kind of error handling occurs (ignore the character/ignore the header/throw?). For response headers, I'd expect that the octet sequence is decoded using ISO-8859-1; so no specific error handling would be needed (although the result may be funny when the intended encoding was something different). Best regards, Julian