Re: XHR LC comment: header encoding

2010-02-08 Thread Anne van Kesteren

On Fri, 05 Feb 2010 23:42:09 +0100, Boris Zbarsky bzbar...@mit.edu wrote:

On 1/31/10 7:38 AM, Anne van Kesteren wrote:

Specifically search for inflate and deflate throughout the drafts:



To deflate a DOMString into an byte sequence means to remove from
each code point in the DOMString the higher-order byte and let the
resulting byte (all the lower-order bytes) be the byte sequence.


How about:

   To deflate a DOMString into a byte sequence means to create a
   sequence of bytes such the n-th byte of the sequence is equal to
   the low-order byte of the n-th code point in the original DOMString.


To inflate an byte sequence into a DOMString means to create a code
point for each byte of which the higher-order byte is 0x00 and the
lower-order byte is the byte. The resulting code point sequence is
the DOMString.


   To inflate a byte sequence into a DOMString means to create a
   DOMString such that the n-th codepoint has 0x00 as the high-order
   byte and the n-th byte of the byte sequence as the low-order byte.

Other than that looks ok, though I still worry about changing behavior  
here...


Thanks, fixed. Hopefully it all works out, and if not we will have to  
change the specification again.



--
Anne van Kesteren
http://annevankesteren.nl/



Re: XHR LC comment: header encoding

2010-02-05 Thread Boris Zbarsky

On 1/31/10 7:38 AM, Anne van Kesteren wrote:

Specifically search for inflate and deflate throughout the drafts:


 To deflate a DOMString into an byte sequence means to remove from
 each code point in the DOMString the higher-order byte and let the
 resulting byte (all the lower-order bytes) be the byte sequence.

How about:

  To deflate a DOMString into a byte sequence means to create a
  sequence of bytes such the n-th byte of the sequence is equal to
  the low-order byte of the n-th code point in the original DOMString.

 To inflate an byte sequence into a DOMString means to create a code
 point for each byte of which the higher-order byte is 0x00 and the
 lower-order byte is the byte. The resulting code point sequence is
 the DOMString.

  To inflate a byte sequence into a DOMString means to create a
  DOMString such that the n-th codepoint has 0x00 as the high-order
  byte and the n-th byte of the byte sequence as the low-order byte.

Other than that looks ok, though I still worry about changing behavior 
here...


-Boris



Re: XHR LC comment: header encoding

2010-02-01 Thread Julian Reschke

Anne van Kesteren wrote:

On Tue, 05 Jan 2010 13:49:55 +0100, Boris Zbarsky bzbar...@mit.edu wrote:
Apart from the obvious worry of switching away from a behavior that 
the vast majority of UAs currently implement, with the ensuing 
potential for website breakage, sounds fine...


I know... Though Opera not having received bug reports so far on this 
issue gives me some hope, since we have received lots of other bug 
reports on far more minor details starting very early on.


The editor drafts of XHR1 and XHR2 now include the change. This also 
moved things away from being defined in Unicode to a combination of 
bytes and ASCII. Please let me know if you (i.e. anyone reading this 
thread) have any editorial suggestions on my changes or if I missed 
something while making the edits.


Specifically search for inflate and deflate throughout the drafts:

  http://dev.w3.org/2006/webapi/XMLHttpRequest/
  http://dev.w3.org/2006/webapi/XMLHttpRequest-2/
...


I've got a question. You know have several parts where you say something 
like:


If any code point in method is higher than U+00FF LATIN SMALL LETTER Y 
WITH DIAERESIS or after deflating method it does not match the Method 
token production raise a SYNTAX_ERR exception and terminate these steps.


a) the part about  U+00FF seems to be redundant with the requirement 
for deflate not to loose information, and


b) as Method token (actually token in HTTP/1.1) does not allow 
non-ASCII characters anyway, it appears to be much simpler to just 
require conformance to that ABNF.


So this is probably correct, but appears to be way too verbose to me...

Best regards, Julian




Re: XHR LC comment: header encoding

2010-01-31 Thread Anne van Kesteren

On Tue, 05 Jan 2010 13:49:55 +0100, Boris Zbarsky bzbar...@mit.edu wrote:
Apart from the obvious worry of switching away from a behavior that the  
vast majority of UAs currently implement, with the ensuing potential for  
website breakage, sounds fine...


I know... Though Opera not having received bug reports so far on this  
issue gives me some hope, since we have received lots of other bug reports  
on far more minor details starting very early on.


The editor drafts of XHR1 and XHR2 now include the change. This also moved  
things away from being defined in Unicode to a combination of bytes and  
ASCII. Please let me know if you (i.e. anyone reading this thread) have  
any editorial suggestions on my changes or if I missed something while  
making the edits.


Specifically search for inflate and deflate throughout the drafts:

  http://dev.w3.org/2006/webapi/XMLHttpRequest/
  http://dev.w3.org/2006/webapi/XMLHttpRequest-2/

Or review the diff of xhr-source:

  
http://dev.w3.org/cvsweb/2006/webapi/XMLHttpRequest-2/xhr-source.diff?r1=1.6r2=1.7f=h

Kind regards,


--
Anne van Kesteren
http://annevankesteren.nl/



Re: XHR LC comment: header encoding

2010-01-05 Thread Anne van Kesteren
On Tue, 05 Jan 2010 08:39:26 +0100, Anne van Kesteren ann...@opera.com  
wrote:
On Tue, 05 Jan 2010 08:29:53 +0100, Jonas Sicking jo...@sicking.cc  
wrote:

At the very least, throwing if the upper byte is non-zero seems like
the right thing to do to prevent silent data loss.


That works for me.


More concretely, this means that combined with the rules coming from HTTP  
a SYNTAX_ERR exception would be raised for the value argument if one of  
the characters has a code point larger than U+00FF, if the code point is  
U+007F, or if the code point is smaller than U+0020 but is not U+0009. If  
this is all ok the lower bytes are collected as the new header value.


Does this sound acceptable to implementors?


--
Anne van Kesteren
http://annevankesteren.nl/



Re: XHR LC comment: header encoding

2010-01-05 Thread Julian Reschke

Anne van Kesteren wrote:

On Tue, 05 Jan 2010 08:29:53 +0100, Jonas Sicking jo...@sicking.cc wrote:

Wouldn't it then be better to throw for any non ASCII characters? That
way we don't restrict ourself for when (if?) IETF defines an encoding
for http headers.


The defined encoding is ISO-8859-1 (unfortunately).


Well, that's debatable, as RFC 2616 wasn't sufficiently precise.

What's a fact is that some HTTP APIs treat them as ISO-8859-1 (servlet 
API, for instance).


HTTPbis currently has:

Historically, HTTP has allowed field content with text in the 
ISO-8859-1 [ISO-8859-1] character encoding and supported other character 
sets only through use of [RFC2047] encoding. In practice, most HTTP 
header field values use only a subset of the US-ASCII character encoding 
[USASCII]. Newly defined header fields SHOULD limit their field values 
to US-ASCII characters. Recipients SHOULD treat other (obs-text) octets 
in field content as opaque data. -- 
http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p1-messaging-08.html#rfc.section.3.2



At the very least, throwing if the upper byte is non-zero seems like
the right thing to do to prevent silent data loss.


That works for me.


Sounds good to me as well.

Best regards, Julian



Re: XHR LC comment: header encoding

2010-01-05 Thread Boris Zbarsky

On 1/5/10 5:03 AM, Anne van Kesteren wrote:

More concretely, this means that combined with the rules coming from
HTTP a SYNTAX_ERR exception would be raised for the value argument if
one of the characters has a code point larger than U+00FF, if the code
point is U+007F, or if the code point is smaller than U+0020 but is not
U+0009. If this is all ok the lower bytes are collected as the new
header value.

Does this sound acceptable to implementors?


Apart from the obvious worry of switching away from a behavior that the 
vast majority of UAs currently implement, with the ensuing potential for 
website breakage, sounds fine...


-Boris




Re: XHR LC comment: header encoding

2010-01-04 Thread Anne van Kesteren
On Mon, 07 Dec 2009 16:42:31 +0100, Julian Reschke julian.resc...@gmx.de  
wrote:
I think XHR needs to elaborate on how non-ASCII characters in request  
headers are put on the wire, and how non-ASCII characters in response  
headers are transformed back to Javascript characters.


Hmm yeah. I somehow assumed this was easy because everything was  
restricted to the ASCII range. It appears octets higher than 7E can occur  
as well per HTTP.



For request headers, I would assume that the character encoding is  
ISO-8859-1, and if a character can't be encoded using ISO-8859-1, some  
kind of error handling occurs (ignore the character/ignore the  
header/throw?).


From my limited testing it seems Firefox, Chrome, and Internet Explorer  
use UTF-8 octets. E.g. \xFF in ECMAScript gets transmitted as C3 BF (in  
octets). Opera sends \xFF as FF.



For response headers, I'd expect that the octet sequence is decoded  
using ISO-8859-1; so no specific error handling would be needed  
(although the result may be funny when the intended encoding was


Firefox, Opera, and Internet Explorer indeed do this. Chrome decodes as  
UTF-8 as far as I can tell.



I'd love some implementor feedback on the manner.


--
Anne van Kesteren
http://annevankesteren.nl/



Re: XHR LC comment: header encoding

2010-01-04 Thread Julian Reschke

Anne van Kesteren wrote:
On Mon, 07 Dec 2009 16:42:31 +0100, Julian Reschke 
julian.resc...@gmx.de wrote:
I think XHR needs to elaborate on how non-ASCII characters in request 
headers are put on the wire, and how non-ASCII characters in response 
headers are transformed back to Javascript characters.


Hmm yeah. I somehow assumed this was easy because everything was 
restricted to the ASCII range. It appears octets higher than 7E can 
occur as well per HTTP.



For request headers, I would assume that the character encoding is 
ISO-8859-1, and if a character can't be encoded using ISO-8859-1, some 
kind of error handling occurs (ignore the character/ignore the 
header/throw?).


 From my limited testing it seems Firefox, Chrome, and Internet Explorer 
use UTF-8 octets. E.g. \xFF in ECMAScript gets transmitted as C3 BF 
(in octets). Opera sends \xFF as FF.



For response headers, I'd expect that the octet sequence is decoded 
using ISO-8859-1; so no specific error handling would be needed 
(although the result may be funny when the intended encoding was


Firefox, Opera, and Internet Explorer indeed do this. Chrome decodes as 
UTF-8 as far as I can tell.



I'd love some implementor feedback on the manner.
...


Thanks for doing the testing. The discrepancy between setting and 
getting worries me a lot :-).


From HTTP's point of view, the header field value really is opaque. So 
you can put there anything, as long as it fits into the header field ABNF.


Of course that only helps if senders and receivers agree on the 
encoding. In my experience, server frameworks (servlet API, for 
instance) assume ISO-8859-1 here (but that probably should be tested).


For XHR 1 I think the resolution should be to leave this 
implementation-specific, and advise users not to rely on anything non-ASCII.


Best regards, Julian




Re: XHR LC comment: header encoding

2010-01-04 Thread Boris Zbarsky

On 1/4/10 11:17 AM, Julian Reschke wrote:

For request headers, I would assume that the character encoding is
ISO-8859-1, and if a character can't be encoded using ISO-8859-1,
some kind of error handling occurs (ignore the character/ignore the
header/throw?).


From my limited testing it seems Firefox, Chrome, and Internet
Explorer use UTF-8 octets. E.g. \xFF in ECMAScript gets transmitted
as C3 BF (in octets). Opera sends \xFF as FF.


That's what Gecko does, correct.


For response headers, I'd expect that the octet sequence is decoded
using ISO-8859-1; so no specific error handling would be needed
(although the result may be funny when the intended encoding was


Firefox, Opera, and Internet Explorer indeed do this. Chrome decodes
as UTF-8 as far as I can tell.


More precisely, what Gecko does here is to take the raw byte string and 
byte-inflate it (by setting the high byte of each 16-bit code unit to 0 
and the low byte to the corresponding byte of the given byte string) 
before returning it to JS.


This happens to more or less match decoding as ISO-8859-1, but not quite.


Thanks for doing the testing. The discrepancy between setting and
getting worries me a lot :-).


In Gecko's case it seems to be an accident, at least historically.  The 
getter and setter used to both do byte ops only (so byte inflation in 
the getter, and dropping the high byte in the setter) until the fix for 
https://bugzilla.mozilla.org/show_bug.cgi?id=232493.  The review 
comments at https://bugzilla.mozilla.org/show_bug.cgi?id=232493#c4 
point out the UTF-8-vs-byte-inflation inconsistency here, but didn;t 
seem to get addressed...



 From HTTP's point of view, the header field value really is opaque. So
you can put there anything, as long as it fits into the header field ABNF.


True; what does that mean for converting header values to 16-bit code 
units in practice?  Seems like byte-inflation might be the only 
reasonable thing to do...



Of course that only helps if senders and receivers agree on the
encoding.


True, but encoding here needs to mean more than just encoding of 
Unicode, since one can just stick random byte arrays, within the ABNF 
restrictions, in the header, right?


-Boris



Re: XHR LC comment: header encoding

2010-01-04 Thread Julian Reschke

Boris Zbarsky wrote:

...
Mozilla trunk already does byte _inflation_ when converting from header 
bytes into a JavaScript string.  I assume you meant to convert 
JavaScript strings into header bytes via dropping the high byte of each 
16-bit code unit.  However that fails the preserve as much information 
as possible test...  In particular, as soon as any Unicode character 
outside the U+-U+00FF range is used, byte-dropping loses information.

...


But what's the alternative? Decide the encoding in each case? The 
encoding not being predictable seems to be worse than anything else...


BR, Julian



Re: XHR LC comment: header encoding

2010-01-04 Thread Boris Zbarsky

On 1/4/10 3:15 PM, Julian Reschke wrote:

But what's the alternative? Decide the encoding in each case? The
encoding not being predictable seems to be worse than anything else...


Well, one non-destructive alternative is to encode JS strings as bytes 
by converting each 16-bit code unit into a byte pair (in LE or BE order, 
as desired).  This has the obvious drawback of stuffing null bytes into 
the header, as well as not round-tripping with the byte-inflation.


But that's the only non-destructive alternative (well, that and variants 
like base64-encoding to get around the null byte thing) I see, given 
that JS strings are actually arrays of arbitrary 16-bit integers.  In 
particular, conversion to UTF-8 is in fact destructive, as is any other 
conversion that treats the string as Unicode of some sort.


If we don't have a requirement to preserve any possible JS string via 
this API, then we probably have more flexibility..


-Boris



Re: XHR LC comment: header encoding

2010-01-04 Thread Anne van Kesteren

On Mon, 04 Jan 2010 21:57:34 +0100, Boris Zbarsky bzbar...@mit.edu wrote:
If we don't have a requirement to preserve any possible JS string via  
this API, then we probably have more flexibility..


I don't think we have that requirement.

I tested Opera a bit further and it seems to simply remove the first byte  
of a 16-bit code unit on setting. So e.g. U+FFFD becomes FD and U+033A  
becomes 3A. (This seems to match what you call byte-inflation.) I  
personally quite like this. It is very predictable and allows you to  
submit any valid HTTP header.


If Gecko can switch back to this behavior as well other browsers are  
probably willing to follow. Unless there are strong objections I will  
define this behavior in the specification. I.e. byte-inflation for both  
setting and getting headers.



--
Anne van Kesteren
http://annevankesteren.nl/



Re: XHR LC comment: header encoding

2010-01-04 Thread Jonas Sicking
On Mon, Jan 4, 2010 at 9:51 PM, Anne van Kesteren ann...@opera.com wrote:
 On Mon, 04 Jan 2010 21:57:34 +0100, Boris Zbarsky bzbar...@mit.edu wrote:

 If we don't have a requirement to preserve any possible JS string via this
 API, then we probably have more flexibility..

 I don't think we have that requirement.

 I tested Opera a bit further and it seems to simply remove the first byte of
 a 16-bit code unit on setting. So e.g. U+FFFD becomes FD and U+033A becomes
 3A. (This seems to match what you call byte-inflation.) I personally quite
 like this. It is very predictable and allows you to submit any valid HTTP
 header.

 If Gecko can switch back to this behavior as well other browsers are
 probably willing to follow. Unless there are strong objections I will define
 this behavior in the specification. I.e. byte-inflation for both setting and
 getting headers.

Wouldn't it then be better to throw for any non ASCII characters? That
way we don't restrict ourself for when (if?) IETF defines an encoding
for http headers.

At the very least, throwing if the upper byte is non-zero seems like
the right thing to do to prevent silent data loss.

/ Jonas



Re: XHR LC comment: header encoding

2010-01-04 Thread Anne van Kesteren

On Tue, 05 Jan 2010 08:29:53 +0100, Jonas Sicking jo...@sicking.cc wrote:

Wouldn't it then be better to throw for any non ASCII characters? That
way we don't restrict ourself for when (if?) IETF defines an encoding
for http headers.


The defined encoding is ISO-8859-1 (unfortunately).



At the very least, throwing if the upper byte is non-zero seems like
the right thing to do to prevent silent data loss.


That works for me.


--
Anne van Kesteren
http://annevankesteren.nl/



XHR LC comment: header encoding

2009-12-07 Thread Julian Reschke

Hi,

I think XHR needs to elaborate on how non-ASCII characters in request 
headers are put on the wire, and how non-ASCII characters in response 
headers are transformed back to Javascript characters.


For request headers, I would assume that the character encoding is 
ISO-8859-1, and if a character can't be encoded using ISO-8859-1, some 
kind of error handling occurs (ignore the character/ignore the 
header/throw?).


For response headers, I'd expect that the octet sequence is decoded 
using ISO-8859-1; so no specific error handling would be needed 
(although the result may be funny when the intended encoding was 
something different).


Best regards, Julian