Re: Should Decode accept U+FFFE or U+FFFF (and other Unicode non-characters)?

2011-07-15 Thread Allen Wirfs-Brock

On Jul 14, 2011, at 10:38 PM, Jeff Walden wrote:

 Reraising this issue...
 
 To briefly repeat: Decode, called by decodeURI{,Component}, says to reject 
 %ab%cd%ef sequences whose octets [do] not contain a valid UTF-8 encoding of 
 a Unicode code point.  It appears browsers interpret this requirement as: 
 reject overlong UTF-8 sequences, and otherwise reject only unpaired or 
 mispaired surrogate code points.  Is this exactly what ES5 requires?  And 
 if it is, should it be?  Firefox has also treated otherwise-valid-looking 
 encodings of U+FFFE and U+ as specifying that the replacement character 
 U+FFFD be used.  And the rationale for rejecting U+FFF{E,F} also seems to 
 apply to the non-character range [U+FDD0, U+FDEF] and U+xyFF{E,F}.  Table 21 
 seems to say only malformed encodings and bad surrogates should be rejected, 
 but valid encoding of a code point is arguably unclear.

I haven't swapped back my technical understanding of the subtleties of UTF8 
encodings yet today so I'm not yet prepared to try to provide a technical 
response.  But I think I can speak to the intent of the spec (or at least the 
ES5 version):

1) these are legacy functions that have been in browser JS implementations at 
least since ES3 days.  We didn't want to change them in any incompatible way.
2) Like with RegExp and other similar issues, browser reality (well, legacy 
browser reality, maybe not newbies) is more important than what the spec. 
actually says.  If browser all do something different from the spec. then the 
spec. should be updated accordingly. However, for ES5 we didn't do any deep 
analysis of this browser reality so we might have missed something.
3)  The intent is pretty clearly stated in the last paragraph note that 
includes table 21 (BTW, since the table is in a note it isn't normative).  It 
essentially says throw an exception when decoding anything that RFC 3629 says 
if not a valid UTF-8 encoding. 


I would prioritizes #3 after #1#2.  If there is consistent behavior in all 
major browsers that date prior to ES5 then that is the behavior that should be 
followed (and the spec. updated if necessary). If there is disagreement among 
those legacy browsers then I would simply follow the ES5 spec. unless it does 
something that is contrary to RFC 3629.  If it does, then we need to think 
about whether we have a spec. bug.
 
 At least one person interested in Firefox's decoding implementation argues 
 that not rejecting or replacing U+FFF{E,F} is a potential security 
 vulnerability because those code points (particularly U+FFFE) might confuse 
 code into interpreting a sequence of code points with the wrong endianness.  
 I find the argument unpersuasive and the potential harm too speculative 
 (particularly as no other browser replaces or rejects U+FFF{E,F}).  But the 
 point's been raised, and it's at least somewhat plausible, so I'd like to see 
 it conclusively addressed.


It's just a transformation from one JS string to another. It can't do anything 
that hand written JS code couldn't do.  How would this be any more of a problem 
then simply providing the code points that that the bogus sequence would be 
incorrectly interpreted as.  That said, #3 above does that that the intent is 
to reject anything that is not valid UTF-8. 


Allen
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Should Decode accept U+FFFE or U+FFFF (and other Unicode non-characters)?

2011-07-14 Thread Jeff Walden

Reraising this issue...

To briefly repeat: Decode, called by decodeURI{,Component}, says to reject %ab%cd%ef sequences whose octets 
[do] not contain a valid UTF-8 encoding of a Unicode code point.  It appears browsers interpret 
this requirement as: reject overlong UTF-8 sequences, and otherwise reject only unpaired or mispaired 
surrogate code points.  Is this exactly what ES5 requires?  And if it is, should it be?  Firefox 
has also treated otherwise-valid-looking encodings of U+FFFE and U+ as specifying that the replacement 
character U+FFFD be used.  And the rationale for rejecting U+FFF{E,F} also seems to apply to the 
non-character range [U+FDD0, U+FDEF] and U+xyFF{E,F}.  Table 21 seems to say only malformed encodings and bad 
surrogates should be rejected, but valid encoding of a code point is arguably unclear.

At least one person interested in Firefox's decoding implementation argues that not 
rejecting or replacing U+FFF{E,F} is a potential security vulnerability 
because those code points (particularly U+FFFE) might confuse code into interpreting a 
sequence of code points with the wrong endianness.  I find the argument unpersuasive and 
the potential harm too speculative (particularly as no other browser replaces or rejects 
U+FFF{E,F}).  But the point's been raised, and it's at least somewhat plausible, so I'd 
like to see it conclusively addressed.

A last note: two test262 tests directly exercise exercise the Decode algorithm 
and expect that these two characters decode to U+FFF{E,F}.  (I think at a 
glance they might also allow throwing, tho it's not clear to me that's 
intentional.)

http://hg.ecmascript.org/tests/test262/file/b4690e1408ee/test/suite/sputnik_converted/15_Native/15.1_The_Global_Object/15.1.3_URI_Handling_Function_Properties/15.1.3.1_decodeURI/S15.1.3.1_A2.4_T1.js
http://hg.ecmascript.org/tests/test262/file/b4690e1408ee/test/suite/sputnik_converted/15_Native/15.1_The_Global_Object/15.1.3_URI_Handling_Function_Properties/15.1.3.2_decodeURIComponent/S15.1.3.2_A2.4_T1.js

Jeff
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Should Decode accept U+FFFE or U+FFFF (and other Unicode non-characters)?

2009-10-08 Thread Jeff Walden

I was looking at how SpiderMonkey decodes URI-encoded strings, specifically to update it to reject 
overlong UTF-8 sequences per ES5 (breaking change from ES3 that should generally be agreed to have 
been necessary, not to mention that existing implementations were loose and strict inconsistently). 
 After SpiderMonkey made that change I noticed some non-standard extra behavior: U+FFFE and U+ 
decode to the replacement character.  ES5 doesn't say to do this -- the decode table categorizes 
only [0xD800, 0xDFFF] as invalid (when not in a surrogate pair) and resulting in a URIError.  
(Prose in Decode says If Octects does not contain a valid UTF-8 encoding of a Unicode code 
point, which might be interpretable as saying that the UTF-8 encoding of U+FFFE 
isn't valid and therefore a URIError must be thrown if you squinted.)

U+ is not a valid Unicode code point, and U+FFFE conceivably could confuse 
Unicode decoders into decoding with the wrong endianness under the right 
circumstances.  Theoretically, at least.  Might it make sense to throw a 
URIError upon encountering them (and perhaps also the non-code points [U+FDD0, 
U+FDEF], and maybe even the code points which are = FFFE mod 0x1 as well)?

Jeff
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss