Re: [whatwg] Stripping newlines from URI attributes

2009-07-30 Thread Anne van Kesteren
On Thu, 30 Jul 2009 02:49:01 +0200, Kartikaya Gupta lists.wha...@stakface.com 
wrote:
 It seems that most browsers do some sort of newline and tab removal from  
 URI attributes. For example, if you have

 img src=foo
 bar.jpg

 browsers will still render the image called foobar.jpg despite the  
 CRLF pair in the middle of the src attribute. The behavior actually  
 seems a bit more complex; quote from one of my co-workers who  
 investigated this:

Any chance you could also check whether this applies to CSS, XMLHttpRequest, 
HTTP Location, etc.? So for I've found that browsers use the same URL processor 
everywhere (though sometimes the URL character encoding flag is set to UTF-8 
and cannot be changed). As such it would be nice to know if that is still true 
here or whether this is a pre-processing step specific to HTML attribute values.


-- 
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Stripping newlines from URI attributes

2009-07-30 Thread Elliotte Rusty Harold
On Wed, Jul 29, 2009 at 5:49 PM, Kartikaya
Guptalists.wha...@stakface.com wrote:
 It seems that most browsers do some sort of newline and tab removal from URI 
 attributes. For example, if you have

 img src=foo
 bar.jpg

 browsers will still render the image called foobar.jpg despite the CRLF 
 pair in the middle of the src attribute. The behavior actually seems a bit 
 more complex; quote from one of my co-workers who investigated this:


 This behavior doesn't seem to be specced anywhere as far as I can tell. 
 Assuming the WEBADDRESSES spec referred to in HTML5 is the one at 
 http://www.w3.org/html/wg/href/draft.html that only says to trim 
 leading/trailing whitespace and url-encode the rest. This doesn't seem to 
 match existing behavior, so it should probably be updated.

How weird. Frankly how insane. While I can believe that some browsers
act like this, I would be quite surprised to find that they were
compatible with each other. Indeed your tests seem to show they
aren't.

This is an area where we should not attempt (and probably simply
cannot) maintain compatibility with existing browsers. They're just
too broken.

-- 
Elliotte Rusty Harold
elh...@ibiblio.org


Re: [whatwg] Stripping newlines from URI attributes

2009-07-30 Thread Philip Taylor
On Thu, Jul 30, 2009 at 2:37 PM, Elliotte Rusty
Haroldelh...@ibiblio.org wrote:
 On Wed, Jul 29, 2009 at 5:49 PM, Kartikaya
 Guptalists.wha...@stakface.com wrote:
 It seems that most browsers do some sort of newline and tab removal from URI 
 attributes. For example, if you have

 img src=foo
 bar.jpg

 browsers will still render the image called foobar.jpg despite the CRLF 
 pair in the middle of the src attribute.
 [...]

 This is an area where we should not attempt (and probably simply
 cannot) maintain compatibility with existing browsers. They're just
 too broken.

We should attempt to maintain compatibility with existing content, and
whitespace in URI attributes seems very common in existing content,
e.g.:

http://www.topdogphotos.com/photo-gallery/gallery11.html (newlines in
a href, img src)

http://www.sprig.com/coyuchi_george_or_thor_hooded_baby_towel (tabs
and #xD;#xA; in img src)

and loads more.

-- 
Philip Taylor
exc...@gmail.com


Re: [whatwg] Stripping newlines from URI attributes

2009-07-30 Thread Alex Henrie
On Wed, Jul 29, 2009 at 6:49 PM, Kartikaya Gupta
lists.wha...@stakface.comwrote:

 This behavior doesn't seem to be specced anywhere as far as I can tell.
 Assuming the WEBADDRESSES spec referred to in HTML5 is the one at
 http://www.w3.org/html/wg/href/draft.html that only says to trim
 leading/trailing whitespace and url-encode the rest. This doesn't seem to
 match existing behavior, so it should probably be updated.


RFC 3986, which is referenced in the Web addresses specification, states In
some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may have to
be added to break a long URI across lines. The whitespace should be ignored
when the URI is extracted. Firefox's behavior appears to be consistent with
this.

-Alex


[whatwg] Stripping newlines from URI attributes

2009-07-29 Thread Kartikaya Gupta
It seems that most browsers do some sort of newline and tab removal from URI 
attributes. For example, if you have

img src=foo
bar.jpg

browsers will still render the image called foobar.jpg despite the CRLF pair 
in the middle of the src attribute. The behavior actually seems a bit more 
complex; quote from one of my co-workers who investigated this:

 img id='bar' width=288 height=48 foo=abc
 def src=http://m.theglobeandmail.com/image-
 server/img//rO0ABXQAS2Z7aHR0cDovL2JldGEuaW1hZ2VzLnRoZWdsb2JlYW5kbWFpbC5jb20vaW1hZ2VzL21v
 YmlsZS9nYW1fZmxhZy5wbmd9dDBmMjg4dA==.png alt=img /
  
 script type=text/javascript 
 alert( document.getElementById('bar').getAttribute('src').indexOf('\n') );
 alert( document.getElementById('bar').src.indexOf('\n') );
 /script
  
 Firefox and Sarafi both generate two alerts, 36 and -1.
 
 It seems mozilla ignores 0x09, 0x0a, 0x0d in the URI
 Whereas webkit seems to ignore 0x09, 0x0a, 0x0d in the path.
 
 Try putting a CRLF inside the authority and
 alert( document.getElementById('bar').src.indexOf('\n') );
 
 will return non -1 in safari. But will still fetch the image. Firefox seems 
 to return -1 all the time.
 
 Opera is like firefox. 

This behavior doesn't seem to be specced anywhere as far as I can tell. 
Assuming the WEBADDRESSES spec referred to in HTML5 is the one at 
http://www.w3.org/html/wg/href/draft.html that only says to trim 
leading/trailing whitespace and url-encode the rest. This doesn't seem to match 
existing behavior, so it should probably be updated.

On a related note, I was wondering if all these spin-off specs could be 
listed somewhere easy to find; it took me a while to locate the web addresses 
one and I had to use google to find it. Putting a list at, say, 
http://www.whatwg.org/specs/ would be handy; or even better, the references 
section in the HTML5 spec could list them.

Thanks,
kats