Re: [whatwg] Stripping newlines from URI attributes
On Thu, 30 Jul 2009 02:49:01 +0200, Kartikaya Gupta lists.wha...@stakface.com wrote: It seems that most browsers do some sort of newline and tab removal from URI attributes. For example, if you have img src=foo bar.jpg browsers will still render the image called foobar.jpg despite the CRLF pair in the middle of the src attribute. The behavior actually seems a bit more complex; quote from one of my co-workers who investigated this: Any chance you could also check whether this applies to CSS, XMLHttpRequest, HTTP Location, etc.? So for I've found that browsers use the same URL processor everywhere (though sometimes the URL character encoding flag is set to UTF-8 and cannot be changed). As such it would be nice to know if that is still true here or whether this is a pre-processing step specific to HTML attribute values. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] Stripping newlines from URI attributes
On Wed, Jul 29, 2009 at 5:49 PM, Kartikaya Guptalists.wha...@stakface.com wrote: It seems that most browsers do some sort of newline and tab removal from URI attributes. For example, if you have img src=foo bar.jpg browsers will still render the image called foobar.jpg despite the CRLF pair in the middle of the src attribute. The behavior actually seems a bit more complex; quote from one of my co-workers who investigated this: This behavior doesn't seem to be specced anywhere as far as I can tell. Assuming the WEBADDRESSES spec referred to in HTML5 is the one at http://www.w3.org/html/wg/href/draft.html that only says to trim leading/trailing whitespace and url-encode the rest. This doesn't seem to match existing behavior, so it should probably be updated. How weird. Frankly how insane. While I can believe that some browsers act like this, I would be quite surprised to find that they were compatible with each other. Indeed your tests seem to show they aren't. This is an area where we should not attempt (and probably simply cannot) maintain compatibility with existing browsers. They're just too broken. -- Elliotte Rusty Harold elh...@ibiblio.org
Re: [whatwg] Stripping newlines from URI attributes
On Thu, Jul 30, 2009 at 2:37 PM, Elliotte Rusty Haroldelh...@ibiblio.org wrote: On Wed, Jul 29, 2009 at 5:49 PM, Kartikaya Guptalists.wha...@stakface.com wrote: It seems that most browsers do some sort of newline and tab removal from URI attributes. For example, if you have img src=foo bar.jpg browsers will still render the image called foobar.jpg despite the CRLF pair in the middle of the src attribute. [...] This is an area where we should not attempt (and probably simply cannot) maintain compatibility with existing browsers. They're just too broken. We should attempt to maintain compatibility with existing content, and whitespace in URI attributes seems very common in existing content, e.g.: http://www.topdogphotos.com/photo-gallery/gallery11.html (newlines in a href, img src) http://www.sprig.com/coyuchi_george_or_thor_hooded_baby_towel (tabs and #xD;#xA; in img src) and loads more. -- Philip Taylor exc...@gmail.com
Re: [whatwg] Stripping newlines from URI attributes
On Wed, Jul 29, 2009 at 6:49 PM, Kartikaya Gupta lists.wha...@stakface.comwrote: This behavior doesn't seem to be specced anywhere as far as I can tell. Assuming the WEBADDRESSES spec referred to in HTML5 is the one at http://www.w3.org/html/wg/href/draft.html that only says to trim leading/trailing whitespace and url-encode the rest. This doesn't seem to match existing behavior, so it should probably be updated. RFC 3986, which is referenced in the Web addresses specification, states In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may have to be added to break a long URI across lines. The whitespace should be ignored when the URI is extracted. Firefox's behavior appears to be consistent with this. -Alex
[whatwg] Stripping newlines from URI attributes
It seems that most browsers do some sort of newline and tab removal from URI attributes. For example, if you have img src=foo bar.jpg browsers will still render the image called foobar.jpg despite the CRLF pair in the middle of the src attribute. The behavior actually seems a bit more complex; quote from one of my co-workers who investigated this: img id='bar' width=288 height=48 foo=abc def src=http://m.theglobeandmail.com/image- server/img//rO0ABXQAS2Z7aHR0cDovL2JldGEuaW1hZ2VzLnRoZWdsb2JlYW5kbWFpbC5jb20vaW1hZ2VzL21v YmlsZS9nYW1fZmxhZy5wbmd9dDBmMjg4dA==.png alt=img / script type=text/javascript alert( document.getElementById('bar').getAttribute('src').indexOf('\n') ); alert( document.getElementById('bar').src.indexOf('\n') ); /script Firefox and Sarafi both generate two alerts, 36 and -1. It seems mozilla ignores 0x09, 0x0a, 0x0d in the URI Whereas webkit seems to ignore 0x09, 0x0a, 0x0d in the path. Try putting a CRLF inside the authority and alert( document.getElementById('bar').src.indexOf('\n') ); will return non -1 in safari. But will still fetch the image. Firefox seems to return -1 all the time. Opera is like firefox. This behavior doesn't seem to be specced anywhere as far as I can tell. Assuming the WEBADDRESSES spec referred to in HTML5 is the one at http://www.w3.org/html/wg/href/draft.html that only says to trim leading/trailing whitespace and url-encode the rest. This doesn't seem to match existing behavior, so it should probably be updated. On a related note, I was wondering if all these spin-off specs could be listed somewhere easy to find; it took me a while to locate the web addresses one and I had to use google to find it. Putting a list at, say, http://www.whatwg.org/specs/ would be handy; or even better, the references section in the HTML5 spec could list them. Thanks, kats