Re: Proposal: Document.parse() [AKA: Implied Context Parsing]
E4H doesn't address all the use cases of Document.parse(). It doesn't solve the problem of existing templating libraries constructing DOM fragments from processed templates. E4H (or something similar) would be great, but I think it's a mistake to make it mutually exclusive with Document.parse(). On Tue, Jun 5, 2012 at 11:24 PM, Ian Hickson i...@hixie.ch wrote: On Mon, 4 Jun 2012, Adam Barth wrote: http://www.hixie.ch/specs/e4h/strawman Who wants to be first to implement it? Doesn't e4h have the same security problems as e4x? As written it did, yes (specifically, if you can inject content into an XML file you can cause it to run JS under your control in your origin with content from the other origin). However, as Anne and you have said, it's easy to fix, either by using an XML-incompatible syntax or using CORS to disable it. Since we have to disable it in Workers anyway, I'd go with disabling it when there's no CORS. Strawman has been updated accordingly. On Tue, 5 Jun 2012, Anne van Kesteren wrote: A (bigger?) problem with E4H/H4E is that TC39 does not like it: http://lists.w3.org/Archives/Public/public-script-coord/2011OctDec/thread.html#msg33 What matters is what implementors want to do. The TC-39 spec process isn't the problem here. TC-39 is composed of implementors, and they are clearly stating a preference for quasis. -- Ian Hickson U+1047E )\._.,--,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: Proposal: Document.parse() [AKA: Implied Context Parsing]
On Tue, Jun 5, 2012 at 12:58 AM, Anne van Kesteren ann...@annevk.nl wrote: On Tue, Jun 5, 2012 at 2:10 AM, Adam Barth w...@adambarth.com wrote: Doesn't e4h have the same security problems as e4x? If you mean http://code.google.com/p/doctype-mirror/wiki/ArticleE4XSecurity I guess that would depend on how we define it. By the way, it occurs to me that we can solve these security problems if we restrict the syntax to only working when executing inline or via script crossorigin src= If the script has appropriate CORS headers, then it doesn't matter if we leak its contents because they're already readable by the document executing the script. Adam
Re: Proposal: Document.parse() [AKA: Implied Context Parsing]
On Tue, Jun 5, 2012 at 2:10 AM, Adam Barth w...@adambarth.com wrote: Doesn't e4h have the same security problems as e4x? If you mean http://code.google.com/p/doctype-mirror/wiki/ArticleE4XSecurity I guess that would depend on how we define it. A (bigger?) problem with E4H/H4E is that TC39 does not like it: http://lists.w3.org/Archives/Public/public-script-coord/2011OctDec/thread.html#msg33 I'm not as optimistic as them that quasis just solve this (not compile time, nobody has actually written out the safehtml definition), but TC39 being against it does not help. -- Anne — Opera Software http://annevankesteren.nl/ http://www.opera.com/
Re: Proposal: Document.parse() [AKA: Implied Context Parsing]
On Tue, Jun 5, 2012 at 11:02 AM, Adam Barth w...@adambarth.com wrote: On Tue, Jun 5, 2012 at 2:10 AM, Adam Barth w...@adambarth.com wrote: If you mean http://code.google.com/p/doctype-mirror/wiki/ArticleE4XSecurity I guess that would depend on how we define it. By the way, it occurs to me that we can solve these security problems if we restrict the syntax to only working when executing inline or via script crossorigin src= If the script has appropriate CORS headers, then it doesn't matter if we leak its contents because they're already readable by the document executing the script. It would also have to be disabled for workers until we have DOM access there... -- Anne — Opera Software http://annevankesteren.nl/ http://www.opera.com/
Re: Proposal: Document.parse() [AKA: Implied Context Parsing]
On Mon, 4 Jun 2012, Adam Barth wrote: � http://www.hixie.ch/specs/e4h/strawman Who wants to be first to implement it? Doesn't e4h have the same security problems as e4x? As written it did, yes (specifically, if you can inject content into an XML file you can cause it to run JS under your control in your origin with content from the other origin). However, as Anne and you have said, it's easy to fix, either by using an XML-incompatible syntax or using CORS to disable it. Since we have to disable it in Workers anyway, I'd go with disabling it when there's no CORS. Strawman has been updated accordingly. On Tue, 5 Jun 2012, Anne van Kesteren wrote: A (bigger?) problem with E4H/H4E is that TC39 does not like it: http://lists.w3.org/Archives/Public/public-script-coord/2011OctDec/thread.html#msg33 What matters is what implementors want to do. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: Proposal: Document.parse() [AKA: Implied Context Parsing]
On Fri, 25 May 2012, Rafael Weinstein wrote: Now's the time to raise objections to UA's adding support for this feature. For the record, I very much object to Document.parse(). I think it's a terrible API. We should IMHO resolve the use case of generate a DOM tree from script using a much more robust solution that has compile-time syntax checking and so forth, rather than relying on the super-hacky concatenate a bunch of strings and then parse them solution that authors are forced to use today. innerHTML and document.write() are abominations unto computer science, and we are doing nobody any favours by continuing the platform down this road. They lead to programming styles that are rife with injection bugs (XSS), they are extremely difficult to debug and maintain, and they are terribly complicated to implement compared to more structured alternatives. The core reasons for these problems, IMHO, are two-fold: 1. Lack of compile-time syntax checking, which leads to typos not being caught and thus programmer intent not being faithfully represented, and 2. Putting markup syntax and data at the same level, instead of having separating them as with other features in JS. For example, this kind of bug is easy to introduce and hard to spot or debug: var heading = 'h1Hello/h1'; // ... div.innerHTML = 'h1' + heading + '/h1'; Even worse are things like typos: tr.innerHTML = 'td' + c1 + '/tdtd' + c2 + '/tddt' + c3 + '/td; Compile-time syntax checking makes this a non-issue. Making data variables be qualitatively different than the syntax also solves problems, e.g.: var title = I hate /p tags.; // ... div.innerHTML = 'pToday's topic is: ' + title + '/p'; // oops, not escaped There have been several alternative proposals; my personal favourite is Anne's E4H solution, basically E4X but simplified just for HTML, which I've written a strawman spec for here: http://www.hixie.ch/specs/e4h/strawman I'm happy to write a more serious spec for this if this is something anyone is interested in implementing. The above examples become much easier to debug. The first one results in very ugly markup visible in the output of the page rather than in the weird spacing: var heading = 'h1Hello/h1'; // ... div.appendChild(h1{heading}/h1); The second results in a compile-time syntax error so would be caught even before the code is reviewed: tr.appendChild(td{c1}/tdtd{c2}/tddt{c3}/td/); The third becomes a non-issue because you don't need to escape text to avoid it from being mistaken for markup [1]: var title = I hate /p tags.; // ... div.innerHTML = pToday's topic is: {title}/p; Other proposed solutions include Element.create(), which is less verbose than the DOM but still more verbose than innerHTML or E4H; and quasistrings, which still suffer from lack of compile-time checking and mix markup with data, but at least would be more structured than raw strings and could offer better injection protection. [1] (This is not the same as auto-escaping strings in other contexts. For example, E4H doesn't propose to have CSS literals, so a string embedded in a style= attribute wouldn't be automagically safe.) -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: Proposal: Document.parse() [AKA: Implied Context Parsing]
Just to be clear: what you are objecting to is the addition of formal API for this. You're generally supportive of adding a template element whose contents would parse the way we're discussing here -- and given that, a webdev could trivially polyfil Document.parse(). I.e. you're ok with the approach of the parser picking a context element based on the contents of markup, but against giving webdevs the impression that innerHTML is good practice, by adding more API in that direction? Put another way, though you're not happy with adding the API, you willing to set that aside and help spec the parser changes required for both this and template element (assuming the remaining issues with template can be agreed upon)? FWIW, I agree with Hixie in principle, but disagree in practice. I think innerHTML is generally to be avoided, but I feel that adding Document.parse() improves the situation by making some current uses (which aren't likely to go away) less hacky. Also, I'm not as worried with webdevs taking the wrong message from us adding API. My feeling is that they just do what works best for them and don't think much about what we are or are not encouraging. Also, I'm highly supportive of the goal of allowing HTML literals in script. I fully agree that better load (compile) time feedback would be beneficial to authors here. On Mon, Jun 4, 2012 at 3:47 PM, Ian Hickson i...@hixie.ch wrote: On Fri, 25 May 2012, Rafael Weinstein wrote: Now's the time to raise objections to UA's adding support for this feature. For the record, I very much object to Document.parse(). I think it's a terrible API. We should IMHO resolve the use case of generate a DOM tree from script using a much more robust solution that has compile-time syntax checking and so forth, rather than relying on the super-hacky concatenate a bunch of strings and then parse them solution that authors are forced to use today. innerHTML and document.write() are abominations unto computer science, and we are doing nobody any favours by continuing the platform down this road. They lead to programming styles that are rife with injection bugs (XSS), they are extremely difficult to debug and maintain, and they are terribly complicated to implement compared to more structured alternatives. The core reasons for these problems, IMHO, are two-fold: 1. Lack of compile-time syntax checking, which leads to typos not being caught and thus programmer intent not being faithfully represented, and 2. Putting markup syntax and data at the same level, instead of having separating them as with other features in JS. For example, this kind of bug is easy to introduce and hard to spot or debug: var heading = 'h1Hello/h1'; // ... div.innerHTML = 'h1' + heading + '/h1'; Even worse are things like typos: tr.innerHTML = 'td' + c1 + '/tdtd' + c2 + '/tddt' + c3 + '/td; Compile-time syntax checking makes this a non-issue. Making data variables be qualitatively different than the syntax also solves problems, e.g.: var title = I hate /p tags.; // ... div.innerHTML = 'pToday's topic is: ' + title + '/p'; // oops, not escaped There have been several alternative proposals; my personal favourite is Anne's E4H solution, basically E4X but simplified just for HTML, which I've written a strawman spec for here: http://www.hixie.ch/specs/e4h/strawman I'm happy to write a more serious spec for this if this is something anyone is interested in implementing. The above examples become much easier to debug. The first one results in very ugly markup visible in the output of the page rather than in the weird spacing: var heading = 'h1Hello/h1'; // ... div.appendChild(h1{heading}/h1); The second results in a compile-time syntax error so would be caught even before the code is reviewed: tr.appendChild(td{c1}/tdtd{c2}/tddt{c3}/td/); The third becomes a non-issue because you don't need to escape text to avoid it from being mistaken for markup [1]: var title = I hate /p tags.; // ... div.innerHTML = pToday's topic is: {title}/p; Other proposed solutions include Element.create(), which is less verbose than the DOM but still more verbose than innerHTML or E4H; and quasistrings, which still suffer from lack of compile-time checking and mix markup with data, but at least would be more structured than raw strings and could offer better injection protection. [1] (This is not the same as auto-escaping strings in other contexts. For example, E4H doesn't propose to have CSS literals, so a string embedded in a style= attribute wouldn't be automagically safe.) -- Ian Hickson U+1047E )\._.,--,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: Proposal: Document.parse() [AKA: Implied Context Parsing]
On Mon, 4 Jun 2012, Rafael Weinstein wrote: Just to be clear: what you are objecting to is the addition of formal API for this. You're generally supportive of adding a template element whose contents would parse the way we're discussing here -- and given that, a webdev could trivially polyfil Document.parse(). Sure. I.e. you're ok with the approach of the parser picking a context element based on the contents of markup, but against giving webdevs the impression that innerHTML is good practice, by adding more API in that direction? Right. Put another way, though you're not happy with adding the API, you willing to set that aside and help spec the parser changes required for both this and template element (assuming the remaining issues with template can be agreed upon)? I think template is important. If implementing that happens to make it easier for a script to implement a bad practice, so be it. (See my e-mail on the template thread for comments on that subject.) FWIW, I agree with Hixie in principle, but disagree in practice. I think innerHTML is generally to be avoided, but I feel that adding Document.parse() improves the situation by making some current uses (which aren't likely to go away) less hacky. If we want to make things less hacky, let's actually make them less hacky, not introduce more APIs that suck. Also, I'm not as worried with webdevs taking the wrong message from us adding API. My feeling is that they just do what works best for them and don't think much about what we are or are not encouraging. I strongly disagree on that. Whether consciously or not, we set the standard for what is good practice. I've defintely seen authors look at the standards community for leadership. Just look at how authors adopted XHTML's syntax, even in the absence of actually using XHTML. It was such a tidal wave that we ended up actually changing HTML's conformance criteria to ignore the extra characters rather than say they were invalid. Why? Because XHTML was what the W3C was working on, so it must have been good, even though objectively it really added no semantics (literally nothing, the language was defined by deferring to HTML4) and the syntax changes were a net negative. Also, I'm highly supportive of the goal of allowing HTML literals in script. I fully agree that better load (compile) time feedback would be beneficial to authors here. Let's do it! As far as I can tell, the impact on a JS parser would be pretty minimal. http://www.hixie.ch/specs/e4h/strawman Who wants to be first to implement it? -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: Proposal: Document.parse() [AKA: Implied Context Parsing]
On Mon, Jun 4, 2012 at 4:38 PM, Ian Hickson i...@hixie.ch wrote: On Mon, 4 Jun 2012, Rafael Weinstein wrote: Just to be clear: what you are objecting to is the addition of formal API for this. You're generally supportive of adding a template element whose contents would parse the way we're discussing here -- and given that, a webdev could trivially polyfil Document.parse(). Sure. I.e. you're ok with the approach of the parser picking a context element based on the contents of markup, but against giving webdevs the impression that innerHTML is good practice, by adding more API in that direction? Right. Put another way, though you're not happy with adding the API, you willing to set that aside and help spec the parser changes required for both this and template element (assuming the remaining issues with template can be agreed upon)? I think template is important. If implementing that happens to make it easier for a script to implement a bad practice, so be it. (See my e-mail on the template thread for comments on that subject.) FWIW, I agree with Hixie in principle, but disagree in practice. I think innerHTML is generally to be avoided, but I feel that adding Document.parse() improves the situation by making some current uses (which aren't likely to go away) less hacky. If we want to make things less hacky, let's actually make them less hacky, not introduce more APIs that suck. Also, I'm not as worried with webdevs taking the wrong message from us adding API. My feeling is that they just do what works best for them and don't think much about what we are or are not encouraging. I strongly disagree on that. Whether consciously or not, we set the standard for what is good practice. I've defintely seen authors look at the standards community for leadership. Just look at how authors adopted XHTML's syntax, even in the absence of actually using XHTML. It was such a tidal wave that we ended up actually changing HTML's conformance criteria to ignore the extra characters rather than say they were invalid. Why? Because XHTML was what the W3C was working on, so it must have been good, even though objectively it really added no semantics (literally nothing, the language was defined by deferring to HTML4) and the syntax changes were a net negative. Also, I'm highly supportive of the goal of allowing HTML literals in script. I fully agree that better load (compile) time feedback would be beneficial to authors here. Let's do it! As far as I can tell, the impact on a JS parser would be pretty minimal. http://www.hixie.ch/specs/e4h/strawman Who wants to be first to implement it? Doesn't e4h have the same security problems as e4x? Adam
Re: Proposal: Document.parse() [AKA: Implied Context Parsing]
On Fri, 25 May 2012 09:01:43 +0200, Rafael Weinstein rafa...@google.com wrote: Ok, so from consensus on earlier threads, here's the full API semantics. Now's the time to raise objections to UA's adding support for this feature. - 1) The Document interface is extended to include a new method: DocumentFragment parse (DOMString markup); which: -Invokes the fragment parsing algorithm with markup and an empty context element, -Unmarks all scripts in the returned fragment node as already started -Returns the fragment node 2) The fragment parsing algorithm's context element is now optional. It's behavior is similar to the case of a known context element, but the tokenizer is simply set to the data state 3) Resetting the insertion appropriately now sets the mode to Implied Context if parsing a fragment and no context element is set, and aborts. 4) A new Implied Context insertion mode is defined which -Ignores doctype, end tag tokens -Handles comment character tokens as if in body -Handles the following start tags as if in body (which is as if in head): style, script, link, meta -Handles any other start tag by selecting a context element, resetting the insertion mode appropriately and reprocessing the token. 5) A new selecting a context element algorithm is defined which takes a start tag as input and outputs an element. The element's identity is as follows: -If start tag is tbody, thead, tfoot, caption or colgroup return table -if start tag is tr, return tbody -if start tag is col return colgroup -if start tag is td or td return tr -if start tag is head or body return html -if start tag is rp or rt return ruby I think ruby is better handled by always making rp and rt generate implied end tags in the fragment case (maybe even when parsing normally, too). Making the context element ruby still doesn't make rt parse right, because the spec currently looks for ruby on the *stack* (and the context element isn't on the stack). Also, the ruby base is allowed to include markup, so this would fail: ruby.appendChild(document.parse('spanfoo/spanrtbarrtbaz')); -if start tag is a defined SVG localName (case insensitive) return svg Except those that conflict with HTML? -if start tag is a defined MathML localName (case insensitive) return math (Making the context element svg or math doesn't do anything currently: https://www.w3.org/Bugs/Public/show_bug.cgi?id=16635 ) -otherwise, return body -- Simon Pieters Opera Software
Re: Proposal: Document.parse() [AKA: Implied Context Parsing]
On Fri, May 25, 2012 at 12:32 AM, Simon Pieters sim...@opera.com wrote: On Fri, 25 May 2012 09:01:43 +0200, Rafael Weinstein rafa...@google.com wrote: Ok, so from consensus on earlier threads, here's the full API semantics. Now's the time to raise objections to UA's adding support for this feature. - 1) The Document interface is extended to include a new method: DocumentFragment parse (DOMString markup); which: -Invokes the fragment parsing algorithm with markup and an empty context element, -Unmarks all scripts in the returned fragment node as already started -Returns the fragment node 2) The fragment parsing algorithm's context element is now optional. It's behavior is similar to the case of a known context element, but the tokenizer is simply set to the data state 3) Resetting the insertion appropriately now sets the mode to Implied Context if parsing a fragment and no context element is set, and aborts. 4) A new Implied Context insertion mode is defined which -Ignores doctype, end tag tokens -Handles comment character tokens as if in body -Handles the following start tags as if in body (which is as if in head): style, script, link, meta -Handles any other start tag by selecting a context element, resetting the insertion mode appropriately and reprocessing the token. 5) A new selecting a context element algorithm is defined which takes a start tag as input and outputs an element. The element's identity is as follows: -If start tag is tbody, thead, tfoot, caption or colgroup return table -if start tag is tr, return tbody -if start tag is col return colgroup -if start tag is td or td return tr -if start tag is head or body return html -if start tag is rp or rt return ruby I think ruby is better handled by always making rp and rt generate implied end tags in the fragment case (maybe even when parsing normally, too). Making the context element ruby still doesn't make rt parse right, because the spec currently looks for ruby on the *stack* (and the context element isn't on the stack). Also, the ruby base is allowed to include markup, so this would fail: ruby.appendChild(document.parse('spanfoo/spanrtbarrtbaz')); -if start tag is a defined SVG localName (case insensitive) return svg Except those that conflict with HTML? Yes. Thank you. Item 5 should be: 5) A new selecting a context element algorithm is defined which takes a start tag as input and outputs an element. The element's identity is as follows: -If start tag is tbody, thead, tfoot, caption or colgroup return table -if start tag is tr, return tbody -if start tag is col return colgroup -if start tag is td or td return tr -if start tag is head or body return html -if start tag is rp or rt return ruby -if start tag is a defined HTML localName (case insensitive) return body -if start tag is a defined SVG localName (case insensitive) return svg -if start tag is a defined MathML localName (case insensitive) return math -otherwise, return body -if start tag is a defined MathML localName (case insensitive) return math (Making the context element svg or math doesn't do anything currently: https://www.w3.org/Bugs/Public/show_bug.cgi?id=16635 ) -otherwise, return body -- Simon Pieters Opera Software
Re: Proposal: Document.parse() [AKA: Implied Context Parsing]
On Fri, May 25, 2012 at 3:01 AM, Rafael Weinstein rafa...@google.comwrote: -if start tag is td or td typo: th