Re: [whatwg] Sandboxing to accommodate user generated content.

Lachlan Hunt Tue, 17 Jun 2008 13:11:45 -0700

Frode Børli wrote:

I have been reading up on past discussions on sandboxing content, and
I feel that it is generally agreed on that there should be some
mechanism for marking content as "user generated". The discussion
mainly appears to be focused on implementation. Please read my
implementation notes at the end of this message on how we can include
this function safely for both HTML 4 and HTML 5 browsers, and still
allow HTML 4 browsers to function properly.


My main arguments for having this feature (in one form or another) in
the browser is:

- It is future proof. Changes to browsers (for example adding
expression support to css) will never again require old sanitizers to
be updated.

If the sanitiser uses a whitelist based approach that forbids everythingby default, and then only allows known elements and attributes; and inthe case of the style attribute, known properties and values that aresafe, then that would also be the case.

- It does not require much skill and effort from the web developer to
safely sanitize user content.
- Security bugs are fixed by browser vendors, and not by each web developer.

Note that sandboxing doesn't entirely remove the need for sanitisinguser generated content on the server, it's just an extra line of defencein case something slips through.

The suggested solution of using an attribute on an <iframe> element
for storing the user generated content has several problems;

1: The use of src= as a fallback means that style information will be
lost and stylesheets must be loaded again.

This is not a major problem. If it uses the same stylesheet, which canbe cached by the browser, then at worst it results in a 304 Not Modifiedresponse.

2: The use of src= yields problems with iframe heights (since the
src-url must be hosted on another server javascript cannot fix this)
and HTML 4 browsers have no other method of adjusting the iframe
height according to the content.

In recent browsers that support cross-document messaging (Opera 9,Safari 3, Firefox 3 and IE 8), you could include a script within thecomment page that calculates its own height and sends a message to theparent page with the info. In older browsers, just set the height to areasonable minimum and let the user scroll. Sure, it's not perfect, butit's called graceul degradation.

3: If you have a page that lists 60 comments on a blog, then the user
agent would have to contact the server 60 times to fetch each comment.
This again means that perl/php scripts have to be invoked 60 times for
one page view - that is 61 separate database connections and session
initializations.

You could always concatenate all of the comments into a single file,reducing it down to 1 request.

4: For the fallback method of using src= for HTML 4 browsers to
actually work, the fallback documents must be hosted on a separate
domain name. This again means that a website using HTTPS must purchase
and maintain two certificates.


I don't see that as a show stopper.

My solution:

If we add a new element <htmlarea></htmlarea>, old browsers will run
scripts, while new browsers will stop scripts and this is a major
problem.

If HTML 5 browsers require everything between <htmlarea></htmlarea> to
be html entity escaped, that is < and > must be replaced with &lt; and
&gt; respectively. If this is not done, HTML 5 browsers will issue a
severe warning and refuse to display the page. Developers will quickly
learn.

Draconian error handling is something we really want to avoid,particularly when the such an error can be triggered by failing tohandle user generated content properly.

HTML 4 browsers will never run scripts (since it will only see plain
text). HTML 5 browsers will display rich text. It would be completely
secure for both HTML 4 and HTML 5 browsers.

A simple Javascript could clean up the HTML markup for HTML 4 browsers..


In a separate mail, you wrote:

<data>

&lt;user supplied input&gt;

</data>

Then this will be secure both for HTML 4 and HTML 5 browsers. HTML 4
browsers will display html, while HTML 5 browsers will display
correctly formatted code. A simple javascript like this (untested)
would make the data tags readable for HTML 4 browsers:

var els = document.getElementsByTagName("DATA");
for(e in els) els[e].innerHTML =
els[e].innerHTML.replace(/<&#91;^>&#93;*>/g, "").replace(/\n/g,
"<br>");

At first, I had no idea what that script was trying to do. But AFAICT,you were trying to use this regex: /<[^>]*>/g, which would theoreticallymatch "<foo>". But, in this context, even with the corrected regex, thescript is entirely useless.

It wouldn't work, for example, with <foo bar=">" baz="xxx">. But alsobecause the inner HTML that you're running the regex on is supposed tohave all < and > escaped, and so nothing would be matched anyway.

A problem with this approach is that developers might forget to escape
tags, therefore I think browsers should display a security warning
message if the character < or > is encountered inside a <data> tag.

If a developer forgot to escape the markup at all, then a user couldenter "</data><script>...</script>" and do anything they wanted.


--
Lachlan Hunt - Opera Software
http://lachy.id.au/
http://www.opera.com/

Re: [whatwg] Sandboxing to accommodate user generated content.

Reply via email to