Re: HTML-sanitizing helper

Mike Orr Mon, 08 Jun 2009 10:13:32 -0700

"When in doubt, refuse the temptation to guess."  So maybe I should
just leave it out of WebHelpers for now in favor of this recipe?


I started to put support for safe tags, but took it out because it
makes the code significantly more complicated and still doesn't handle
the pathological or unbalanced cases.  But Genshi doesn't handle
unbalanced cases at all; it just raises an error if the input is not
well-formed XML.  But we can't expect users to input well-formed XML,
or even to know what that means.  My intention was to support the
limited markup some newspaper sites allow (<i>, <a>, and a few
others).  But given that this is not all that urgent, I was reluctant
to make the basic code more complex for it.

--Mike

On Sun, Jun 7, 2009 at 11:10 PM, Mark T.<mark.t...@gmail.com> wrote:
>
> Hi Mike,
>
> I had a need for HTML "sanitizing" in the past.  When I went searching
> for existing code to handle the task, I found Genshi provided quite a
> bit of functionality.  However, the terminology they use (and I
> believe Rails and other libraries) is a little different.  Loosely:
>
>  HTML sanitizing: remove some or all HTML elements from a string,
> using whitelists of allowed HTML elements and HTML attributes
>
>  Text serialization: produce plain text from markup streams
>
>
> The Genshi docs have a pretty good example of filtering a stream to
> both sanitize the contents of some HTML tags and then render it to
> plaintext:
>
> http://genshi.edgewall.org/wiki/Documentation/0.5.x/streams.html#serialization
>
>
> Since the Webhelpers library is quite widely used, I thought it may be
> more consistent to use similar terminology with other libraries, and
> possibly provide the same two types of distinct functionality.
>
>
> For the curious, in my own code, I performed sanitizing and
> serialization a little differently than the Genshi doc examples.  It
> may not be efficient, but it worked for my purposes:
>
>    from genshi.input import HTML
>    from genshi.filters import HTMLSanitizer
>
>    safe_tags = frozenset(['a','b','br','em',...   ])
>    safe_attrs = frozenset(['align','alt',...   ])
>    safe_schemes = frozenset(['ftp', 'http', 'https', 'mailto', None])
>
>    def sanitize_html(content_str):
>        markup = HTML(content_str) | HTMLSanitizer(safe_tags,
> safe_attrs, safe_schemes)
>        return markup.render('html')
>
>    def serialize_to_plaintext(content_str):
>        markup = HTML(content_str)
>        return markup.render('text')
>
>
> Best,
> Mark
>
>
> On May 31, 4:17 pm, Mike Orr <sluggos...@gmail.com> wrote:
>> I put an HTML santizing helper in WebHelpers dev.  It's
>> webhelpers.html.converters.sanitize(), defined in
>> webhelpers.html.render.  I'm not sure I'm satisfied with it though.
>> It strips all tags but leaves their content.
>>
>> This would handle:
>>     I <i>really</i> like <script language="javascript"></script> steak!
>> =>
>>    I really like steak!
>>
>> On the other hand it lets this through:
>>     I <i>really</i> like <script language="javascript">NEFARIOUS
>> CODE</script> steak!
>> =>
>>     I really like NEFARIOUS CODE steak
>>
>> I'm not sure if that can be exploited.  It also doesn't resolve HTML
>> entities.  Should it?  Because the Javascript may have entities or raw
>> <'s meant for comparisions.
>>
>> The HTML parser (Python's HTMLParser) lets raw <'s surrounded by
>> whitespace through:
>>     A < B
>> =>
>>     A < B
>>
>> But raises a fit if it looks like an unfinished tag:
>>     A <B
>> =>
>>     HTMLParser.HTMLParseError: EOF in middle of construct, at line 1, column 
>> 3
>>
>> This means we can't make a converter that handles all pathological
>> output without significant work.
>>
>> I could strip the tag *and* the content, which would remove the
>> embedded Javascript but make users wonder where their <i> content
>> went, potentially leading to unreadable text.
>>
>> PHP's strip_tags just strips the tags but leaves the content, so maybe
>> that's enough?
>>
>> http://us2.php.net/manual/en/function.strip-tags.php
>>
>> The manpage has this caveat:
>>
>>     Because strip_tags() does not actually validate the HTML, partial,
>> or broken tags can result
>>     in the removal of more text/data than expected.
>>
>> I've got several various patches implemented in WebHelpers tip.  I'll
>> probably release the beta in a few days, although I would like to give
>> it a proper manual before final.  But I'm still learning how to set
>> that up with Sphinx.
>>
>> --
>> Mike Orr <sluggos...@gmail.com>
>
> >
>



-- 
Mike Orr <sluggos...@gmail.com>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pylons-devel" group.
To post to this group, send email to pylons-devel@googlegroups.com
To unsubscribe from this group, send email to 
pylons-devel+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/pylons-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: HTML-sanitizing helper

Reply via email to