There are two discussions here that are getting convoluted: WHEN to
"clean" and HOW to clean. I still have yet to find a good comprehensive
way to do the latter (more below), but right here I'm responding to the
former.
Christopher Schultz wrote:
If you /are/ capturing text you will be using that /can/ contain HTML
markup, then cleaning it as it comes in is still a mistake. Let's say
you have a bug in your cleansing code. In that case, bad stuff gets into
your database where it's hard to root out and fix.
If that data is hard to find than you haven't cleanly defined your DB
schema.
WHEN to do the cleaning is not a question of security and
maintainability, but a question of amortizing clock cycles to try to get
responses out to browsers as quickly as possible. There is no reason to
clean the same piece of text with the same algorithm more than once, so
why not do it on the input side? If you find a bug in your cleansing
code, then once you change it, re-run it ONCE on all the potentially
dangerous text blocks. Those should map directly to columns in your DB.
If you can't look at your DB schema and tell me which columns are
displayed without escaping their contents, then something is wrong.
I agree with Leon: cleaning input is not usually a good idea. Cleaning
output is where the real money is -- from a security and maintainability
standpoint.
I'd be happy to change my mind if you can you suggest any other reason
to re-do that work more frequently than changes to the filtering module
/ data that backs the filtering module?
The acknowledgment that said algorithm also needs backing data leads us
right back to the question of HOW.
I believe all filtering efforts will eventually come down to "What
tags/attributes are OK?" (among other critical questions, like "What
values for attributes are OK?".) (If you're stuck in the "what
tags/attributes are NOT OK" world then we have need of a different
discussion: white lists vs black lists.)
So, does anyone have a good list of "safe" tags/attributes that should
be allowed through (assuming the attribute values also pass muster)?
For example, here are my (woefully incomplete) lists (plus a crossover
table (allowed_xhtml_tag_attribute_map) not shown linking allowable
combinations of the two):
allowed_xhtml_tag: a b blockquote br cite del div em font h1 h2 h3 h4
h5 h6 i img ins li ol p pre span strong sub sup table td th tr u ul
allowed_xhtml_attribute: alt border cite class color href name src
style title
For example, I already know I need to add caption and tbody to the first
table, but I've been delaying more by-hand tweaks in hopes of finding a
more systematic way to fill the tables. I've yet to find it. Any
suggestions?
-Dale Newfield
[EMAIL PROTECTED]
P.S.: the "tagsoup parse" suggestion is also good because it guarantees
that anything you do reflect back to users is valid XHTML (and so won't
screw up other parts of your page with illegally nested/unbalanced tags).
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]