On 3/16/07, Dale Newfield <[EMAIL PROTECTED]> wrote:
There are two discussions here that are getting convoluted: WHEN to
"clean" and HOW to clean. I still have yet to find a good comprehensive
way to do the latter (more below), but right here I'm responding to the
former.
Christopher Schultz wrote:
> If you /are/ capturing text you will be using that /can/ contain HTML
> markup, then cleaning it as it comes in is still a mistake. Let's say
> you have a bug in your cleansing code. In that case, bad stuff gets into
> your database where it's hard to root out and fix.
If that data is hard to find than you haven't cleanly defined your DB
schema.
There are more persistent storages between hell and heaven than a rdbms.
And even with an rdbms, have you ever tried to update like 1.000.000
rows of an in production db under traffic?
WHEN to do the cleaning is not a question of security and
maintainability, but a question of amortizing clock cycles to try to get
responses out to browsers as quickly as possible. There is no reason to
clean the same piece of text with the same algorithm more than once, so
why not do it on the input side? If you find a bug in your cleansing
code, then once you change it, re-run it ONCE on all the potentially
dangerous text blocks. Those should map directly to columns in your DB.
If you can't look at your DB schema and tell me which columns are
displayed without escaping their contents, then something is wrong.
If you CAN look at the DB at say what are displayed where, than there
is something wrong with you application design. Or you are a decoding
genius if you can recalculate in mind what exactly N levels of
abstraction do with each data chunk. But its probably too theoretical.
As for
There is no reason to
clean the same piece of text with the same algorithm more than once, so
why not do it on the input side?
There are many. First of all the user data remain untouched. This
could have some legal issues. Especially in case your filtering does a
bit too much. Than, encoding is cheaper as regexp. Much cheaper. And
you have to encode anyway, since you want to deliver valid html, wan't
you?
> I agree with Leon: cleaning input is not usually a good idea. Cleaning
> output is where the real money is -- from a security and maintainability
> standpoint.
I'd be happy to change my mind if you can you suggest any other reason
to re-do that work more frequently than changes to the filtering module
/ data that backs the filtering module?
1. Avoiding content destruction through bugs in your filtering module.
2. Avoiding DOS exposition since filtering, especially with regexp, is
very expensive.
3. You have to encode the output html anyway, so why doing something twice?
4. Updates in the filtering/encoding logic can be applied on the fly
since you dont have to change any data.
And i assume there are some more :-)
regards
Leon
-Dale Newfield
[EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]