Hi Max,

I had a similar situation in the past involving a single table of Content Items which evolved from simple strings to elaborate objects with branching / staging, localization, history, and cached processing. The solution I implemented involved three tables:

1. Content Items: a placeholder for the item and its primary data (id, name, current version, access level, etc.) 2. Content Item Instances: the item value is kept here (id, item_id, version, value, etc.) 3. Content Items Metadata: the caching lives here (id, item_id, key, value, creation date)

When a Content Item is updated you'll increment its version number and add a new instance with the new value and version number (be wary of the potential race condition here). That gives you a full item history in the Content Item Instances table.

When an item is accessed you'll check its Metadata to see if a cached value for its particular processing key exists. If it does and the creation date is later than the update date of those processors (I use a single site-wide update date for simplicity) show the cached value. Otherwise create the cache and store it in the Metadata table.


Some notes:

You might be able to store the value directly in the Content Items table if you don't have to deal with i18n, staging, branching and the like. In my case I needed multiple instances per Content Item to begin with, and the history was a simple bonus extension to that implementation. You'd still need something like Table 2 above, but you wouldn't have to check it every time you need the raw value for a Content Item.

Likewise my metadata situation is a bit more complicated, as I actually store the caches on a per-instance basis. I still have Table 3, but I also have a fourth table for Content Item Instance Metadata which is where the caches live.

You mentioned that you have many tables in your database that need this treatment. You may need to experiment a bit to see if it is more efficient to triple the number of tables in your database (ouch) or to add something like a 'table_name' field to Tables 2 and 3 above and use them for all your data. I suspect this later route may lead to some table size issues, however, as those tables tend to fill up rather rapidly.


I've been meaning to write this up for awhile; if the ideas seem useful I can prepare a more rigorous treatment. Let me know what you eventually decide to use, I'd be interested to hear how your particular requirements shape the solution.

-- dann


Hello buds,

It's been a while since I've been active on the list but I figured I'd give a holler and see if anyone had any suggestions for an application design
problem I've run into.

I have a large number of text fields across many tables in a fairly large database all of which can be manipulated in any number of ways. Some common
manipulations would be scrubbing strings for display on the web (XHTML
compliance and XSS avoidance), censoring of "bad" words, rich-text, etc.

All in all, once you mix and match all of the various text manipulations, you end up with a large number of versions of the same chunk of text, and you need access to all of them based on a plethora of variables such as user options, access interface etc. On top of that, some fields can be edited,
and I'd like to keep copies of the entire revision history, which adds
another level of complexity.

Originally I thought of some sort of memory caching solution, but the main goal of this is to come up with a scalable solution and there is currently a few gigabytes of text that this would apply to, so if anything it would probably need to expire. It's possible that I could have some mixture of short-term memory cache and long-term disk cache, as disk/database space
isn't a large concern.

Another issue is manipulation function versioning, e.g. when a new word is added to the censor function, you want to purge the cache of all of the
censored text created by the last version.

Maybe I'm just over-complicating the entire thing, but doing this sort of manipulation on a high traffic site seems like a gigantic duplication of
CPU-intensive work that could (and should) be avoided.

I've come up with a lot of solutions, but none of them seem very elegant.
I'm trying to avoid a lot of excess DB queries and SQL joins.

I've done some searching around and it seems like anyone who has solved this problem hasn't discussed it publicly. I thought maybe someone dealing with locale on a large scale might have come up with a good solution, but since
locale is mostly static, it doesn't seem to apply in most cases.

So has anyone dealt with something similar, or is there an obvious solution that I'm missing? I'd be interested in hearing some of the more seasoned
NYPHPer's opinions.

Thanks for any advice in advance!
-Max



_______________________________________________
New York PHP Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

NYPHPCon 2006 Presentations Online
http://www.nyphpcon.com

Show Your Participation in New York PHP
http://www.nyphp.org/show_participation.php

Reply via email to