[nyphp-talk] Text versioning/manipulation caching architecture suggestions

dann Tue, 29 Jan 2008 11:51:16 -0800

Hi Max,

I had a similar situation in the past involving a single table ofContent Items which evolved from simple strings to elaborate objectswith branching / staging, localization, history, and cachedprocessing. The solution I implemented involved three tables:

1. Content Items: a placeholder for the item and its primary data(id, name, current version, access level, etc.)2. Content Item Instances: the item value is kept here (id, item_id,version, value, etc.)3. Content Items Metadata: the caching lives here (id, item_id, key,value, creation date)

When a Content Item is updated you'll increment its version numberand add a new instance with the new value and version number (be waryof the potential race condition here). That gives you a full itemhistory in the Content Item Instances table.

When an item is accessed you'll check its Metadata to see if a cachedvalue for its particular processing key exists. If it does and thecreation date is later than the update date of those processors (Iuse a single site-wide update date for simplicity) show the cachedvalue. Otherwise create the cache and store it in the Metadata table.



Some notes:

You might be able to store the value directly in the Content Itemstable if you don't have to deal with i18n, staging, branching and thelike. In my case I needed multiple instances per Content Item tobegin with, and the history was a simple bonus extension to thatimplementation. You'd still need something like Table 2 above, butyou wouldn't have to check it every time you need the raw value for aContent Item.

Likewise my metadata situation is a bit more complicated, as Iactually store the caches on a per-instance basis. I still have Table3, but I also have a fourth table for Content Item Instance Metadatawhich is where the caches live.

You mentioned that you have many tables in your database that needthis treatment. You may need to experiment a bit to see if it is moreefficient to triple the number of tables in your database (ouch) orto add something like a 'table_name' field to Tables 2 and 3 aboveand use them for all your data. I suspect this later route may leadto some table size issues, however, as those tables tend to fill uprather rapidly.

I've been meaning to write this up for awhile; if the ideas seemuseful I can prepare a more rigorous treatment. Let me know what youeventually decide to use, I'd be interested to hear how yourparticular requirements shape the solution.


-- dann

Hello buds,
It's been a while since I've been active on the list but I figuredI'd givea holler and see if anyone had any suggestions for an applicationdesign
problem I've run into.
I have a large number of text fields across many tables in a fairlylargedatabase all of which can be manipulated in any number of ways.Some common
manipulations would be scrubbing strings for display on the web (XHTML
compliance and XSS avoidance), censoring of "bad" words, rich-text,etc.
All in all, once you mix and match all of the various textmanipulations,you end up with a large number of versions of the same chunk oftext, andyou need access to all of them based on a plethora of variablessuch as useroptions, access interface etc. On top of that, some fields can beedited,
and I'd like to keep copies of the entire revision history, which adds
another level of complexity.
Originally I thought of some sort of memory caching solution, butthe maingoal of this is to come up with a scalable solution and there iscurrently afew gigabytes of text that this would apply to, so if anything itwouldprobably need to expire. It's possible that I could have somemixture ofshort-term memory cache and long-term disk cache, as disk/databasespace
isn't a large concern.
Another issue is manipulation function versioning, e.g. when a newword isadded to the censor function, you want to purge the cache of all ofthe
censored text created by the last version.
Maybe I'm just over-complicating the entire thing, but doing thissort ofmanipulation on a high traffic site seems like a giganticduplication of
CPU-intensive work that could (and should) be avoided.
I've come up with a lot of solutions, but none of them seem veryelegant.
I'm trying to avoid a lot of excess DB queries and SQL joins.
I've done some searching around and it seems like anyone who hassolved thisproblem hasn't discussed it publicly. I thought maybe someonedealing withlocale on a large scale might have come up with a good solution,but since
locale is mostly static, it doesn't seem to apply in most cases.
So has anyone dealt with something similar, or is there an obvioussolutionthat I'm missing? I'd be interested in hearing some of the moreseasoned
NYPHPer's opinions.

Thanks for any advice in advance!
-Max



_______________________________________________
New York PHP Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

NYPHPCon 2006 Presentations Online
http://www.nyphpcon.com

Show Your Participation in New York PHP
http://www.nyphp.org/show_participation.php

[nyphp-talk] Text versioning/manipulation caching architecture suggestions

Reply via email to