[PHP] heavy parsing of text, storing both versions
Hi all, I'm building a CMS that does heavy parsing of a HTML shorthand plain text to XHTML strict, in a similar way to Textile http://www.textism.com/tools/textile/. The problem is this conversion might take place on 2-3 columns of text, and unlimited other fields (my CMS has user-defined data models), and since they'll need to edit this text at a later date, I either need to: 1. Parse the text on demand into HTML -- the parsing script is to heavy/slow for this. 2. Store both the plain (shorthand HTML) text and parsed XHTML versions of each field -- the problem with this being that i'm storing double the data in the database... combine this with versioning of each 'page', and I'm going to be storing a LOT of data in the DB. 100 articles x 3 versions each x 500 words x 6 chars per word = 900,000 chars; add a whole bunch of XHTML to this, and it's looking pretty huge. Double the articles or versions, and it's scary :) It also means I need to have two fields for each field (input and parsed), which makes the MySQL tables a lot more complex, etc. 3. write a reverse set of functions which converts the XHTML back to the shorthand on demand for editing -- this seems great, but I don't like the idea of maintaining two functions for such a beast. Has anyone got any further ideas? --- Justin French http://indent.com.au -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] heavy parsing of text, storing both versions
Justin French wrote: Hi all, I'm building a CMS that does heavy parsing of a HTML shorthand plain text to XHTML strict, in a similar way to Textile http://www.textism.com/tools/textile/. 1. Parse the text on demand into HTML -- the parsing script is to heavy/slow for this. 2. Store both the plain (shorthand HTML) text and parsed XHTML versions of each field -- the problem with this being that i'm storing double the data in the database... 3. write a reverse set of functions which converts the XHTML back to the shorthand on demand for editing -- this seems great, but I don't like the idea of maintaining two functions for such a beast. Well, you pretty much listed all of the options. Personally, I'd probably go with #2 because hard drive space is cheap. But... if the process is really that intensive and you're really that concerned about space, then I'd do #3. It doesn't seem like it'd be that hard to maintain as you're just reversing everything and how often do you expect it to change? Sorry I can't offer a better option. :) -- ---John Holmes... Amazon Wishlist: www.amazon.com/o/registry/3BEXC84AB3A5E/ php|architect: The Magazine for PHP Professionals www.phparch.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] heavy parsing of text, storing both versions
On Fri, Feb 20, 2004 at 10:35:11AM +1100, Justin French wrote: 1. Parse the text on demand into HTML -- the parsing script is to heavy/slow for this. 2. Store both the plain (shorthand HTML) text and parsed XHTML versions of each field -- the problem with this being that i'm storing double the data in the database... combine this with versioning of each 'page', and I'm going to be storing a LOT of data in the DB. snip 3. write a reverse set of functions which converts the XHTML back to the shorthand on demand for editing -- this seems great, but I don't like the idea of maintaining two functions for such a beast. Has anyone got any further ideas? 4. Store the plain (shorthand HTML) text and when users 'save' changes, generate a static page containing the transformed XHTML version. You will have the processing overhead once (when data is changed), and everytime else visitors get static files. It sounds like #3 would be quite difficult. Going from HTML-XHTML you know what the end result would look like. Going the other way, you won't know for sure what the users originally entered when they authored the content. I'm assuming this isn't a 1-to-1 transformation, so that these: bSome bold text/b BSome bold text/B bSome bold text/B will all get turned into: strongSome bold text/strong If you turn the strong text back into b, then it's not clear which of the three options you should use. Unless I'm misunderstanding... joel -- [ joel boonstra | gospelcom.net ] -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] heavy parsing of text, storing both versions
joel boonstra wrote: On Fri, Feb 20, 2004 at 10:35:11AM +1100, Justin French wrote: 1. Parse the text on demand into HTML -- the parsing script is to heavy/slow for this. 2. Store both the plain (shorthand HTML) text and parsed XHTML versions of each field -- the problem with this being that i'm storing double the data in the database... combine this with versioning of each 'page', and I'm going to be storing a LOT of data in the DB. snip 3. write a reverse set of functions which converts the XHTML back to the shorthand on demand for editing -- this seems great, but I don't like the idea of maintaining two functions for such a beast. Has anyone got any further ideas? 4. Store the plain (shorthand HTML) text and when users 'save' changes, generate a static page containing the transformed XHTML version. You will have the processing overhead once (when data is changed), and everytime else visitors get static files. Isn't that just an alternate version of #2? You're still duplicating the data and taking up storage space. Again, I wouldn't really be worried about this, but that's the issue presented in #2. Sure, static files would probably be faster, but that doesn't answer the issue of when/how to do the conversion. -- ---John Holmes... Amazon Wishlist: www.amazon.com/o/registry/3BEXC84AB3A5E/ php|architect: The Magazine for PHP Professionals www.phparch.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] heavy parsing of text, storing both versions
On Thu, Feb 19, 2004 at 09:15:35PM -0500, John W. Holmes wrote: 2. Store both the plain (shorthand HTML) text and parsed XHTML versions of each field -- the problem with this being that i'm storing double the data in the database... combine this with versioning of each 'page', and I'm going to be storing a LOT of data in the DB. snip 4. Store the plain (shorthand HTML) text and when users 'save' changes, generate a static page containing the transformed XHTML version. You will have the processing overhead once (when data is changed), and everytime else visitors get static files. Isn't that just an alternate version of #2? You're still duplicating the data and taking up storage space. Again, I wouldn't really be worried about this, but that's the issue presented in #2. Sure, static files would probably be faster, but that doesn't answer the issue of when/how to do the conversion. Version #2 involved an identical database structure, or multiple database fields, or some sort of redundant data storage that mirrors the HTML database structure. The bigger problem to me seemed to be the complexity introduced into the database, not the extra storage space required. This solves the script run problem (only runs once), and lets the database remain as originally planned. The point is that the XHTML version is only necessary for display on the finished webpage, and the simple HTML version is only necessary for editing in the administrative interface. Publishing static XHTML files eliminates the need to do database interactivity on each page request (after all, the content isn't going to change with each request, is it?) and keeping the HTML in the database lets the admin. interface be as interactive and dynamic as is necessary. Just my $.02, though -- I'm not going to have to end up maintaining this, so the best answer is the one that works the best for the OP. joel -- [ joel boonstra | gospelcom.net ] -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php