Caching Aggregation of Directories

Tim Hannigan Wed, 16 Aug 2006 07:57:40 -0700

Solprovider - I've managed to use the aggregate files piece (http://solprovider.com/lenya/aggregatefiles) that you wrote months back and it's great.

Here at Queen's University, we're using Lenya as our Campus-wide CMS, and this aggregation is a critical piece of an integrated strategy.

Every new publication that comes on board is armed with a "Queen's Default" publication to work from. This publication comes with the basic Queen's template built-in, as well as many of the features we've developed.

One such feature is what we're calling "tagging-based newsAggregation" (works somewhat similar to Flickr based tagging).

With this feature, there are a few components.

Firstly, there is a pipeline in each publication matching the url $pubname/siteAggregator.xml that will do a dump of the content in the publication.

Next, we have a custom doctype called "newsAggregator" that is being passed the publication xml dump ($pubname/siteAggregator.xml).

Each instance of a newsAggregator will use it's own keywords meta data (string tokenized by a comma) to find other pages on the site that also contain those keywords in their meta data.

When found, we're printing out the title of each page, followed by the meta description and the link. This doctype also creates an associated RSS feed of the found items.

The end result is an intuitive way to create RSS feeds and aggregation (albeit a simple filter at the moment, this could be expanded through boolean string functions).

For a simple example of it, check out this site: http://lenya.adv.queensu.ca/queens_centre/news.html

You'll notice that in this publication, we have 2 instances of a newsAggregator (the "news" tab and the "events" tab) each using different filtering conditions (and generating their own respective RSS feeds).

Now, on a small site, this technique works quite well. It's clean, easy to use, and quite powerful (well-formed RSS feeds are being created by people who couldn't even tell you what RSS stands for).

However, on a larger site, this technique inevitably slows down performance.

I'm looking for a way to cache the dump pipeline (so that we don't have to aggregate the entire site each time a newsAggregator doctype is accessed). I've looked into a few options and I'm looking for some advice on which to go forward with.

One technique would be to somehow set a timed cache on just that pipeline; however it looks like expires cache (as per the Cocoon docs) is only available as of Cocoon 2.1.9 and my IT dept is set on using Cocoon 2.1.7 for now (also I'm not even sure that 2.1.9 is compatible with Lenya?).

A second technique I've considered would be to use the File Generator (which uses cache very nicely) to bring in an xml site dump as a file; this file would have to be generated outside of this pipeline, and would be a precondition for the newsAggregator pipeline executing.

This leads me to consider 2 sub-options:

i) run a scheduled process that would call the $pubname/siteAggregator.xml url, then take the xml output and write it to a file in the publication's work directory.

I'm not entirely sure how Cocoon's scheduler works, but I suppose I could have a shell script on a cron job that's doing a CURL. I'd love to do it internally in Cocoon if I could.

ii) somehow leverage Lucene's site dump and use that instead.

I haven't used Lucene yet, so I'm not really sure how to use it in this context. Am I correct to assume that Lucene has a cron job that generates a dump on a prescribed timeline?

Some advice would greatly be appreciated,

Thanks,

-Tim

Tim Hannigan

Manager, Electronic Communications

Marketing & Communications

Queen's University

Kingston ON K7L 3N6

Phone: 613-533-6000 ext. 74126

Fax: 613-533-6652

Caching Aggregation of Directories

Reply via email to