Re: Caching Aggregation of Directories

solprovider Wed, 16 Aug 2006 16:53:18 -0700

On 8/16/06, Tim Hannigan <[EMAIL PROTECTED]> wrote:

Solprovider - I've managed to use the aggregate files piece
(http://solprovider.com/lenya/aggregatefiles) that you
wrote months back and it's great.

Thanks.  It is nice to have one's work appreciated.


[Summary: We tag documents and dynamically produce a report of
documents with specific tags.  Producing the report is
processing-intensive and we would like to use caching to improve
performance. Lenya-1.2.4.]

I'm looking for a way to cache the dump pipeline (so that we don't have to
aggregate the entire site each time a newsAggregator doctype is accessed).
I've looked into a few options and I'm looking for some advice on which to
go forward with.

One technique would be to somehow set a timed cache on just that pipeline;
however it looks like expires cache (as per the Cocoon docs) is only
available as of Cocoon 2.1.9 and my IT dept is set on using Cocoon 2.1.7 for
now (also I'm not even sure that 2.1.9 is compatible with Lenya?).

A second technique I've considered would be to use the File Generator (which
uses cache very nicely) to bring in an xml site dump as a file; this file
would have to be generated outside of this pipeline, and would be a
precondition for the newsAggregator pipeline executing.
This leads me to consider 2 sub-options:
i) run a scheduled process that would call the $pubname/siteAggregator.xml
url, then take the xml output and write it to a file in the publication's
work directory.
I'm not entirely sure how Cocoon's scheduler works, but I suppose I could
have a shell script on a cron job that's doing a CURL. I'd love to do it
internally in Cocoon if I could.
ii) somehow leverage Lucene's site dump and use that instead.
I haven't used Lucene yet, so I'm not really sure how to use it in this
context. Am I correct to assume that Lucene has a cron job that generates a
dump on a prescribed timeline?


This breaks into three functions:
1. Cache the results.
2. Use the cache if it exists.
3. Delete the cache on a schedule.

The first two functions are built into publication-sitemap.xml in
Lenya-1.2.  It was disabled in 1.2.4 by adding "disabled" to the match
of the pipeline.

Another examples is at:
http://solprovider.com/lenya/cache

This handles the issues of not caching pages if the visitor is logged
in, or if there is a query string.  None of the expanded functionality
matters in your case, but it shows the important lines from the
standard publication-sitemap.xmap.  See the "Check Cache" and "Create
Cache" commented sections.

map:read is easy.  The WriteSourceTransformer is more complicated, and
is documented at:
http://cocoon.apache.org/2.1/userdocs/sourcewriting-transformer.html

You may want to change the cache directory.  You may need custom
addSourceTags.xsl and removeSourceTags.xsl.  Or maybe it will just
work.

---
#3 may require thought.  I have not used Lenya's Scheduler; my few
attempts did not work, and I did not put much effort into it.  Maybe
someone else can assist with it.

#3 can be solved easily with a cron job that just deletes the files
from the cache (assuming you are using a real operating system.)  That
should take almost 30 seconds for a shell programmer.  If the files
are deleted, then the "Check Cache" code fails, and "Create Cache" is
called.

solprovider

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Caching Aggregation of Directories

Reply via email to