John D. Hardin writes:
> On Tue, 22 Jan 2008, George Georgalis wrote:
> 
> > On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote:
> >
> > >Neither am I. Another thing to consider is the fraction of defined
> > >rules that actually hit and affect the score is rather small. The
> > >greatest optimization would be to not test REs you know will fail;  
> > >but how do you do *that*?
> > 
> > thanks for all the followups on my inquiry. I'm glad the topic is/was
> > considered and it looks like there is some room for development, but
> > I now realize it is not as simple as I thought it might have been.
> > In answer to above question, maybe the tests need their own scoring?
> > eg fast tests and with big spam scores get a higher test score than
> > slow tests with low spam scores.
> > 
> > maybe if there was some way to establish a hierachy at startup
> > which groups rule processing into nodes. some nodes finish
> > quickly, some have dependencies, some are negative, etc.
> 
> Loren mentioned to me in a private email: "common subexpressions".
> 
> It would be theoretically possible to analyze all the rules in a given
> set (e.g. body rules) to extract common subexpressions and develop a
> processing/pruning tree based on that. You'd probably gain some
> performance scanning messages, but at the cost of how much
> startup/compiling time?

I experimented with this concept in my sa-compile work, but I could
achieve any speedup on real-world mixed spam/ham datasets.

Feel free to give it a try though ;)

--j.

Reply via email to