John D. Hardin writes: > On Tue, 22 Jan 2008, George Georgalis wrote: > > > On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote: > > > > >Neither am I. Another thing to consider is the fraction of defined > > >rules that actually hit and affect the score is rather small. The > > >greatest optimization would be to not test REs you know will fail; > > >but how do you do *that*? > > > > thanks for all the followups on my inquiry. I'm glad the topic is/was > > considered and it looks like there is some room for development, but > > I now realize it is not as simple as I thought it might have been. > > In answer to above question, maybe the tests need their own scoring? > > eg fast tests and with big spam scores get a higher test score than > > slow tests with low spam scores. > > > > maybe if there was some way to establish a hierachy at startup > > which groups rule processing into nodes. some nodes finish > > quickly, some have dependencies, some are negative, etc. > > Loren mentioned to me in a private email: "common subexpressions". > > It would be theoretically possible to analyze all the rules in a given > set (e.g. body rules) to extract common subexpressions and develop a > processing/pruning tree based on that. You'd probably gain some > performance scanning messages, but at the cost of how much > startup/compiling time?
I experimented with this concept in my sa-compile work, but I could achieve any speedup on real-world mixed spam/ham datasets. Feel free to give it a try though ;) --j.