Jim Maul writes: > Justin Mason wrote: > > John D. Hardin writes: > >> On Tue, 22 Jan 2008, George Georgalis wrote: > >> > >>> On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote: > >>> > >>>> Neither am I. Another thing to consider is the fraction of defined > >>>> rules that actually hit and affect the score is rather small. The > >>>> greatest optimization would be to not test REs you know will fail; > >>>> but how do you do *that*? > >>> thanks for all the followups on my inquiry. I'm glad the topic is/was > >>> considered and it looks like there is some room for development, but > >>> I now realize it is not as simple as I thought it might have been. > >>> In answer to above question, maybe the tests need their own scoring? > >>> eg fast tests and with big spam scores get a higher test score than > >>> slow tests with low spam scores. > >>> > >>> maybe if there was some way to establish a hierachy at startup > >>> which groups rule processing into nodes. some nodes finish > >>> quickly, some have dependencies, some are negative, etc. > >> Loren mentioned to me in a private email: "common subexpressions". > >> > >> It would be theoretically possible to analyze all the rules in a given > >> set (e.g. body rules) to extract common subexpressions and develop a > >> processing/pruning tree based on that. You'd probably gain some > >> performance scanning messages, but at the cost of how much > >> startup/compiling time? > > > > I experimented with this concept in my sa-compile work, but I could > > achieve any speedup on real-world mixed spam/ham datasets. > > > > Feel free to give it a try though ;) > > > > --j. > > > > > > You do mean *couldn't* achieve any speedup, correct?
yep