hi David -- While I doubt it'd have quite the performance gains that A-C can offer, Regexp::Assemble certainly sounds like something worth trying... the coderef trick, in particular, is very nifty.
--j. David Landgren writes: > Justin Mason wrote: > > There's an interesting discussion on my weblog at > > http://taint.org/2006/07/07/184022a.html , regarding the > > recently-contributed "trie" and Aho-Corasick regexp optimizations, which > > have been added to perl 5.9.x. > > > > These could provide a way to get *serious* speedups in SpamAssassin. > > Note that these work really well on fixed text. Once you start adding > STAR, PLUS and CURLY the gain diminishes. > > > If anyone is interested in working on speedups, this would be a really > > great place to do it; body rules are by far the biggest CPU time sink in > > our code. > > One source of slowness is that a lot of SA pattern writers use > (?=[abcd]) zwla's in the belief, I think, to supply more information to > the engine, but even in older perls this was always a loss. You have to > have a something like 95-99% of strings not matching /[abcd]/ for the > benefit to kick in, at least that has been my experience. > > I think the biggest problem is the fact that the scanning algorithm > consists of applying one pattern after the other to see what it triggers. > > I wrote Regexp::Assemble to allow one to take a large slab of patterns > and concoct a single expression (also structured as a trie, > incidentally) as a result. The benefit is that all the common paths are > shared, so all the expensive curlies and stars are visited only one. > > For instance, with /a+b+c+/ and /a+c+e+c+/ the result is > /a+(c+e+|b+)c+/, so if you are give aaaafc, you don't have to try the > first pattern, fail and then the second pattern and fail. As soon as the > alternation in the middle of the resulting pattern above fails, there's > no backtracking and so you stop. When you have three thousand patterns > instead of two, this starts to become very interesting. > > Regexp::Assemble also has a mode whereby you can build up a hash of > pattern => coderef pairs, assemble all the patterns into a single > pattern, and then feed that the body text. > > This then means that your entire scan of the body just becomes a single > //g pattern match, and each time something matches, you can go back to > the dispatch table and call out the action according to what just > matched. The more patterns you have, the more efficient it becomes. > > I'd be happy to explain this in more detail if this sounds like > something you want to explore. > > David > -- > Much of the propaganda that passes for news in our own society is given > to immobilising and pacifying people and diverting them from the idea > that they can confront power. -- John Pilger