> -----Original Message----- > From: Ted Mittelstaedt [mailto:t...@ipinc.net] > Sent: 2009-10-10 02:40 > To: Marc Perkel > Cc: users@spamassassin.apache.org > Subject: Re: SA needs a new paradigm for rule structure > > > Marc Perkel wrote: > > I've brought this idea up over the years but I'll try to > explain it in a > > different way. Maybe we can do this with a lot of meta rules. > > > > What we need are rules that combine a lot of simple rules > into concepts > > and then combine those rules into rules that score - and > score big. As > > an example, lets take a standard nigerian scam email. > > > > From <> reply to: > > > > [I don't know you] Dear stranger, I am mr, ms. mrs. my name is > > > > [I am connected] I am a soldier in Iraq, I and the daughter of an > > african president, I work at a bank in hong hong > > > > [I have money] I have the sum of 56 million dollars USD > > > > [the money is hot] no beneficiaries, sneak it out of the country, > > oppressive regime > > > > [transfer to your account] splitting the funds, wire to your account > > > > [i need you information] name, address, account number > > > > [i want you to contact me] by email, phone > > > > [keep this a secret] confidential discretion > > > > So - we create a lot of simple rules with no points with > key words and > > phases and then combine these rules using meta rules to get these > > concepts. That way we have a meta rule like, "they don't > know me" "that > > are talking about transferring millions" "they want my information" > > "they are talking about hot money". Then you combine those > concepts into > > rules that can definitively determine it is spam. > > > > And - I am still looking for someone who might do baysian > or some other > > automatic system that looks for rule combinations and > increases scores > > based on that. > > > > I know that it seems like the idea of building up "meta" rules with > a lot of small rules will give you a more accurate hit rate, but > this is one of those non-intuitive things that can be shown by > statistical mathmatics, that is that the concept won't work. Or > rather, it won't work any better than the existing paradigm. > > In other words, the current system of assigning little points to > a lot of little rules will yield the same result for any given > set of spam messages as organizing all > these small rules into groups that have bigger point values. > > The only thing the organization does is for humans to understand > what is going on better. This is because how humans think about > math like statistics is a lot different than how a computer > works with mathematics like statistics. > > Ted
I thought I remembered a few years back that Baysian chains had a 10% increase in capture rate over straight Bayes rules. I would think that this is similar. The problem with meta rules is that they can be fooled by a single change. Hit 4 out of 5 and you don't get the 7.0 score because the spammer changed one single thing. But with single rules at least those 4 things would have scored. You would need to constantly tweak the meta rules. I like the idea, and have thought on it before. I understand Ted's point on the statistics. I think it can be made better, but not with current SA code. And I know the old quote from JM, "All code samples are always welcome." :-) So I hope to one day get something written to try. --Chris