Re: SpamAssassin Ruleset Generation
On Tue, 2009-10-06 at 13:50 -0700, an anonymous Nabble user wrote: Other than the sought rules, all the rules are manually generated? Actually, as has been said, I believe all stock rules are manually written. There are some third-party rule-sets out there that are auto generated -- not limited to Sought. Is there any statistics on how frequently are new rules/regex adopted by spamassasssin? Who are the people who write them? Any details related to it? Somehow this begs the question -- why? Why are you asking? Why and what are you ultimately interested in? And of course, did you even consider to dig through the SVN repo, some docs on the wiki and to ask google? Most of this should be pretty easy to find out if you're willing to read some. -- char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
SpamAssassin Ruleset Generation
I have a question about - understanding how are rulesets generated for spamassassin. For example - consider the rule in 20_drugs.cf : header SUBJECT_DRUG_GAP_C Subject =~ /\bc.{0,2}i.{0,2}a.{0,2}l.{0,2}i.{0,2}s\b/i describe SUBJECT_DRUG_GAP_C Subject contains a gappy version of 'cialis' Who generated the regular expression /\bc.{0,2}i.{0,2}a.{0,2}l.{0,2}i.{0,2}s\b/i a. Is it done manually with people writing regex to see how efficiently they capture spams? b. Is there an algorithm that identifies large corpus of spam and the comes up with these regex'es on its own? c. Is it a combination of (a), (b)? I know scores for rules are generated using a neural network trained with error back propagation http://wiki.apache.org/spamassassin/HowScoresAreAssigned But how are the rules generated themselves? Thnx -- View this message in context: http://www.nabble.com/SpamAssassin-Ruleset-Generation-tp25773508p25773508.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: SpamAssassin Ruleset Generation
On Tue, 6 Oct 2009 11:08:28 -0700 (PDT) poifgh abhinav.pat...@gmail.com wrote: I have a question about - understanding how are rulesets generated for ... a. Is it done manually with people writing regex to see how efficiently they capture spams? b. Is there an algorithm that identifies large corpus of spam and the comes up with these regex'es on its own? c. Is it a combination of (a), (b)? The optional sought rules are autogenerated, the rest are manual.
Re: SpamAssassin Ruleset Generation
RW-15 wrote: On Tue, 6 Oct 2009 11:08:28 -0700 (PDT) poifgh abhinav.pat...@gmail.com wrote: I have a question about - understanding how are rulesets generated for ... a. Is it done manually with people writing regex to see how efficiently they capture spams? b. Is there an algorithm that identifies large corpus of spam and the comes up with these regex'es on its own? c. Is it a combination of (a), (b)? The optional sought rules are autogenerated, the rest are manual. Thnx - What are optional sought rules? -- View this message in context: http://www.nabble.com/SpamAssassin-Ruleset-Generation-tp25773508p25776105.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: SpamAssassin Ruleset Generation
poifgh wrote: RW-15 wrote: On Tue, 6 Oct 2009 11:08:28 -0700 (PDT) poifgh abhinav.pat...@gmail.com wrote: I have a question about - understanding how are rulesets generated for ... a. Is it done manually with people writing regex to see how efficiently they capture spams? b. Is there an algorithm that identifies large corpus of spam and the comes up with these regex'es on its own? c. Is it a combination of (a), (b)? The optional sought rules are autogenerated, the rest are manual. Thnx - What are optional sought rules? http://www.google.com/search?q=spamassassin+sought -- Bowie
Re: SpamAssassin Ruleset Generation
Bowie Bailey wrote: http://www.google.com/search?q=spamassassin+sought :-D - Thnx -- View this message in context: http://www.nabble.com/SpamAssassin-Ruleset-Generation-tp25773508p25776303.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: SpamAssassin Ruleset Generation
poifgh wrote: Bowie Bailey wrote: http://www.google.com/search?q=spamassassin+sought :-D - Thnx Other than the sought rules, all the rules are manually generated? Is there any statistics on how frequently are new rules/regex adopted by spamassasssin? Who are the people who write them? Any details related to it? thnx -- View this message in context: http://www.nabble.com/SpamAssassin-Ruleset-Generation-tp25773508p25776307.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: SpamAssassin Ruleset Generation
Hi, Other than the sought rules, all the rules are manually generated? Is there any statistics on how frequently are new rules/regex adopted by spamassasssin? Who are the people who write them? Any details related to Information on Justin Mason's SOUGHT rules is here: http://taint.org/2007/08/15/004348a.html Use sa-update to update your SA rules once or twice per day with the new stuff. His ongoing development work is here: http://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/sandbox/jm/?sortby=date HTH, Alex
Re: SpamAssassin Ruleset Generation
On Tue, 6 Oct 2009, poifgh wrote: Other than the sought rules, all the rules are manually generated? Is there any statistics on how frequently are new rules/regex adopted by spamassasssin? Who are the people who write them? Any details related to it? Most of the rules are manually written by contributors such as myself. Some meta rules are generated by various means from existing rules - for example, the ADVANCE_FEE rules are generated using genetic algorithms to find effective combinations of simpler subrules that were manually generated. New rules are added whenever a contributor works on them, and this is generally based on when they have time to do so, when they have new ideas, and when new forms of spam appear. Indirect contributors will post rules to the users list and a contributor may add them to the rules sandbox for testing and eventual inclusion in the base ruleset. The CREDITS file in the sources should list all of the contributors. Some contributors may not have added their names to that file, though. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- 5 days since a sunspot last seen - EPA blames CO2 emissions
Re: SpamAssassin Ruleset Generation
poifgh wrote: I have a question about - understanding how are rulesets generated for spamassassin. For example - consider the rule in 20_drugs.cf : header SUBJECT_DRUG_GAP_C Subject =~ /\bc.{0,2}i.{0,2}a.{0,2}l.{0,2}i.{0,2}s\b/i describe SUBJECT_DRUG_GAP_C Subject contains a gappy version of 'cialis' Who generated the regular expression /\bc.{0,2}i.{0,2}a.{0,2}l.{0,2}i.{0,2}s\b/i Man, that's a good question. I wrote a large chunk of the rules in 20_drugs.cf, but not that one. ( I wrote the stuff near the bottom that uses meta rules. ie: __DRUGS_ERECTILE1 through DRUGS_MANYKINDS, originally distributed as a separate set called antidrug.cf). As I recall, there were 2 other people making drug rules, but it's been a LONG time, and I forget who did it. Those rules were written in the 2004-2006 time frame when pharmacy spams were just hammering the heck outa everyone. a. Is it done manually with people writing regex to see how efficiently they capture spams? Yes. Many hours of reading spams, studying them, testing various regex tweaks, checking for false positives, etc, etc. mass-check is your friend for this kind of stuff. One post from when I was developing this as a stand-alone set: http://mail-archives.apache.org/mod_mbox/spamassassin-users/200404.mbox/%3c6.0.0.22.0.20040428132346.029d9...@opal.evi-inc.com%3e Note: the comcast link mentioned in that message should be considered DEAD. The antidrug set is no longer maintained separately from the mailline ruleset, and hasn't been for years. If you want to break the rules down a bit, here's some tips: The rules are in general designed to detect common methods to obscure text by inserting spaces, punctuation, etc between letters, and possibly substituting some of the letters for other similar looking characters. (W4R3Z style, etc) The simple format would be to think of it in groupings. You end up using a repeating pattern of (some representation of a character)(some kind of gap sequence)(character)(gap)...etc. .{0,2} is a gap sequence, although not one I prefer. I prefer [_\W]{0,3} in most cases because it's a bit less FP-prone, but risks missing things using small lower-case letters to gap. You also get replacements for characters in some of those, like [A4] instead of just A. Or, more elaborately.. [a4\xe0-\...@] So this mess: body __DRUGS_ERECTILE1 /(?:\b|\s)[_\W]{0,3}(?:\\\/|V)[_\W]{0,3}[ij1!|l\xEC\xED\xEE\xEF][_\W]{0,3}[a40\xe0-\...@][_\w]{0,3}[xyz]?[gj][_\W]{0,3}r[_\W]{0,3}[a40\xe0-\...@][_\w]{0,3}x?[_\W]{0,3}(?:\b|\s)/i Could be broken down: (?:\b|\s) - preamble, detecting space or word boundary. [_\W]{0,3} - gap (?:\\\/|V) - V [_\W]{0,3} - gap [ij1!|l\xEC\xED\xEE\xEF] - I [_\W]{0,3} - gap [a40\xe0-\...@] - A [_\W]{0,3} - gap [xyz]?[gj] - G (with optional extra garbage before it) [_\W]{0,3} - gap r- just R :-) [_\W]{0,3} - gap [a40\xe0-\...@] -A [_\W]{0,3} - gap x? - optional garbage [_\W]{0,3} - gap (?:\b|\s)- suffix, detecting space or word boundary. Which detects weird spacings and substitutions in the word Viagra. But how are the rules generated themselves? Mostly meatware, except the sought rules others have mentioned. Thnx