Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
On Thu, Mar 5, 2009 at 00:23, decoder deco...@own-hero.net wrote: decoder wrote: Justin Mason wrote: So you're volunteering to code it up, then? ;) I was planning to do at least some brainstorming+experiements as to what learning methods would seem suitable and how well the method performs, whenever I have time again. Unless someone else did that already? Ok, I did some short experiments: I've built an SVM classifier from a large mail corpus (8226 mails (5414 ham, 2812 spam)) and did a 5-fold cross validation. The resulting classifier has an accuracy of over 99%, so performs as good as the regular system. Now I applied this to a set of 202 False Negatives that I collected, and 69 of these are recognized as spam by the SVM. As a second test, I pulled 2707 mails from one of my other inboxes and applied the classifier, the accuracy was again over 99% (and this is only ham). From my point of view, the results show that this approach has potential. It is highly accurate with respect to the current system, but additionally outperformed it on several false negatives. There are other advantages that this system has over the common system: It allows everybody to train the whole spamfilter (not only Bayes) to the kind of spam that one receives, i.e. it is more adaptive than the common system. Any opinions on this are greatly welcome. Maybe we should try to come up with a proof of concept plugin for SA? Thanks for doing this! couple of q's: 1. I can offer a bigger ham/spam corpus if you'd like to test against that as well; corpora from multiple contributors can sometimes expose training set bias. 2. can you test it on spam that scored less than 10 points when it arrived? low-scoring spam is, of course, more useful to hit than stuff that scored highly on the existing rules. 3. does it give an indication of confidence in its results? or just a binary spam/ham decision? 4. hey, if you're writing an SVM plugin, it might be worth making one that _also_ supports body text tokens, similarly to the existing Bayes plugin. ;) 5. btw one particularly tricky part of dealing with user-trainable dbs, is supporting expiry of old tokens. but that can be deferred until later anyway. --j.
Re: Bye Bye Bayes
On 04.03.09 06:17, John Hardin wrote: I used to have a couple of users who treated their Trash folder as long-term read-message storage. After reading most messages they'd move them to Trash, and _never_ _purge_ _it_. I couldn't break them of this habit, even after purging their Trash folder from the server a couple of times. (Oops! Disk failure! Well, that was trash, you can afford to lose that.) We set up courier's imap server to remove files after being in in trash for more than 7 days... Luckily, we have documented that long time ago, so they cannot comply... -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Boost your system's speed by 500% - DEL C:\WINDOWS\*.*
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Justin Mason wrote: Thanks for doing this! couple of q's: 1. I can offer a bigger ham/spam corpus if you'd like to test against that as well; corpora from multiple contributors can sometimes expose training set bias. That would be cool :) Is this corpus already processed by spamassassin (i.e. has SA headers)? My poc code currently mines only the headers to find out what rules are triggered. 2. can you test it on spam that scored less than 10 points when it arrived? low-scoring spam is, of course, more useful to hit than stuff that scored highly on the existing rules. Things like that should be possible easily. I need to check if I have enough mails to do a sufficiently reliable test here. 3. does it give an indication of confidence in its results? or just a binary spam/ham decision? I'm currently working only with a binary classifier. However, libsvm supports probability estimates and regression (and to my knowledge, internally, most SVM algorithms relax classification output to real values and then use the sign to determine the classification, this can also be seen as some sort of confidence value) 4. hey, if you're writing an SVM plugin, it might be worth making one that _also_ supports body text tokens, similarly to the existing Bayes plugin. ;) This would surely be possible somehow, but we'd first have to come up with a good representation of the problem for an SVM. I wouldn't want to mix this either with the current experiment, as these two things somehow represent different data. One of the problems with text tokens is that there can always be new ones (which would increase the dimension of the problem and hence require the whole SVM to be remodeled, so, a system as performant as bayes might not work directly.) 5. btw one particularly tricky part of dealing with user-trainable dbs, is supporting expiry of old tokens. but that can be deferred until later anyway. I guess this is a question of implementation :) Best regards, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: SpamAssassin Doesn't Appear to be working
In this set up I am lead top believe that Amavisd-new handles the SA config but I did not see a process for spamd so i enabled in rc.conf. There is no need for a spamd process in this setup - think of amavisd proces as an equivalent of spamd (in that it calls a SpamAssassin library of perl modules), but speaks a different protocol: amavisd speaks SMTP, spamd speaks spamc/spamd protocol. Ahh ok I understand that portion.. I was not aware The most likely reason for absence of X-Spam-* header fields is that the recipient was not considered local - check your setting of @local_domains_maps (or %local_domains or @local_domains_acl). X-Spam-* header fields are not inserted for outbound mail (i.e. when recipient is not considered local). Check the log (possibly at elevated log level) to make sure. Not sure what I am looking for in log I can see the rejection but not sure what I am looking for relative to the xspam I am logging at level 4 I am going to bump up top 5 I have this in the config @local_domains_maps = ( read_hash(/usr/local/etc/postfix/virtual_domains) ); # using hash so my local domains should be recognized thanks jason -- View this message in context: http://www.nabble.com/SpamAssassin-Doesn%27t-Appear-to-be-working-tp22341459p22350544.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Marc Perkel wrote: Good work so far but sounds like you need to throw more data at it. Also even though you indicate over 99% accuracy can you break that down better? 99.9% is 10 times as accurate as 99%. What do you mean by more data? Of course, some additional data might help. One should consider that _most_ of the SA rules are designed to score on spam. For an SVM, you can use more general data like Mail has property XYZ although you don't know what this property means (ham or spam) or if it is even suitable to classify anything. This is of course an advantage. With respect to the numbers: I repeated the experiments today with slight modifications to provide a more solid setup: The input is again the dataset I used yesterday. In one run, I permutate the dataset, then split it (2/3 training vs. 1/3 testing, not stratified). Then the training set is used to train an SVM, and it is applied to the 1/3 testing set and additionally to my false negatives set. The SVM outputs an accuracy value, but I wrote a tool that calculates precision and recall by hand because these values are more interesting as 1 - Precision = False Positive Rate (which is an important factor in SA) 1 - Recall = False Negative Rate (or, consider recall as the detection rate) I ran this 5 times, the output is attached as text file, there you will see the exact numbers :) Taking the mean over the 5 runs: False positive rate: 0.37908199952036 % Detection Rate: 99.18104855859372 % Detection Rate on False Negatives (my SA has 0% on this set): 31.7821782178218 % One should consider that my dataset might not be 100% accurate. It is combined from my inbox and my spam folder. Of course my spam folder is unlikely to contain ham, but it is surely possible that I forgot to delete one or another false negative from my inbox. I'm looking forward to get Justin's set :) Also - when it identifies messages do the numbers on the spam scores go up and ham goes down? If so that makes it more solid and starves the middle. I'm encouraged that the initial results are good. What do you mean by that question, I don't really understand it :) My feeling is that if this works that it will work better if we have more informational tokens. For example - is the from address a freemail address. Does the message contain a freemail address. By themselves these wouldn't score points. But spam coming from yahoo, hotmail, gmail, etc. is a different kind of spam than spam coming from spambots. Maybe country tokens from the received lines would be useful. Maybe names of banks in the message would be useful. For example Bank of America + Nigeria = spam. Yes, this is exactly what I meant above. These tokens are of limited use for SA currently, but an SVM might be able to use them :) Cheers, Chris Reading dataset... Permutating... Splitting and outputting... Training... * optimization finished, #iter = 449 nu = 0.144606 obj = -529.640159, rho = -2.227729 nSV = 802, nBSV = 785 Total nSV = 802 Predicting test set... Accuracy = 99.2706% (2722/2742) (classification) Predicting false negative set... Accuracy = 31.6832% (64/202) (classification) Evaluating results... Results on test set: Precision: 99.8896856039713 % Recall: 99.01585565883 % Results on false negative set: Precision: 100 % Recall: 31.6831683168317 % = Reading dataset... Permutating... Splitting and outputting... Training... * optimization finished, #iter = 466 nu = 0.147031 obj = -539.132218, rho = -2.297470 nSV = 817, nBSV = 791 Total nSV = 817 Predicting test set... Accuracy = 99.2706% (2722/2742) (classification) Predicting false negative set... Accuracy = 32.1782% (65/202) (classification) Evaluating results... Results on test set: Precision: 99.6613995485327 % Recall: 99.2134831460674 % Results on false negative set: Precision: 100 % Recall: 32.1782178217822 % = Reading dataset... Permutating... Splitting and outputting... Training... * optimization finished, #iter = 454 nu = 0.146568 obj = -535.034660, rho = -2.187959 nSV = 814, nBSV = 793 Total nSV = 814 Predicting test set... Accuracy = 99.2341% (2721/2742) (classification) Predicting false negative set... Accuracy = 31.6832% (64/202) (classification) Evaluating results... Results on test set: Precision: 99.3834080717489 % Recall: 99.4391475042064 % Results on false negative set: Precision: 100 % Recall: 31.6831683168317 % = Reading dataset... Permutating... Splitting and outputting... Training... * optimization finished, #iter = 447 nu = 0.144391 obj = -530.359839, rho = -2.219816 nSV = 802, nBSV = 781 Total nSV = 802 Predicting test set... Accuracy = 99.2341% (2721/2742) (classification) Predicting false negative set... Accuracy = 31.6832% (64/202) (classification) Evaluating results... Results on
how to make a custom ruleset
Dear all, I found that a lot of spam is using recipient email address as the sender. (from a...@internux.co.id to a...@internux.co.id, or from i...@apache.org to i...@apache.org). Since if we mail to our self, usually we have very low score, I hope it is save to give a BIG score (probably 2 or 3). Is there a hint how to make this custom rule set?
Re: how to make a custom ruleset
On Thu, 2009-03-05 at 21:31 +0800, Adi Nugroho wrote: I found that a lot of spam is using recipient email address as the sender. (from a...@internux.co.id to a...@internux.co.id, or from i...@apache.org to i...@apache.org). The only disadvantage is that you'll label test messages as spam. Since if we mail to our self, usually we have very low score, I hope it is save to give a BIG score (probably 2 or 3). Is there a hint how to make this custom rule set? Use a meta rule: describe SELF Trap mail with forged sender the same as recipient header SELF From =~ /\...@my.address/i header SELF To =~ /\...@my.address/i meta SELF 5.0 This will work for a domain where internal mail is *not* scanned by SA. Martin
Re: how to make a custom ruleset
On Thu, 2009-03-05 at 21:31 +0800, Adi Nugroho wrote: Dear all, I found that a lot of spam is using recipient email address as the sender. (from a...@internux.co.id to a...@internux.co.id, or from i...@apache.org to i...@apache.org). Since if we mail to our self, usually we have very low score, I hope it is save to give a BIG score (probably 2 or 3). Is there a hint how to make this custom rule set? Here's one way. I'm sure there will be many holes in this approach. 1. Define and publish SPF policies for your network. 2. Create a rule like this: header __OUR_DOMAIN_FROMFrom:addr example.com header __OUR_DOMAIN_ENVELOPEEnvelopeFrom:addr example.com meta OUR_DOMAIN (__OUR_DOMAIN_FROM || __OUR_DOMAIN_ENVELOPE) SPF_FAIL describe OUR_DOMAIN claims to be from our domain but fails SPF score OUR_DOMAIN 2.5 -- Daniel J McDonald, CCIE #2495, CISSP #78281, CNX Austin Energy http://www.austinenergy.com
Re: how to make a custom ruleset
On Thu, March 5, 2009 14:31, Adi Nugroho wrote: I found that a lot of spam is using recipient email address as the sender. (from a...@internux.co.id to a...@internux.co.id, or from i...@apache.org to i...@apache.org). all this happends on domains that have no spf and or testing spf in mta, when spf is used properly this will soon go away Since if we mail to our self, usually we have very low score, I hope it is save to give a BIG score (probably 2 or 3). you know where you send from (ip-wise, smtp auth) there should be no problem to make a wall on this Is there a hint how to make this custom rule set? enable spf / dkim, testing spf / dkim, problem solved http://www.openspf.org/ http://www.dkim.org/ -- http://localhost/ 100% uptime and 100% mirrored :)
Re: NOTICE: mail delivery status.
I added this to my local.cf, is this syntax OK or will this block 'Club' 'Casion' and 'Vegas' if used seperate? header FROM_CASINO From:name =~ /\Vegas Club Casino\b/i descrbibe FROM_CASINO Casino Club Casino filter 04/03/09 score FROM_CASINO 10.0 I got the following response from my MTA Reporting-MTA: dns; mail.removed.be X-Postfix-Queue-ID: A047E104436 X-Postfix-Sender: rfc822; ru...@removed.be https://secure.studioo.be/gbmail/src/compose.php?send_to=ruben.dobbelaere%40sovoarte.be Arrival-Date: Thu, 5 Mar 2009 11:35:51 +0100 (CET) Final-Recipient: rfc822; hendr...@mail.removed.be https://secure.studioo.be/gbmail/src/compose.php?send_to=hendrika%40mail.studioo.be Original-Recipient: rfc822; t...@removed.be https://secure.studioo.be/gbmail/src/compose.php?send_to=team%40sovoarte.be Action: failed Status: 5.0.0 Diagnostic-Code: X-Postfix; can't create user output file. Command output: [14547] warn: Unrecognized escape \V passed through in regex; marked by -- HERE in m/(?i)\V -- HERE egas Club Casino\b/ at /usr/lib/perl5/vendor_perl/5.8.5/Mail/SpamAssassin/Conf/Parser.pm line 1173. [14547] warn: Unrecognized escape \V passed through at /etc/mail/spamassassin/local.cf, rule FROM_CASINO, line 1.
Re: NOTICE: mail delivery status.
On Thu, March 5, 2009 16:01, Geert Batsleer wrote: header FROM_CASINO From:name =~ /\Vegas Club Casino\b/i header FROM_CASINO From:name =~ /\bVegas Club Casino\b/i -- http://localhost/ 100% uptime and 100% mirrored :)
Re: NOTICE: mail delivery status.
On 05.03.09 16:01, Geert Batsleer wrote: I added this to my local.cf, is this syntax OK or will this block 'Club' 'Casion' and 'Vegas' if used seperate? header FROM_CASINO From:name =~ /\Vegas Club Casino\b/i descrbibe FROM_CASINO Casino Club Casino filter 04/03/09 score FROM_CASINO 10.0 \V is unknown to me, and apparently to perl too. Also, score 10 is too much. combined with BAYES_99 with standard score of 3.5 and standard required score of 5.0, 1.505 should be enough even for cases the sender has correct SPF, DKIM or whatever tests that give small negative score.. But what do tou mean if used separate? defining score for each word separate would work, athough this way it's more reliable (words vegas, club, casino can appear in From: lines of mant mails). Using ReplaceTags could help you if anyone starts obfuscating that... 1173. [14547] warn: Unrecognized escape \V passed through at /etc/mail/spamassassin/local.cf, rule FROM_CASINO, line 1. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. LSD will make your ECS screen display 16.7 million colors
Re: how to make a custom ruleset
On Thursday 05 March 2009 22:28:23 Martin Gregorie wrote: describe SELF Trap mail with forged sender the same as recipient header SELF From =~ /\...@my.address/i header SELF To =~ /\...@my.address/i meta SELF 5.0 Dear Martin, Thank you for the rule... I made a file self.cf in /etc/mail/spamassassin: describe SELF Trap mail with forged sender the same as recipient header SELF From =~ /\...@my.address/i header SELF To =~ /\...@my.address/i meta SELF 5.0 score SELF 3.0 But all mail identified as SELF :D Did I misunderstood something?
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
decoder wrote: Marc Perkel wrote: Good work so far but sounds like you need to throw more data at it. Also even though you indicate over 99% accuracy can you break that down better? 99.9% is 10 times as accurate as 99%. What do you mean by more data? Of course, some additional data might help. One should consider that _most_ of the SA rules are designed to score on spam. For an SVM, you can use more general data like Mail has property XYZ although you don't know what this property means (ham or spam) or if it is even suitable to classify anything. This is of course an advantage. With respect to the numbers: I repeated the experiments today with slight modifications to provide a more solid setup: The input is again the dataset I used yesterday. In one run, I permutate the dataset, then split it (2/3 training vs. 1/3 testing, not stratified). Then the training set is used to train an SVM, and it is applied to the 1/3 testing set and additionally to my false negatives set. The SVM outputs an accuracy value, but I wrote a tool that calculates precision and recall by hand because these values are more interesting as 1 - Precision = False Positive Rate (which is an important factor in SA) 1 - Recall = False Negative Rate (or, consider recall as the detection rate) I ran this 5 times, the output is attached as text file, there you will see the exact numbers :) Taking the mean over the 5 runs: False positive rate: 0.37908199952036 % Detection Rate: 99.18104855859372 % Detection Rate on False Negatives (my SA has 0% on this set): 31.7821782178218 % One should consider that my dataset might not be 100% accurate. It is combined from my inbox and my spam folder. Of course my spam folder is unlikely to contain ham, but it is surely possible that I forgot to delete one or another false negative from my inbox. I'm looking forward to get Justin's set :) Also - when it identifies messages do the numbers on the spam scores go up and ham goes down? If so that makes it more solid and starves the middle. I'm encouraged that the initial results are good. What do you mean by that question, I don't really understand it :) My feeling is that if this works that it will work better if we have more informational tokens. For example - is the from address a freemail address. Does the message contain a freemail address. By themselves these wouldn't score points. But spam coming from yahoo, hotmail, gmail, etc. is a different kind of spam than spam coming from spambots. Maybe country tokens from the received lines would be useful. Maybe names of banks in the message would be useful. For example Bank of America + Nigeria = spam. Yes, this is exactly what I meant above. These tokens are of limited use for SA currently, but an SVM might be able to use them :) Cheers, Chris I suppose what I was thinking was that you still used the SA result but added or subtracted from the SA result based on your SVM code, sort of the way bayes does. Or are you letting SVM make the final determination? In my SA processing I'm used to getting numbers back and processing different based on the grade of spam/ham. I was envisioning that this new process would increase the accuracy and starve the middle pushing the result into bigger ham/spam numbers.
Re: how to make a custom ruleset
On Thu, March 5, 2009 16:27, Adi Nugroho wrote: describe SELF Trap mail with forged sender the same as recipient header SELF From =~ /\...@my.address/i header SELF_TO To =~ /\...@my.address/i meta SELF 5.0 ups header SELF_FROM From =~ /\...@my.address/i header SELF_TO To =~ /\...@my.address/i meta SELF (SELF_FROM SELF_TO) describe SELF Trap mail with forged sender the same as recipient score SELF 3.0 But all mail identified as SELF :D then it works Did I misunderstood something? nope, make sure NO_RELAYS or ALL_TRUSTED have highter scores then SELF eg: score NO_RELAYS -3.1 or score ALL_TRUSTED -3.1 -- http://localhost/ 100% uptime and 100% mirrored :)
Re: Some emails pass spamassassin unprocessed
Karsten Bräckelmann-2 wrote: Mentioning some numbers is good, though too qualitative. How many mails is that per day, in absolute numbers? Maybe 1-10 a day in total. I have several email accounts there and it happens with all of them although not in a serious amount per account. Karsten Bräckelmann-2 wrote: Of course, never forget that there's a default max-size per message. Unless told otherwise, spamc won't scan mail that's larger than 500 kByte. Probably not your issue, though, unless (most) of the unprocessed mail you're talking about actually *is* large. The problem is not size related. The spam emails that pass often are only one line text mails. My settings do not specify a max-size since the default value seems to suit me. Karsten Bräckelmann-2 wrote: Also, this is the spamc default of safe fallback. That means, if there is any issue in communicating with the daemon, the message will be passed along unprocessed. But let's re-schedule that for later. (See follow-up post.) The logfiles do not report any problems. I ran procmail with VERBOSE=yes for some time now and it shows that even the unprocessed emails get passed to spamc. Karsten Bräckelmann-2 wrote: Unrelated: Are you using mbox or maildir storage? The $MAILDIR hints you actually are using maildir format, though the junk folder is an mbox file! With mbox, you seriously should add locking to any delivering recipe. Maildir is beeing used as maildir storage. I just use the mbox file for detected spam emails instead of forgetting them right away. Thanks for the hint with locking, I added the colon to the line concerning the mbox write access. Karsten Bräckelmann-2 wrote: Yes. Check your logs. If you are running out of spamd children, you'll see something like this in the logs: prefork: child states One state indicating char per children. B is busy, I means idle. Idle processes are ready to take a message. If you're seeing too many busy children, you're server can't handle the load. In that case, you /can/ increase the number of children, if you got plenty of RAM. Limiting the parallel resource usage by locking (in procmail) can help, too. And it definitely would be worth investigating more, like how long the children take for processing a message. Too long scan times can amplify this. guenther I checked the logs and that does not seem to be the cause of the problem. My log shows 'II' in most cases and rarely more than 3 threads running. My max threads is 5 I think. I try and add locking to the spamc call now in all the .procmailrc files. I will come back to you =) Many thanks for all the hints so far! -- View this message in context: http://www.nabble.com/Some-emails-pass-spamassassin-unprocessed-tp22119041p22355105.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: SpamAssassin Doesn't Appear to be working
Got it tweaked the settings set $sa_tag_level_deflt = ; and the header now shows... I feel better even if it was working before -- View this message in context: http://www.nabble.com/SpamAssassin-Doesn%27t-Appear-to-be-working-tp22341459p22355132.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: how to make a custom ruleset
On Thu, 5 Mar 2009, Benny Pedersen wrote: header SELF_FROM From =~ /\...@my.address/i header SELF_TO To =~ /\...@my.address/i Are you sure you want to give 1 point to each of those cases in addition to whatever points the meta adds? If not, then they should be named __SELF_FROM and __SELF_TO -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Failure to plan ahead on someone else's part does not constitute an emergency on my part. -- David W. Barts in a.s.r --- 3 days until Daylight Saving Time begins in U.S. - Spring Forward
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
Marc Perkel wrote: I suppose what I was thinking was that you still used the SA result but added or subtracted from the SA result based on your SVM code, sort of the way bayes does. Or are you letting SVM make the final determination? At the moment, I am only using the SVM answer. What you finally do with it, is the next step. You can use it like a normal rule and give it a score, of course. You can also only use the SVM, but I think I'll go for the scoring idea :) It would also be possible to use an SVM model that supports confidence/probabilities. At the moment I was only evaluating the precision/recall for this method only without any scorings. Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
John Hardin wrote: Would there be any benefit to having an offline version - i.e. something that evaluates the log or a corpus to generate new meta rules, that could be added onto the default ruleset? For instance: cron @ 0200: sa_meta_eval /etc/mail/spamassassin/metarules.cf /etc/init.d/spamassassin restart This is definetly a good idea. You can create the SVM model offline from a logfile only, if it includes the rules that scored and the ham/spam status. However, you cannot generate metarules with SVMs, for that purpose you need a different learning algorithm (for example bayes, or decision trees). However, SVM classification is very cheap, so once you created the model offline, you can use it online really quickly with a plugin. Cheers, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: how to make a custom ruleset
On Thu, March 5, 2009 17:31, John Hardin wrote: header SELF_FROM From =~ /\...@my.address/i header SELF_TO To =~ /\...@my.address/i Are you sure you want to give 1 point to each of those cases in addition to whatever points the meta adds? it was not me that maked the rules, just edit them :) If not, then they should be named __SELF_FROM and __SELF_TO sure when do you stop CC me ? -- http://localhost/ 100% uptime and 100% mirrored :)
Re: 2 + 2 != 4 - Spamassassin needs a new paradigm
On Thu, Mar 5, 2009 at 11:12, decoder deco...@own-hero.net wrote: Justin Mason wrote: Thanks for doing this! couple of q's: 1. I can offer a bigger ham/spam corpus if you'd like to test against that as well; corpora from multiple contributors can sometimes expose training set bias. That would be cool :) Is this corpus already processed by spamassassin (i.e. has SA headers)? My poc code currently mines only the headers to find out what rules are triggered. yep, it is. OK, let me take a look later on tonight and see if I can make up a tarball for you... 2. can you test it on spam that scored less than 10 points when it arrived? low-scoring spam is, of course, more useful to hit than stuff that scored highly on the existing rules. Things like that should be possible easily. I need to check if I have enough mails to do a sufficiently reliable test here. cool. 3. does it give an indication of confidence in its results? or just a binary spam/ham decision? I'm currently working only with a binary classifier. However, libsvm supports probability estimates and regression (and to my knowledge, internally, most SVM algorithms relax classification output to real values and then use the sign to determine the classification, this can also be seen as some sort of confidence value) yep, that should work. 4. hey, if you're writing an SVM plugin, it might be worth making one that _also_ supports body text tokens, similarly to the existing Bayes plugin. ;) This would surely be possible somehow, but we'd first have to come up with a good representation of the problem for an SVM. I wouldn't want to mix this either with the current experiment, as these two things somehow represent different data. One of the problems with text tokens is that there can always be new ones (which would increase the dimension of the problem and hence require the whole SVM to be remodeled, so, a system as performant as bayes might not work directly.) interesting, I hadn't thought of that angle. --j.
Re: Dealing with low scoring spam - tighter MTA integration
--On Thursday, March 05, 2009 7:43 AM +0100 Andrzej Adam Filip a...@onet.eu wrote: What I would like to see is a option to make spam assassin to produce weighted scores based on subset of all tests capable to work on subset of the final data available *before* message headersbody are transfered in SMTP session. Before you get the DATA part, you only have the EHLO and envelope. Not a real need for a full-blown SA scan at that point. What rules would you apply that couldn't be done with a simple Perl function? (For lurkers, MIMEDefang allows one to write a Sendmail milter in Perl, by providing a C-to-Perl translation layer.)
Re: Dealing with low scoring spam - tighter MTA integration
Kenneth Porter sh...@sewingwitch.com wrote: --On Thursday, March 05, 2009 7:43 AM +0100 Andrzej Adam Filip a...@onet.eu wrote: What I would like to see is a option to make spam assassin to produce weighted scores based on subset of all tests capable to work on subset of the final data available *before* message headersbody are transfered in SMTP session. Before you get the DATA part, you only have the EHLO and envelope. At RCPT TO: stage there are available: * connecting client IP address (last mail hop) so big part of DNSBL and DNSWL tests *CAN* be used * envelope sender for SPF based tests * envelope sender and envelope recipient for auto white/black listing (producing some kind of grey-listing based for first attempt from unknown reputation source) Not a real need for a full-blown SA scan at that point. I try hard to preach that SA methodology of creating spam score based on weighted tests *CAN* be applied at this point too. I would like too apply such test in milter (MIMEDefang) that uses SA anyway in my installation. What rules would you apply that couldn't be done with a simple Perl function? SA is not a simple set of perl functions? ;-) Delivering such functionality via SA would assure keeping sync of weights with changing spamming patterns. Some spammers are smart, many spammers are smart enough to follow so quality of maintenance team and maintenance methodology does make difference. (For lurkers, MIMEDefang allows one to write a Sendmail milter in Perl, by providing a C-to-Perl translation layer.) -- [plen: Andrew] Andrzej Adam Filip : a...@onet.eu You can't have everything. Where would you put it? -- Steven Wright
Re: Dealing with low scoring spam - tighter MTA integration
Andrzej Adam Filip wrote: At RCPT TO: stage there are available: * connecting client IP address (last mail hop) so big part of DNSBL and DNSWL tests *CAN* be used * envelope sender for SPF based tests * envelope sender and envelope recipient for auto white/black listing (producing some kind of grey-listing based for first attempt from unknown reputation source) Are you thinking that it might be good to tie this in to the SpamAssassin AWL score? So a sender with an existing low AWL might be allowed through even if the sending host gets on one or two DNSBLs? And you’re missing the possibility of doing reverse DNS lookups, too. James. -- E-mail: james@ | A: Because people don’t normally read bottom to top. aprilcottage.co.uk | Q: Why is top-posting such a bad thing? | A: Top-posting. | Q: What is the most annoying thing in e-mail and usenet?
Re: Dealing with low scoring spam - tighter MTA integration
James Wilkinson sa-u...@aprilcottage.co.uk wrote: Andrzej Adam Filip wrote: At RCPT TO: stage there are available: * connecting client IP address (last mail hop) so big part of DNSBL and DNSWL tests *CAN* be used * envelope sender for SPF based tests * envelope sender and envelope recipient for auto white/black listing (producing some kind of grey-listing based for first attempt from unknown reputation source) Are you thinking that it might be good to tie this in to the SpamAssassin AWL score? So a sender with an existing low AWL might be allowed through even if the sending host gets on one or two DNSBLs? I want a platform allowing many people to contribute small improvements e.g. whilte-listing based on combination of sender address and ASN (or routing prefix). And you’re missing the possibility of doing reverse DNS lookups, too. I have considered it to be obvious derivate of connecting client IP address -- [plen: Andrew] Andrzej Adam Filip : a...@onet.eu Seek simplicity -- and distrust it. -- Alfred North Whitehead
Re: how to make a custom ruleset
On Mar 5, 2009, at 7:28, Martin Gregorie mar...@gregorie.org wrote: On Thu, 2009-03-05 at 21:31 +0800, Adi Nugroho wrote: I found that a lot of spam is using recipient email address as the sender. (from a...@internux.co.id to a...@internux.co.id, or from i...@apache.org to i...@apache.org). The only disadvantage is that you'll label test messages as spam. If you allow address delimiters this is trivial to get around, just have the email their test to user+t...@example.com
Re: Dealing with low scoring spam - tighter MTA integration
--On Thursday, March 05, 2009 10:31 PM +0100 Andrzej Adam Filip a...@onet.eu wrote: I try hard to preach that SA methodology of creating spam score based on weighted tests *CAN* be applied at this point too. I would like too apply such test in milter (MIMEDefang) that uses SA anyway in my installation. A cheap way of doing it would be to construct an artificial message from the information available. One would probably want to use a custom set of rules (ie. strip out most of the normal rules that assume a full set of headers and a regular body). At RCPT TO: stage there are available: * connecting client IP address (last mail hop) so big part of DNSBL and DNSWL tests *CAN* be used * envelope sender for SPF based tests * envelope sender and envelope recipient for auto white/black listing (producing some kind of grey-listing based for first attempt from unknown reputation source) Instead of running all of SA, perhaps you could just invoke the individual plugins from their Perl entry points. I'm not familiar enough with SA's architecture to know how practical that is, though.
Re: how to make a custom ruleset
On Thursday 05 March 2009 23:44:39 Benny Pedersen wrote: ups header SELF_FROM From =~ /\...@my.address/i header SELF_TO To =~ /\...@my.address/i meta SELF (SELF_FROM SELF_TO) describe SELF Trap mail with forged sender the same as recipient score SELF 3.0 I have tried above syntax but failed. No mail identified as SELF. Is there a howto about this ruleset?