Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-05 Thread Justin Mason
On Thu, Mar 5, 2009 at 00:23, decoder deco...@own-hero.net wrote:
 decoder wrote:
 Justin Mason wrote:
 So you're volunteering to code it up, then? ;)

 I was planning to do at least some brainstorming+experiements as to what
 learning methods would seem suitable and how well the method performs,
 whenever I have time again. Unless someone else did that already?


 Ok, I did some short experiments: I've built an SVM classifier from a large
 mail corpus (8226 mails (5414 ham, 2812 spam)) and did a 5-fold cross
 validation. The resulting classifier has an accuracy of over 99%, so
 performs as good as the regular system. Now I applied this to a set of 202
 False Negatives that I collected, and 69 of these are recognized as spam by
 the SVM. As a second test, I pulled 2707 mails from one of my other inboxes
 and applied the classifier, the accuracy was again over 99% (and this is
 only ham).

 From my point of view, the results show that this approach has potential. It
 is highly accurate with respect to the current system, but additionally
 outperformed it on several false negatives.


 There are other advantages that this system has over the common system: It
 allows everybody to train the whole spamfilter (not only Bayes) to the kind
 of spam that one receives, i.e. it is more adaptive than the common system.


 Any opinions on this are greatly welcome. Maybe we should try to come up
 with a proof of concept plugin for SA?

Thanks for doing this!  couple of q's:

1. I can offer a bigger ham/spam corpus if you'd like to test against
that as well;
corpora from multiple contributors can sometimes expose training set bias.

2. can you test it on spam that scored less than 10 points when it arrived?
low-scoring spam is, of course, more useful to hit than stuff that scored highly
on the existing rules.

3. does it give an indication of confidence in its results? or just a
binary spam/ham
decision?

4. hey, if you're writing an SVM plugin, it might be worth making one
that _also_
supports body text tokens, similarly to the existing Bayes plugin. ;)

5. btw one particularly tricky part of dealing with user-trainable
dbs, is supporting
expiry of old tokens.  but that can be deferred until later anyway.

--j.


Re: Bye Bye Bayes

2009-03-05 Thread Matus UHLAR - fantomas
On 04.03.09 06:17, John Hardin wrote:
 I used to have a couple of users who treated their Trash folder as 
 long-term read-message storage. After reading most messages they'd move 
 them to Trash, and _never_ _purge_ _it_. I couldn't break them of this 
 habit, even after purging their Trash folder from the server a couple of 
 times. (Oops! Disk failure! Well, that was trash, you can afford to lose 
 that.)

We set up courier's imap server to remove files after being in in trash for
more than 7 days... Luckily, we have documented that long time ago, so they
cannot comply...

-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Boost your system's speed by 500% - DEL C:\WINDOWS\*.*


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-05 Thread decoder

Justin Mason wrote:


Thanks for doing this!  couple of q's:

1. I can offer a bigger ham/spam corpus if you'd like to test against
that as well;
corpora from multiple contributors can sometimes expose training set bias.
  
That would be cool :) Is this corpus already processed by spamassassin 
(i.e. has SA headers)?


My poc code currently mines only the headers to find out what rules are 
triggered.

2. can you test it on spam that scored less than 10 points when it arrived?
low-scoring spam is, of course, more useful to hit than stuff that scored highly
on the existing rules.
  
Things like that should be possible easily. I need to check if I have 
enough mails to

do a sufficiently reliable test here.


3. does it give an indication of confidence in its results? or just a
binary spam/ham
decision?
  
I'm currently working only with a binary classifier. However, libsvm 
supports

probability estimates and regression (and to my knowledge, internally, most
SVM algorithms relax classification output to real values and then use 
the sign
to determine the classification, this can also be seen as some sort of 
confidence value)



4. hey, if you're writing an SVM plugin, it might be worth making one
that _also_
supports body text tokens, similarly to the existing Bayes plugin. ;)
  
This would surely be possible somehow, but we'd first have to come up 
with a good
representation of the problem for an SVM. I wouldn't want to mix this 
either with the

current experiment, as these two things somehow represent different data.

One of the problems with text tokens is that there can always be new 
ones (which would
increase the dimension of the problem and hence require the whole SVM to 
be remodeled,

so, a system as performant as bayes might not work directly.)


5. btw one particularly tricky part of dealing with user-trainable
dbs, is supporting
expiry of old tokens.  but that can be deferred until later anyway.
  

I guess this is a question of implementation :)




Best regards,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: SpamAssassin Doesn't Appear to be working

2009-03-05 Thread JasonHirsh



 In this set up I am lead top believe that Amavisd-new handles the SA
 config
 but I did not see a process for spamd so i enabled in rc.conf.
 
 There is no need for a spamd process in this setup - think of amavisd
 proces as an equivalent of spamd (in that it calls a SpamAssassin library
 of perl modules), but speaks a different protocol: amavisd speaks SMTP,
 spamd speaks spamc/spamd protocol.
 
 
Ahh ok I understand that portion.. I was not aware



 
 The most likely reason for absence of X-Spam-* header fields is that
 the recipient was not considered local - check your setting of 
 @local_domains_maps (or %local_domains or @local_domains_acl).
 X-Spam-* header fields are not inserted for outbound mail (i.e. when
 recipient is not considered local). Check the log (possibly at elevated
 log level) to make sure.
 
 

Not sure what I am looking for in log  I can see the rejection but not sure
what I am looking for relative
to the xspam  I am logging at level 4  I am going to bump up top 5

I have this in the config 

 @local_domains_maps = ( read_hash(/usr/local/etc/postfix/virtual_domains)
); # using hash


so my local domains should be recognized


thanks


jason

 


-- 
View this message in context: 
http://www.nabble.com/SpamAssassin-Doesn%27t-Appear-to-be-working-tp22341459p22350544.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-05 Thread decoder

Marc Perkel wrote:


Good work so far but sounds like you need to throw more data at it. 
Also even though you indicate over 99% accuracy can you break that 
down better? 99.9% is 10 times as accurate as 99%.
What do you mean by more data? Of course, some additional data might 
help. One should consider that _most_ of the SA rules are designed to 
score on spam. For an SVM, you can use more general data like Mail has 
property XYZ although you don't know what this property means (ham or 
spam) or if it is even suitable to classify anything. This is of course 
an advantage.



With respect to the numbers:

I repeated the experiments today with slight modifications to provide a 
more solid setup:


The input is again the dataset I used yesterday. In one run, I permutate 
the dataset, then split it (2/3 training vs. 1/3 testing, not stratified).
Then the training set is used to train an SVM, and it is applied to the 
1/3 testing set and additionally to my false negatives set.


The SVM outputs an accuracy value, but I wrote a tool that calculates 
precision and recall by hand because these values are more interesting as


1 - Precision = False Positive Rate (which is an important factor in SA)
1 - Recall = False Negative Rate (or, consider recall as the detection rate)


I ran this 5 times, the output is attached as text file, there you will 
see the exact numbers :)


Taking the mean over the 5 runs:


False positive rate: 0.37908199952036 %
Detection Rate: 99.18104855859372 %

Detection Rate on False Negatives (my SA has 0% on this set): 
31.7821782178218 %



One should consider that my dataset might not be 100% accurate. It is 
combined from my inbox and my spam folder. Of course my spam folder is 
unlikely to contain ham, but it is surely possible that I forgot to 
delete one or another false negative from my inbox. I'm looking forward 
to get Justin's set :)





Also - when it identifies messages do the numbers on the spam scores 
go up and ham goes down? If so that makes it more solid and starves 
the middle. I'm encouraged that the initial results are good.

What do you mean by that question, I don't really understand it :)


My feeling is that if this works that it will work better if we have 
more informational tokens. For example - is the from address a 
freemail address. Does the message contain a freemail address. By 
themselves these wouldn't score points. But spam coming from yahoo, 
hotmail, gmail, etc. is a different kind of spam than spam coming from 
spambots. Maybe country tokens from the received lines would be 
useful. Maybe names of banks in the message would be useful. For 
example Bank of America + Nigeria = spam.
Yes, this is exactly what I meant above. These tokens are of limited use 
for SA currently, but an SVM might be able to use them :)



Cheers,


Chris
Reading dataset...
Permutating...
Splitting and outputting...
Training...
*
optimization finished, #iter = 449
nu = 0.144606
obj = -529.640159, rho = -2.227729
nSV = 802, nBSV = 785
Total nSV = 802
Predicting test set...
Accuracy = 99.2706% (2722/2742) (classification)
Predicting false negative set...
Accuracy = 31.6832% (64/202) (classification)
Evaluating results...
Results on test set:
Precision: 99.8896856039713 %
Recall: 99.01585565883 %

Results on false negative set:
Precision: 100 %
Recall: 31.6831683168317 %

=

Reading dataset...
Permutating...
Splitting and outputting...
Training...
*
optimization finished, #iter = 466
nu = 0.147031
obj = -539.132218, rho = -2.297470
nSV = 817, nBSV = 791
Total nSV = 817
Predicting test set...
Accuracy = 99.2706% (2722/2742) (classification)
Predicting false negative set...
Accuracy = 32.1782% (65/202) (classification)
Evaluating results...
Results on test set:
Precision: 99.6613995485327 %
Recall: 99.2134831460674 %

Results on false negative set:
Precision: 100 %
Recall: 32.1782178217822 %

=

Reading dataset...
Permutating...
Splitting and outputting...
Training...
*
optimization finished, #iter = 454
nu = 0.146568
obj = -535.034660, rho = -2.187959
nSV = 814, nBSV = 793
Total nSV = 814
Predicting test set...
Accuracy = 99.2341% (2721/2742) (classification)
Predicting false negative set...
Accuracy = 31.6832% (64/202) (classification)
Evaluating results...
Results on test set:
Precision: 99.3834080717489 %
Recall: 99.4391475042064 %

Results on false negative set:
Precision: 100 %
Recall: 31.6831683168317 %

=

Reading dataset...
Permutating...
Splitting and outputting...
Training...
*
optimization finished, #iter = 447
nu = 0.144391
obj = -530.359839, rho = -2.219816
nSV = 802, nBSV = 781
Total nSV = 802
Predicting test set...
Accuracy = 99.2341% (2721/2742) (classification)
Predicting false negative set...
Accuracy = 31.6832% (64/202) (classification)
Evaluating results...
Results on 

how to make a custom ruleset

2009-03-05 Thread Adi Nugroho
Dear all,

I found that a lot of spam is using recipient email address as the sender.
(from a...@internux.co.id to a...@internux.co.id, or from i...@apache.org to 
i...@apache.org).

Since if we mail to our self, usually we have very low score, I hope it is 
save to give a BIG score (probably 2 or 3).

Is there a hint how to make this custom rule set?



Re: how to make a custom ruleset

2009-03-05 Thread Martin Gregorie
On Thu, 2009-03-05 at 21:31 +0800, Adi Nugroho wrote:
 I found that a lot of spam is using recipient email address as the sender.
 (from a...@internux.co.id to a...@internux.co.id, or from i...@apache.org to 
 i...@apache.org).
 
The only disadvantage is that you'll label test messages as spam.

 Since if we mail to our self, usually we have very low score, I hope it is 
 save to give a BIG score (probably 2 or 3).
 
 Is there a hint how to make this custom rule set?
 
Use a meta rule:

describe SELF Trap mail with forged sender the same as recipient
header   SELF From =~ /\...@my.address/i
header   SELF To =~ /\...@my.address/i
meta SELF 5.0

This will work for a domain where internal mail is *not* scanned by SA. 


Martin



Re: how to make a custom ruleset

2009-03-05 Thread Daniel J McDonald
On Thu, 2009-03-05 at 21:31 +0800, Adi Nugroho wrote:
 Dear all,
 
 I found that a lot of spam is using recipient email address as the sender.
 (from a...@internux.co.id to a...@internux.co.id, or from i...@apache.org to 
 i...@apache.org).
 
 Since if we mail to our self, usually we have very low score, I hope it is 
 save to give a BIG score (probably 2 or 3).
 
 Is there a hint how to make this custom rule set?

Here's one way.  I'm sure there will be many holes in this approach.

1. Define and publish SPF policies for your network.
2. Create a rule like this:

header __OUR_DOMAIN_FROMFrom:addr   example.com
header __OUR_DOMAIN_ENVELOPEEnvelopeFrom:addr   example.com

meta OUR_DOMAIN (__OUR_DOMAIN_FROM || __OUR_DOMAIN_ENVELOPE)  SPF_FAIL
describe OUR_DOMAIN claims to be from our domain but fails SPF
score OUR_DOMAIN 2.5

-- 
Daniel J McDonald, CCIE #2495, CISSP #78281, CNX
Austin Energy
http://www.austinenergy.com



Re: how to make a custom ruleset

2009-03-05 Thread Benny Pedersen

On Thu, March 5, 2009 14:31, Adi Nugroho wrote:
 I found that a lot of spam is using recipient email address as the
 sender. (from a...@internux.co.id to a...@internux.co.id, or from
 i...@apache.org to i...@apache.org).

all this happends on domains that have no spf and or testing spf in
mta, when spf is used properly this will soon go away

 Since if we mail to our self, usually we have very low score, I hope
 it is save to give a BIG score (probably 2 or 3).

you know where you send from (ip-wise, smtp auth) there should be no
problem to make a wall on this

 Is there a hint how to make this custom rule set?

enable spf / dkim, testing spf / dkim, problem solved

http://www.openspf.org/
http://www.dkim.org/

-- 
http://localhost/ 100% uptime and 100% mirrored :)



Re: NOTICE: mail delivery status.

2009-03-05 Thread Geert Batsleer
I added this to my local.cf, is this syntax OK or will this block 'Club'
'Casion' and 'Vegas'  if used seperate?


header FROM_CASINO  From:name =~ /\Vegas Club Casino\b/i
descrbibe FROM_CASINO   Casino Club Casino filter 04/03/09
score FROM_CASINO   10.0

I got the following response from my MTA


Reporting-MTA: dns; mail.removed.be
X-Postfix-Queue-ID: A047E104436
X-Postfix-Sender: rfc822; ru...@removed.be
https://secure.studioo.be/gbmail/src/compose.php?send_to=ruben.dobbelaere%40sovoarte.be
Arrival-Date: Thu,  5 Mar 2009 11:35:51 +0100 (CET)

Final-Recipient: rfc822; hendr...@mail.removed.be
https://secure.studioo.be/gbmail/src/compose.php?send_to=hendrika%40mail.studioo.be
Original-Recipient: rfc822; t...@removed.be
https://secure.studioo.be/gbmail/src/compose.php?send_to=team%40sovoarte.be
Action: failed
Status: 5.0.0
Diagnostic-Code: X-Postfix; can't create user output file. Command output:
[14547] warn: Unrecognized escape \V passed through in regex; marked by --
HERE in m/(?i)\V -- HERE egas Club Casino\b/ at
/usr/lib/perl5/vendor_perl/5.8.5/Mail/SpamAssassin/Conf/Parser.pm line
1173. [14547] warn: Unrecognized escape \V passed through at
/etc/mail/spamassassin/local.cf, rule FROM_CASINO, line 1.


Re: NOTICE: mail delivery status.

2009-03-05 Thread Benny Pedersen

On Thu, March 5, 2009 16:01, Geert Batsleer wrote:

 header FROM_CASINO  From:name =~ /\Vegas Club Casino\b/i

header FROM_CASINO  From:name =~ /\bVegas Club Casino\b/i

-- 
http://localhost/ 100% uptime and 100% mirrored :)



Re: NOTICE: mail delivery status.

2009-03-05 Thread Matus UHLAR - fantomas
On 05.03.09 16:01, Geert Batsleer wrote:
 I added this to my local.cf, is this syntax OK or will this block 'Club'
 'Casion' and 'Vegas'  if used seperate?
 
 
 header FROM_CASINO  From:name =~ /\Vegas Club Casino\b/i
 descrbibe FROM_CASINO   Casino Club Casino filter 04/03/09
 score FROM_CASINO   10.0

\V is unknown to me, and apparently to perl too.

Also, score 10 is too much. combined with BAYES_99 with standard score of
3.5 and standard required score of 5.0, 1.505 should be enough even for cases
the sender has correct SPF, DKIM or whatever tests that give small negative
score..

But what do tou mean if used separate?
defining score for each word separate would work, athough this way it's more
reliable (words vegas, club, casino can appear in From: lines of mant
mails).

Using ReplaceTags could help you if anyone starts obfuscating that...

 1173. [14547] warn: Unrecognized escape \V passed through at
 /etc/mail/spamassassin/local.cf, rule FROM_CASINO, line 1.

-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
LSD will make your ECS screen display 16.7 million colors


Re: how to make a custom ruleset

2009-03-05 Thread Adi Nugroho
On Thursday 05 March 2009 22:28:23 Martin Gregorie wrote:
 describe SELF Trap mail with forged sender the same as recipient
 header   SELF From =~ /\...@my.address/i
 header   SELF To =~ /\...@my.address/i
 meta SELF 5.0

Dear Martin,

Thank you for the rule...

I made a file self.cf in /etc/mail/spamassassin:

describe SELF Trap mail with forged sender the same as recipient
header   SELF From =~ /\...@my.address/i
header   SELF To =~ /\...@my.address/i
meta SELF 5.0
score SELF 3.0

But all mail identified as SELF :D

Did I misunderstood something?


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-05 Thread Marc Perkel



decoder wrote:

Marc Perkel wrote:


Good work so far but sounds like you need to throw more data at it. 
Also even though you indicate over 99% accuracy can you break that 
down better? 99.9% is 10 times as accurate as 99%.
What do you mean by more data? Of course, some additional data might 
help. One should consider that _most_ of the SA rules are designed to 
score on spam. For an SVM, you can use more general data like Mail 
has property XYZ although you don't know what this property means 
(ham or spam) or if it is even suitable to classify anything. This is 
of course an advantage.



With respect to the numbers:

I repeated the experiments today with slight modifications to provide 
a more solid setup:


The input is again the dataset I used yesterday. In one run, I 
permutate the dataset, then split it (2/3 training vs. 1/3 testing, 
not stratified).
Then the training set is used to train an SVM, and it is applied to 
the 1/3 testing set and additionally to my false negatives set.


The SVM outputs an accuracy value, but I wrote a tool that calculates 
precision and recall by hand because these values are more interesting as


1 - Precision = False Positive Rate (which is an important factor in SA)
1 - Recall = False Negative Rate (or, consider recall as the detection 
rate)



I ran this 5 times, the output is attached as text file, there you 
will see the exact numbers :)


Taking the mean over the 5 runs:


False positive rate: 0.37908199952036 %
Detection Rate: 99.18104855859372 %

Detection Rate on False Negatives (my SA has 0% on this set): 
31.7821782178218 %



One should consider that my dataset might not be 100% accurate. It is 
combined from my inbox and my spam folder. Of course my spam folder is 
unlikely to contain ham, but it is surely possible that I forgot to 
delete one or another false negative from my inbox. I'm looking 
forward to get Justin's set :)





Also - when it identifies messages do the numbers on the spam scores 
go up and ham goes down? If so that makes it more solid and starves 
the middle. I'm encouraged that the initial results are good.

What do you mean by that question, I don't really understand it :)


My feeling is that if this works that it will work better if we have 
more informational tokens. For example - is the from address a 
freemail address. Does the message contain a freemail address. By 
themselves these wouldn't score points. But spam coming from yahoo, 
hotmail, gmail, etc. is a different kind of spam than spam coming 
from spambots. Maybe country tokens from the received lines would be 
useful. Maybe names of banks in the message would be useful. For 
example Bank of America + Nigeria = spam.
Yes, this is exactly what I meant above. These tokens are of limited 
use for SA currently, but an SVM might be able to use them :)



Cheers,


Chris


I suppose what I was thinking was that you still used the SA result but 
added or subtracted from the SA result based on your SVM code, sort of 
the way bayes does. Or are you letting SVM make the final determination?


In my SA processing I'm used to getting numbers back and processing 
different based on the grade of spam/ham. I was envisioning that this 
new process would increase the accuracy and starve the middle pushing 
the result into bigger ham/spam numbers.


Re: how to make a custom ruleset

2009-03-05 Thread Benny Pedersen

On Thu, March 5, 2009 16:27, Adi Nugroho wrote:

 describe SELF Trap mail with forged sender the same as recipient
 header   SELF From =~ /\...@my.address/i
 header   SELF_TO To =~ /\...@my.address/i
 meta SELF 5.0

ups

header SELF_FROM From =~ /\...@my.address/i
header SELF_TO To =~ /\...@my.address/i
meta SELF (SELF_FROM  SELF_TO)
describe SELF Trap mail with forged sender the same as recipient
score SELF 3.0

 But all mail identified as SELF :D

then it works

 Did I misunderstood something?

nope, make sure NO_RELAYS or ALL_TRUSTED have highter scores then SELF

eg:
score NO_RELAYS -3.1
or
score ALL_TRUSTED -3.1

-- 
http://localhost/ 100% uptime and 100% mirrored :)



Re: Some emails pass spamassassin unprocessed

2009-03-05 Thread Monky



Karsten Bräckelmann-2 wrote:
 
 Mentioning some numbers is good, though too qualitative. How many mails
 is that per day, in absolute numbers?
 
Maybe 1-10 a day in total. I have several email accounts there and it
happens with all of them although not in a serious amount per account.


Karsten Bräckelmann-2 wrote:
 
 Of course, never forget that there's a default max-size per message.
 Unless told otherwise, spamc won't scan mail that's larger than 500
 kByte. Probably not your issue, though, unless (most) of the unprocessed
 mail you're talking about actually *is* large.
 
The problem is not size related. The spam emails that pass often are only
one line text mails. My settings do not specify a max-size since the default
value seems to suit me.


Karsten Bräckelmann-2 wrote:
 
 Also, this is the spamc default of safe fallback. That means, if there
 is any issue in communicating with the daemon, the message will be
 passed along unprocessed. But let's re-schedule that for later. (See
 follow-up post.)
 
The logfiles do not report any problems. I ran procmail with VERBOSE=yes for
some time now and it shows that even the unprocessed emails get passed to
spamc.


Karsten Bräckelmann-2 wrote:
 
 Unrelated: Are you using mbox or maildir storage? The $MAILDIR hints you
 actually are using maildir format, though the junk folder is an mbox
 file!
 With mbox, you seriously should add locking to any delivering recipe.
 
Maildir is beeing used as maildir storage. I just use the mbox file for
detected spam emails instead of forgetting them right away. Thanks for the
hint with locking, I added the colon to the line concerning the mbox write
access.


Karsten Bräckelmann-2 wrote:
 
 Yes. Check your logs. If you are running out of spamd children, you'll
 see something like this in the logs:
   prefork: child states 
 
 One state indicating char per children. B is busy, I means idle. Idle
 processes are ready to take a message. If you're seeing too many busy
 children, you're server can't handle the load.
 
 In that case, you /can/ increase the number of children, if you got
 plenty of RAM. Limiting the parallel resource usage by locking (in
 procmail) can help, too. And it definitely would be worth investigating
 more, like how long the children take for processing a message. Too long
 scan times can amplify this.
 
   guenther
 
I checked the logs and that does not seem to be the cause of the problem. My
log shows 'II' in most cases and rarely more than 3 threads running. My max
threads is 5 I think.

I try and add locking to the spamc call now in all the .procmailrc files. I
will come back to you =)
Many thanks for all the hints so far!
-- 
View this message in context: 
http://www.nabble.com/Some-emails-pass-spamassassin-unprocessed-tp22119041p22355105.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



Re: SpamAssassin Doesn't Appear to be working

2009-03-05 Thread JasonHirsh


Got it  tweaked the settings


set

$sa_tag_level_deflt = ;


and the header now shows... I feel better even if it was working before 


-- 
View this message in context: 
http://www.nabble.com/SpamAssassin-Doesn%27t-Appear-to-be-working-tp22341459p22355132.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: how to make a custom ruleset

2009-03-05 Thread John Hardin

On Thu, 5 Mar 2009, Benny Pedersen wrote:


header SELF_FROM From =~ /\...@my.address/i
header SELF_TO To =~ /\...@my.address/i


Are you sure you want to give 1 point to each of those cases in addition 
to whatever points the meta adds?


If not, then they should be named __SELF_FROM and __SELF_TO

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Failure to plan ahead on someone else's part does not constitute
  an emergency on my part. -- David W. Barts in a.s.r
---
 3 days until Daylight Saving Time begins in U.S. - Spring Forward


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-05 Thread decoder

Marc Perkel wrote:


I suppose what I was thinking was that you still used the SA result 
but added or subtracted from the SA result based on your SVM code, 
sort of the way bayes does. Or are you letting SVM make the final 
determination?
At the moment, I am only using the SVM answer. What you finally do with 
it, is the next step. You can use it like a normal rule and give it a 
score, of course. You can also only use the SVM, but I think I'll go for 
the scoring idea :) It would also be possible to use an SVM model that 
supports confidence/probabilities. At the moment I was only evaluating 
the precision/recall for this method only without any scorings.




Chris



smime.p7s
Description: S/MIME Cryptographic Signature


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-05 Thread decoder

John Hardin wrote:
Would there be any benefit to having an offline version - i.e. 
something that evaluates the log or a corpus to generate new meta 
rules, that could be added onto the default ruleset? For instance:


cron @ 0200:
sa_meta_eval  /etc/mail/spamassassin/metarules.cf
/etc/init.d/spamassassin restart




This is definetly a good idea. You can create the SVM model offline from 
a logfile only, if it includes the rules that scored and the ham/spam 
status. However, you cannot generate metarules with SVMs, for that 
purpose you need a different learning algorithm (for example bayes, or 
decision trees).


However, SVM classification is very cheap, so once you created the model 
offline, you can use it online really quickly with a plugin.




Cheers,



Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: how to make a custom ruleset

2009-03-05 Thread Benny Pedersen

On Thu, March 5, 2009 17:31, John Hardin wrote:
 header SELF_FROM From =~ /\...@my.address/i
 header SELF_TO To =~ /\...@my.address/i

 Are you sure you want to give 1 point to each of those cases in
 addition to whatever points the meta adds?

it was not me that maked the rules, just edit them :)

 If not, then they should be named __SELF_FROM and __SELF_TO

sure

when do you stop CC me ?

-- 
http://localhost/ 100% uptime and 100% mirrored :)



Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-05 Thread Justin Mason
On Thu, Mar 5, 2009 at 11:12, decoder deco...@own-hero.net wrote:
 Justin Mason wrote:

 Thanks for doing this!  couple of q's:

 1. I can offer a bigger ham/spam corpus if you'd like to test against
 that as well;
 corpora from multiple contributors can sometimes expose training set bias.


 That would be cool :) Is this corpus already processed by spamassassin (i.e.
 has SA headers)?

 My poc code currently mines only the headers to find out what rules are
 triggered.

yep, it is.  OK, let me take a look later on tonight and see if I can make
up a tarball for you...

 2. can you test it on spam that scored less than 10 points when it
 arrived?
 low-scoring spam is, of course, more useful to hit than stuff that scored
 highly
 on the existing rules.

 Things like that should be possible easily. I need to check if I have enough
 mails to
 do a sufficiently reliable test here.

cool.

 3. does it give an indication of confidence in its results? or just a
 binary spam/ham
 decision?

 I'm currently working only with a binary classifier. However, libsvm
 supports
 probability estimates and regression (and to my knowledge, internally, most
 SVM algorithms relax classification output to real values and then use the
 sign
 to determine the classification, this can also be seen as some sort of
 confidence value)

yep, that should work.

 4. hey, if you're writing an SVM plugin, it might be worth making one
 that _also_
 supports body text tokens, similarly to the existing Bayes plugin. ;)


 This would surely be possible somehow, but we'd first have to come up with a
 good
 representation of the problem for an SVM. I wouldn't want to mix this either
 with the
 current experiment, as these two things somehow represent different data.

 One of the problems with text tokens is that there can always be new ones
 (which would
 increase the dimension of the problem and hence require the whole SVM to be
 remodeled,
 so, a system as performant as bayes might not work directly.)

interesting, I hadn't thought of that angle.

--j.


Re: Dealing with low scoring spam - tighter MTA integration

2009-03-05 Thread Kenneth Porter
--On Thursday, March 05, 2009 7:43 AM +0100 Andrzej Adam Filip 
a...@onet.eu wrote:



What I would like to see is a option to make spam assassin to produce
weighted scores based on subset of all tests capable to work on subset
of the final data available *before* message headersbody are
transfered in SMTP session.


Before you get the DATA part, you only have the EHLO and envelope. Not a 
real need for a full-blown SA scan at that point. What rules would you 
apply that couldn't be done with a simple Perl function? (For lurkers, 
MIMEDefang allows one to write a Sendmail milter in Perl, by providing a 
C-to-Perl translation layer.)





Re: Dealing with low scoring spam - tighter MTA integration

2009-03-05 Thread Andrzej Adam Filip
Kenneth Porter sh...@sewingwitch.com wrote:

 --On Thursday, March 05, 2009 7:43 AM +0100 Andrzej Adam Filip
 a...@onet.eu wrote:

 What I would like to see is a option to make spam assassin to produce
 weighted scores based on subset of all tests capable to work on subset
 of the final data available *before* message headersbody are
 transfered in SMTP session.

 Before you get the DATA part, you only have the EHLO and envelope. 

At RCPT TO: stage there are available:
* connecting client IP address (last mail hop)
  so big part of DNSBL and DNSWL tests *CAN* be used
* envelope sender for SPF based tests
* envelope sender and envelope recipient for auto white/black listing
  (producing some kind of grey-listing based for first attempt from
  unknown reputation source)

 Not a real need for a full-blown SA scan at that point.

I try hard to preach that SA methodology of creating spam score based
on weighted tests *CAN* be applied at this point too.
I would like too apply such test in milter (MIMEDefang) that uses SA
anyway in my installation.

 What rules would  you apply that couldn't be done with a simple Perl
 function?

SA is not a simple set of perl functions? ;-)

Delivering such functionality via SA would assure keeping sync of
weights with changing spamming patterns. Some spammers are smart,
many spammers are smart enough to follow so quality of maintenance team
and maintenance methodology does make difference.

 (For lurkers, MIMEDefang allows one to write a Sendmail milter in
 Perl, by providing a C-to-Perl translation layer.)

-- 
[plen: Andrew] Andrzej Adam Filip : a...@onet.eu
You can't have everything.  Where would you put it?
  -- Steven Wright


Re: Dealing with low scoring spam - tighter MTA integration

2009-03-05 Thread James Wilkinson
Andrzej Adam Filip wrote:
 At RCPT TO: stage there are available:
 * connecting client IP address (last mail hop)
   so big part of DNSBL and DNSWL tests *CAN* be used
 * envelope sender for SPF based tests
 * envelope sender and envelope recipient for auto white/black listing
   (producing some kind of grey-listing based for first attempt from
   unknown reputation source)

Are you thinking that it might be good to tie this in to the
SpamAssassin AWL score? So a sender with an existing low AWL might be
allowed through even if the sending host gets on one or two DNSBLs?

And you’re missing the possibility of doing reverse DNS lookups, too.

James.

-- 
E-mail: james@ | A: Because people don’t normally read bottom to top.
aprilcottage.co.uk | Q: Why is top-posting such a bad thing?
   | A: Top-posting.
   | Q: What is the most annoying thing in e-mail and usenet?


Re: Dealing with low scoring spam - tighter MTA integration

2009-03-05 Thread Andrzej Adam Filip
James Wilkinson sa-u...@aprilcottage.co.uk wrote:

 Andrzej Adam Filip wrote:
 At RCPT TO: stage there are available:
 * connecting client IP address (last mail hop)
   so big part of DNSBL and DNSWL tests *CAN* be used
 * envelope sender for SPF based tests
 * envelope sender and envelope recipient for auto white/black listing
   (producing some kind of grey-listing based for first attempt from
   unknown reputation source)

 Are you thinking that it might be good to tie this in to the
 SpamAssassin AWL score? So a sender with an existing low AWL might be
 allowed through even if the sending host gets on one or two DNSBLs?

I want a platform allowing many people to contribute 
small improvements e.g. whilte-listing based on combination
of sender address and ASN (or routing prefix).

 And you’re missing the possibility of doing reverse DNS lookups, too.

I have considered it to be obvious derivate of connecting client IP address

-- 
[plen: Andrew] Andrzej Adam Filip : a...@onet.eu
Seek simplicity -- and distrust it.
  -- Alfred North Whitehead


Re: how to make a custom ruleset

2009-03-05 Thread LuKreme

On Mar 5, 2009, at 7:28, Martin Gregorie mar...@gregorie.org wrote:

On Thu, 2009-03-05 at 21:31 +0800, Adi Nugroho wrote:
I found that a lot of spam is using recipient email address as the  
sender.
(from a...@internux.co.id to a...@internux.co.id, or from i...@apache.org 
 to

i...@apache.org).


The only disadvantage is that you'll label test messages as spam.


If you allow address delimiters this is trivial to get around, just  
have the email their test to user+t...@example.com









Re: Dealing with low scoring spam - tighter MTA integration

2009-03-05 Thread Kenneth Porter
--On Thursday, March 05, 2009 10:31 PM +0100 Andrzej Adam Filip 
a...@onet.eu wrote:



I try hard to preach that SA methodology of creating spam score based
on weighted tests *CAN* be applied at this point too.
I would like too apply such test in milter (MIMEDefang) that uses SA
anyway in my installation.


A cheap way of doing it would be to construct an artificial message from 
the information available. One would probably want to use a custom set of 
rules (ie. strip out most of the normal rules that assume a full set of 
headers and a regular body).



At RCPT TO: stage there are available:
* connecting client IP address (last mail hop)
  so big part of DNSBL and DNSWL tests *CAN* be used
* envelope sender for SPF based tests
* envelope sender and envelope recipient for auto white/black listing
  (producing some kind of grey-listing based for first attempt from
  unknown reputation source)


Instead of running all of SA, perhaps you could just invoke the individual 
plugins from their Perl entry points. I'm not familiar enough with SA's 
architecture to know how practical that is, though.


Re: how to make a custom ruleset

2009-03-05 Thread Adi Nugroho
On Thursday 05 March 2009 23:44:39 Benny Pedersen wrote:
 ups

 header SELF_FROM From =~ /\...@my.address/i
 header SELF_TO To =~ /\...@my.address/i
 meta SELF (SELF_FROM  SELF_TO)
 describe SELF Trap mail with forged sender the same as recipient
 score SELF 3.0

I have tried above syntax but failed.
No mail identified as SELF.

Is there a howto about this ruleset?