Re: Bayes autolearn questions

2014-09-09 Thread Axb

On 09/09/2014 03:50 PM, Alex Regan wrote:

Hi,


Did you understand that all
tokens are learned, regardless whether they have been seen before?


That doesn't really matter from a user perspective, though, right? I
mean, if there are tokens that have already been learned are learned
again, the net result is zero.


Very much not zero. Each token has several values assocated with it:
  # ham
  # spam
  time-stamp

So each time it's learned its respective ham/spam counter is incremented
which indicates how spammy or hammy a given token is and its
time-stamp is
updated indicating how "fresh" a token is. The bayes expiry process
removes
"stale" tokens when it does its job to prune the database down to size.


Ah, yes, of course. I knew about that, but somehow didn't put it
together with this.

I would like to know why, after training similar messages a number of
times, it still shows the same bayes score on new similar messages.

I'd also like to figure out why or how many more times it's necessary
for a message to be re-trained to reflect the new desired persuasion.

I've had this particular FN with frequently a bayes50, sometimes lower,
that also have a few dozen every day that are tagged as spam properly,
but still have bayes50. I pull them out of the quarantine and keep
training them as spam, but there's still a few that get through every day.

Is there any particular analysis I can do on one of the FNs that can
tell me how far off the bayes50 is from becoming bayes99 in a similar
message?

Hopefully that's clear. I understand there's a large number of variables
involved here, and I would think the fewer number of tokens in a
message, the more difficult it probably should be to persuade, but it's
frustrating to see bayes50 so repeatedly...


you could add

report BAYES_HT _HAMMYTOKENS(50)_
report BAYES_ST _SPAMMYTOKENS(50)_

to your local.cf to add a header report & see what tokens are being seen




Re: Bayes autolearn questions

2014-09-09 Thread Alex Regan

Hi,


Did you understand that all
tokens are learned, regardless whether they have been seen before?


That doesn't really matter from a user perspective, though, right? I
mean, if there are tokens that have already been learned are learned
again, the net result is zero.


Very much not zero. Each token has several values assocated with it:
  # ham
  # spam
  time-stamp

So each time it's learned its respective ham/spam counter is incremented
which indicates how spammy or hammy a given token is and its time-stamp is
updated indicating how "fresh" a token is. The bayes expiry process removes
"stale" tokens when it does its job to prune the database down to size.


Ah, yes, of course. I knew about that, but somehow didn't put it 
together with this.


I would like to know why, after training similar messages a number of 
times, it still shows the same bayes score on new similar messages.


I'd also like to figure out why or how many more times it's necessary 
for a message to be re-trained to reflect the new desired persuasion.


I've had this particular FN with frequently a bayes50, sometimes lower, 
that also have a few dozen every day that are tagged as spam properly, 
but still have bayes50. I pull them out of the quarantine and keep 
training them as spam, but there's still a few that get through every day.


Is there any particular analysis I can do on one of the FNs that can 
tell me how far off the bayes50 is from becoming bayes99 in a similar 
message?


Hopefully that's clear. I understand there's a large number of variables 
involved here, and I would think the fewer number of tokens in a 
message, the more difficult it probably should be to persuade, but it's 
frustrating to see bayes50 so repeatedly...


Thanks,
Alex


Re: Bayes autolearn questions

2014-09-09 Thread Alex Regan

Hi,


Please use plain-text rather than HTML. In particular with that really
bad indentation format of quoting.


It doesn't seem possible with gmail directly any longer, so I've set up 
thunderbird for this. Maybe it is, but not after clicking around in the obvious 
places.


It's possible.  A little googling reveals how:

When composing a message (or reply), click the little downward-facing triangle on the 
bottom right of the compose box (next to the trash can).  From the pop-up menu, click 
"plain text mode."

Haven't tried it personally, but seems like it should work as advertised.


That looks like it, thanks. I figured it would be with the rest of the 
fonts and formatting section. It's a per-message thing, though. It was 
just easier to set up Thunderbird anyway.


Thanks,
Alex


Re: Bayes autolearn questions

2014-09-08 Thread Amir Caspi
On Sep 8, 2014, at 7:17 PM, Alex Regan  wrote:

>> Please use plain-text rather than HTML. In particular with that really
>> bad indentation format of quoting.
> 
> It doesn't seem possible with gmail directly any longer, so I've set up 
> thunderbird for this. Maybe it is, but not after clicking around in the 
> obvious places.

It's possible.  A little googling reveals how:

When composing a message (or reply), click the little downward-facing triangle 
on the bottom right of the compose box (next to the trash can).  From the 
pop-up menu, click "plain text mode."

Haven't tried it personally, but seems like it should work as advertised.

--- Amir




Re: Bayes autolearn questions

2014-09-08 Thread David B Funk

On Mon, 8 Sep 2014, Alex Regan wrote:


Did you understand that the number of previously not seen tokens has
absolutely nothing to do with auto-learning?


Yes, that was a mistake.


Did you understand that all
tokens are learned, regardless whether they have been seen before?


That doesn't really matter from a user perspective, though, right? I mean, if 
there are tokens that have already been learned are learned again, the net 
result is zero.


Very much not zero. Each token has several values assocated with it:
 # ham
 # spam
 time-stamp

So each time it's learned its respective ham/spam counter is incremented
which indicates how spammy or hammy a given token is and its time-stamp is
updated indicating how "fresh" a token is. The bayes expiry process removes
"stale" tokens when it does its job to prune the database down to size.

Thus learning a token multiple times increases its weight and keeps it
"fresh" so it is kept as an active/relevant piece of info.

--
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{


Re: Bayes autolearn questions

2014-09-08 Thread Alex Regan

Hi,


Please use plain-text rather than HTML. In particular with that really
bad indentation format of quoting.


It doesn't seem possible with gmail directly any longer, so I've set up 
thunderbird for this. Maybe it is, but not after clicking around in the 
obvious places.



X-Spam-MyReport: Tokens: new, 47; hammy, 7; neutral, 54; spammy, 16.

Isn't that sufficient for auto-learning this message as spam?

 
That's clearly referring to the _TOKEN_ data in the custom header, is it
not?


Yes. Burning the candle at both ends. Really overworked.


Sorry to hear. Nonetheless, did you take the time to really understand
my explanations? It seems you sometimes didn't in the past, and I am not
happy to waste my time on other people's problems if they aren't
following thoroughly.


Yes, always. It may not be immediately, but the time you give up to do 
this is not lost on me. My brain sometimes goes faster than I can 
explain myself properly. I make too many assumptions about what people 
understand about me, my abilities, and my comprehension of a topic.



Learning is not limited to new tokens. All tokens are learned,
regardless their current (h|sp)ammyness.

Still, the number of (new) tokens is not a condition for auto-learning.
That header shows some more or less nice information, but in this
context absolutely irrelevant information.


I understood "new" to mean the tokens that have not been seen before, and
would be learned if the other conditions were met.


Well, yes. So what?

Did you understand that the number of previously not seen tokens has
absolutely nothing to do with auto-learning?


Yes, that was a mistake.


Did you understand that all
tokens are learned, regardless whether they have been seen before?


That doesn't really matter from a user perspective, though, right? I 
mean, if there are tokens that have already been learned are learned 
again, the net result is zero.



This whole part is entirely unrelated to auto-learning and your original
question.


Yes, I see that, and much of it comes to not explaining myself properly 
originally. I really only meant to tie it in with the tokens that would 
be learned had it been determined that autolearning would be taking place.


I understand now that all the tokens are learned always anyway.


As I have mentioned before in this thread: It is NOT the message's
reported total score that must exceed the threshold. The auto-learning
discriminator uses an internally calculated score using the respective
non-Bayes scoreset.


Very helpful, thanks. Is there a way to see more about how it makes that
decision on a particular message?


   spamassassin -D learn

Unsurprisingly, the -D debug option shows information on that decision.
In this case limiting debug output to the 'learn' area comes in handy,
eliminating the noise.

The output includes the important details like auto-learn decision with
human readable explanation, score computed for autolearn as well as head
and body points.


It's been a long time since I've gone through the debug output for bayes 
info, but I have done that. Only now, I'll have a little better 
understanding of what it means, and can start to improve my overall 
understanding of the bayes component of spamassassin.


Hopefully others also benefited from this crazy thread as much as I did.

Thanks,
Alex



Re: Bayes autolearn questions

2014-09-06 Thread Karsten Bräckelmann
Please use plain-text rather than HTML. In particular with that really
bad indentation format of quoting.


On Sat, 2014-09-06 at 17:22 -0400, Alex wrote:
> On Thu, Sep 4, 2014 at 1:44 PM, Karsten Bräckelmann wrote:
> > On Wed, 2014-09-03 at 23:50 -0400, Alex wrote:
> >
> > > > > I looked in the quarantined message, and according to the _TOKEN_
> > > > > header I've added:
> > > > >
> > > > > X-Spam-MyReport: Tokens: new, 47; hammy, 7; neutral, 54; spammy, 16.
> > > > >
> > > > > Isn't that sufficient for auto-learning this message as spam?
> > 
> > That's clearly referring to the _TOKEN_ data in the custom header, is it
> > not?
> 
> Yes. Burning the candle at both ends. Really overworked.

Sorry to hear. Nonetheless, did you take the time to really understand
my explanations? It seems you sometimes didn't in the past, and I am not
happy to waste my time on other people's problems if they aren't
following thoroughly.


> > > > That has absolutely nothing to do with auto-learning. Where did you get
> > > > the impression it might?
> > >
> > > If the conditions for autolearning had been met, I understood that it
> > > would be those new tokens that would be learned.
> >
> > Learning is not limited to new tokens. All tokens are learned,
> > regardless their current (h|sp)ammyness.
> >
> > Still, the number of (new) tokens is not a condition for auto-learning.
> > That header shows some more or less nice information, but in this
> > context absolutely irrelevant information.
> 
> I understood "new" to mean the tokens that have not been seen before, and
> would be learned if the other conditions were met.

Well, yes. So what?

Did you understand that the number of previously not seen tokens has
absolutely nothing to do with auto-learning? Did you understand that all
tokens are learned, regardless whether they have been seen before?

This whole part is entirely unrelated to auto-learning and your original
question.


> > Auto-learning in a nutshell: Take all tests hit. Drop some of them with
> > certain tflags, like the BAYES_xx rules. For the remaining rules, look
> > up their scores in the non-Bayes scoreset 0 or 1. Sum up those scores to
> > a total, and compare with the auto-learn threshold values. For spam,
> > also check there are at least 3 points each by header and body rules.
> > Finally, if all that matches, learn.
> 
> Is it important to understand how those three points are achieved or
> calculated?

In most cases, no, I guess. Though that is really just a distinction
usually easy to do based on the rule's type: header vs body-ish rule
definitions.

If the re-calculated total score in scoreset 0 or 1 exceeds the
auto-learn threshold but the message still is not -- then it is
important. Unless you trust the auto-learn discriminator to not cheat on
you.


> > > Okay, of course I understood the difference between points and tokens.
> > > Since the points were over the specified threshold, I thought those
> > > new tokens would have been added.
> >
> > As I have mentioned before in this thread: It is NOT the message's
> > reported total score that must exceed the threshold. The auto-learning
> > discriminator uses an internally calculated score using the respective
> > non-Bayes scoreset.
> 
> Very helpful, thanks. Is there a way to see more about how it makes that
> decision on a particular message?

  spamassassin -D learn

Unsurprisingly, the -D debug option shows information on that decision.
In this case limiting debug output to the 'learn' area comes in handy,
eliminating the noise.

The output includes the important details like auto-learn decision with
human readable explanation, score computed for autolearn as well as head
and body points.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Bayes autolearn questions

2014-09-06 Thread Alex
Hi,

On Thu, Sep 4, 2014 at 1:44 PM, Karsten Bräckelmann 
wrote:

> On Wed, 2014-09-03 at 23:50 -0400, Alex wrote:
>
> > > > I looked in the quarantined message, and according to the _TOKEN_
> > > > header I've added:
> > > >
> > > > X-Spam-MyReport: Tokens: new, 47; hammy, 7; neutral, 54; spammy, 16.
> > > >
> > > > Isn't that sufficient for auto-learning this message as spam?
> 
> That's clearly referring to the _TOKEN_ data in the custom header, is it
> not?
>

Yes. Burning the candle at both ends. Really overworked.


> > > That has absolutely nothing to do with auto-learning. Where did you get
> > > the impression it might?
> >
> > If the conditions for autolearning had been met, I understood that it
> > would be those new tokens that would be learned.
>
> Learning is not limited to new tokens. All tokens are learned,
> regardless their current (h|sp)ammyness.
>
> Still, the number of (new) tokens is not a condition for auto-learning.
> That header shows some more or less nice information, but in this
> context absolutely irrelevant information.
>

I understood "new" to mean the tokens that have not been seen before, and
would be learned if the other conditions were met.


> Auto-learning in a nutshell: Take all tests hit. Drop some of them with
> certain tflags, like the BAYES_xx rules. For the remaining rules, look
> up their scores in the non-Bayes scoreset 0 or 1. Sum up those scores to
> a total, and compare with the auto-learn threshold values. For spam,
> also check there are at least 3 points each by header and body rules.
> Finally, if all that matches, learn.
>

Is it important to understand how those three points are achieved or
calculated?


> > Okay, of course I understood the difference between points and tokens.
> > Since the points were over the specified threshold, I thought those
> > new tokens would have been added.
>
> As I have mentioned before in this thread: It is NOT the message's
> reported total score that must exceed the threshold. The auto-learning
> discriminator uses an internally calculated score using the respective
> non-Bayes scoreset.
>

Very helpful, thanks. Is there a way to see more about how it makes that
decision on a particular message?

Thanks,
Alex


Re: Bayes autolearn questions

2014-09-04 Thread Karsten Bräckelmann
On Wed, 2014-09-03 at 23:50 -0400, Alex wrote:

> > > I looked in the quarantined message, and according to the _TOKEN_
> > > header I've added:
> > > 
> > > X-Spam-MyReport: Tokens: new, 47; hammy, 7; neutral, 54; spammy, 16.
> > > 
> > > Isn't that sufficient for auto-learning this message as spam?

That's clearly referring to the _TOKEN_ data in the custom header, is it
not?

> > That has absolutely nothing to do with auto-learning. Where did you get
> > the impression it might?
> 
> If the conditions for autolearning had been met, I understood that it
> would be those new tokens that would be learned.

Learning is not limited to new tokens. All tokens are learned,
regardless their current (h|sp)ammyness.

Still, the number of (new) tokens is not a condition for auto-learning.
That header shows some more or less nice information, but in this
context absolutely irrelevant information.


Auto-learning in a nutshell: Take all tests hit. Drop some of them with
certain tflags, like the BAYES_xx rules. For the remaining rules, look
up their scores in the non-Bayes scoreset 0 or 1. Sum up those scores to
a total, and compare with the auto-learn threshold values. For spam,
also check there are at least 3 points each by header and body rules.
Finally, if all that matches, learn.


> Okay, of course I understood the difference between points and tokens.
> Since the points were over the specified threshold, I thought those
> new tokens would have been added.

As I have mentioned before in this thread: It is NOT the message's
reported total score that must exceed the threshold. The auto-learning
discriminator uses an internally calculated score using the respective
non-Bayes scoreset.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Bayes autolearn questions

2014-09-03 Thread Alex
Hi,

> > However, spam with scores greater than 9.0 aren't being autolearned:
>
>
http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html
>
>
> > Sep  2 21:01:51 mail01 amavis[25938]: (25938-10)
> > header_edits_for_quar:  ->
> > , Yes, score=16.519 tag=-200 tag2=5 kill=5
> > tests=[BAYES_50=0.8, KAM_LAZY_DOMAIN_SECURITY=1, KAM_LINKBAIT=5,
> > LOC_DOT_SUBJ=0.1, LOC_SHORT=3.1, RCVD_IN_BL_SPAMCOP_NET=1.347,
> > RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_PSBL=2.3,
> > RCVD_IN_UCEPROTECT1=0.01, RCVD_IN_UCEPROTECT2=0.01, RDNS_NONE=0.793,
> > RELAYCOUNTRY_CN=0.1, RELAYCOUNTRY_HIGH=0.5, SAGREY=0.01] autolearn=no
> > autolearn_force=no
> >
> > I've re-read the autolearn section of the docs,
>
> The one I linked to above?

Yes, and the FAQ entry regarding reasons why autolearn doesn't work.

> I looked in the quarantined message, and according to the _TOKEN_
> header I've added:
>
> X-Spam-MyReport: Tokens: new, 47; hammy, 7; neutral, 54; spammy, 16.
>
> Isn't that sufficient for auto-learning this message as spam?
>
> That has absolutely nothing to do with auto-learning. Where did you get
> the impression it might?

If the conditions for autolearning had been met, I understood that it would
be those new tokens that would be learned.

> > I just wanted to be sure this is just a case of not enough new points
> > (tokens?) for the message to be learned, and that I I wasn't doing
> > something wrong.
>
> Points: aka score, used in the context of per-rule (per-test) and
> overall score classifying a message based on the required_score setting.
>
> Token: think of it as "word" used by the Bayesian classifier sub-system.
> In practice, it is more complicated than simply space separated words.
> Context (e.x. headers) and case might be taken into account, too.

Okay, of course I understood the difference between points and tokens.
Since the points were over the specified threshold, I thought those new
tokens would have been added.

I'll continue reading and experimenting. Posting very late again. Thanks
guys for your help, as always.

Thanks,
Alex


Re: Bayes autolearn questions

2014-09-02 Thread Karsten Bräckelmann
On Tue, 2014-09-02 at 21:16 -0600, LuKreme wrote:
> On 02 Sep 2014, at 20:50 , Karsten Bräckelmann  wrote:
> > On Tue, 2014-09-02 at 20:22 -0600, LuKreme wrote:

> >> I believe the score threshold is the base score WITHOUT bayes.
> >> 
> >> Try running the email through with a -D flag and see what you get.
> >> 
> >> (And that is only a partial answer, the threshold number ignores
> >> certain classes of tests beyond bayes,but I don't remember which ones.
> >> It's unfortunate that the learn_threshold_spam uses a number that
> >> appears to be related to the spam score, because it isn't.
> > 
> > It is. Using the accompanying, non-Bayes score-set. To avoid direct
> > Bayes self-feeding, and other rules indirect self-feeding due to Bayes-
> > enabled scores.
> > 
> > BTW, if one knows of that mysterious (bayes_auto_) learn_threshold_spam
> > you mentioned, one found the AutoLearnThreshold doc mentioning exactly
> > that: Bayes auto-learning is based on non-Bayes scores.
> 
> But that is not the case, You can have a score without bayes that
> exceeds the threshold and still have the message not auto learned.

True.

I chose to not repeat myself highlighting the details and mentioning the
constraint of header and body rules' points. See my other post half an
hour earlier to this thread. And the docs.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Bayes autolearn questions

2014-09-02 Thread LuKreme

On 02 Sep 2014, at 20:50 , Karsten Bräckelmann  wrote:

> On Tue, 2014-09-02 at 20:22 -0600, LuKreme wrote:
>> On 02 Sep 2014, at 19:11 , Alex  wrote:
>> 
>>> However, spam with scores greater than 9.0 aren't being autolearned:
>> 
>> I believe the score threshold is the base score WITHOUT bayes.
>> 
>> Try running the email through with a -D flag and see what you get.
>> 
>> (And that is only a partial answer, the threshold number ignores
>> certain classes of tests beyond bayes,but I don't remember which ones.
>> It's unfortunate that the learn_threshold_spam uses a number that
>> appears to be related to the spam score, because it isn't.
> 
> It is. Using the accompanying, non-Bayes score-set. To avoid direct
> Bayes self-feeding, and other rules indirect self-feeding due to Bayes-
> enabled scores.
> 
> BTW, if one knows of that mysterious (bayes_auto_) learn_threshold_spam
> you mentioned, one found the AutoLearnThreshold doc mentioning exactly
> that: Bayes auto-learning is based on non-Bayes scores.

But that is not the case, You can have a score without bayes that exceeds the 
threshold and still have the message not auto learned.


-- 
'They're the cream!' Rincewind sighed. 'Cohen, they're the cheese.'



Re: Bayes autolearn questions

2014-09-02 Thread Karsten Bräckelmann
On Tue, 2014-09-02 at 20:22 -0600, LuKreme wrote:
> On 02 Sep 2014, at 19:11 , Alex  wrote:
> 
> > However, spam with scores greater than 9.0 aren't being autolearned:
> 
> I believe the score threshold is the base score WITHOUT bayes.
> 
> Try running the email through with a -D flag and see what you get.
> 
> (And that is only a partial answer, the threshold number ignores
> certain classes of tests beyond bayes,but I don't remember which ones.
> It's unfortunate that the learn_threshold_spam uses a number that
> appears to be related to the spam score, because it isn't.

It is. Using the accompanying, non-Bayes score-set. To avoid direct
Bayes self-feeding, and other rules indirect self-feeding due to Bayes-
enabled scores.

BTW, if one knows of that mysterious (bayes_auto_) learn_threshold_spam
you mentioned, one found the AutoLearnThreshold doc mentioning exactly
that: Bayes auto-learning is based on non-Bayes scores.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Bayes autolearn questions

2014-09-02 Thread LuKreme

On 02 Sep 2014, at 19:11 , Alex  wrote:

> However, spam with scores greater than 9.0 aren't being autolearned:

I believe the score threshold is the base score WITHOUT bayes.

Try running the email through with a -D flag and see what you get.

(And that is only a partial answer, the threshold number ignores certain 
classes of tests beyond bayes,but I don't remember which ones. It's unfortunate 
that the learn_threshold_spam uses a number that appears to be related to the 
spam score, because it isn't.
 
-- 
It's like a cow's opinion. It just doesn't matter. It's moo



Re: Bayes autolearn questions

2014-09-02 Thread Karsten Bräckelmann
On Tue, 2014-09-02 at 21:11 -0400, Alex wrote:
> I have a spamassassin-3.4 system with the following bayes config:
> 
> required_hits 5.0
> rbl_timeout 8
> use_bayes 1
> bayes_auto_learn 1
> bayes_auto_learn_on_error 1
> bayes_auto_learn_threshold_spam 9.0
> bayes_expiry_max_db_size 950
> bayes_auto_expire 0
> 
> However, spam with scores greater than 9.0 aren't being autolearned:

http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html


> Sep  2 21:01:51 mail01 amavis[25938]: (25938-10)
> header_edits_for_quar:  ->
> , Yes, score=16.519 tag=-200 tag2=5 kill=5
> tests=[BAYES_50=0.8, KAM_LAZY_DOMAIN_SECURITY=1, KAM_LINKBAIT=5,
> LOC_DOT_SUBJ=0.1, LOC_SHORT=3.1, RCVD_IN_BL_SPAMCOP_NET=1.347,
> RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_PSBL=2.3,
> RCVD_IN_UCEPROTECT1=0.01, RCVD_IN_UCEPROTECT2=0.01, RDNS_NONE=0.793,
> RELAYCOUNTRY_CN=0.1, RELAYCOUNTRY_HIGH=0.5, SAGREY=0.01] autolearn=no
> autolearn_force=no
> 
> I've re-read the autolearn section of the docs,

The one I linked to above?

> and don't see any reason why this 16-point email wouldn't have any new
> tokens to be learned?

Rules with certain tflags are ignored when determining whether a message
should be trained upon. Most notably here BAYES_xx.

Moreover, the auto-learning decision occurs using scores from either
scoreset 0 or 1, that is using scores of a non-Bayes scoreset. IOW the
message's score of 16 is irrelevant, since the auto-learn algorithm uses
different scores per rule.

Next safety net is requiring at least 3 points each from header and body
rules, unless autolearn_force is enabled. Which it is not in your
sample.

Either of those could have prevented auto-learning.


Also, according to your wording, you seem to think in terms of (number
of) "new tokens to be learned". Which has nothing in common with
auto-learning.

(Even worse, "new tokens" would strongly apply to random gibberish
strings, hapaxes in Bayes context. Which are commonly ignored in Bayes
classification.)


> I looked in the quarantined message, and according to the _TOKEN_
> header I've added:
> 
> X-Spam-MyReport: Tokens: new, 47; hammy, 7; neutral, 54; spammy, 16.
> 
> Isn't that sufficient for auto-learning this message as spam?

That has absolutely nothing to do with auto-learning. Where did you get
the impression it might?


> I just wanted to be sure this is just a case of not enough new points
> (tokens?) for the message to be learned, and that I I wasn't doing
> something wrong.

Points: aka score, used in the context of per-rule (per-test) and
overall score classifying a message based on the required_score setting.

Token: think of it as "word" used by the Bayesian classifier sub-system.
In practice, it is more complicated than simply space separated words.
Context (e.x. headers) and case might be taken into account, too.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Bayes autolearn questions

2014-09-02 Thread Alex
Hi,

I have a spamassassin-3.4 system with the following bayes config:

required_hits 5.0
rbl_timeout 8
use_bayes 1
bayes_auto_learn 1
bayes_auto_learn_on_error 1
bayes_auto_learn_threshold_spam 9.0
bayes_expiry_max_db_size 950
bayes_auto_expire 0

However, spam with scores greater than 9.0 aren't being autolearned:

Sep  2 21:01:51 mail01 amavis[25938]: (25938-10) header_edits_for_quar: <
bmu011...@bmu-011.hichina.com> -> , Yes, score=16.519
tag=-200 tag2=5 kill=5 tests=[BAYES_50=0.8, KAM_LAZY_DOMAIN_SECURITY=1,
KAM_LINKBAIT=5, LOC_DOT_SUBJ=0.1, LOC_SHORT=3.1,
RCVD_IN_BL_SPAMCOP_NET=1.347, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_PSBL=2.3,
RCVD_IN_UCEPROTECT1=0.01, RCVD_IN_UCEPROTECT2=0.01, RDNS_NONE=0.793,
RELAYCOUNTRY_CN=0.1, RELAYCOUNTRY_HIGH=0.5, SAGREY=0.01] autolearn=no
autolearn_force=no

I've re-read the autolearn section of the docs, and don't see any reason
why this 16-point email wouldn't have any new tokens to be learned?

I looked in the quarantined message, and according to the _TOKEN_ header
I've added:

X-Spam-MyReport: Tokens: new, 47; hammy, 7; neutral, 54; spammy, 16.

Isn't that sufficient for auto-learning this message as spam?

I just wanted to be sure this is just a case of not enough new points
(tokens?) for the message to be learned, and that I I wasn't doing
something wrong.

Thanks,
Alex