Please use plain-text rather than HTML. In particular with that really
bad indentation format of quoting.


On Sat, 2014-09-06 at 17:22 -0400, Alex wrote:
> On Thu, Sep 4, 2014 at 1:44 PM, Karsten Bräckelmann wrote:
> > On Wed, 2014-09-03 at 23:50 -0400, Alex wrote:
> >
> > > > > I looked in the quarantined message, and according to the _TOKEN_
> > > > > header I've added:
> > > > >
> > > > > X-Spam-MyReport: Tokens: new, 47; hammy, 7; neutral, 54; spammy, 16.
> > > > >
> > > > > Isn't that sufficient for auto-learning this message as spam?
> >             ^^^^
> > That's clearly referring to the _TOKEN_ data in the custom header, is it
> > not?
> 
> Yes. Burning the candle at both ends. Really overworked.

Sorry to hear. Nonetheless, did you take the time to really understand
my explanations? It seems you sometimes didn't in the past, and I am not
happy to waste my time on other people's problems if they aren't
following thoroughly.


> > > > That has absolutely nothing to do with auto-learning. Where did you get
> > > > the impression it might?
> > >
> > > If the conditions for autolearning had been met, I understood that it
> > > would be those new tokens that would be learned.
> >
> > Learning is not limited to new tokens. All tokens are learned,
> > regardless their current (h|sp)ammyness.
> >
> > Still, the number of (new) tokens is not a condition for auto-learning.
> > That header shows some more or less nice information, but in this
> > context absolutely irrelevant information.
> 
> I understood "new" to mean the tokens that have not been seen before, and
> would be learned if the other conditions were met.

Well, yes. So what?

Did you understand that the number of previously not seen tokens has
absolutely nothing to do with auto-learning? Did you understand that all
tokens are learned, regardless whether they have been seen before?

This whole part is entirely unrelated to auto-learning and your original
question.


> > Auto-learning in a nutshell: Take all tests hit. Drop some of them with
> > certain tflags, like the BAYES_xx rules. For the remaining rules, look
> > up their scores in the non-Bayes scoreset 0 or 1. Sum up those scores to
> > a total, and compare with the auto-learn threshold values. For spam,
> > also check there are at least 3 points each by header and body rules.
> > Finally, if all that matches, learn.
> 
> Is it important to understand how those three points are achieved or
> calculated?

In most cases, no, I guess. Though that is really just a distinction
usually easy to do based on the rule's type: header vs body-ish rule
definitions.

If the re-calculated total score in scoreset 0 or 1 exceeds the
auto-learn threshold but the message still is not -- then it is
important. Unless you trust the auto-learn discriminator to not cheat on
you.


> > > Okay, of course I understood the difference between points and tokens.
> > > Since the points were over the specified threshold, I thought those
> > > new tokens would have been added.
> >
> > As I have mentioned before in this thread: It is NOT the message's
> > reported total score that must exceed the threshold. The auto-learning
> > discriminator uses an internally calculated score using the respective
> > non-Bayes scoreset.
> 
> Very helpful, thanks. Is there a way to see more about how it makes that
> decision on a particular message?

  spamassassin -D learn

Unsurprisingly, the -D debug option shows information on that decision.
In this case limiting debug output to the 'learn' area comes in handy,
eliminating the noise.

The output includes the important details like auto-learn decision with
human readable explanation, score computed for autolearn as well as head
and body points.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to