Please use plain-text rather than HTML. In particular with that really bad indentation format of quoting.
On Sat, 2014-09-06 at 17:22 -0400, Alex wrote: > On Thu, Sep 4, 2014 at 1:44 PM, Karsten Bräckelmann wrote: > > On Wed, 2014-09-03 at 23:50 -0400, Alex wrote: > > > > > > > I looked in the quarantined message, and according to the _TOKEN_ > > > > > header I've added: > > > > > > > > > > X-Spam-MyReport: Tokens: new, 47; hammy, 7; neutral, 54; spammy, 16. > > > > > > > > > > Isn't that sufficient for auto-learning this message as spam? > > ^^^^ > > That's clearly referring to the _TOKEN_ data in the custom header, is it > > not? > > Yes. Burning the candle at both ends. Really overworked. Sorry to hear. Nonetheless, did you take the time to really understand my explanations? It seems you sometimes didn't in the past, and I am not happy to waste my time on other people's problems if they aren't following thoroughly. > > > > That has absolutely nothing to do with auto-learning. Where did you get > > > > the impression it might? > > > > > > If the conditions for autolearning had been met, I understood that it > > > would be those new tokens that would be learned. > > > > Learning is not limited to new tokens. All tokens are learned, > > regardless their current (h|sp)ammyness. > > > > Still, the number of (new) tokens is not a condition for auto-learning. > > That header shows some more or less nice information, but in this > > context absolutely irrelevant information. > > I understood "new" to mean the tokens that have not been seen before, and > would be learned if the other conditions were met. Well, yes. So what? Did you understand that the number of previously not seen tokens has absolutely nothing to do with auto-learning? Did you understand that all tokens are learned, regardless whether they have been seen before? This whole part is entirely unrelated to auto-learning and your original question. > > Auto-learning in a nutshell: Take all tests hit. Drop some of them with > > certain tflags, like the BAYES_xx rules. For the remaining rules, look > > up their scores in the non-Bayes scoreset 0 or 1. Sum up those scores to > > a total, and compare with the auto-learn threshold values. For spam, > > also check there are at least 3 points each by header and body rules. > > Finally, if all that matches, learn. > > Is it important to understand how those three points are achieved or > calculated? In most cases, no, I guess. Though that is really just a distinction usually easy to do based on the rule's type: header vs body-ish rule definitions. If the re-calculated total score in scoreset 0 or 1 exceeds the auto-learn threshold but the message still is not -- then it is important. Unless you trust the auto-learn discriminator to not cheat on you. > > > Okay, of course I understood the difference between points and tokens. > > > Since the points were over the specified threshold, I thought those > > > new tokens would have been added. > > > > As I have mentioned before in this thread: It is NOT the message's > > reported total score that must exceed the threshold. The auto-learning > > discriminator uses an internally calculated score using the respective > > non-Bayes scoreset. > > Very helpful, thanks. Is there a way to see more about how it makes that > decision on a particular message? spamassassin -D learn Unsurprisingly, the -D debug option shows information on that decision. In this case limiting debug output to the 'learn' area comes in handy, eliminating the noise. The output includes the important details like auto-learn decision with human readable explanation, score computed for autolearn as well as head and body points. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}