Many thanks for all previous mailing lists referring to problems with autolearn=spam. I've taken into account your remarks and first of all I've fed my bayesian databases. Now, this my resulat of sa-learn --dump -magic command :
0.000 0 3 0 non-token data: bayes db version 0.000 0 272 0 non-token data: nspam 0.000 0 245 0 non-token data: nham 0.000 0 21292 0 non-token data: ntokens 0.000 0 1109767086 0 non-token data: oldest atime 0.000 0 1110286647 0 non-token data: newest atime 0.000 0 1110365778 0 non-token data: last journal sync atime 0.000 0 0 0 non-token data: last expiry atime 0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count And local.cf as you can see ( defaults values for bayesian filtering and that's why it is on ) : score MISSING_SUBJECT 15.0 score NIGERIAN_BODY1 15.0 bayes_file_mode 0770 skip_rbl_checks 0 use_razor2 1 use_dcc 1 use_pyzor 1 ok_languages en pl ok_locales en I remind you I've prepared script which makes my own spams and sends them to my mail server This server is placed in local net, not in Internet because I'm only testing SpamAssassin. Here are analisis of details of my examplary spam : Content analysis details: (41.3 points, 5.0 required) pts rule name description ---- ---------------------- ------------------------------------------------ -- 1.3 FROM_NO_LOWER From address has no lower-case characters -2.8 ALL_TRUSTED Did not pass through any untrusted hosts 0.8 AMATEUR_PORN BODY: Possible porn - Amateur Porn 1.3 MILLION_USD BODY: Talks about millions of dollars 0.5 SUBJ_2_CREDIT BODY: Contains 'subject to credit approval' 0.8 DEAR_FRIEND BODY: Dear Friend? That's not very dear! 0.4 US_DOLLARS_3 BODY: Mentions millions of $ ($NN,NNN,NNN.NN) 0.5 BODY_ENHANCEMENT BODY: Information on growing body parts 0.6 PORN_URL_MISC URI: URL uses words/phrases which indicate porn (misc) 0.0 DRUGS_ERECTILE Refers to an erectile drug 15 MISSING_SUBJECT Missing Subject: header 0.5 UPPERCASE_75_100 message body is 75-100% uppercase 0.5 NIGERIAN_BODY2 Message body looks like a Nigerian spam message 2+ 15 NIGERIAN_BODY1 Message body looks like a Nigerian spam message 1+ 1.4 INVALID_MSGID Message-Id is not valid, according to RFC 2822 1.4 NIGERIAN_BODY4 Message body looks like a Nigerian spam message 4+ 2.3 LONGWORDS Long string of long words 1.9 NIGERIAN_BODY3 Message body looks like a Nigerian spam message 3+ And the header contents of above mentioned spam : X-Spam-Flag: YES X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on kronos X-Spam-Level: ****************************************** X-Spam-Status: Yes, score=42.5 required=5.0 tests=ALL_TRUSTED,AMATEUR_PORN, BAYES_99,DEAR_FRIEND,DRUGS_ERECTILE,FROM_NO_LOWER,INVALID_MSGID, LONGWORDS,MILLION_USD,MISSING_SUBJECT,NIGERIAN_BODY1,NIGERIAN_BODY2, NIGERIAN_BODY3,NIGERIAN_BODY4,PORN_URL_MISC,SUBJ_2_CREDIT, UPPERCASE_50_75,US_DOLLARS_3 autolearn=no version=3.0.2 X-Spam-Report: * 0.4 FROM_NO_LOWER From address has no lower-case characters * -3.3 ALL_TRUSTED Did not pass through any untrusted hosts * 1.7 AMATEUR_PORN BODY: Possible porn - Amateur Porn * 2.8 MILLION_USD BODY: Talks about millions of dollars * 0.1 SUBJ_2_CREDIT BODY: Contains 'subject to credit approval' * 0.1 DEAR_FRIEND BODY: Dear Friend? That's not very dear! * 0.4 US_DOLLARS_3 BODY: Mentions millions of $ ($NN,NNN,NNN.NN) * 1.6 PORN_URL_MISC URI: URL uses words/phrases which indicate porn (misc) * 1.9 BAYES_99 BODY: Bayesian spam probability is 99 to 100% * [score: 1.0000] * 0.2 DRUGS_ERECTILE Refers to an erectile drug * 15 MISSING_SUBJECT Missing Subject: header * 0.0 UPPERCASE_50_75 message body is 50-75% uppercase * 0.6 NIGERIAN_BODY2 Message body looks like a Nigerian spam message 2+ * 15 NIGERIAN_BODY1 Message body looks like a Nigerian spam message 1+ * 1.1 INVALID_MSGID Message-Id is not valid, according to RFC 2822 * 2.7 NIGERIAN_BODY4 Message body looks like a Nigerian spam message 4+ * 2.3 LONGWORDS Long string of long words * 0.1 NIGERIAN_BODY3 Message body looks like a Nigerian spam message 3+ My questions are : My main question 1) I still can't see a mail with the header containing autolearn=spam. It seems that this spam should feed databases as spam because : - it has more than 3 points from the header and more than 3 points from the body - the score is more than 12 points (bayes_auto_learn_threshold_spam 12.0) However if the score of the mail is less than 0.1, autolearning works correctly ( in the header it can see autolearn=ham ). I suppose autolearning with spam doesn't work properly (????) And the other ones : 2) There are differences beetwen scores of tests in the Content analysis details and in the header ( see above ). For example, FROM_NO_LOWER test has 1.3 pts in Content analysis details and 0.4 in the header ; in Content analysis details it can't see BAYES_99 BODY test at all, but in the header you can see this test. Why ? 3) I added the following lines to local.cf : rewrite_subject 1 subject_tag *****SPAM***** use_terse_report 0 auto_learn 1 Now, if I run spamassassin -D --lint I find the statements : config: SpamAssassin failed to parse line, skipping: rewrite_subject 1 config: SpamAssassin failed to parse line, skipping: subject_tag *****SPAM***** config: SpamAssassin failed to parse line, skipping: use_terse_report 0 config: SpamAssassin failed to parse line, skipping: auto_learn 1 config: SpamAssassin failed to parse line, skipping: rewrite_subject 1 config: SpamAssassin failed to parse line, skipping: subject_tag *****SPAM***** config: SpamAssassin failed to parse line, skipping: use_terse_report 0 What does it mean ? Regards Mirek Wasik
