Hello,

I setup spamassassin the other week on my inbox mail-server, and so far its been running good, now I wanted to try to train my bayes database with some mails I have stored (200+ of each spam and ham, which should be enough according to documentation).

Here is the version I am using:

[foo@mailcollect ~]$ sa-learn --version
SpamAssassin version 3.4.3

I have spamassassin running as a systemd service:
/system.slice/spamassassin.service
+-645 /usr/bin/perl -T -w /usr/bin/spamd -c -m5 -H --razor-log-file=sys-syslog
+-652 spamd child
+-653 spamd child

And the platform I am running on:

[foo@mailcollect ~]$ grep -e PRETTY_NAME /etc/os-release
PRETTY_NAME="Fedora 31 (Thirty One)"
[foo@mailcollect ~]$ rpm -qi spamassassin
Name        : spamassassin
Version     : 3.4.3
Release     : 2.fc31
Architecture: x86_64

The the problem is this: when I initially used sa-learn on my mailboxes, it was fairly good.

For example:

+ /usr/bin/sa-learn --no-sync --progress --ham /var/spool/fetchmail/Maildir/.Congstar
 92% [=================================    ]   5.21 msgs/sec 00m09s DONE
Learned tokens from 45 message(s) (49 message(s) examined)

This is just a small Maildir, I have other much bigger ones (including my spam Maildir, which contains 2000+ messages).

Now, if I run sa-learn again on the same folder (the manual says "SpamAssassin remembers which mail messages it has learnt already, and will not re-learn those messages again, unless you use the --forget option.", so I think this is OK to do), it gets absurdly slow, taking over 2 minutes for the same directory with 45 mails.

+ /usr/bin/sa-learn --no-sync --progress --ham /var/spool/fetchmail/Maildir/.Congstar
 92% [=============================        ]   0.30 msgs/sec 02m40s DONE
Learned tokens from 0 message(s) (49 message(s) examined)

Now imagine this for a folder with over 2k messages (of which I have several).

I am not sure why this is. I ran sa-learn with debug enabled to see whether I can see something and it looks like it spends ~3s on each messages for updating the TxRep database (which I enabled in spamassassin "loadplugin Mail::SpamAssassin::Plugin::TxRep"):

Jan  8 23:49:50.745 [308] dbg: TxRep: forgetting a message
Jan 8 23:49:50.746 [308] dbg: auto-whitelist: db-based ec300f7aa9c95003b94439831b843605e9a94660@sa_generated|ip=none scores 2/-40 Jan 8 23:49:50.746 [308] dbg: check: tagrun - tag TXREPMSG_ID is now ready, value: -20.0 Jan 8 23:49:50.746 [308] dbg: TxRep: reputation: -20.000, count: 2, weight: 1.0, delta: -20.000, MSG_ID: ec300f7aa9c95003b94439831b843605e9a94660@sa_generated Jan 8 23:49:52.202 [308] dbg: TxRep: forgetting stored score -20.000 of message ec300f7aa9c95003b94439831b843605e9a94660@sa_generated Jan 8 23:49:52.203 [308] dbg: TxRep: active, ec300f7aa9c95003b94439831b843605e9a94660@sa_generated pre-score: ?, autolearn score: -20, IP: 93.191.162.21, address: nore...@congstarnews.de (unsigned)

Jan 8 23:49:52.209 [308] dbg: TxRep: reputation: none, count: 0, learning: -20, MSG_ID: ec300f7aa9c95003b94439831b843605e9a94660@sa_generated Jan 8 23:49:52.209 [308] dbg: auto-whitelist: add_score: new count: 1, new totscore: 20 Jan 8 23:49:53.710 [308] dbg: auto-whitelist: DB addr list: untie-ing and unlocking Jan 8 23:49:53.715 [308] dbg: auto-whitelist: DB addr list: file locked, breaking lock Jan 8 23:49:53.716 [308] dbg: locker: safe_unlock: unlink /var/spool/fetchmail/.spamassassin/tx-reputation.lock

You see the timestamps. This happens for each of the 49 messages. Whenever it wants to forget a score. Which also explains why it was so much faster initially.. when it didn't know the score yet, and didn't have anything to forget. Adding the new score is also slow'ish it seems.

Whats going on here? This is the file-sizes of my databases:

[mageta@mailcollect ~]$ ls -lh .spamassassin/
total 45M
-rw------- 1 mageta mail  61K Jan  9 00:23 bayes_journal
-rw------- 1 mageta mail 4.7M Jan  8 23:39 bayes_seen
-rw------- 1 mageta mail  41M Jan  8 23:39 bayes_toks
-rw------- 1 mageta mail  11M Jan  9 00:23 tx-reputation
-rw------- 1 mageta mail    4 Jan  9 00:23 tx-reputation.mutex
-rw-r--r-- 1 mageta mail 2.7K Jan  9 00:09 user_prefs
[mageta@mailcollect ~]$ file .spamassassin/tx-reputation
.spamassassin/tx-reputation: Berkeley DB (Hash, version 9, native byte-order)
[mageta@mailcollect ~]$ file .spamassassin/bayes_toks
.spamassassin/bayes_toks: Berkeley DB (Hash, version 9, native byte-order)

Any ideas? Can I fix this somehow? Should I make a bug-report? This makes sa-learn pretty unusable for me atm. I have let it run once for everything I have.. so I should be good for now - which is great!! -, but letting it rerun will tie up one CPU on my server for _hours_ now.


best regards,
- Benjamin

Reply via email to