sa-learn absurdly slow on re-iterating over mailboxes (TxRep involved)

Benjamin Block Wed, 08 Jan 2020 15:49:34 -0800

Hello,

I setup spamassassin the other week on my inbox mail-server, and so farits been running good, now I wanted to try to train my bayes databasewith some mails I have stored (200+ of each spam and ham, which shouldbe enough according to documentation).


Here is the version I am using:

[foo@mailcollect ~]$ sa-learn --version
SpamAssassin version 3.4.3

I have spamassassin running as a systemd service:
/system.slice/spamassassin.service

+-645 /usr/bin/perl -T -w /usr/bin/spamd -c -m5 -H--razor-log-file=sys-syslog

+-652 spamd child
+-653 spamd child

And the platform I am running on:

[foo@mailcollect ~]$ grep -e PRETTY_NAME /etc/os-release
PRETTY_NAME="Fedora 31 (Thirty One)"
[foo@mailcollect ~]$ rpm -qi spamassassin
Name        : spamassassin
Version     : 3.4.3
Release     : 2.fc31
Architecture: x86_64

The the problem is this: when I initially used sa-learn on my mailboxes,it was fairly good.


For example:

+ /usr/bin/sa-learn --no-sync --progress --ham/var/spool/fetchmail/Maildir/.Congstar

 92% [=================================    ]   5.21 msgs/sec 00m09s DONE
Learned tokens from 45 message(s) (49 message(s) examined)

This is just a small Maildir, I have other much bigger ones (includingmy spam Maildir, which contains 2000+ messages).

Now, if I run sa-learn again on the same folder (the manual says"SpamAssassin remembers which mail messages it has learnt already, andwill not re-learn those messages again, unless you use the --forgetoption.", so I think this is OK to do), it gets absurdly slow, takingover 2 minutes for the same directory with 45 mails.

+ /usr/bin/sa-learn --no-sync --progress --ham/var/spool/fetchmail/Maildir/.Congstar

 92% [=============================        ]   0.30 msgs/sec 02m40s DONE
Learned tokens from 0 message(s) (49 message(s) examined)

Now imagine this for a folder with over 2k messages (of which I haveseveral).

I am not sure why this is. I ran sa-learn with debug enabled to seewhether I can see something and it looks like it spends ~3s on eachmessages for updating the TxRep database (which I enabled inspamassassin "loadplugin Mail::SpamAssassin::Plugin::TxRep"):


Jan  8 23:49:50.745 [308] dbg: TxRep: forgetting a message

Jan 8 23:49:50.746 [308] dbg: auto-whitelist: db-basedec300f7aa9c95003b94439831b843605e9a94660@sa_generated|ip=none scores 2/-40Jan 8 23:49:50.746 [308] dbg: check: tagrun - tag TXREPMSG_ID is nowready, value: -20.0Jan 8 23:49:50.746 [308] dbg: TxRep: reputation: -20.000, count: 2,weight: 1.0, delta: -20.000, MSG_ID:ec300f7aa9c95003b94439831b843605e9a94660@sa_generatedJan 8 23:49:52.202 [308] dbg: TxRep: forgetting stored score -20.000 ofmessage ec300f7aa9c95003b94439831b843605e9a94660@sa_generatedJan 8 23:49:52.203 [308] dbg: TxRep: active,ec300f7aa9c95003b94439831b843605e9a94660@sa_generated pre-score: ?,autolearn score: -20, IP: 93.191.162.21, address:nore...@congstarnews.de (unsigned)

Jan 8 23:49:52.209 [308] dbg: TxRep: reputation: none, count: 0,learning: -20, MSG_ID: ec300f7aa9c95003b94439831b843605e9a94660@sa_generatedJan 8 23:49:52.209 [308] dbg: auto-whitelist: add_score: new count: 1,new totscore: 20Jan 8 23:49:53.710 [308] dbg: auto-whitelist: DB addr list: untie-ingand unlockingJan 8 23:49:53.715 [308] dbg: auto-whitelist: DB addr list: filelocked, breaking lockJan 8 23:49:53.716 [308] dbg: locker: safe_unlock: unlink/var/spool/fetchmail/.spamassassin/tx-reputation.lock

You see the timestamps. This happens for each of the 49 messages.Whenever it wants to forget a score. Which also explains why it was somuch faster initially.. when it didn't know the score yet, and didn'thave anything to forget. Adding the new score is also slow'ish it seems.


Whats going on here? This is the file-sizes of my databases:

[mageta@mailcollect ~]$ ls -lh .spamassassin/
total 45M
-rw------- 1 mageta mail  61K Jan  9 00:23 bayes_journal
-rw------- 1 mageta mail 4.7M Jan  8 23:39 bayes_seen
-rw------- 1 mageta mail  41M Jan  8 23:39 bayes_toks
-rw------- 1 mageta mail  11M Jan  9 00:23 tx-reputation
-rw------- 1 mageta mail    4 Jan  9 00:23 tx-reputation.mutex
-rw-r--r-- 1 mageta mail 2.7K Jan  9 00:09 user_prefs
[mageta@mailcollect ~]$ file .spamassassin/tx-reputation

.spamassassin/tx-reputation: Berkeley DB (Hash, version 9, nativebyte-order)

[mageta@mailcollect ~]$ file .spamassassin/bayes_toks
.spamassassin/bayes_toks: Berkeley DB (Hash, version 9, native byte-order)

Any ideas? Can I fix this somehow? Should I make a bug-report? Thismakes sa-learn pretty unusable for me atm. I have let it run once foreverything I have.. so I should be good for now - which is great!! -,but letting it rerun will tie up one CPU on my server for _hours_ now.



best regards,
- Benjamin

sa-learn absurdly slow on re-iterating over mailboxes (TxRep involved)

Reply via email to