Re: Workflow for adding new ham/spam to existing site-wide database?

Matus UHLAR - fantomas Thu, 18 Mar 2021 09:38:51 -0700

>On Wed, 17 Mar 2021 10:42:14 -0400 Kris Deugau wrote:


>> My own experience has been that accumulating blobs of ham/spam and
>> just repeatedly running sa-learn over those works just fine.  It
>> also reduces the incidence of tokens from somewhat rarer mail
>> automatically expiring out of Bayes, leading to FPs and FNs.

On 17.03.21 22:01, RW wrote:
>It wont do that by default. You would need to have something removing
>the signature hashes from the database.

On Thu, 18 Mar 2021 14:01:28 +0100 Matus UHLAR - fantomas wrote:

oh, yes, it does:

       bayes_auto_expire             (default: 1)


On 18.03.21 16:09, RW wrote:

I meant that sa-learn will ignore mail that's already been trained. So,
by default, rerunning it over a corpus that already been trained wont
prevent any tokens expiring.


Aha - yes, correct.

Also, re-training over huge file takes time to parse it, so I usually split
old trained mailboxes into one-per year or similar.

Redis does support ageing-out signatures, but I don't see why you would
want to retrain on old mail at the expense of losing tokens from
new mail. You'll also end up with a database where very old emails will
have been trained many times and recent, more relevant, FPs & FNs have
only have been trained once.


I already encountered case where (apparently poorly trained) BAYES failed to
properly classify ham/spam and training multiple mail didn't change its
results (BAYES_50 nearly all the time).

Dropping bayes DB and re-training it on old corpus made it work like charm
(training one single spam pushes new mail from BAYES_50 to BAYES_999 usually)

Keeping old corpus made huge sense there, especially for targeted phish
that's quite rare and highly unique.

--
Matus UHLAR - fantomas, [email protected] ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
WinError #99999: Out of error messages.

Re: Workflow for adding new ham/spam to existing site-wide database?

Reply via email to