This is a blend of a not-entirely-sure documentation bug report and
questions.  I am using 3.4.5.

I used to use BAYES.  To train it, I sorted ham that landed in spam
folders back to where it should have gone, and sorted spam that landed
in ham folders to "spam.manual".  I had a cron job ran sa-learn over
each folder once a day, with --spam and --ham arguments.  This worked
reasonably well, even though there are a vast number of messsages; most
are not new and the relearning process tended to just pick up the new or
re-filed ones.

Recently I enabled TXREP, and I'm generally very happy with it.
I did run sa-learn on a few messages that were misclassified; both
ham that score above 1 and low-scoring spam.

I received advice that bayes was difficult to use correctly in terms of
training and keeping the database in good shape, and had some
misclassifications, so I decided to clear out my bayes db and retrain,
by which I mean running sa-learn over my current set of ham/spam.

I was surprised by two aspects of this (note that I am only 98% sure I
interpretead things right):

  With TXREP enabled, sa-learn seems to cause a full reevaluation of
  each message.  On one hand this makes a lot of sense once considered,
  because the foundation of TXREP is moving scores towards the learned
  average.

  Because of TXREP's re-evaluation, without "-L", sa-learn causes rbl
  queries to be made for each messages scanned, and the rate of queries
  is very high.  After doing this, I found that I was blocked by URIBL.

So therefore:

  1) sa-learn -L documents that 

       -L, --local
           Do not perform any network accesses while learning details about
           the mail messages.  This will speed up the learning process, but
           may result in a slightly lower accuracy.

           Note that this is currently ignored, as current versions of
           SpamAssassin will not perform network access while learning; but
           future versions may.

   and while I haven't quite proved it, the second paragraph seems
   wrong.

  2) The web page at

    https://cwiki.apache.org/confluence/display/SPAMASSASSIN/TxRep

  says to use sa-learn and doesn't caution about -L.  (Of course, if one
  manually trains a few errant messages it doesn't matter.)

  3) sa-learn does not document that it is no longer for BAYES, but a
  general interface to mechanisms that learn.  (There's also no
  "sa-learn --methods" to show the current list.)  Many of the sa-learn
  options seem to really be about bayes only, and some seem to be higher
  level.

  4) There is a bonus of txrep_learn_penalty for learning spam, default
  20.  If the user says it is spam by calling learn, then I don't
  understand why it isn't just treated as score 20.  Likewise
  txrep_learn_bonus and being treated as -20.  It would seem to avoid
  much processing and also potentially huge amounts of RBL traffic.
  (I've added -L to my script that calls sa-learn.)

  5) It's very nice to have URIBL_BLOCKED, which is how I noticed.
  Thanks to whoever added that, and to URIBL for providing the BL.  I'm
  sorry my machine generated excessive queries (and I'm glad the block
  expired after a few weeks of not making any).


I'm curious how others see this, and if anyone else has had trouble with
dnsbl blocks from running sa-learn with txrep.

Greg

Attachment: signature.asc
Description: PGP signature

Reply via email to