Bart Schaefer <[email protected]> writes:
> On Sat, Sep 14, 2013 at 1:07 PM, Harry Putnam <[email protected]> wrote:
>>
>> 1) Does it matter that I have autolearn turned off in spamassassin
>> conf filt 'local.cf' while doing my sandbox work
>
> No, it doesn't. In fact it's probably better that way because SA
> won't waste time updating the bayes database with the mis-classified
> stuff that will have to be backed out later.
>
>> 2) I've dirived the mbox files of pure ham and pure spam by running
>> mixed mail so SA has already seen this mail.
>
> That definitely doesn't make any difference *IF* you disabled
> auto-learning in the previous step. It shouldn't make any difference
> even if autolearning was on, because sa-learn will discard the tokens
> from the first pass on each message before re-learning, but it'll be
> somewhat faster if that's not necessary.
Thanks for confirmations.
Since last post, I've sort of started over by clearing out
~/.spamassassin where the db is kept. Reduced procmailrc to a spam
and a ham mbox.
I ran about 700 fresh mixed messages thru, then went into the ham findings
and peeled out the 60-70 percent spam into a pure spam mbox.
Ran enough more mixed mail to gather an equivalent mbox of pure ham.
I ran those two under sa-learn --spam and then --ham.
About 450 msgs each
I was a little disappointed to find that after that SA is still miss
identifying spam as ham by at least 50%.
After the learning sessions I ran unseen mixed mail thru and find that
50% or worse is mis-classified.
Is it just not enough learning yet or should I see more improvement
than I have? If the latter then I'm probably doing something wrong.
So, can you review the summary that follows and tell me if you think I
should be seeing better results?
1) rm -rf ~/.spamassassin
2) run a few mails thru procmail/SA with:
cat 5mixedMboxMsgs| formail -e -s procmail -m ${sandbox}/trc
This recreates ~/.spamassassin
the rc file (trc above) has this:
------- 8<--------- 8<---=--- --------- --------
#shell-script-*--
PATH=/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin
SHELL=/bin/sh
MAILDIR=/home/reader/projects/reader/proc/spool
LOGFILE=/home/reader/projects/reader/proc/log/log
ORGMAIL=/home/reader/projects/reader/proc/spool/$LOGNAME
DEFAULT=$ORGMAIL
VERBOSE=YES
LOG=" `echo -e START
"
TRAP='formail -XMessage-Id: && date +"%b %d %T%nSTOP"'
PSCRIPTS="/home/reader/projects/perl"
SCRIPTS="/home/reader/scripts/"
MAILARC="/home/reader/proc/spool"
:0fw
| /usr/bin/spamc
:0:
* ^X-Spam-Status: Yes
spam_.in
:0
ham.in
------- --------- ---=--- --------- --------
3) run 700 mixed message thru the sandbox command shown above
4) Using mutt, I went thru the resulting `ham' mbox and picked out all
the spam
5) put the remaining ham into all ham file, then enough more mixed
mail to capture a few hundred more all ham messages.
6) Ran sa-learn --mbox --spam purespam
[..] sa-learn --mbox --ham pureham
(Approximately 450 msgs of each)
I could see the tokens file inside ~/.spamassassin had grown quite a bit
following those runs.
7) run 700 fresh mixed messages thru the sandbox.
I see SA's ability to tell the difference has improved very little
.. maybe 5-10% (roughly)
Is this result about par for the course? Do I need to run more mail,
pull out spam/ham and run more sa-learn sessions? And if that is the
case can any take a good guess at how much is enough.