On Mon, 2004-03-01 at 06:59, Steven Dickenson wrote:
> John Hardin wrote:
> > All:
> > 
> > Here's what we're doing to allow Microsoft Outlook users to
> > train a global SA Bayes database:
> 
> Seems like a bit much for a normal user to follow easily.

The users in my beta include fairly technically-challenged users, and
the only problem we've had so far was one instance of copying the HAM
folder to the SPAM directory on the spamassassin machine. (We get so few
false positives, y'know... :)

I cleaned that up and changed the script (attached) to be more picky
(and also to do --mbox imports) and emphasized where to put things in
the README.

After they do it a couple of times, it shouldn't be too difficult to
grasp.

> Your solution seems to be more oriented 
> towards people using PST files.

It is.

--
John Hardin  KA7OHZ                           
Internal Systems Administrator/Guru               voice: (425) 672-1304
Apropos Retail Management Systems, Inc.             fax: (425) 672-0192
-----------------------------------------------------------------------
  Failure to plan ahead on someone else's part does not constitute an
  emergency on my part.
                                  - David W. Barts in a.s.r
-----------------------------------------------------------------------
 Today: ICQ Corp goes away - have you installed Jabber yet?
#!/bin/bash

#
# Train spamassassin global bayes filter
#

# extract messages from .PST files
for DIR in /home/spamd/spam /home/spamd/ham
do
	if [ -d "$DIR" ]
	then
		cd $DIR || continue
	else
		continue
	fi
	[ -d export ] || mkdir export
	unset MSGTYPE LEARN
	case $DIR in
		*ham)
			MSGTYPE='[Hh][Aa][Mm]'
			LEARN='--ham'
			;;
		*spam)
			MSGTYPE='[Ss][Pp][Aa][Mm]'
			LEARN='--spam'
			;;
		*)
			echo "$0: $DIR not supported"
			continue
			;;
	esac
	for PST in *.[Pp][Ss][Tt]
	do
		unset LEARNED
		if [ -s "$PST" ]
		then
			echo "Processing $PST"
			rm -rf export/*
			/usr/local/bin/readpst -o export $PST
			mv -f $PST ${DIR}.old
			cd export
			for MBOX in *-$MSGTYPE
			do
				if [ -s "$MBOX" ]
				then
					echo "Learning $LEARN from $PST/$MBOX"
					/usr/bin/sa-learn $LEARN -C /etc/mail/spamassassin --mbox "$MBOX"
					LEARNED=1
				fi
			done
			cd $DIR
			rm -rf export/*
			if [ -z "$LEARNED" ]
			then
				echo "$0: NOTICE! Didn't find any mail folders in $PST to learn from..."
			fi
		fi
	done
done

# only process properly-formatted saved messages
cd /home/spamd/spam
file * | grep -vi "mail text" | grep -vi "directory" | sed -e 's/:.*//' -e 's/\*//' | xargs -r -i mv {} /home/spamd/invalid-format-spam/
cd /home/spamd/ham
file * | grep -vi "mail text" | grep -vi "directory" | sed -e 's/:.*//' -e 's/\*//' | xargs -r -i mv {} /home/spamd/invalid-format-ham/

# educate SpamAssassin
cd /home/spamd
echo "Learning spams"
/usr/bin/sa-learn --spam -C /etc/mail/spamassassin -L /home/spamd/spam
echo "Learning hams"
/usr/bin/sa-learn --ham  -C /etc/mail/spamassassin -L /home/spamd/ham
echo "Bayes Statistics:"
# Report status
/usr/bin/sa-learn --dump magic

# archive old messages
# this may need to be revisited
find /home/spamd/spam -type f -mtime +10 | xargs -r -i mv -f {} /home/spamd/spam.old/
find /home/spamd/ham  -type f -mtime +10 | xargs -r -i mv -f {} /home/spamd/ham.old/

gzip -9f /home/spamd/spam.old/* /home/spamd/ham.old/* >/dev/null 2>&1

Reply via email to