> Thank you. I don't know why, but the correction as inserted
> by me does not appear to work. Could you please send me a corrected
> sb_imapfilter.py after all?
Attached. If this doesn't work, let me know (it did appear to fix the
problem here).
=Tony.Meyer
--
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.
#!/usr/bin/env python
"""An IMAP filter. An IMAP message box is scanned and all non-scored
messages are scored and (where necessary) filtered.
Usage:
sb_imapfilter [options]
note: option values with spaces in them must be enclosed
in double quotes
options:
-p dbname : pickled training database filename
-d dbname : dbm training database filename
-t : train contents of spam folder and ham folder
-c : classify inbox
-h : display this message
-v : verbose mode
-P : security option to prompt for imap password,
rather than look in options["imap", "password"]
-e y/n : expunge/purge messages on exit (y) or not (n)
-i debuglvl : a somewhat mysterious imaplib debugging level
(4 is a good level, and suitable for bug reports)
-l minutes : period of time between filtering operations
-b : Launch a web browser showing the user interface.
-o section:option:value :
set [section, option] in the options database
to value
Examples:
Classify inbox, with dbm database
sb_imapfilter -c -d bayes.db
Train Spam and Ham, then classify inbox, with dbm database
sb_imapfilter -t -c -d bayes.db
Train Spam and Ham only, with pickled database
sb_imapfilter -t -p bayes.db
Warnings:
o We never delete mail, unless you use the -e/purge option, but we do
mark a lot as deleted, and your mail client might remove that for
you. We try to only mark as deleted once the moved/altered message
is correctly saved, but things might go wrong. We *strongly*
recommend that you try this script out on mail that you can recover
from somewhere else, at least at first.
"""
from __future__ import generators
todo = """
o IMAP supports authentication via other methods than the plain-text
password method that we are using at the moment. Neither of the
servers I have access to offer any alternative method, however. If
someone's does, then it would be nice to offer this.
Thanks to #1169939 we now support CRAM_MD5 if available. It'd still
be good to support others, though.
o Usernames should be able to be literals as well as quoted strings.
This might help if the username/password has special characters like
accented characters.
o Suggestions?
"""
# This module is part of the SpamBayes project, which is Copyright 2002-5
# The Python Software Foundation and is covered by the Python Software
# Foundation license.
__author__ = "Tony Meyer <[EMAIL PROTECTED]>, Tim Stone"
__credits__ = "All the SpamBayes folk. The original filter design owed " \
"much to isbg by Roger Binns (http://www.rogerbinns.com/isbg)."
try:
True, False
except NameError:
# Maintain compatibility with Python 2.2
True, False = 1, 0
# If we are running as a frozen application, then chances are that
# output is just lost. We'd rather log this, like sb_server and Oulook
# log, so that the user can pull up the output if possible. We could just
# rely on the user piping the output appropriately, but would rather have
# more control. The sb_server tray application only does this if not
# running in a console window, but we do it whenever we are frozen.
import os
import sys
if hasattr(sys, "frozen"):
# We want to move to logging module later, so for now, we
# hack together a simple logging strategy.
try:
import win32api
except ImportError:
if sys.platform == "win32":
# Fall back to CWD, but warn user.
status = "Warning: your log is stored in the current " \
"working directory. We recommend installing " \
"the pywin32 extensions, so that the log is " \
"stored in the Windows temp directory."
temp_dir = os.getcwd()
else:
# Try for a /tmp directory.
if os.path.isdir("/tmp"):
temp_dir = "/tmp"
status = "Log file opened in /tmp"
else:
status = "Warning: your log is stored in the current " \
"working directory. If this does not suit you " \
"please let the [email protected] crowd know " \
"so that an alternative can be arranged."
else:
temp_dir = win32api.GetTempPath()
status = "Log file opened in " + temp_dir
for i in range(3,0,-1):
try:
os.unlink(os.path.join(temp_dir, "SpamBayesIMAP%d.log" % (i+1)))
except os.error:
pass
try:
os.rename(
os.path.join(temp_dir, "SpamBayesIMAP%d.log" % i),
os.path.join(temp_dir, "SpamBayesIMAP%d.log" % (i+1))
)
except os.error:
pass
# Open this log, as unbuffered, so crashes still get written.
sys.stdout = open(os.path.join(temp_dir,"SpamBayesIMAP1.log"), "wt", 0)
sys.stderr = sys.stdout
import socket
import re
import time
import getopt
import types
import thread
import traceback
import email
import email.Parser
from getpass import getpass
from email.Utils import parsedate
try:
import cStringIO as StringIO
except ImportError:
import StringIO
from spambayes import Stats
from spambayes import message
from spambayes.Options import options, get_pathname_option, optionsPathname
from spambayes import tokenizer, storage, message, Dibbler
from spambayes.UserInterface import UserInterfaceServer
from spambayes.ImapUI import IMAPUserInterface, LoginFailure
from spambayes.Version import get_current_version
from imaplib import IMAP4
from imaplib import Time2Internaldate
try:
if options["imap", "use_ssl"]:
from imaplib import IMAP4_SSL as BaseIMAP
else:
from imaplib import IMAP4 as BaseIMAP
except ImportError:
from imaplib import IMAP4 as BaseIMAP
class BadIMAPResponseError(Exception):
"""An IMAP command returned a non-"OK" response."""
def __init__(self, command, response):
self.command = command
self.response = response
def __str__(self):
return "The command '%s' failed to give an OK response.\n%s" % \
(self.command, self.response)
class IMAPSession(BaseIMAP):
'''A class extending the IMAP4 class, with a few optimizations'''
timeout = 60 # seconds
def __init__(self, server, port, debug=0, do_expunge=False):
# There's a tricky situation where if use_ssl is False, but we
# try to connect to a IMAP over SSL server, we will just hang
# forever, waiting for a response that will never come. To
# get past this, just for the welcome message, we install a
# timeout on the connection. Normal service is then returned.
# This only applies when we are not using SSL.
if not hasattr(self, "ssl"):
readline = self.readline
self.readline = self.readline_timeout
try:
BaseIMAP.__init__(self, server, port)
except (BaseIMAP.error, socket.gaierror, socket.error):
print "Cannot connect to server %s on port %s" % (server, port)
if not hasattr(self, "ssl"):
print "If you are connecting to an SSL server, please " \
"ensure that you have the 'Use SSL' option enabled."
self.connected = False
else:
self.connected = True
if not hasattr(self, "ssl"):
self.readline = readline
self.debug = debug
self.do_expunge = do_expunge
self.server = server
self.port = port
self.logged_in = False
# For efficiency, we remember which folder we are currently
# in, and only send a select command to the IMAP server if
# we want to *change* folders. This functionality is used by
# both IMAPMessage and IMAPFolder.
self.current_folder = None
# We override the base read so that we only read a certain amount
# of data at a time. OS X and Python has problems with getting
# large amounts of memory at a time, so maybe this will be a way we
# can work around that (I don't know, and don't have a mac to test,
# but we need to try something).
self._read = self.read
self.read = self.safe_read
def readline_timeout(self):
"""Read line from remote, possibly timing out."""
st_time = time.time()
self.sock.setblocking(False)
buffer = []
while True:
if (time.time() - st_time) > self.timeout:
if options["globals", "verbose"]:
print >> sys.stderr, "IMAP Timing out"
break
try:
data = self.sock.recv(1)
except socket.error, e:
if e[0] == 10035:
# Nothing to receive, keep going.
continue
raise
if not data:
break
if data == '\n':
break
buffer.append(data)
self.sock.setblocking(True)
return "".join(buffer)
def login(self, username, pwd):
"""Log in to the IMAP server, catching invalid username/password."""
assert self.connected, "Must be connected before logging in."
if 'AUTH=CRAM-MD5' in self.capabilities:
login_func = self.login_cram_md5
args = (username, pwd)
description = "MD5"
else:
login_func = BaseIMAP.login # superclass login
args = (self, username, pwd)
description = "plain-text"
try:
login_func(*args)
except BaseIMAP.error, e:
msg = "The username (%s) and/or password (sent in %s) may " \
"be incorrect." % (username, description)
raise LoginFailure(msg)
self.logged_in = True
def logout(self):
"""Log off from the IMAP server, possibly expunging.
Note that most, if not all, of the expunging is probably done in
SelectFolder, rather than here, for purposes of speed."""
# We may never have logged in, in which case we do nothing.
if self.connected and self.logged_in and self.do_expunge:
# Expunge messages from the ham, spam and unsure folders.
for fol in ["spam_folder",
"unsure_folder",
"ham_folder"]:
folder_name = options["imap", fol]
if folder_name:
self.select(folder_name)
self.expunge()
# Expunge messages from the ham and spam training folders.
for fol_list in ["ham_train_folders",
"spam_train_folders",]:
for fol in options["imap", fol_list]:
self.select(fol)
self.expunge()
BaseIMAP.logout(self) # superclass logout
def check_response(self, command, IMAP_response):
"""A utility function to check the response from IMAP commands.
Raises BadIMAPResponseError if the response is not OK. Returns
the data segment of the response otherwise."""
response, data = IMAP_response
if response != "OK":
raise BadIMAPResponseError(command, IMAP_response)
return data
def SelectFolder(self, folder):
"""A method to point ensuing IMAP operations at a target folder.
This is essentially a wrapper around the IMAP select command, which
ignores the command if the folder is already selected."""
if self.current_folder != folder:
if self.current_folder != None and self.do_expunge:
# It is faster to do close() than a single
# expunge when we log out (because expunge returns
# a list of all the deleted messages which we don't do
# anything with).
self.close()
self.current_folder = None
if folder == "":
# This is Python bug #845560 - if the empty string is
# passed, we get a traceback, not just an 'invalid folder'
# error, so raise our own error.
raise BadIMAPResponseError("Cannot have empty string as " \
"folder name in select", "")
# We *always* use SELECT and not EXAMINE, because this
# speeds things up considerably.
response = self.select(folder, None)
data = self.check_response("select %s" % (folder,), response)
self.current_folder = folder
return data
number_re = re.compile(r"{\d+}")
folder_re = re.compile(r"\(([\w\\ ]*)\) ")
def folder_list(self):
"""Return a alphabetical list of all folders available on the
server."""
response = self.list()
try:
all_folders = self.check_response("list", response)
except BadIMAPResponseError:
# We want to keep going, so just print out a warning, and
# return an empty list.
print "Could not retrieve folder list."
return []
folders = []
for fol in all_folders:
# Sigh. Some servers may give us back the folder name as a
# literal, so we need to crunch this out.
if isinstance(fol, types.TupleType):
m = self.number_re.search(fol[0])
if not m:
# Something is wrong here! Skip this folder.
continue
fol = '%s"%s"' % (fol[0][:m.start()], fol[1])
m = self.folder_re.search(fol)
if not m:
# Something is not good with this folder, so skip it.
continue
name_attributes = fol[:m.end()-1]
# IMAP is a truly odd protocol. The delimiter is
# only the delimiter for this particular folder - each
# folder *may* have a different delimiter
self.folder_delimiter = fol[m.end()+1:m.end()+2]
# A bit of a hack, but we really need to know if this is
# the case.
if self.folder_delimiter == ',':
print "WARNING: Your imap server uses a comma as the " \
"folder delimiter. This may cause unpredictable " \
"errors."
folders.append(fol[m.end()+4:].strip('"'))
folders.sort()
return folders
# A flag can have any character in the ascii range 32-126 except for
# (){ %*"\
FLAG_CHARS = ""
for i in range(32, 127):
if not chr(i) in ['(', ')', '{', ' ', '%', '*', '"', '\\']:
FLAG_CHARS += chr(i)
FLAG = r"\\?[" + re.escape(FLAG_CHARS) + r"]+"
# The empty flag set "()" doesn't match, so that extract_fetch_data()
# returns data["FLAGS"] == None
FLAGS_RE = re.compile(r"(FLAGS) (\((" + FLAG + r" )*(" + FLAG + r")\))")
INTERNALDATE_RE = re.compile(r"(INTERNALDATE) (\"\d{1,2}\-[A-Za-z]{3,3}\-" +
r"\d{2,4} \d{2,2}\:\d{2,2}\:\d{2,2} " +
r"[\+\-]\d{4,4}\")")
RFC822_RE = re.compile(r"(RFC822) (\{[\d]+\})")
BODY_PEEK_RE = re.compile(r"(BODY\[\]) (\{[\d]+\})")
RFC822_HEADER_RE = re.compile(r"(RFC822.HEADER) (\{[\d]+\})")
UID_RE = re.compile(r"(UID) ([\d]+)")
FETCH_RESPONSE_RE = re.compile(r"([0-9]+) \(([" + \
re.escape(FLAG_CHARS) + r"\"\{\}\(\)\\
]*)\)?")
LITERAL_RE = re.compile(r"^\{[\d]+\}$")
def _extract_fetch_data(self, response):
"""This does the real work of extracting the data, for each message
number.
"""
# We support the following FETCH items:
# FLAGS
# INTERNALDATE
# RFC822
# UID
# RFC822.HEADER
# BODY.PEEK
# All others are ignored.
if isinstance(response, types.StringTypes):
response = (response,)
data = {}
expected_literal = None
for part in response:
# We ignore parentheses by themselves, for convenience.
if part == ')':
continue
if expected_literal:
# This should be a literal of a certain size.
key, expected_size = expected_literal
## if len(part) != expected_size:
## raise BadIMAPResponseError(\
## "FETCH response (wrong size literal %d != %d)" % \
## (len(part), expected_size), response)
data[key] = part
expected_literal = None
continue
# The first item will always be the message number.
mo = self.FETCH_RESPONSE_RE.match(part)
if mo:
data["message_number"] = mo.group(1)
rest = mo.group(2)
else:
raise BadIMAPResponseError("FETCH response", response)
for r in [self.FLAGS_RE, self.INTERNALDATE_RE, self.RFC822_RE,
self.UID_RE, self.RFC822_HEADER_RE, self.BODY_PEEK_RE]:
mo = r.search(rest)
if mo is not None:
if self.LITERAL_RE.match(mo.group(2)):
# The next element will be a literal.
expected_literal = (mo.group(1),
int(mo.group(2)[1:-1]))
else:
data[mo.group(1)] = mo.group(2)
return data
def extract_fetch_data(self, response):
"""Extract data from the response given to an IMAP FETCH command.
The data is put into a dictionary, which is returned, where the
keys are the fetch items.
"""
# There may be more than one message number in the response, so
# handle separately.
if isinstance(response, types.StringTypes):
response = (response,)
data = {}
for msg in response:
msg_data = self._extract_fetch_data(msg)
if msg_data:
# Maybe there are two about the same message number!
num = msg_data["message_number"]
if num in data:
data[num].update(msg_data)
else:
data[num] = msg_data
return data
# Maximum amount of data that will be read at any one time.
MAXIMUM_SAFE_READ = 4096
def safe_read(self, size):
"""Read data from remote, but in manageable sizes."""
data = []
while size > 0:
if size < self.MAXIMUM_SAFE_READ:
to_collect = size
else:
to_collect = self.MAXIMUM_SAFE_READ
data.append(self._read(to_collect))
size -= self.MAXIMUM_SAFE_READ
return "".join(data)
class IMAPMessage(message.SBHeaderMessage):
def __init__(self):
message.Message.__init__(self)
self.folder = None
self.previous_folder = None
self.rfc822_command = "(BODY.PEEK[])"
self.rfc822_key = "BODY[]"
self.got_substance = False
self.invalid = False
self.could_not_retrieve = False
self.imap_server = None
def extractTime(self):
"""When we create a new copy of a message, we need to specify
a timestamp for the message, if we can't get the information
from the IMAP server itself. If the message has a valid date
header we use that. Otherwise, we use the current time."""
message_date = self["Date"]
if message_date is not None:
parsed_date = parsedate(message_date)
if parsed_date is not None:
try:
return Time2Internaldate(time.mktime(parsed_date))
except ValueError:
# Invalid dates can cause mktime() to raise a
# ValueError, for example:
# >>> time.mktime(parsedate("Mon, 06 May 0102 10:51:16
-0100"))
# Traceback (most recent call last):
# File "<interactive input>", line 1, in ?
# ValueError: year out of range
# (Why this person is getting mail from almost two
# thousand years ago is another question <wink>).
# In any case, we just pass and use the current date.
pass
except OverflowError:
pass
return Time2Internaldate(time.time())
def get_full_message(self):
"""Retrieve the RFC822 message from the IMAP server and return a
new IMAPMessage object that has the same details as this message,
but also has the substance."""
if self.got_substance:
return self
assert self.id, "Cannot get substance of message without an id"
assert self.uid, "Cannot get substance of message without an UID"
assert self.imap_server, "Cannot do anything without IMAP connection"
# First, try to select the folder that the message is in.
try:
self.imap_server.SelectFolder(self.folder.name)
except BadIMAPResponseError:
# Can't select the folder, so getting the substance will not
# work.
self.could_not_retrieve = True
print >>sys.stderr, "Could not select folder %s for message " \
"%s (uid %s)" % (self.folder.name, self.id, self.uid)
return self
# Now try to fetch the substance of the message.
try:
response = self.imap_server.uid("FETCH", self.uid,
self.rfc822_command)
except MemoryError:
# Really big messages can trigger a MemoryError here.
# The problem seems to be line 311 (Python 2.3) of socket.py,
# which has "return "".join(buffers)". This has also caused
# problems with Mac OS X 10.3, which apparently is very stingy
# with memory (the malloc calls fail!). The problem then is
# line 301 of socket.py which does
# "data = self._sock.recv(recv_size)".
# We want to handle this gracefully, although we can't really
# do what we do later, and rewrite the message, since we can't
# load it in the first place. Maybe an elegant solution would
# be to get the message in parts, or just use the first X
# characters for classification. For now, we just carry on,
# warning the user and ignoring the message.
self.could_not_retrieve = True
print >>sys.stderr, "MemoryError with message %s (uid %s)" % \
(self.id, self.uid)
return self
command = "uid fetch %s" % (self.uid,)
response_data = self.imap_server.check_response(command, response)
data = self.imap_server.extract_fetch_data(response_data)
# The data will be a dictionary - hopefully with only one element,
# but maybe more than one. The key is the message number, which we
# do not have (we use the UID instead). So we look through the
# message and use the first data of the right type we find.
rfc822_data = None
for msg_data in data.itervalues():
if self.rfc822_key in msg_data:
rfc822_data = msg_data[self.rfc822_key]
break
if rfc822_data is None:
raise BadIMAPResponseError("FETCH response", response_data)
try:
new_msg = email.message_from_string(rfc822_data, IMAPMessage)
# We use a general 'except' because the email package doesn't
# always return email.Errors (it can return a TypeError, for
# example) if the email is invalid. In any case, we want
# to keep going, and not crash, because we might leave the
# user's mailbox in a bad state if we do. Better to soldier on.
except:
# Yikes! Barry set this to return at this point, which
# would work ok for training (IIRC, that's all he's
# using it for), but for filtering, what happens is that
# the message ends up blank, but ok, so the original is
# flagged to be deleted, and a new (almost certainly
# unsure) message, *with only the spambayes headers* is
# created. The nice solution is still to do what sb_server
# does and have a X-Spambayes-Exception header with the
# exception data and then the original message.
self.invalid = True
text, details = message.insert_exception_header(
rfc822_data, self.id)
self.invalid_content = text
self.got_substance = True
# Print the exception and a traceback.
print >>sys.stderr, details
return self
new_msg.folder = self.folder
new_msg.previous_folder = self.previous_folder
new_msg.rfc822_command = self.rfc822_command
new_msg.rfc822_key = self.rfc822_key
new_msg.imap_server = self.imap_server
new_msg.uid = self.uid
new_msg.setId(self.id)
new_msg.got_substance = True
if not new_msg.has_key(options["Headers", "mailid_header_name"]):
new_msg[options["Headers", "mailid_header_name"]] = self.id
if options["globals", "verbose"]:
sys.stdout.write(chr(8) + "*")
return new_msg
def MoveTo(self, dest):
'''Note that message should move to another folder. No move is
carried out until Save() is called, for efficiency.'''
if self.previous_folder is None:
self.previous_folder = self.folder
self.folder = dest
def as_string(self, unixfrom=False):
# Basically the same as the parent class's except that we handle
# the case where the data was unparsable, so we haven't done any
# filtering, and we are not actually a proper email.Message object.
# We also don't mangle the from line; the server must take care of
# this.
if self.invalid:
return self._force_CRLF(self.invalid_content)
else:
return message.SBHeaderMessage.as_string(self, unixfrom,
mangle_from_=False)
recent_re = re.compile(r"\\Recent ?| ?\\Recent")
def Save(self):
"""Save message to IMAP server.
We can't actually update the message with IMAP, so what we do is
create a new message and delete the old one."""
assert self.folder is not None,\
"Can't save a message that doesn't have a folder."
assert self.id, "Can't save a message that doesn't have an id."
assert self.imap_server, "Can't do anything without IMAP connection."
response = self.imap_server.uid("FETCH", self.uid,
"(FLAGS INTERNALDATE)")
command = "fetch %s (flags internaldate)" % (self.uid,)
response_data = self.imap_server.check_response(command, response)
data = self.imap_server.extract_fetch_data(response_data)
# The data will be a dictionary - hopefully with only one element,
# but maybe more than one. The key is the message number, which we
# do not have (we use the UID instead). So we look through the
# message and use the last data of the right type we find.
msg_time = self.extractTime()
flags = None
for msg_data in data.itervalues():
if "INTERNALDATE" in msg_data:
msg_time = msg_data["INTERNALDATE"]
if "FLAGS" in msg_data:
flags = msg_data["FLAGS"]
# The \Recent flag can be fetched, but cannot be stored
# We must remove it from the list if it is there.
flags = self.recent_re.sub("", flags)
# We try to save with flags and time, then with just the
# time, then with the flags and the current time, then with just
# the current time. The first should work, but the first three
# sometimes (due to the quirky IMAP server) fail.
for flgs, tme in [(flags, msg_time),
(None, msg_time),
(flags, Time2Internaldate(time.time())),
(None, Time2Internaldate(time.time()))]:
try:
response = self.imap_server.append(self.folder.name, flgs, tme,
self.as_string())
except BaseIMAP.error:
continue
try:
self.imap_server.check_response("", response)
except BadIMAPResponseError:
pass
else:
break
else:
command = "append %s %s %s %s" % (self.folder.name, flgs, tme,
self.as_string)
raise BadIMAPResponseError(command)
if self.previous_folder is None:
self.imap_server.SelectFolder(self.folder.name)
else:
self.imap_server.SelectFolder(self.previous_folder.name)
self.previous_folder = None
response = self.imap_server.uid("STORE", self.uid, "+FLAGS.SILENT",
"(\\Deleted \\Seen)")
command = "set %s to be deleted and seen" % (self.uid,)
self.imap_server.check_response(command, response)
# Not all IMAP servers immediately offer the new message, but
# we need to find it to get the new UID. We need to wait until
# the server offers up an EXISTS command, so we no-op until that
# is the case.
# See [ 941596 ] sb_imapfilter.py not adding headers / moving messages
# We use the recent() function, which no-ops if necessary. We try
# 100 times, and then give up. If a message arrives independantly,
# and we are told about it before our message, then this could
# cause trouble, but that would be one weird server.
for i in xrange(100):
response = self.imap_server.recent()
data = self.imap_server.check_response("recent", response)
if data[0] is not None:
break
else:
raise BadIMAPResponseError("Cannot find saved message", "")
# We need to update the UID, as it will have changed.
# Although we don't use the UID to keep track of messages, we do
# have to use it for IMAP operations.
self.imap_server.SelectFolder(self.folder.name)
search_string = "(UNDELETED HEADER %s \"%s\")" % \
(options["Headers", "mailid_header_name"],
self.id.replace('\\',r'\\').replace('"',r'\"'))
response = self.imap_server.uid("SEARCH", search_string)
data = self.imap_server.check_response("search " + search_string,
response)
new_id = data[0]
# See [ 870799 ] imap trying to fetch invalid message UID
# It seems that although the save gave a "NO" response to the
# first save, the message was still saved (without the flags,
# probably). This really isn't good behaviour on the server's
# part, but, as usual, we try and deal with it. So, if we get
# more than one undeleted message with the same SpamBayes id,
# delete all of them apart from the last one, and use that.
multiple_ids = new_id.split()
for id_to_remove in multiple_ids[:-1]:
response = self.imap_server.uid("STORE", id_to_remove,
"+FLAGS.SILENT",
"(\\Deleted \\Seen)")
command = "silently delete and make seen %s" % (id_to_remove,)
self.imap_server.check_response(command, response)
if multiple_ids:
new_id = multiple_ids[-1]
else:
# Let's hope it doesn't, but, just in case, if the search
# turns up empty, we make the assumption that the new message
# is the last one with a recent flag.
response = self.imap_server.uid("SEARCH", "RECENT")
data = self.imap_server.check_response("search recent",
response)
new_id = data[0]
if new_id.find(' ') > -1:
ids = new_id.split(' ')
new_id = ids[-1]
# Ok, now we're in trouble if we still haven't found it.
# We make a huge assumption that the new message is the one
# with the highest UID (they are sequential, so this will be
# ok as long as another message hasn't also arrived).
if new_id == "":
response = self.imap_server.uid("SEARCH", "ALL")
data = self.imap_server.check_response("search all",
response)
new_id = data[0]
if new_id.find(' ') > -1:
ids = new_id.split(' ')
new_id = ids[-1]
self.uid = new_id
class IMAPFolder(object):
def __init__(self, folder_name, imap_server, stats):
self.name = folder_name
self.imap_server = imap_server
self.stats = stats
# Unique names for cached messages - see _generate_id below.
self.lastBaseMessageName = ''
self.uniquifier = 2
def __cmp__(self, obj):
"""Two folders are equal if their names are equal."""
if obj is None:
return False
return cmp(self.name, obj.name)
def __iter__(self):
"""Iterate through the messages in this IMAP folder."""
for key in self.keys():
yield self[key]
def keys(self):
'''Returns *uids* for all the messages in the folder not
marked as deleted.'''
self.imap_server.SelectFolder(self.name)
response = self.imap_server.uid("SEARCH", "UNDELETED")
data = self.imap_server.check_response("search undeleted", response)
if data[0]:
return data[0].split(' ')
else:
return []
custom_header_id_re = re.compile(re.escape(\
options["Headers", "mailid_header_name"]) + "\:\s*(\d+(?:\-\d)?)",
re.IGNORECASE)
message_id_re = re.compile("Message-ID\: ?\<([^\n\>]+)\>",
re.IGNORECASE)
def __getitem__(self, key):
"""Return message matching the given *uid*.
The messages returned have no substance (so this should be
reasonably quick, even with large messages). You need to call
get_full_message() on the returned message to get the substance of
the message from the server."""
self.imap_server.SelectFolder(self.name)
# Using RFC822.HEADER.LINES would be better here, but it seems
# that not all servers accept it, even though it is in the RFC
response = self.imap_server.uid("FETCH", key, "RFC822.HEADER")
response_data = self.imap_server.check_response(\
"fetch %s rfc822.header" % (key,), response)
data = self.imap_server.extract_fetch_data(response_data)
# The data will be a dictionary - hopefully with only one element,
# but maybe more than one. The key is the message number, which we
# do not have (we use the UID instead). So we look through the
# message and use the first data of the right type we find.
headers = None
for msg_data in data.itervalues():
if "RFC822.HEADER" in msg_data:
headers = msg_data["RFC822.HEADER"]
break
if headers is None:
raise BadIMAPResponseError("FETCH response", response_data)
# Create a new IMAPMessage object, which will be the return value.
msg = IMAPMessage()
msg.folder = self
msg.uid = key
msg.imap_server = self.imap_server
# We use the MessageID header as the ID for the message, as long
# as it is available, and if not, we add our own.
# Search for our custom id first, for backwards compatibility.
for id_header_re in [self.custom_header_id_re, self.message_id_re]:
mo = id_header_re.search(headers)
if mo:
msg.setId(mo.group(1))
break
else:
msg.setId(self._generate_id())
# Unfortunately, we now have to re-save this message, so that
# our id is stored on the IMAP server. The vast majority of
# messages have Message-ID headers, from what I can tell, so
# we should only rarely have to do this. It's less often than
# with the previous solution, anyway!
msg = msg.get_full_message()
msg.Save()
if options["globals", "verbose"]:
sys.stdout.write(".")
return msg
# Lifted straight from sb_server.py (under the name getNewMessageName)
def _generate_id(self):
# The message id is the time it arrived, with a uniquifier
# appended if two arrive within one clock tick of each other.
messageName = "%10.10d" % long(time.time())
if messageName == self.lastBaseMessageName:
messageName = "%s-%d" % (messageName, self.uniquifier)
self.uniquifier += 1
else:
self.lastBaseMessageName = messageName
self.uniquifier = 2
return messageName
def Train(self, classifier, isSpam):
"""Train folder as spam/ham."""
num_trained = 0
for msg in self:
if msg.GetTrained() == (not isSpam):
msg = msg.get_full_message()
if msg.could_not_retrieve:
# Something went wrong, and we couldn't even get
# an invalid message, so just skip this one.
# Annoyingly, we'll try to do it every time the
# script runs, but hopefully the user will notice
# the errors and move it soon enough.
continue
msg.delSBHeaders()
classifier.unlearn(msg.tokenize(), not isSpam)
if isSpam:
old_class = options["Headers", "header_ham_string"]
else:
old_class = options["Headers", "header_spam_string"]
# Once the message has been untrained, it's training memory
# should reflect that on the off chance that for some
# reason the training breaks.
msg.RememberTrained(None)
else:
old_class = None
if msg.GetTrained() is None:
msg = msg.get_full_message()
if msg.could_not_retrieve:
continue
saved_headers = msg.currentSBHeaders()
msg.delSBHeaders()
classifier.learn(msg.tokenize(), isSpam)
num_trained += 1
msg.RememberTrained(isSpam)
self.stats.RecordTraining(not isSpam, old_class=old_class)
if isSpam:
move_opt_name = "move_trained_spam_to_folder"
else:
move_opt_name = "move_trained_ham_to_folder"
if options["imap", move_opt_name] != "":
# We need to restore the SpamBayes headers.
for header, value in saved_headers.items():
msg[header] = value
msg.MoveTo(IMAPFolder(options["imap", move_opt_name],
self.imap_server, self.stats))
msg.Save()
return num_trained
def Filter(self, classifier, spamfolder, unsurefolder, hamfolder):
count = {}
count["ham"] = 0
count["spam"] = 0
count["unsure"] = 0
for msg in self:
if msg.GetClassification() is None:
msg = msg.get_full_message()
if msg.could_not_retrieve:
# Something went wrong, and we couldn't even get
# an invalid message, so just skip this one.
# Annoyingly, we'll try to do it every time the
# script runs, but hopefully the user will notice
# the errors and move it soon enough.
continue
(prob, clues) = classifier.spamprob(msg.tokenize(),
evidence=True)
# Add headers and remember classification.
msg.delSBHeaders()
msg.addSBHeaders(prob, clues)
self.stats.RecordClassification(prob)
cls = msg.GetClassification()
if cls == options["Headers", "header_ham_string"]:
if hamfolder:
msg.MoveTo(hamfolder)
# Otherwise, we leave ham alone.
count["ham"] += 1
elif cls == options["Headers", "header_spam_string"]:
msg.MoveTo(spamfolder)
count["spam"] += 1
else:
msg.MoveTo(unsurefolder)
count["unsure"] += 1
msg.Save()
return count
class IMAPFilter(object):
def __init__(self, classifier, stats):
self.spam_folder = None
self.unsure_folder = None
self.ham_folder = None
self.classifier = classifier
self.imap_server = None
self.stats = stats
def Train(self):
assert self.imap_server, "Cannot do anything without IMAP server."
if options["globals", "verbose"]:
t = time.time()
total_trained = 0
for is_spam, option_name in [(False, "ham_train_folders"),
(True, "spam_train_folders")]:
training_folders = options["imap", option_name]
for fol in training_folders:
# Select the folder to make sure it exists
try:
self.imap_server.SelectFolder(fol)
except BadIMAPResponseError:
print "Skipping %s, as it cannot be selected." % (fol,)
continue
if options['globals', 'verbose']:
print " Training %s folder %s" % \
(["ham", "spam"][is_spam], fol)
folder = IMAPFolder(fol, self.imap_server, self.stats)
num_trained = folder.Train(self.classifier, is_spam)
total_trained += num_trained
if options['globals', 'verbose']:
print "\n %s trained." % (num_trained,)
if total_trained:
self.classifier.store()
if options["globals", "verbose"]:
print "Training took %.4f seconds, %s messages were trained." \
% (time.time() - t, total_trained)
def Filter(self):
assert self.imap_server, "Cannot do anything without IMAP server."
if not self.spam_folder:
self.spam_folder = IMAPFolder(options["imap", "spam_folder"],
self.imap_server, self.stats)
if not self.unsure_folder:
self.unsure_folder = IMAPFolder(options["imap",
"unsure_folder"],
self.imap_server, self.stats)
ham_folder_name = options["imap", "ham_folder"]
if ham_folder_name and not self.ham_folder:
self.ham_folder = IMAPFolder(ham_folder_name, self.imap_server,
self.stats)
if options["globals", "verbose"]:
t = time.time()
count = {}
count["ham"] = 0
count["spam"] = 0
count["unsure"] = 0
# Select the ham, spam and unsure folders to make sure they exist.
try:
self.imap_server.SelectFolder(self.spam_folder.name)
except BadIMAPResponseError:
print "Cannot select spam folder. Please check configuration."
sys.exit(-1)
try:
self.imap_server.SelectFolder(self.unsure_folder.name)
except BadIMAPResponseError:
print "Cannot select spam folder. Please check configuration."
sys.exit(-1)
if self.ham_folder:
try:
self.imap_server.SelectFolder(self.ham_folder.name)
except BadIMAPResponseError:
print "Cannot select ham folder. Please check configuration."
sys.exit(-1)
for filter_folder in options["imap", "filter_folders"]:
# Select the folder to make sure it exists.
try:
self.imap_server.SelectFolder(filter_folder)
except BadIMAPResponseError:
print "Cannot select %s, skipping." % (filter_folder,)
continue
folder = IMAPFolder(filter_folder, self.imap_server, self.stats)
subcount = folder.Filter(self.classifier, self.spam_folder,
self.unsure_folder, self.ham_folder)
for key in count.keys():
count[key] += subcount.get(key, 0)
if options["globals", "verbose"]:
if count is not None:
print "\nClassified %s ham, %s spam, and %s unsure." % \
(count["ham"], count["spam"], count["unsure"])
print "Classifying took %.4f seconds." % (time.time() - t,)
def run(force_UI=False):
try:
opts, args = getopt.getopt(sys.argv[1:], 'hbPtcvl:e:i:d:p:o:')
except getopt.error, msg:
print >>sys.stderr, str(msg) + '\n\n' + __doc__
sys.exit()
doTrain = False
doClassify = False
doExpunge = options["imap", "expunge"]
imapDebug = 0
sleepTime = 0
promptForPass = False
launchUI = False
servers = ""
usernames = ""
for opt, arg in opts:
if opt == '-h':
print >>sys.stderr, __doc__
sys.exit()
elif opt == "-b":
launchUI = True
elif opt == '-t':
doTrain = True
elif opt == '-P':
promptForPass = True
elif opt == '-c':
doClassify = True
elif opt == '-v':
options["globals", "verbose"] = True
elif opt == '-e':
if arg == 'y':
doExpunge = True
else:
doExpunge = False
elif opt == '-i':
imapDebug = int(arg)
elif opt == '-l':
sleepTime = int(arg) * 60
elif opt == '-o':
options.set_from_cmdline(arg, sys.stderr)
bdbname, useDBM = storage.database_type(opts)
# Let the user know what they are using...
v = get_current_version();
print "%s.\n" % (v.get_long_version("SpamBayes IMAP Filter"),)
if options["globals", "verbose"]:
print "Loading database %s..." % (bdbname),
classifier = storage.open_storage(bdbname, useDBM)
message_db = message.open_storage(*message.database_type())
if options["globals", "verbose"]:
print "Done."
if options["imap", "server"]:
servers = options["imap", "server"]
usernames = options["imap", "username"]
if not promptForPass:
pwds = options["imap", "password"]
else:
pwds = None
if not launchUI and not force_UI:
print "You need to specify both a server and a username."
sys.exit()
if promptForPass:
pwds = []
for i in xrange(len(usernames)):
pwds.append(getpass("Enter password for %s:" % (usernames[i],)))
servers_data = []
for server, username, password in zip(servers, usernames, pwds or []):
if server.find(':') > -1:
server, port = server.split(':', 1)
port = int(port)
else:
if options["imap", "use_ssl"]:
port = 993
else:
port = 143
servers_data.append((server, port, username, password))
# Load stats manager.
stats = Stats.Stats(options, message_db)
imap_filter = IMAPFilter(classifier, stats)
# Web interface. We have changed the rules about this many times.
# With 1.0.x, the rule is that the interface is served if we are
# not classifying or training. However, this runs into the problem
# that if we run with -l, we might still want to edit the options,
# and we don't want to start a separate instance, because then the
# database is accessed from two processes.
# With 1.1.x, the rule is that the interface is also served if the
# -l option is used, which means it is only not served if we are
# doing a one-off classification/train. In that case, there would
# probably not be enough time to get to the interface and interact
# with it (and we don't want it to die halfway through!), and we
# don't want to slow classification/training down, either.
if sleepTime or not (doClassify or doTrain):
imaps = []
for server, port, username, password in servers_data:
if server == "":
imaps.append(None)
else:
imaps.append(IMAPSession(server, port, imapDebug, doExpunge))
httpServer = UserInterfaceServer(options["html_ui", "port"])
httpServer.register(IMAPUserInterface(classifier, imaps, pwds,
IMAPSession, stats=stats))
launchBrowser=launchUI or options["html_ui", "launch_browser"]
if sleepTime:
# Run in a separate thread, as we have more work to do.
thread.start_new_thread(Dibbler.run, (),
{"launchBrowser":launchBrowser})
else:
Dibbler.run(launchBrowser=launchBrowser)
if doClassify or doTrain:
imaps = []
for server, port, username, password in servers_data:
imaps.append(((server, port, imapDebug, doExpunge),
username, password))
# In order to make working with multiple servers easier, we
# allow the user to have separate configuration files for each
# server. These may specify different folders to watch, different
# spam/unsure folders, or any other options (e.g. thresholds).
# For each server we use the default (global) options, and load
# the specific options on top. To facilitate this, we use a
# restore point for the options with just the default (global)
# options.
# XXX What about when we are running with -l and change options
# XXX via the web interface? We need to handle that, really.
options.set_restore_point()
while True:
for (server, port, imapDebug, doExpunge), username, password in
imaps:
imap = IMAPSession(server, port, imapDebug, doExpunge)
if options["globals", "verbose"]:
print "Account: %s:%s" % (imap.server, imap.port)
if imap.connected:
# As above, we load a separate configuration file
# for each server, if it exists. We look for a
# file in the optionsPathname directory, with the
# name server.name.ini or .spambayes_server_name_rc
# XXX While 1.1 is in alpha these names can be
# XXX changed if desired. Please let Tony know!
basedir = os.path.dirname(optionsPathname)
fn1 = os.path.join(basedir, imap.server + ".ini")
fn2 = os.path.join(basedir,
imap.server.replace(".", "_") + \
"_rc")
for fn in (fn1, fn2):
if os.path.exists(fn):
options.merge_file(fn)
try:
imap.login(username, password)
except LoginFailure, e:
print str(e)
continue
imap_filter.imap_server = imap
if doTrain:
if options["globals", "verbose"]:
print "Training"
imap_filter.Train()
if doClassify:
if options["globals", "verbose"]:
print "Classifying"
imap_filter.Filter()
imap.logout()
options.revert_to_restore_point()
else:
# Failed to connect. This may be a temporary problem,
# so just continue on and try again. If we are only
# running once we will end, otherwise we'll try again
# in sleepTime seconds.
# XXX Maybe we should log this error message?
pass
if sleepTime:
time.sleep(sleepTime)
else:
break
if __name__ == '__main__':
run()
_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html