[spambayes-dev] Standalone SpamBayes classifier for websites

skip Mon, 14 May 2007 06:05:11 -0700

CC'ing Richard Jones - Roundup guru and Reimar Bauer - MoinMoin guru.
Reimar, I don't seem to have Marian Neagul's email handy.  Can you forward
this to him?


I've been trying (rather unsuccessfully) to figure out how to integrate a
SpamBayes classifier into Roundup.  Basically I know zilch about Roundup's
code.  You need to score form submissions (the easy part), save them for
later retraining and allow misclassified submissions to be reinjected into
the website (the hard parts).  I had similar problems when I tried to
incorporate SpamBayes into MoinMoin.  These sites generally treat all
submissions as valid.  Presuming we have a SpamBayes training database and
classifier we can talk to it's a fairly easy task to score a submission and
reject it if it looks like spam.  Alas, if the submission is scored as spam
Roundup and MoinMoin have no convenient way to save the submission yet keep
it sequestered so it doesn't turn up on the web.

It occurred to me yesterday that the SpamBayes POP3 proxy and IMAP filter
solve the storage and classification problems for the specific case where
you're talking those two email protocols.  The only trick is that they are
tied to POP3 and IMAP.  Instead of email I need some other way to get a
"message" into and out of the classifier/database manager.  Given an
arbitrary form submission I should be able to convert it to a MIME message
(file uploads map to attachments) and hand it off to a standalone SpamBayes
server for scoring and storage.  If the submission is originally marked as
spam (or unsure) but is later deemed okay, I should be able to convert the
MIME message back into the necessary bits for resubmission.  If the
submission is originally marked as ham but is later deemed to be spam the
regular Roundup or MoinMoin facility for deleting tickets, pages or
attachments would get rid of it.

Alas, sb_server.py and sb_imapfilter.py don't seem to share a lot of code
(save for using Dibbler to build the web user interface).  Is that true?  It
seems the user interface, classifier bits and storage should be essentially
identical.  All that should be different between the them is the way you
transmit messages to and from external systems:

                                               sink
                                                 ^
                                                 |
                +------------------+        +----------+
                |                  |        |          |
                |    Core          |<------>| Protocol |
                |     Server       |        |  Adapter |
                |                  |        |          |
                +------------------+        +----------+
                      ^                          ^
                      |                          |
                      v                        source
                 web & msg storage

For POP3 the source would be the email client and the sink would be the real
POP3 server.  For IMAP the source and sink would be the IMAP server.  For
websites the source and sink would be the web site (Roundup, MoinMoin, etc).
The data sent from the protocol adapter to the core server would be MIME
messages.  The data sent to the protocol adapter would be simply score info
(ham, spam, unsure, perhaps raw scores).

Any ideas on the shortest route to a core server that provides the user,
training and storage interfaces?  Start from scratch?  Rip the POP3 stuff
out of sb_server.py?  Rip the IMAP stuff out of sb_imapfilter.py?  I'd
really hate to reinvent the wheel since we seem to have two wheels already.
Once that core server is available, adapting to different environments
should be possible by plugging in specific protocol adapters

Thx,

Skip

_______________________________________________
spambayes-dev mailing list
spambayes-dev@python.org
http://mail.python.org/mailman/listinfo/spambayes-dev

[spambayes-dev] Standalone SpamBayes classifier for websites

Reply via email to