[EMAIL PROTECTED] wrote on Friday, February 02, 2007 9:23 AM -0600: > Seth Goodman wrote: > > > The word salad they use to drown out significant clues generally > > fails, but if they throw enough words at it, they sometimes dilute > > the spam clues sufficiently. The fact that they throw hundreds of > > "noise" words at the filters for every spam clue they want to hide > > and Bayesian filters still catch half or three-quarters of it > > shows how powerful the Bayesian approach really is.... > > Hmmm... Could we do something to measure the amount of word salad > without penalizing large non-image emails?
That's a very interesting idea: a meta-analysis after tokenizing. To restate the hypothesis you imply: spam using word salad may have a different percentage of tokens that are significant clues than non-spam email. Taking this further, there may also be differences in the total number of distinct tokens generated, and how many of those tokens are from words versus synthetic tokens. So in general, try to make use any correlation between spamminess and meta-information like total number of tokens generated, total number of word tokens generated, number of significant clues and number of non-significant clues. A very cool general extension to Bayesian classification. I don't know how you'd put this meta-information into a form that Spambayes could make use of. Let's see, the database tells you how many times a given token appears in the ham/spam training sets. From this you calculate a spam probability that is combined with the results of other tokens to give an overall spam probability. For a numeric value token, you want to calculate a spam probability of the numeric value with respect to the values in the ham/spam training sets. It's a different calculation, but it is still probably amenable to using a chi-square distribution so you can combine it with other clues. > > > - zombie hosts tend to be weak on SMTP etiquette, so one clue is > > that they often fail to wait when asked; making the SMTP client > > wait for 30 seconds before sending the "connect banner" often > > tricks impatient zombies into spewing, and you can then hang up; > > Yeah, but this is a job for postgrey and other similar tools. Yes, sendmail/exim/qmail, but we're completely in agreement on the location. My point to the OP was that the MTA is the best place to make spam filtering more effective by cutting down on the amount of spam post-acceptance filters have to process. The example was meant to show the kind of behavioral clues that suggest an SMTP client may not be a legitimate mail host and the connection refused. I was suggesting that doing the MTA part a little better has far greater return than anything you do later. I suspect that the best rejection criteria for image spam is the identity of the SMTP client (a zombie host), and that's hard to do once a message is delivered to a user mailbox. After giving a few examples, I realized that the decision process is similar to the one used in a post-acceptance spam filter, so perhaps MTA's could make use of Bayesian classification to make better decisions. The current state of the art (OK, bleeding edge) is to use a reputation system that accumulates reputation (hamminess) for each of several possible sender identity types, identity qualification methods and qualification results. For example, there are three common identities available at SMTP envelope time: connecting IP address, connecting hostname, and SMTP MAILFROM address (domain part only). Because of the prevalence of forgery, you attempt to qualify each identity using a hierarchy of possible methods. Common methods to qualify an identity are SPF and forward/reverse DNS. Each qualification method can produce results of pass, fail or unknown. The tuple of (identity, qualification method, qualification result) forms an atom in the database and holds a reputation score. There are also behavioral clues from the connecting SMTP client which are useful when there is no reputation data. Finally, there is a time component so the data remains current. Every time a connecting MTA offers a message, the receiving MTA must make a trinary decision analogous to what Spambayes does: accept all messages from this sender (whitelist), deny all messages from this sender (blacklist), or allow the sender to present messages but filter each one for content (unsure). The quality of the decisions is particularly important for senders with no reputation, as that is where most spam comes from, yet it also includes infrequent senders with real messages. Sender in this context means mail host or domain that bounces go to, not the mailbox address of the author. -- Seth Goodman _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev