JAMES-1216 [gsoc2011] Design and implement machine learning filters and categorization for mail

Yu Fu Thu, 07 Apr 2011 02:41:48 -0700

Hi Everyone,
This is Vicki. I am applying to Gsoc 2011 on JAMES-1216[jira].
I have the question for updating Bayesian Analysis to 3.0. Because the
container switch from Avalon to Spring, some interfaces need to rewrite? Is
anyone know how it should be?

I have attached my proposal and the figure here.I think it should act as the
starting point for me to contribute to James community. Welcome to any
feedback.

_______________________________________________________________________________________________________________

*Proposal Title: *Design and implement machine learning filters and
categorization for anti spam in the mail system
*Student Name:* Yu Fu
*Student E-mail: *[email protected]
*Organization/Project*: The Apache Software Foundation
*Assigned Mentor*: Eric Charles, Robert Burrell Donkin
Proposal Abstract: Using advanced machine learning methods K-nearest
neighbors to implement intelligent filter mail feature, To strengthen
anti-spam capability on the beyond of current Bayesian Analysis method.
*Detailed Description:*

*1. Understand James Anti-Spam Features*

Current James Anti-Spam have three parts, which I marked with the red in the
figure below. The figure also shows the whole relationship for the basic
concepts.

· The first part *fastfail* in this SMTPServer is to reject a
message on the smtplevel if a configured hits amount is reached. It operates
in the background of a machine, listening on a specific TCP port. In this
level, the server will filter non-existed users, DSN filter, domains with
invalid MX record, ignore duplicated recipients. Also IP Addresses which
send emails to the configured recipients will get blacklisted for the
configured time. The configuration for the certain port and hits score can
be set in the smtpserver.xml.

· *SpamAssassin* in the Mailet is to classify the messages as spam
or not with an internal (configurable) score threshold. Usually a message
will only be considered as spam if it matches multiple criteria; matching
just a single test will not usually be enough to reach the threshold.

· *BayesianAnalysis* in the Mailet are based on the Bayesian
Probably to classify to spam or not. It relied on the training data from the
users’ judgment. Users need to manually judge as spam and send to
[email protected], oppositely, if judge as non-spam and then send to
[email protected]. BayesianAnalysisfeeder learn from these training
dataset, and build predictive models on Bayesian Probably. There will be a
certain table for maintaining the frequency of Corpus for keywords in the
database. Every 10 mins a thread in the BayesianAnalysis will check and
update the table. Also the correct approach is to send the original spam or
non-spam as an attachment to another message send to the feeder in order to
avoid the bias from the current sender’s email header.

*2. Motivation*

Navie Bayes approach, which is based on Bayesian Probablity, is the most
popular method. But for some too popular Phrase or keyword, the Navie Bayes
cannot work well. Consider “Click Here to Unsubscribe” Phrase occurs 10
times as often in spam as good. P(click|spam) is 10 times higher, and
P(unsubscribe|spam) is 10 times higher Multiply together, get factor of
1000. The bias of these hot keywords will heavily hurt the accuracy of the
filter.

I want to implement K-nearest neighbor, which has much better experiment [1]
result than our current approach, and The training time of KNN is very fast.

This project will be the start point for me to contribute to the James Open
Source community. I treat Anti-Spam filter is a practical issue rather than
a theoretical tech problem. In this community we need to win with abandon
practice and the strong analysis reasoning. I am enjoying implement
different ways to find most efficient and easy method.

*3. Project Scope*
The project will be based on the James Server 3.0 framework, and it will
restrict the categories to spam/not spam.

· Upgrade Bayesian Analysis from 2.3 to 3.0

· Implement “TF-IDF” and “binary Vector” feature selection.

· Implement K-nearest neighbor algorithm to sever as the new
Anti-Spam filter.

*4. Approach*

My idea is use K-nearest neighbor algorithm to classify the mail to spam or
non-spam. Computing space distance between different text words, KNN has
better experimental performance considering the accuracy measurement [1].

*Feature Selection*

Both of the “TF-IDF” (case insensitive) and “binary Vector” (case
insensitive) feature selection will be implemented.

*TF-IDF*: The tf–idf weight (term frequency–inverse document frequency) is
is a statistical measure used to evaluate how important a word is to a
document <http://en.wikipedia.org/wiki/Document> in a collection or
corpus<http://en.wikipedia.org/wiki/Text_corpus>.
The importance increases
proportionally<http://en.wikipedia.org/wiki/Proportionality_(mathematics)>
to
the number of times a word appears in the document but is offset by the
frequency of the word in the corpus. Variations of the tf–idf weighting
scheme are often used by search
engines<http://en.wikipedia.org/wiki/Search_engine> as
a central tool in scoring and ranking a document's
relevance<http://en.wikipedia.org/wiki/Relevance_(information_retrieval)>
given
a user query <http://en.wikipedia.org/wiki/Information_retrieval>. So it can
avoid the bias on Bayesian approach. The variable for TF-IDF will be tuned
by the experiment, and also can manual be configured by users.

Weighted *TF-IDF*: based TF-IDF, we can set different weighted, considering
the words position in the mail structure. I have assumption: the words in
message header are more important than the body.

*Binary Vector*: it maintains the information of each feature (term) exist
or not (1 or 0) for each training data. The main variable here is the size
of vector, we will set and consider with memory size for this case.

*Variables*: Vector Size, Ratio for TF-IDF. Weight for words position

*Feature extraction *I will use mutual information in order to reduce
dimension. Using Mutual information of each word with the class variable has
been used to select important features. Mutual information of a word
captures the amount of information gained in classifying documents by
knowing the presence and absence of the word [3].

*KNN:* this method will still feed the predictive model on the training
data, which can share the training data from Bayesian Analysis.The
k-nearest neighbor’s algorithm (k-NN) is a method for
classifying <http://en.wikipedia.org/wiki/Statistical_classification> objects
based on closest training examples in the feature
space<http://en.wikipedia.org/wiki/Feature_space>.
The k-nearest neighbor algorithm is amongst the simplest of all machine
learning <http://en.wikipedia.org/wiki/Machine_learning> algorithms: an
object is classified by a majority vote of its neighbors, with the object
being assigned to the class most common amongst its k nearest neighbors
(k is a positive integer <http://en.wikipedia.org/wiki/Integer>, typically
small). Here K=2, because we need to classier for spam or not-spam.

KNN method implements:

· Method 1: To create a “binary vector” which maintain
the information of each feature (term) exist or not (1 or 0) for each
training data. For each unclassified mail to create a binary vector too,
then using cosine distance to find out the top 2 closest training data from
the then find out which class is the unclassified mail belongs to. Compute
Euclidean distance from target plot to those that were sampled. Order
samples taking for account calculated distances. Choose heuristically
optimal 2 nearest neighbor based on RMSE
<http://en.wikipedia.org/wiki/RMSE> done
by cross validation technique. Calculate an inverse distance weighted
average with the k-nearest multivariate neighbors.

· Method 2: Use “TF-IDF” respectively instead of binary
Vector in Method 1.

The enhanced feature for Bayesian method:

1. Filtering for certain words based on the Bayesian probability of
single words.
2. Filtering for all the words based on the combination probability of
all the words.
3. Filtering for the rare words based on the Bayesian probability.

Schedule

1 week to upgrade Bayesian Analysis feature to 3.0.

2 weeks implement Parser to selection method (3 weeks implement and 1 week
test)

3 weeks implement KNN method (2 weeks implement and 1 week test)

2 weeks tune the predictive model with different parameters.

1 week finish the document and guide for using and tuning.

[1] Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model. *
irlab.csie.ntu.edu.tw/~chean/perWeb/projects/AI/AI_final.pdf*

**[2] An Empirical Performance Comparison of Machine Learning Methods for
Spam E-Mail Categorization.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICHIS.2004.21

<http://www.computer.org/portal/web/csdl/doi/10.1109/ICHIS.2004.21>[3] Text
Categorization Using Weight Adjusted k-Nearest Neighbor Classification
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.1259&rep=rep1&type=pdf

--
Yu Fu
[email protected]
443-388-6654

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

JAMES-1216 [gsoc2011] Design and implement machine learning filters and categorization for mail

Reply via email to