JAMES-1216 [gsoc2011] Proposal Update again for Design and implement machine learning filters and categorization for mail

Yu Fu Fri, 08 Apr 2011 06:55:47 -0700

Thank Eric and Robert,
I have updated my proposal again following:
1.Feature Selection, will consider other approach here too.
2. Link to JAMEs 1216.
3. spell and writing errors.

attached:

Proposal Title: Design and implement machine learning filters and
categorization for anti spam functions in the mail system
Student Name: Yu Fu
Student E-mail: [email protected]
Organization/Project: The Apache Software Foundation
Assigned Mentor: Eric Charles, Robert Burrell Donkin
Proposal Abstract: Using advanced machine learning methods, K-nearest
neighbors, and SVM to implement intelligent filter mail feature, and
adding and testing with Pre-processions, feature selection and ML
algorithm, in order to strengthen anti-spam capability beyond current
Bayesian Analysis method.
Detailed Description:
1. Understand James Anti-Spam Features
The current James Anti-Spam system has three parts, which I marked
with red in the figure below. The figure also shows the whole
relationship between the basic concepts.
• The first part fastfail in this SMTPServer is to reject a message on
the smtplevel if a configured hits amount is reached. It operates in
the background of a machine, listening on a specific TCP port. In this
level, the server will filter non-existent users, DSN filter, domains
with invalid MX record, and ignore duplicated recipients. Also IP
Addresses which send emails to the configured recipients will get
blacklisted for the configured time. The configuration for the certain
port and hits score can be set in the smtpserver.xml.
• SpamAssassin in the Mailet is designed to classify the messages as
spam or not with an internal (configurable) score threshold. Usually a
message will only be considered as spam if it matches multiple
criteria; matching just a single test will not usually be enough to
reach the threshold.
• BayesianAnalysis in the Mailet uses Bayesian probability to classify
mail as spam or not spam. It relies on the training data coming from
the users’ judgment. Users need to manually judge as spam and send to
[email protected], oppositely, if not spam they then send to
[email protected]. BayesianAnalysisfeeder learns from this
training dataset, and build predictive models based on Bayesian
probability. There will be a certain table for maintaining the
frequency of Corpus for keywords in the database. Every 10 mins a
thread in the BayesianAnalysis will check and update the table. Also,
the correct approach is to send the original spam or non-spam as an
attachment to another message sent to the feeder in order to avoid
bias from the current sender’s email header.

2. Motivation
Navie Bayes’ approach, which is based on Bayesian Probability, is the
most popular method. But for some too popular Phrases or keywords, the
Navie Bayes cannot work well. Consider the “Click Here to Unsubscribe”
Phrase occurs 10 times as often in spam as good mail. P(click|spam) is
10 times higher, and P(unsubscribe|spam) is 10 times higher. Multiply
together, and one gets a factor of 1000. The bias of these hot
keywords will heavily hurt the accuracy of the filter.
I want to implement K-nearest neighbor and Light SVM, which have much
better experiment [1] results than our current approach. The
measurement is accuracy of filtering, efficiency to build the model
based on training data and test on test data.
This project will be the starting point for me to contribute to the
James Open Source community. Beyond Anti-Spam functions and for the
long term consideration, I plan to design intelligent ways to directly
filter mail with the “Priority” rank or mark, which is already
implemented in the Gmail system by Google. Intelligent email filtering
is a practical issue more than a theoretical technique. I can
contribute with strong and scalable analysis reasoning capability and
foresight design in this community. My aim here is to engage Machine
Learning methods in James Mail systems efficiently and easily.
3. Project Scope
The project will be based on the James Server 3.0 framework, and
currently it will restrict the categories to spam/not spam.
• Implement tokenization and stemming Pre-processions
• Implement “TF-IDF” and “binary Vector” feature selection.
• Implement K-nearest neighbor algorithm to enforce the new Anti-Spam
filter.
• Implement Light SVM algorithm to enforce the Anti-Spam filter.
4. Approach
My idea is to use K-nearest neighbor algorithm to classify the mail as
spam or non-spam. Computing space distance between different text
words, KNN has better experimental performance in accuracy and
efficiency [1].

Tokenization and stemming Pre-processions
This part will implement integrated with Mime4Jparser.
Tokenization: In order to separate text into individual words. For
example, the sentence “I’m working on the proposal” will turn into
“I’m”, “working”, “on”, “the”, “proposal”.
Stop word removal: to remove common words that are usually not useful
for text classification. Example: to remove words such as “a”, “the”,
“I”, “he”, “she”, “is”, “are”, etc.
Stemming: to normalize words derived from the same root. For example:
“attending” turns into “attend” and “teacher” turns into “teach””

Feature Selection
Both of the “TF-IDF” (case insensitive) and “binary Vector” (case
insensitive) feature selection will be implemented.
TF-IDF: The tf–idf weight (term frequency–inverse document frequency)
is is a statistical measure used to evaluate how important a word is
to a document in a collection or corpus. The importance increases
proportionally to the number of times a word appears in the document
but is offset by the frequency of the word in the corpus. Variations
of the tf–idf weighting scheme are often used by search engines as a
central tool in scoring and ranking a document's relevance given a
user query. So it can avoid the bias in the Bayesian approach. The
variable for TF-IDF will be tuned by the experiment, and also can be
manually configured by users.
Weighted TF-IDF: Based on TF-IDF, we can set different weights,
considering the words position in the mail structure. I have an
assumption: the words in the message header are more important than
the body.
Binary Vector: It maintains the information for whether each feature
(term) exists or not (1 or 0) for each training data. The main
variable here is the size of vector, which we will set and consider
with memory size for this case.
Variables Input: Vector Size, Ratio for TF-IDF and Weight for words position
Feature extraction
Good ideas do not always scale. In my approach the good Feature
Construction will promise with efficiency in the model building and
accuracy for the model prediction. Several approaches will be tested
for this project, including mutual information [3], PCA[6] and Linear
Discriminate Analysis [7], which are used as dimensionality reduction
for the classification. Using Mutual information of each word with the
class variable has been used to select important features. Mutual
information of a word captures the amount of information gained in
classifying documents by knowing the presence and absence of the
word.PCA is the simplest of the true eigenvector-based multivariate
analyses. LDA assumes each class is Gaussian distributed with
different means but the same covariance. It's probably a good idea to
plot some of the data beforehand to make sure that this assumption is
reasonable.

SVM
Support Vector Machine is a powerful machine learning technique that
is able to work with high dimensional and noisy data. SVM use
sophisticated kernel functions to map data points into a higher
dimension and classify data in hyperspace, which is the complex
decision surface in this high dimension. I will use the LibSVM[5] for
the implementation.
SVM method implements:
1) Use of folding in the linear select top or a certain amount of
document frequency terms.
2) Create “TF-IDF” for each training data.
3) Use SVM to train a model and get a predictive model.

K-Nearest Neighbor (KNN)
This method still needs to feed the predictive model in the training
data, which can share the training data from Bayesian Analysis. The
k-nearest neighbor’s algorithm (kNN) is a method for classifying
objects based on closest training examples in the feature space. The
k-nearest neighbor algorithm is amongst the simplest of all machine
learning algorithms: an object is classified by a majority vote of its
neighbors, with the object being assigned to the class most common
amongst its k nearest neighbors (k is a positive integer, typically
small). The KNN will use the same spam/no spam training data provided
by users manually. But KNN model will be efficiently trained.
For the complexity for computing the model, it will depend on two
variables k and size of keywords [4]. For our case K is 2, because we
just need classified spam/nonspam. It will be O(N*N) to build the
model, since it will compute the distance between each note., N is the
size of the vector setting.
KNN method implements:
• Method 1: To create a “binary vector,” which maintains the
information for whether each feature (term) exists or not (1 or 0) for
each training data. For each unclassified mail to create a binary
vector too, then use cosine distance to find out the top 2 closest
training data from the. Then find out which class the unclassified
mail belongs to. Compute Euclidean distance from target plot to those
that were sampled. Order samples taking into account calculated
distances. Choose heuristically optimal 2 nearest neighbors based on
RMSE done by cross validation technique. Calculate an inverse distance
weighted average with the k-nearest multivariate neighbors.
• Method 2: Use “TF-IDF” respectively instead of binary Vector in Method
1.
Schedule
1 week understand Mailets API
1 week implement Pre-processions
1 week implement feature selection on the training dataset
2 weeks implement KNN method (1 weeks implement and 1 week test)
2 weeks implement Light SVM method (1 weeks implement and 1 week test)
2 weeks tune the predictive model with different parameters.
1 week finish the document and guide for using and tuning.

My Bio
Yu Fu, PhD student in University of Maryland, Baltimore County
I have more than two years research experience in data mining. Now I
am designing a new secure data mining algorithm in encrypted
databases, which is funded by NSF. I implement the new regression
tree, KNN and other data mining algorithm based on the Weka API to
support my experiment. I have three publications about preserving
privacy and security in data mining approach.

[1] Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model.
irlab.csie.ntu.edu.tw/~chean/perWeb/projects/AI/AI_final.pdf
[2] An Empirical Performance Comparison of Machine Learning Methods
for Spam E-Mail Categorization.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICHIS.2004.21
[3] Text Categorization Using Weight Adjusted k-Nearest Neighbor
Classification
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.1259&rep=rep1&type=pdf
[4] Introduction to Information Retrieval, By Christopher D. Manning,
Prabhakar Raghavan & Hinrich Schütze
http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-and-optimality-of-knn-1.html#tab:knncomp
[5] LibSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[6] Jolliffe, I. T. (1986). Principal Component Analysis.
Springer-Verlag. pp. 487. doi:10.1007/b98835. ISBN 978-0-387-95442-4.
[7] McLachlan (2004) Discriminant Analysis and Statistical Pattern
Recognition In: Wiley Interscience

--
Yu Fu
[email protected]
443-388-6654

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

JAMES-1216 [gsoc2011] Proposal Update again for Design and implement machine learning filters and categorization for mail

Reply via email to