[
https://issues.apache.org/jira/browse/JAMES-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025023#comment-13025023
]
Vicki Fu commented on JAMES-1216:
---------------------------------
Proposal Title: Design and implement machine learning filters and
categorization for anti spam functions in the mail system
https://issues.apache.org/jira/browse/JAMES-1216
Student Name: Yu Fu
Student E-mail: [email protected]
Organization/Project: The Apache Software Foundation
Assigned Mentor: Eric Charles, Robert Burrell Donkin
Proposal Abstract: Using advanced machine learning methods, K-nearest
neighbors, and SVM to implement intelligent filter mail feature, and adding and
testing with Pre-processions, feature selection and ML algorithm, in order to
strengthen anti-spam capability beyond current Bayesian Analysis method.
Detailed Description:
1. Understand James Anti-Spam Features
The current James Anti-Spam system has three parts, which I marked with red in
the figure below. The figure also shows the whole relationship between the
basic concepts.
The first part fastfail in this SMTPServer is to reject a message on the
smtplevel if a configured hits amount is reached. It operates in the background
of a machine, listening on a specific TCP port. In this level, the server will
filter non-existent users, DSN filter, domains with invalid MX record, and
ignore duplicated recipients. Also IP Addresses which send emails to the
configured recipients will get blacklisted for the configured time. The
configuration for the certain port and hits score can be set in the
smtpserver.xml.
SpamAssassin in the Mailet is designed to classify the messages as spam or not
with an internal (configurable) score threshold. Usually a message will only be
considered as spam if it matches multiple criteria; matching just a single test
will not usually be enough to reach the threshold.
BayesianAnalysis in the Mailet uses Bayesian probability to classify mail as
spam or not spam. It relies on the training data coming from the users’
judgment. Users need to manually judge as spam and send to [email protected],
oppositely, if not spam they then send to [email protected].
BayesianAnalysisfeeder learns from this training dataset, and build predictive
models based on Bayesian probability. There will be a certain table for
maintaining the frequency of Corpus for keywords in the database. Every 10 mins
a thread in the BayesianAnalysis will check and update the table. Also, the
correct approach is to send the original spam or non-spam as an attachment to
another message sent to the feeder in order to avoid bias from the current
sender’s email header.
2. Motivation
Navie Bayes’ approach, which is based on Bayesian Probability, is the most
popular method. But for some too popular Phrases or keywords, the Navie Bayes
cannot work well. Consider the “Click Here to Unsubscribe” Phrase occurs 10
times as often in spam as good mail. P(click|spam) is 10 times higher, and
P(unsubscribe|spam) is 10 times higher. Multiply together, and one gets a
factor of 1000. The bias of these hot keywords will heavily hurt the accuracy
of the filter.
I want to implement K-nearest neighbor and Light SVM, which have much better
experiment [1] results than our current approach. The measurement is accuracy
of filtering, efficiency to build the model based on training data and test on
test data.
This project will be the starting point for me to contribute to the James Open
Source community. Beyond Anti-Spam functions and for the long term
consideration, I plan to design intelligent ways to directly filter mail with
the “Priority” rank or mark, which is already implemented in the Gmail system
by Google. Intelligent email filtering is a practical issue more than a
theoretical technique. I can contribute with strong and scalable analysis
reasoning capability and foresight design in this community. My aim here is to
engage Machine Learning methods in James Mail systems efficiently and easily.
3. Project Scope
The project will be based on the James Server 3.0 framework, and currently it
will restrict the categories to spam/not spam.
Implement tokenization and stemming Pre-processions
Implement “TF-IDF” and “binary Vector” feature selection.
Implement K-nearest neighbor algorithm to enforce the new Anti-Spam filter.
Implement Light SVM algorithm to enforce the Anti-Spam filter.
4. Approach
My idea is to use K-nearest neighbor algorithm to classify the mail as spam or
non-spam. Computing space distance between different text words, KNN has better
experimental performance in accuracy and efficiency [1].
Tokenization and stemming Pre-processions
This part will implement integrated with Mime4Jparser.
Tokenization: In order to separate text into individual words. For example, the
sentence “I’m working on the proposal” will turn into “I’m”, “working”, “on”,
“the”, “proposal”.
Stop word removal: to remove common words that are usually not useful for text
classification. Example: to remove words such as “a”, “the”, “I”, “he”, “she”,
“is”, “are”, etc.
Stemming: to normalize words derived from the same root. For example:
“attending” turns into “attend” and “teacher” turns into “teach””
Feature Selection
Both of the “TF-IDF” (case insensitive) and “binary Vector” (case insensitive)
feature selection will be implemented.
TF-IDF: The tf–idf weight (term frequency–inverse document frequency) is is a
statistical measure used to evaluate how important a word is to a document in a
collection or corpus. The importance increases proportionally to the number of
times a word appears in the document but is offset by the frequency of the word
in the corpus. Variations of the tf–idf weighting scheme are often used by
search engines as a central tool in scoring and ranking a document's relevance
given a user query. So it can avoid the bias in the Bayesian approach. The
variable for TF-IDF will be tuned by the experiment, and also can be manually
configured by users.
Weighted TF-IDF: Based on TF-IDF, we can set different weights, considering the
words position in the mail structure. I have an assumption: the words in the
message header are more important than the body.
Binary Vector: It maintains the information for whether each feature (term)
exists or not (1 or 0) for each training data. The main variable here is the
size of vector, which we will set and consider with memory size for this case.
Variables Input: Vector Size, Ratio for TF-IDF and Weight for words position
Feature extraction
Good ideas do not always scale. In my approach the good Feature Construction
will promise with efficiency in the model building and accuracy for the model
prediction. Several approaches will be tested for this project, including
mutual information [3], PCA[6] and Linear Discriminate Analysis [7], which are
used as dimensionality reduction for the classification. Using Mutual
information of each word with the class variable has been used to select
important features. Mutual information of a word captures the amount of
information gained in classifying documents by knowing the presence and absence
of the word.PCA is the simplest of the true eigenvector-based multivariate
analyses. LDA assumes each class is Gaussian distributed with different means
but the same covariance. It's probably a good idea to plot some of the data
beforehand to make sure that this assumption is reasonable.
SVM
Support Vector Machine is a powerful machine learning technique that is able to
work with high dimensional and noisy data. SVM use sophisticated kernel
functions to map data points into a higher dimension and classify data in
hyperspace, which is the complex decision surface in this high dimension. I
will use the LibSVM[5] for the implementation.
SVM method implements:
1) Use of folding in the linear select top or a certain amount of document
frequency terms.
2) Create “TF-IDF” for each training data.
3) Use SVM to train a model and get a predictive model.
K-Nearest Neighbor (KNN)
This method still needs to feed the predictive model in the training data,
which can share the training data from Bayesian Analysis. The k-nearest
neighbor’s algorithm (kNN) is a method for classifying objects based on closest
training examples in the feature space. The k-nearest neighbor algorithm is
amongst the simplest of all machine learning algorithms: an object is
classified by a majority vote of its neighbors, with the object being assigned
to the class most common amongst its k nearest neighbors (k is a positive
integer, typically small). The KNN will use the same spam/no spam training data
provided by users manually. But KNN model will be efficiently trained.
For the complexity for computing the model, it will depend on two variables k
and size of keywords [4]. For our case K is 2, because we just need classified
spam/nonspam. It will be O(N*N) to build the model, since it will compute the
distance between each note., N is the size of the vector setting.
KNN method implements:
Method 1: To create a “binary vector,” which maintains the information for
whether each feature (term) exists or not (1 or 0) for each training data. For
each unclassified mail to create a binary vector too, then use cosine distance
to find out the top 2 closest training data from the. Then find out which class
the unclassified mail belongs to. Compute Euclidean distance from target plot
to those that were sampled. Order samples taking into account calculated
distances. Choose heuristically optimal 2 nearest neighbors based on RMSE done
by cross validation technique. Calculate an inverse distance weighted average
with the k-nearest multivariate neighbors.
Method 2: Use “TF-IDF” respectively instead of binary Vector in Method 1.
Schedule
1 week understand Mailets API
1 week implement Pre-processions
1 week implement feature selection on the training dataset
2 weeks implement KNN method (1 weeks implement and 1 week test)
2 weeks implement Light SVM method (1 weeks implement and 1 week test)
2 weeks tune the predictive model with different parameters.
1 week finish the document and guide for using and tuning.
My Bio
Yu Fu, PhD student in University of Maryland, Baltimore County
I have more than two years research experience in data mining. Now I am
designing a new secure data mining algorithm in encrypted databases, which is
funded by NSF. I implement the new regression tree, KNN and other data mining
algorithm based on the Weka API to support my experiment. I have three
publications about preserving privacy and security in data mining approach.
[1] Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model.
irlab.csie.ntu.edu.tw/~chean/perWeb/projects/AI/AI_final.pdf
[2] An Empirical Performance Comparison of Machine Learning Methods for Spam
E-Mail Categorization.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICHIS.2004.21
[3] Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.1259&rep=rep1&type=pdf
[4] Introduction to Information Retrieval, By Christopher D. Manning, Prabhakar
Raghavan & Hinrich Schütze
http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-and-optimality-of-knn-1.html#tab:knncomp
[5] LibSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[6] Jolliffe, I. T. (1986). Principal Component Analysis. Springer-Verlag. pp.
487. doi:10.1007/b98835. ISBN 978-0-387-95442-4.
[7] McLachlan (2004) Discriminant Analysis and Statistical Pattern Recognition
In: Wiley Interscience
> [gsoc2011] Design and implement machine learning filters and categorization
> for mail
> ------------------------------------------------------------------------------------
>
> Key: JAMES-1216
> URL: https://issues.apache.org/jira/browse/JAMES-1216
> Project: JAMES Server
> Issue Type: New Feature
> Reporter: Eric Charles
> Assignee: Eric Charles
> Labels: gsoc2011
>
> Context: Anti-spam functionality based on SpamAssassin is available at James
> (base on mailets http://james.apache.org/mailet). Bayesian mailets are also
> available, but not completely integrated/documented. Nothing is available to
> automatically categorize mail traffic per user.
> Task: We are willing to align the existing implementation with any modern
> anti-spam solution based on powerfull machine learning implementation (such
> as apache mahout). We are also willing to extend the machine learning usage
> to some mail categorization (spam vs not-spam is a first category, we can
> extend it to any additional category we can imagine). The implementation can
> partially occur while spooling the mails and/or when mail is stored in
> mailbox.
> Related discussions: See also discussions on mail intelligent mining on
> http://markmail.org/message/2bodrwvdvtfq3f2v (mahout related) and
> http://markmail.org/thread/pksl6csyvoeo27yh (hama related).
> Mentor: eric at apache dot org & [fill in mentor]
> Complexity: high
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]