[jira] [Commented] (JAMES-1216) [gsoc2011] Design and implement machine learning filters and categorization for mail

Vicki Fu (JIRA) Mon, 25 Apr 2011 18:52:44 -0700

    [ 
https://issues.apache.org/jira/browse/JAMES-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025023#comment-13025023
 ]


Vicki Fu commented on JAMES-1216:
---------------------------------

Proposal Title: Design and implement machine learning filters and 
categorization for anti spam functions in the mail system 
https://issues.apache.org/jira/browse/JAMES-1216
Student Name: Yu Fu
Student E-mail: [email protected]
Organization/Project: The Apache Software Foundation
Assigned Mentor: Eric Charles, Robert Burrell Donkin
Proposal Abstract: Using advanced machine learning methods, K-nearest 
neighbors, and SVM to implement intelligent filter mail feature, and adding and 
testing with Pre-processions, feature selection and ML algorithm, in order to 
strengthen anti-spam capability beyond current Bayesian Analysis method.
Detailed Description:
1. Understand James Anti-Spam Features
The current James Anti-Spam system has three parts, which I marked with red in 
the figure below. The figure also shows the whole relationship between the 
basic concepts.
The first part fastfail in this SMTPServer is to reject a message on the 
smtplevel if a configured hits amount is reached. It operates in the background 
of a machine, listening on a specific TCP port. In this level, the server will 
filter non-existent users, DSN filter, domains with invalid MX record, and 
ignore duplicated recipients. Also IP Addresses which send emails to the 
configured recipients will get blacklisted for the configured time. The 
configuration for the certain port and hits score can be set in the 
smtpserver.xml.
SpamAssassin in the Mailet is designed to classify the messages as spam or not 
with an internal (configurable) score threshold. Usually a message will only be 
considered as spam if it matches multiple criteria; matching just a single test 
will not usually be enough to reach the threshold.
BayesianAnalysis in the Mailet uses Bayesian probability to classify mail as 
spam or not spam. It relies on the training data coming from the users’ 
judgment. Users need to manually judge as spam and send to [email protected], 
oppositely, if not spam they then send to [email protected]. 
BayesianAnalysisfeeder learns from this training dataset, and build predictive 
models based on Bayesian probability. There will be a certain table for 
maintaining the frequency of Corpus for keywords in the database. Every 10 mins 
a thread in the BayesianAnalysis will check and update the table. Also, the 
correct approach is to send the original spam or non-spam as an attachment to 
another message sent to the feeder in order to avoid bias from the current 
sender’s email header.

2. Motivation
Navie Bayes’ approach, which is based on Bayesian Probability, is the most 
popular method. But for some too popular Phrases or keywords, the Navie Bayes 
cannot work well. Consider the “Click Here to Unsubscribe” Phrase occurs 10 
times as often in spam as good mail. P(click|spam) is 10 times higher, and 
P(unsubscribe|spam) is 10 times higher. Multiply together, and one gets a 
factor of 1000. The bias of these hot keywords will heavily hurt the accuracy 
of the filter.
I want to implement K-nearest neighbor and Light SVM, which have much better 
experiment [1] results than our current approach. The measurement is accuracy 
of filtering, efficiency to build the model based on training data and test on 
test data.  
This project will be the starting point for me to contribute to the James Open 
Source community. Beyond Anti-Spam functions and for the long term 
consideration, I plan to design intelligent ways to directly filter mail with 
the “Priority” rank or mark, which is already implemented in the Gmail system 
by Google. Intelligent email filtering is a practical issue more than a 
theoretical technique. I can contribute with strong and scalable analysis 
reasoning capability and foresight design in this community. My aim here is to 
engage Machine Learning methods in James Mail systems efficiently and easily.
3. Project Scope
The project will be based on the James Server 3.0 framework, and currently it 
will restrict the categories to spam/not spam.
Implement tokenization and  stemming Pre-processions
Implement “TF-IDF” and “binary Vector” feature selection.
Implement K-nearest neighbor algorithm to enforce the new Anti-Spam filter.
Implement Light SVM algorithm to enforce the Anti-Spam filter.

4. Approach
My idea is to use K-nearest neighbor algorithm to classify the mail as spam or 
non-spam. Computing space distance between different text words, KNN has better 
experimental performance in accuracy and efficiency [1].
Tokenization and stemming Pre-processions
This part will implement integrated with Mime4Jparser.
Tokenization: In order to separate text into individual words. For example, the 
sentence “I’m working on the proposal” will turn into “I’m”, “working”, “on”, 
“the”, “proposal”.
Stop word removal: to remove common words that are usually not useful for text 
classification. Example: to remove words such as “a”, “the”, “I”, “he”, “she”, 
“is”, “are”, etc.
Stemming: to normalize words derived from the same root. For example: 
“attending” turns into “attend” and “teacher” turns into “teach””
Feature Selection
Both of the “TF-IDF” (case insensitive) and “binary Vector” (case insensitive) 
feature selection will be implemented.
TF-IDF: The tf–idf weight (term frequency–inverse document frequency) is is a 
statistical measure used to evaluate how important a word is to a document in a 
collection or corpus. The importance increases proportionally to the number of 
times a word appears in the document but is offset by the frequency of the word 
in the corpus. Variations of the tf–idf weighting scheme are often used by 
search engines as a central tool in scoring and ranking a document's relevance 
given a user query. So it can avoid the bias in the Bayesian approach. The 
variable for TF-IDF will be tuned by the experiment, and also can be manually 
configured by users.
Weighted TF-IDF: Based on TF-IDF, we can set different weights, considering the 
words position in the mail structure. I have an assumption: the words in the 
message header are more important than the body.
Binary Vector: It maintains the information for whether each feature (term) 
exists or not (1 or 0) for each training data. The main variable here is the 
size of vector, which we will set and consider with memory size for this case.
Variables Input: Vector Size, Ratio for TF-IDF and Weight for words position
Feature extraction
Good ideas do not always scale. In my approach the good Feature Construction 
will promise with efficiency in the model building and accuracy for the model 
prediction. Several approaches will be tested for this project, including 
mutual information [3], PCA[6] and Linear Discriminate Analysis [7], which are 
used as dimensionality reduction for the classification. Using Mutual 
information of each word with the class variable has been used to select 
important features.  Mutual information of a word captures the amount of 
information gained in classifying documents by knowing the presence and absence 
of the word.PCA is the simplest of the true eigenvector-based multivariate 
analyses. LDA assumes each class is Gaussian distributed with different means 
but the same covariance. It's probably a good idea to plot some of the data 
beforehand to make sure that this assumption is reasonable.
SVM
Support Vector Machine is a powerful machine learning technique that is able to 
work with high dimensional and noisy data. SVM use sophisticated kernel 
functions to map data points into a higher dimension and classify data in 
hyperspace, which is the complex decision surface in this high dimension. I 
will use the LibSVM[5] for the implementation.
SVM method implements:
1)      Use of folding in the linear select top or a certain amount of document 
frequency terms.
2)      Create “TF-IDF” for each training data.
3)      Use SVM to train a model and get a predictive model.
K-Nearest Neighbor (KNN)
This method still needs to feed the predictive model in the training data, 
which can share the training data from Bayesian Analysis. The k-nearest 
neighbor’s algorithm (kNN) is a method for classifying objects based on closest 
training examples in the feature space. The k-nearest neighbor algorithm is 
amongst the simplest of all machine learning algorithms: an object is 
classified by a majority vote of its neighbors, with the object being assigned 
to the class most common amongst its k nearest neighbors (k is a positive 
integer, typically small). The KNN will use the same spam/no spam training data 
provided by users manually. But KNN model will be efficiently trained.
For the complexity for computing the model, it will depend on two variables k 
and size of keywords [4]. For our case K is 2, because we just need classified 
spam/nonspam. It will be O(N*N) to build the model, since it will compute the 
distance between each note., N is the size of the vector setting.
KNN method implements:
Method 1: To create a “binary vector,” which maintains the information for 
whether each feature (term) exists or not (1 or 0) for each training data. For 
each unclassified mail to create a binary vector too, then use cosine distance 
to find out the top 2 closest training data from the. Then find out which class 
the unclassified mail belongs to. Compute Euclidean distance from target plot 
to those that were sampled. Order samples taking into account calculated 
distances. Choose heuristically optimal 2 nearest neighbors based on RMSE done 
by cross validation technique. Calculate an inverse distance weighted average 
with the k-nearest multivariate neighbors.
Method 2: Use “TF-IDF” respectively instead of binary Vector in Method 1.

Schedule
1 week understand Mailets API
1 week implement Pre-processions
1 week implement feature selection on the training dataset
2 weeks implement KNN method (1 weeks implement and 1 week test)
2 weeks implement Light SVM method (1 weeks implement and 1 week test)
2 weeks tune the predictive model with different parameters.
1 week finish the document and guide for using and tuning.
My Bio
Yu Fu, PhD student in University of Maryland, Baltimore County
I have more than two years research experience in data mining. Now I am 
designing a new secure data mining algorithm in encrypted databases, which is 
funded by NSF. I implement the new regression tree, KNN and other data mining 
algorithm based on the Weka API to support my experiment. I have three 
publications about preserving privacy and security in data mining approach.
[1] Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model. 
irlab.csie.ntu.edu.tw/~chean/perWeb/projects/AI/AI_final.pdf
[2] An Empirical Performance Comparison of Machine Learning Methods for Spam 
E-Mail Categorization. 
http://www.computer.org/portal/web/csdl/doi/10.1109/ICHIS.2004.21
[3] Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.1259&rep=rep1&type=pdf
[4] Introduction to Information Retrieval, By Christopher D. Manning, Prabhakar 
Raghavan & Hinrich Schütze  
http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-and-optimality-of-knn-1.html#tab:knncomp
[5] LibSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[6] Jolliffe, I. T. (1986). Principal Component Analysis. Springer-Verlag. pp. 
487. doi:10.1007/b98835. ISBN 978-0-387-95442-4.
[7] McLachlan (2004) Discriminant Analysis and Statistical Pattern Recognition 
In: Wiley Interscience

> [gsoc2011] Design and implement machine learning filters and categorization 
> for mail
> ------------------------------------------------------------------------------------
>
>                 Key: JAMES-1216
>                 URL: https://issues.apache.org/jira/browse/JAMES-1216
>             Project: JAMES Server
>          Issue Type: New Feature
>            Reporter: Eric Charles
>            Assignee: Eric Charles
>              Labels: gsoc2011
>
> Context: Anti-spam functionality based on SpamAssassin is available at James 
> (base on mailets http://james.apache.org/mailet). Bayesian mailets are also 
> available, but not completely integrated/documented. Nothing is available to 
> automatically categorize mail traffic per user.
> Task: We are willing to align the existing implementation with any modern 
> anti-spam solution based on powerfull machine learning implementation (such 
> as apache mahout). We are also willing to extend the machine learning usage 
> to some mail categorization (spam vs not-spam is a first category, we can 
> extend it to any additional category we can imagine). The implementation can 
> partially occur while spooling the mails and/or when mail is stored in 
> mailbox.
> Related discussions: See also discussions on mail intelligent mining on 
> http://markmail.org/message/2bodrwvdvtfq3f2v (mahout related) and 
> http://markmail.org/thread/pksl6csyvoeo27yh (hama related).
> Mentor: eric at apache dot org & [fill in mentor]
> Complexity: high 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (JAMES-1216) [gsoc2011] Design and implement machine learning filters and categorization for mail

Reply via email to