Thank Eric and Robert, I have updated my proposal again following: 1.Feature Selection, will consider other approach here too. 2. Link to JAMEs 1216. 3. spell and writing errors.
attached: Proposal Title: Design and implement machine learning filters and categorization for anti spam functions in the mail system Student Name: Yu Fu Student E-mail: [email protected] Organization/Project: The Apache Software Foundation Assigned Mentor: Eric Charles, Robert Burrell Donkin Proposal Abstract: Using advanced machine learning methods, K-nearest neighbors, and SVM to implement intelligent filter mail feature, and adding and testing with Pre-processions, feature selection and ML algorithm, in order to strengthen anti-spam capability beyond current Bayesian Analysis method. Detailed Description: 1. Understand James Anti-Spam Features The current James Anti-Spam system has three parts, which I marked with red in the figure below. The figure also shows the whole relationship between the basic concepts. • The first part fastfail in this SMTPServer is to reject a message on the smtplevel if a configured hits amount is reached. It operates in the background of a machine, listening on a specific TCP port. In this level, the server will filter non-existent users, DSN filter, domains with invalid MX record, and ignore duplicated recipients. Also IP Addresses which send emails to the configured recipients will get blacklisted for the configured time. The configuration for the certain port and hits score can be set in the smtpserver.xml. • SpamAssassin in the Mailet is designed to classify the messages as spam or not with an internal (configurable) score threshold. Usually a message will only be considered as spam if it matches multiple criteria; matching just a single test will not usually be enough to reach the threshold. • BayesianAnalysis in the Mailet uses Bayesian probability to classify mail as spam or not spam. It relies on the training data coming from the users’ judgment. Users need to manually judge as spam and send to [email protected], oppositely, if not spam they then send to [email protected]. BayesianAnalysisfeeder learns from this training dataset, and build predictive models based on Bayesian probability. There will be a certain table for maintaining the frequency of Corpus for keywords in the database. Every 10 mins a thread in the BayesianAnalysis will check and update the table. Also, the correct approach is to send the original spam or non-spam as an attachment to another message sent to the feeder in order to avoid bias from the current sender’s email header. 2. Motivation Navie Bayes’ approach, which is based on Bayesian Probability, is the most popular method. But for some too popular Phrases or keywords, the Navie Bayes cannot work well. Consider the “Click Here to Unsubscribe” Phrase occurs 10 times as often in spam as good mail. P(click|spam) is 10 times higher, and P(unsubscribe|spam) is 10 times higher. Multiply together, and one gets a factor of 1000. The bias of these hot keywords will heavily hurt the accuracy of the filter. I want to implement K-nearest neighbor and Light SVM, which have much better experiment [1] results than our current approach. The measurement is accuracy of filtering, efficiency to build the model based on training data and test on test data. This project will be the starting point for me to contribute to the James Open Source community. Beyond Anti-Spam functions and for the long term consideration, I plan to design intelligent ways to directly filter mail with the “Priority” rank or mark, which is already implemented in the Gmail system by Google. Intelligent email filtering is a practical issue more than a theoretical technique. I can contribute with strong and scalable analysis reasoning capability and foresight design in this community. My aim here is to engage Machine Learning methods in James Mail systems efficiently and easily. 3. Project Scope The project will be based on the James Server 3.0 framework, and currently it will restrict the categories to spam/not spam. • Implement tokenization and stemming Pre-processions • Implement “TF-IDF” and “binary Vector” feature selection. • Implement K-nearest neighbor algorithm to enforce the new Anti-Spam filter. • Implement Light SVM algorithm to enforce the Anti-Spam filter. 4. Approach My idea is to use K-nearest neighbor algorithm to classify the mail as spam or non-spam. Computing space distance between different text words, KNN has better experimental performance in accuracy and efficiency [1]. Tokenization and stemming Pre-processions This part will implement integrated with Mime4Jparser. Tokenization: In order to separate text into individual words. For example, the sentence “I’m working on the proposal” will turn into “I’m”, “working”, “on”, “the”, “proposal”. Stop word removal: to remove common words that are usually not useful for text classification. Example: to remove words such as “a”, “the”, “I”, “he”, “she”, “is”, “are”, etc. Stemming: to normalize words derived from the same root. For example: “attending” turns into “attend” and “teacher” turns into “teach”” Feature Selection Both of the “TF-IDF” (case insensitive) and “binary Vector” (case insensitive) feature selection will be implemented. TF-IDF: The tf–idf weight (term frequency–inverse document frequency) is is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. So it can avoid the bias in the Bayesian approach. The variable for TF-IDF will be tuned by the experiment, and also can be manually configured by users. Weighted TF-IDF: Based on TF-IDF, we can set different weights, considering the words position in the mail structure. I have an assumption: the words in the message header are more important than the body. Binary Vector: It maintains the information for whether each feature (term) exists or not (1 or 0) for each training data. The main variable here is the size of vector, which we will set and consider with memory size for this case. Variables Input: Vector Size, Ratio for TF-IDF and Weight for words position Feature extraction Good ideas do not always scale. In my approach the good Feature Construction will promise with efficiency in the model building and accuracy for the model prediction. Several approaches will be tested for this project, including mutual information [3], PCA[6] and Linear Discriminate Analysis [7], which are used as dimensionality reduction for the classification. Using Mutual information of each word with the class variable has been used to select important features. Mutual information of a word captures the amount of information gained in classifying documents by knowing the presence and absence of the word.PCA is the simplest of the true eigenvector-based multivariate analyses. LDA assumes each class is Gaussian distributed with different means but the same covariance. It's probably a good idea to plot some of the data beforehand to make sure that this assumption is reasonable. SVM Support Vector Machine is a powerful machine learning technique that is able to work with high dimensional and noisy data. SVM use sophisticated kernel functions to map data points into a higher dimension and classify data in hyperspace, which is the complex decision surface in this high dimension. I will use the LibSVM[5] for the implementation. SVM method implements: 1) Use of folding in the linear select top or a certain amount of document frequency terms. 2) Create “TF-IDF” for each training data. 3) Use SVM to train a model and get a predictive model. K-Nearest Neighbor (KNN) This method still needs to feed the predictive model in the training data, which can share the training data from Bayesian Analysis. The k-nearest neighbor’s algorithm (kNN) is a method for classifying objects based on closest training examples in the feature space. The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms: an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). The KNN will use the same spam/no spam training data provided by users manually. But KNN model will be efficiently trained. For the complexity for computing the model, it will depend on two variables k and size of keywords [4]. For our case K is 2, because we just need classified spam/nonspam. It will be O(N*N) to build the model, since it will compute the distance between each note., N is the size of the vector setting. KNN method implements: • Method 1: To create a “binary vector,” which maintains the information for whether each feature (term) exists or not (1 or 0) for each training data. For each unclassified mail to create a binary vector too, then use cosine distance to find out the top 2 closest training data from the. Then find out which class the unclassified mail belongs to. Compute Euclidean distance from target plot to those that were sampled. Order samples taking into account calculated distances. Choose heuristically optimal 2 nearest neighbors based on RMSE done by cross validation technique. Calculate an inverse distance weighted average with the k-nearest multivariate neighbors. • Method 2: Use “TF-IDF” respectively instead of binary Vector in Method 1. Schedule 1 week understand Mailets API 1 week implement Pre-processions 1 week implement feature selection on the training dataset 2 weeks implement KNN method (1 weeks implement and 1 week test) 2 weeks implement Light SVM method (1 weeks implement and 1 week test) 2 weeks tune the predictive model with different parameters. 1 week finish the document and guide for using and tuning. My Bio Yu Fu, PhD student in University of Maryland, Baltimore County I have more than two years research experience in data mining. Now I am designing a new secure data mining algorithm in encrypted databases, which is funded by NSF. I implement the new regression tree, KNN and other data mining algorithm based on the Weka API to support my experiment. I have three publications about preserving privacy and security in data mining approach. [1] Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model. irlab.csie.ntu.edu.tw/~chean/perWeb/projects/AI/AI_final.pdf [2] An Empirical Performance Comparison of Machine Learning Methods for Spam E-Mail Categorization. http://www.computer.org/portal/web/csdl/doi/10.1109/ICHIS.2004.21 [3] Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.1259&rep=rep1&type=pdf [4] Introduction to Information Retrieval, By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-and-optimality-of-knn-1.html#tab:knncomp [5] LibSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ [6] Jolliffe, I. T. (1986). Principal Component Analysis. Springer-Verlag. pp. 487. doi:10.1007/b98835. ISBN 978-0-387-95442-4. [7] McLachlan (2004) Discriminant Analysis and Statistical Pattern Recognition In: Wiley Interscience -- Yu Fu [email protected] 443-388-6654 --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
