Hi Everyone, This is Vicki. Mostly it is closed to the deadline for proposal submission. Thank the valuable comments from Eric and Robert. I have some update
- Implement tokenization and stemming Pre-processions - Implement Light SVM algorithm to enforce the Anti-Spam filter. - The consideration for "out of box" design on Priority Mail filter. I have attached the proposal below: Proposal Title: Design and implement machine learning filters and categorization for anti spam in the mail system Student Name: Yu Fu Student E-mail: [email protected] Organization/Project: The Apache Software Foundation Assigned Mentor: Eric Charles, Robert Burrell Donkin Proposal Abstract: Using advanced machine learning methods K-nearest neighbors and SVM to implement intelligent filter mail feature, To strengthen anti-spam capability on the beyond of current Bayesian Analysis method. implemented and test with Pre-processions, feature selection and ML algorithm. Detailed Description: 1. Understand James Anti-Spam Features The current James Anti-Spam has three parts, which I marked with the red in the figure below. The figure also shows the whole relationship for the basic concepts. - The first part fastfail in this SMTPServer is to reject a message on the smtplevel if a configured hits amount is reached. It operates in the background of a machine, listening on a specific TCP port. In this level, the server will filter non-existed users, DSN filter, domains with invalid MX record, ignore duplicated recipients. Also IP Addresses which send emails to the configured recipients will get blacklisted for the configured time. The configuration for the certain port and hits score can be set in the smtpserver.xml. - SpamAssassin in the Mailet is to classify the messages as spam or not with an internal (configurable) score threshold. Usually a message will only be considered as spam if it matches multiple criteria; matching just a single test will not usually be enough to reach the threshold. - BayesianAnalysis in the Mailet are based on the Bayesian probability to classify to spam or not. It relied on the training data from the users’ judgment. Users need to manually judge as spam and send to [email protected], oppositely, if judge as non-spam and then send to [email protected]. BayesianAnalysisfeeder learn from these training dataset, and build predictive models on Bayesian probability. There will be a certain table for maintaining the frequency of Corpus for keywords in the database. Every 10 mins a thread in the BayesianAnalysis will check and update the table. Also the correct approach is to send the original spam or non-spam as an attachment to another message send to the feeder in order to avoid the bias from the current sender’s email header. 2. Motivation Navie Bayes approach, which is based on Bayesian Probablity, is the most popular method. But for some too popular Phrase or keyword, the Navie Bayes cannot work well. Consider “Click Here to Unsubscribe” Phrase occurs 10 times as often in spam as good. P(click|spam) is 10 times higher, and P(unsubscribe|spam) is 10 times higher Multiply together, get factor of 1000. The bias of these hot keywords will heavily hurt the accuracy of the filter. I want to implement K-nearest neighbor and Light SVM, which have much better experiment [1] result than our current approach. The measurement is about accuracy for the filtering, efficiency to build the model on training data and test on test data. This project will be the starting point for me to contribute to the James Open Source community. Beyond the Anti-Spam and for the long term consideration, I plan to design intelligent way to directly filter the mail with the “Priority” rank or mark, which is already implemented in Gmail system by Google. Intelligence email filter is a practical issue more than a theoretical technique. I can contribute with the strong and scalable analysis reasoning capability and foresight design in this community. My aim here is to engage Machine Learning method in James Mail system efficiently and easily. 3. Project Scope The project will be based on the James Server 3.0 framework, and currently it will restrict the categories to spam/not spam. - Implement tokenization and stemming Pre-processions - Implement “TF-IDF” and “binary Vector” feature selection. - Implement K-nearest neighbor algorithm to enforce the new Anti-Spam filter. - Implement Light SVM algorithm to enforce the Anti-Spam filter. 4. Approach My idea is use K-nearest neighbor algorithm to classify the mail to spam or non-spam. Computing space distance between different text words, KNN has better experimental performance considering the measurement of the accuracy and efficiency [1]. Tokenization and stemming Pre-processions This part will implement integrated with Mime4Jparser. Tokenization: In order to separate text into individual words. For example, the sentence “I’ am working on the proposal” will turn into “I’m”, “working”, “on”, “the”, “proposal”. Stop word removal: to remove common words those are usually not useful for text classification. Example: to remove words such as “a”, “the”, “I”, “he”, “she”, “is”, “are”, etc. Stemming: to normalize words derived from the same root. For example: “attending” turns into “attend” and “teacher” turns into “teach”” Feature Selection Both of the “TF-IDF” (case insensitive) and “binary Vector” (case insensitive) feature selection will be implemented. TF-IDF: The tf–idf weight (term frequency–inverse document frequency) is is a statistical measure used to evaluate how important a word is to a document in a collection or corpus <http://en.wikipedia.org/wiki/Text_corpus>. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. So it can avoid the bias on Bayesian approach. The variable for TF-IDF will be tuned by the experiment, and also can manual be configured by users. Weighted TF-IDF: based TF-IDF, we can set different weighted, considering the words position in the mail structure. I have assumption: the words in message header are more important than the body. Binary Vector: it maintains the information of each feature (term) exist or not (1 or 0) for each training data. The main variable here is the size of vector, we will set and consider with memory size for this case. Variables Input: Vector Size, Ratio for TF-IDF and Weight for words position Feature extraction I will use mutual information in order to reduce dimension. Using Mutual information of each word with the class variable has been used to select important features. Mutual information of a word captures the amount of information gained in classifying documents by knowing the presence and absence of the word [3]. Light SVM Support Vector Machine is a powerful machine learning technique that is able to work with high dimensional and noisy data. SVM use sophisticatedly kernel functions to map data points into higher dimension and classify data in with the hyperspace, which is the complex decision surface in this high dimension. Determining the right parameters of such functions is not only computationally expensive, the resulting models are also susceptible to over fitting when the number of training examples is small due to their large VC dimensions. I will use the optimized SVM approach: the Light SVM [5]. This method improve traditional SVM with fast optimization algorithm, "shrinking" heuristic, caching of kernel evaluations. SVM method implements: 1) Use of folding in the linear select top or a certain amount document frequency terms. 2) Create “TF-IDF” for each training data. 3) Use SVM to train a model and get predictive model. I employ the LightSVM code implemented with C by Thorsten Joachims[5] K-Nearest Neighbor (KNN) This method still need feed the predictive model on the training data, which can share the training data from Bayesian Analysis. The k-nearest neighbor’s algorithm (kNN) is a method for classifying<http://en.wikipedia.org/wiki/Statistical_classification>objects based on closest training examples in the feature space. The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms: an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). The KNN will use the same spam/no spam training data provided by users manually. But KNN model will be efficiently trained. For the complexity for computing the model, it will depend on two variables k and size of keywords [4]. For our case K is 2, because we just need classified spam/nonspam. It will be O(N*N) to build the model since it will compute the distance between each note., N is the size of the vector setting. KNN method implements: - Method 1: To create a “binary vector” which maintain the information of each feature (term) exist or not (1 or 0) for each training data. For each unclassified mail to create a binary vector too, then using cosine distance to find out the top 2 closest training data from the then find out which class is the unclassified mail belongs to. Compute Euclidean distance from target plot to those that were sampled. Order samples taking for account calculated distances. Choose heuristically optimal 2 nearest neighbor based on RMSE <http://en.wikipedia.org/wiki/RMSE> done by cross validation technique. Calculate an inverse distance weighted average with the k-nearest multivariate neighbors. - Method 2: Use “TF-IDF” respectively instead of binary Vector in Method 1. Schedule 1 week understand Mailets API 1 week implement Pre-processions 1 week implement feature selection on the training dataset 2 weeks implement KNN method (1 weeks implement and 1 week test) 2 weeks implement Light SVM method (1 weeks implement and 1 week test) 2 weeks tune the predictive model with different parameters. 1 week finish the document and guide for using and tuning. My Bio Yu Fu, PhD student in University of Maryland, Baltimore County I have more than two years research experience in data mining. Now I am designing the new secure data mining algorithm on the encrypted database, which is funded by NSF. I implement the new regression tree, KNN and other data mining algorithm based on the Weka API to support my experiment. I have three publications about preserving privacy and security in data mining approach. [1] Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model. irlab.csie.ntu.edu.tw/~chean/perWeb/projects/AI/AI_final.pdf [2] An Empirical Performance Comparison of Machine Learning Methods for Spam E-Mail Categorization. http://www.computer.org/portal/web/csdl/doi/10.1109/ICHIS.2004.21 [3] Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.1259&rep=rep1&type=pdf [4] Introduction to Information Retrieval, By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-and-optimality-of-knn-1.html#tab:knncomp [5] Light SVM: http://svmlight.joachims.org Yu Fu [email protected] 443-388-6654
