Hi Everyone, This is Vicki. I am applying to Gsoc 2011 on JAMES-1216[jira]. I have the question for updating Bayesian Analysis to 3.0. Because the container switch from Avalon to Spring, some interfaces need to rewrite? Is anyone know how it should be?
I have attached my proposal and the figure here.I think it should act as the starting point for me to contribute to James community. Welcome to any feedback. _______________________________________________________________________________________________________________ *Proposal Title: *Design and implement machine learning filters and categorization for anti spam in the mail system *Student Name:* Yu Fu *Student E-mail: *[email protected] *Organization/Project*: The Apache Software Foundation *Assigned Mentor*: Eric Charles, Robert Burrell Donkin Proposal Abstract: Using advanced machine learning methods K-nearest neighbors to implement intelligent filter mail feature, To strengthen anti-spam capability on the beyond of current Bayesian Analysis method. *Detailed Description:* *1. Understand James Anti-Spam Features* Current James Anti-Spam have three parts, which I marked with the red in the figure below. The figure also shows the whole relationship for the basic concepts. · The first part *fastfail* in this SMTPServer is to reject a message on the smtplevel if a configured hits amount is reached. It operates in the background of a machine, listening on a specific TCP port. In this level, the server will filter non-existed users, DSN filter, domains with invalid MX record, ignore duplicated recipients. Also IP Addresses which send emails to the configured recipients will get blacklisted for the configured time. The configuration for the certain port and hits score can be set in the smtpserver.xml. · *SpamAssassin* in the Mailet is to classify the messages as spam or not with an internal (configurable) score threshold. Usually a message will only be considered as spam if it matches multiple criteria; matching just a single test will not usually be enough to reach the threshold. · *BayesianAnalysis* in the Mailet are based on the Bayesian Probably to classify to spam or not. It relied on the training data from the users’ judgment. Users need to manually judge as spam and send to [email protected], oppositely, if judge as non-spam and then send to [email protected]. BayesianAnalysisfeeder learn from these training dataset, and build predictive models on Bayesian Probably. There will be a certain table for maintaining the frequency of Corpus for keywords in the database. Every 10 mins a thread in the BayesianAnalysis will check and update the table. Also the correct approach is to send the original spam or non-spam as an attachment to another message send to the feeder in order to avoid the bias from the current sender’s email header. *2. Motivation* Navie Bayes approach, which is based on Bayesian Probablity, is the most popular method. But for some too popular Phrase or keyword, the Navie Bayes cannot work well. Consider “Click Here to Unsubscribe” Phrase occurs 10 times as often in spam as good. P(click|spam) is 10 times higher, and P(unsubscribe|spam) is 10 times higher Multiply together, get factor of 1000. The bias of these hot keywords will heavily hurt the accuracy of the filter. I want to implement K-nearest neighbor, which has much better experiment [1] result than our current approach, and The training time of KNN is very fast. This project will be the start point for me to contribute to the James Open Source community. I treat Anti-Spam filter is a practical issue rather than a theoretical tech problem. In this community we need to win with abandon practice and the strong analysis reasoning. I am enjoying implement different ways to find most efficient and easy method. *3. Project Scope* The project will be based on the James Server 3.0 framework, and it will restrict the categories to spam/not spam. · Upgrade Bayesian Analysis from 2.3 to 3.0 · Implement “TF-IDF” and “binary Vector” feature selection. · Implement K-nearest neighbor algorithm to sever as the new Anti-Spam filter. *4. Approach* My idea is use K-nearest neighbor algorithm to classify the mail to spam or non-spam. Computing space distance between different text words, KNN has better experimental performance considering the accuracy measurement [1]. *Feature Selection* Both of the “TF-IDF” (case insensitive) and “binary Vector” (case insensitive) feature selection will be implemented. *TF-IDF*: The tf–idf weight (term frequency–inverse document frequency) is is a statistical measure used to evaluate how important a word is to a document <http://en.wikipedia.org/wiki/Document> in a collection or corpus<http://en.wikipedia.org/wiki/Text_corpus>. The importance increases proportionally<http://en.wikipedia.org/wiki/Proportionality_(mathematics)> to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines<http://en.wikipedia.org/wiki/Search_engine> as a central tool in scoring and ranking a document's relevance<http://en.wikipedia.org/wiki/Relevance_(information_retrieval)> given a user query <http://en.wikipedia.org/wiki/Information_retrieval>. So it can avoid the bias on Bayesian approach. The variable for TF-IDF will be tuned by the experiment, and also can manual be configured by users. Weighted *TF-IDF*: based TF-IDF, we can set different weighted, considering the words position in the mail structure. I have assumption: the words in message header are more important than the body. *Binary Vector*: it maintains the information of each feature (term) exist or not (1 or 0) for each training data. The main variable here is the size of vector, we will set and consider with memory size for this case. *Variables*: Vector Size, Ratio for TF-IDF. Weight for words position *Feature extraction *I will use mutual information in order to reduce dimension. Using Mutual information of each word with the class variable has been used to select important features. Mutual information of a word captures the amount of information gained in classifying documents by knowing the presence and absence of the word [3]. *KNN:* this method will still feed the predictive model on the training data, which can share the training data from Bayesian Analysis.The k-nearest neighbor’s algorithm (k-NN) is a method for classifying <http://en.wikipedia.org/wiki/Statistical_classification> objects based on closest training examples in the feature space<http://en.wikipedia.org/wiki/Feature_space>. The k-nearest neighbor algorithm is amongst the simplest of all machine learning <http://en.wikipedia.org/wiki/Machine_learning> algorithms: an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer <http://en.wikipedia.org/wiki/Integer>, typically small). Here K=2, because we need to classier for spam or not-spam. KNN method implements: · Method 1: To create a “binary vector” which maintain the information of each feature (term) exist or not (1 or 0) for each training data. For each unclassified mail to create a binary vector too, then using cosine distance to find out the top 2 closest training data from the then find out which class is the unclassified mail belongs to. Compute Euclidean distance from target plot to those that were sampled. Order samples taking for account calculated distances. Choose heuristically optimal 2 nearest neighbor based on RMSE <http://en.wikipedia.org/wiki/RMSE> done by cross validation technique. Calculate an inverse distance weighted average with the k-nearest multivariate neighbors. · Method 2: Use “TF-IDF” respectively instead of binary Vector in Method 1. The enhanced feature for Bayesian method: 1. Filtering for certain words based on the Bayesian probability of single words. 2. Filtering for all the words based on the combination probability of all the words. 3. Filtering for the rare words based on the Bayesian probability. Schedule 1 week to upgrade Bayesian Analysis feature to 3.0. 2 weeks implement Parser to selection method (3 weeks implement and 1 week test) 3 weeks implement KNN method (2 weeks implement and 1 week test) 2 weeks tune the predictive model with different parameters. 1 week finish the document and guide for using and tuning. [1] Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model. * irlab.csie.ntu.edu.tw/~chean/perWeb/projects/AI/AI_final.pdf* **[2] An Empirical Performance Comparison of Machine Learning Methods for Spam E-Mail Categorization. http://www.computer.org/portal/web/csdl/doi/10.1109/ICHIS.2004.21 <http://www.computer.org/portal/web/csdl/doi/10.1109/ICHIS.2004.21>[3] Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.1259&rep=rep1&type=pdf -- Yu Fu [email protected] 443-388-6654
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
