Re: JAMES-1216 [gsoc2011] Design and implement machine learning filters and categorization for mail

Eric Charles Thu, 07 Apr 2011 02:59:36 -0700

Hi Vicky,

Welcome!

Hope you will enjoy some time with James community, and especially withRobert and I who will mentor this.

Further to container/database access change between 2.3 and 3.0, thereare sometimes stuff that need to be tuned.


I should look at the Bayesian mailets in James 3.0 before end of this week.

In the meantime, you can try to build James from source(http://james.apache.org/server/3/dev-build.html), configure (launchjConsole, and go on MBeans tab), and run with standard configuration(use your favorite mail client to send/receive mails).

Come back on this mailing list if you encounter any issue on this (thedoc is not 100% in line with the src trunk, we are working on it).

If everything goes well, you can also try the Bayesian mailets(http://wiki.apache.org/james/Bayesian_Analysis), and maybe haveexceptions with SQL queries towards the embedded derby database.Those mailets still use the SQL queries you will find insqlResources.xml (the section"org.apache.james.util.bayesian.JDBCBayesianAnalyzer").All databases was done in 2.3 via this file, we tend in 3.0 to implementdatabase via OpenJPA.

For the project, we will move the needed classes to a sandbox (seehttp://markmail.org/message/ichoc67sz4sjybig)


Tks again,

PS: I left some comment on google melange regarding your application (weare today in the last mile).



On 7/04/2011 11:40, Yu Fu wrote:

Hi Everyone,
This is Vicki. I am applying to Gsoc 2011 on JAMES-1216[jira].
I have the question for updating Bayesian Analysis to 3.0. Because the
container switch from Avalon to Spring, some interfaces need to rewrite?
Is anyone know how it should be?

I have attached my proposal and the figure here.I think it should act as
the starting point for me to contribute to James community. Welcome to
any feedback.

_______________________________________________________________________________________________________________

*Proposal Title: *Design and implement machine learning filters and
categorization for anti spam in the mail system
*Student Name:*Yu Fu
*Student E-mail: *[email protected] <mailto:[email protected]>
*Organization/Project*: The Apache Software Foundation
*Assigned Mentor*: Eric Charles, Robert Burrell Donkin
Proposal Abstract: Using advanced machine learning methods K-nearest
neighbors to implement intelligent filter mail feature, To strengthen
anti-spam capability on the beyond of current Bayesian Analysis method.
*Detailed Description:*


*1. Understand James Anti-Spam Features*

Current James Anti-Spam have three parts, which I marked with the red in
the figure below. The figure also shows the whole relationship for the
basic concepts.

·The first part *fastfail* in this SMTPServer is to reject a message on
the smtplevel if a configured hits amount is reached. It operates in the
background of a machine, listening on a specific TCP port. In this
level, the server will filter non-existed users, DSN filter, domains
with invalid MX record, ignore duplicated recipients. Also IP Addresses
which send emails to the configured recipients will get blacklisted for
the configured time. The configuration for the certain port and hits
score can be set in the smtpserver.xml.

·*SpamAssassin*in the Mailet is to classify the messages as spam or not
with an internal (configurable) score threshold. Usually a message will
only be considered as spam if it matches multiple criteria; matching
just a single test will not usually be enough to reach the threshold.

·*BayesianAnalysis*in the Mailet are based on the Bayesian Probably to
classify to spam or not. It relied on the training data from the users’
judgment. Users need to manually judge as spam and send to
[email protected] <mailto:[email protected]>, oppositely, if judge
as non-spam and then send to [email protected]
<mailto:[email protected]>. BayesianAnalysisfeeder learn from
these training dataset, and build predictive models on Bayesian
Probably. There will be a certain table for maintaining the frequency of
Corpus for keywords in the database. Every 10 mins a thread in the
BayesianAnalysis will check and update the table. Also the correct
approach is to send the original spam or non-spam as an attachment to
another message send to the feeder in order to avoid the bias from the
current sender’s email header.

*2. Motivation*

Navie Bayes approach, which is based on Bayesian Probablity, is the most
popular method. But for some too popular Phrase or keyword, the Navie
Bayes cannot work well. Consider “Click Here to Unsubscribe” Phrase
occurs 10 times as often in spam as good. P(click|spam) is 10 times
higher, and P(unsubscribe|spam) is 10 times higher Multiply together,
get factor of 1000. The bias of these hot keywords will heavily hurt the
accuracy of the filter.

I want to implement K-nearest neighbor, which has much better experiment
[1] result than our current approach, andThe training time of KNN is
very fast.

This project will be the start point for me to contribute to the James
Open Source community. I treat Anti-Spam filter is a practical issue
rather than a theoretical tech problem. In this community we need to win
with abandon practice and the strong analysis reasoning. I am enjoying
implement different ways to find most efficient and easy method.

*3. Project Scope*
The project will be based on the James Server 3.0 framework, and it will
restrict the categories to spam/not spam.

·Upgrade Bayesian Analysis from 2.3 to 3.0

·Implement “TF-IDF” and “binary Vector” feature selection.

·Implement K-nearest neighbor algorithmto sever as the new Anti-Spam filter.


*4. Approach*

My idea is use K-nearest neighbor algorithm to classify the mail to spam
or non-spam. Computing space distance between different text words, KNN
has better experimental performance considering the accuracy measurement
[1].

*Feature Selection*

Both of the “TF-IDF” (case insensitive) and “binary Vector” (case
insensitive) feature selection will be implemented.

*TF-IDF*: The tf–idf weight (term frequency–inverse document frequency)
is is a statistical measure used to evaluate how important a word is to
a document <http://en.wikipedia.org/wiki/Document> in a collection or
corpus <http://en.wikipedia.org/wiki/Text_corpus>. The importance
increases proportionally
<http://en.wikipedia.org/wiki/Proportionality_(mathematics)> to the
number of times a word appears in the document but is offset by the
frequency of the word in the corpus. Variations of the tf–idf weighting
scheme are often used by search engines
<http://en.wikipedia.org/wiki/Search_engine> as a central tool in
scoring and ranking a document's relevance
<http://en.wikipedia.org/wiki/Relevance_(information_retrieval)> given a
user query <http://en.wikipedia.org/wiki/Information_retrieval>. So it
can avoid the bias on Bayesian approach. The variable for TF-IDF will be
tuned by the experiment, and also can manual be configured by users.

Weighted *TF-IDF*: based TF-IDF, we can set different weighted,
considering the words position in the mail structure. I have assumption:
the words in message header are more important than the body.

*Binary Vector*: it maintains the information of each feature (term)
exist or not (1 or 0) for each training data. The main variable here is
the size of vector, we will set and consider with memory size for this case.

*Variables*: Vector Size, Ratio for TF-IDF. Weight for words position

*Feature extraction *I will use mutual information in order to reduce
dimension. Using Mutual information of each word with the class variable
has been used to select important features. Mutual information of a word
captures the amount of information gained in classifying documents by
knowing the presence and absence of the word [3].

*KNN:*this method will still feed the predictive model on the training
data, which can share the training data from Bayesian
Analysis.The k-nearest neighbor’s algorithm (k-NN) is a method for
classifying
<http://en.wikipedia.org/wiki/Statistical_classification> objects based
on closest training examples in the feature space
<http://en.wikipedia.org/wiki/Feature_space>. The k-nearest neighbor
algorithm is amongst the simplest of all machine learning
<http://en.wikipedia.org/wiki/Machine_learning> algorithms: an object is
classified by a majority vote of its neighbors, with the object being
assigned to the class most common amongst its k nearest neighbors (k is
a positive integer <http://en.wikipedia.org/wiki/Integer>, typically
small). Here K=2, because we need to classier for spam or not-spam.

KNN method implements:

·Method 1: To create a “binary vector” which maintain the information of
each feature (term) exist or not (1 or 0) for each training data. For
each unclassified mail to create a binary vector too, then using cosine
distance to find out the top 2 closest training data from the then find
out which class is the unclassified mail belongs to. Compute Euclidean
distance from target plot to those that were sampled. Order samples
taking for account calculated distances. Choose heuristically
optimal 2 nearest neighbor based on RMSE
<http://en.wikipedia.org/wiki/RMSE> done by cross validation technique.
Calculate an inverse distance weighted average with the k-nearest
multivariate neighbors.

·Method 2: Use “TF-IDF” respectively instead of binary Vector in Method 1.

The enhanced feature for Bayesian method:

   1. Filtering for certain words based on the Bayesian probability of
      single words.
   2. Filtering for all the words based on the combination probability
      of all the words.
   3. Filtering for the rare words based on the Bayesian probability.

Schedule

1 week to upgrade Bayesian Analysis feature to 3.0.

2 weeks implement Parser to selection method (3 weeks implement and 1
week test)

3 weeks implement KNN method (2 weeks implement and 1 week test)

2 weeks tune the predictive model with different parameters.

1 week finish the document and guide for using and tuning.

[1]Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model.
_irlab.csie.ntu.edu.tw/~chean/perWeb/projects/AI/AI_final.pdf
<http://irlab.csie.ntu.edu.tw/~chean/perWeb/projects/AI/AI_final.pdf>_

__[2] An Empirical Performance Comparison of Machine Learning Methods
for Spam E-Mail Categorization.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICHIS.2004.21

<http://www.computer.org/portal/web/csdl/doi/10.1109/ICHIS.2004.21>[3]
Text Categorization Using Weight Adjusted k-Nearest Neighbor
Classification
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.1259&rep=rep1&type=pdf
<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.1259&rep=rep1&type=pdf>





--
Yu Fu
[email protected] <mailto:[email protected]>
443-388-6654




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: JAMES-1216 [gsoc2011] Design and implement machine learning filters and categorization for mail

Reply via email to