Re: [jira] [Issue Comment Edited] (JAMES-1216) [gsoc2011] Design and implement machine learning filters and categorization for mail

Eric Charles Mon, 25 Apr 2011 01:41:25 -0700

Hi Vicki,

Sorry for this, but your research defense first.

Yes, we will continue the AI project and don't hesitate toanswer/contribute on this mailing list with ideas, questions,...


Good luck for your defense :)
Tks,

- Eric

On 25/04/2011 06:22, Vicki Fu wrote:

Hi Robert and Eric,
I am sorry to say that I can not participate this gsoc because my
mentor suggest me to focus my research defense first.
I am sorry for quitting this opportunity to contribute JAMES. Since
the AI project is on the track,
maybe I can help for recommendation, keywords filter to pre-read mail
or news later.
Thank you for all of your suggestion on my proposal. I will be happy
that anyone can use my proposal to continue this project.
Thanks.
Vicki


On Wed, Apr 6, 2011 at 4:56 PM, Robert Burrell Donkin (JIRA)
<[email protected]>  wrote:


    [ 
https://issues.apache.org/jira/browse/JAMES-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016521#comment-13016521
 ]

Robert Burrell Donkin edited comment on JAMES-1216 at 4/6/11 8:56 PM:
----------------------------------------------------------------------

Feature Selection
-------------------------
Feature extraction from emails may potentially result in a large number of 
features, and so high dimensionality.

For some algorithms, this may have undesirable performance consequences. For 
example, k-nearest neighbour implementations typically hold all training data 
in memory during classification, and computes distances between the test point 
and each training point. To understand this trade-off, it would be useful to 
estimate how memory and computation complexity scales with the number of 
features, and relate this to desired mail throughput.

A strong GSOC application should probably consider feature selection, so that 
it can be factored into the design even if time does not allow a full 
implementation.

"An Introduction To Variable and Feature Selection" by Guyon and Elisseef; 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.3593&rep=rep1&type=pdf
"Fast Binary Feature Selection with Conditional Mutual Information" by Fleuret; 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.60.8398&rep=rep1&type=pdf

      was (Author: robertburrelldonkin):
    Feature Selection
-------------------------
Feature extraction from emails may potentially result in a large number of 
features, and so high dimensionality. For some algorithms

[gsoc2011] Design and implement machine learning filters and categorization for 
mail
------------------------------------------------------------------------------------

                 Key: JAMES-1216
                 URL: https://issues.apache.org/jira/browse/JAMES-1216
             Project: JAMES Server
          Issue Type: New Feature
            Reporter: Eric Charles
            Assignee: Eric Charles
              Labels: gsoc2011

Context: Anti-spam functionality based on SpamAssassin is available at James 
(base on mailets http://james.apache.org/mailet). Bayesian mailets are also 
available, but not completely integrated/documented. Nothing is available to 
automatically categorize mail traffic per user.
Task: We are willing to align the existing implementation with any modern 
anti-spam solution based on powerfull machine learning implementation (such as 
apache mahout). We are also willing to extend the machine learning usage to 
some mail categorization (spam vs not-spam is a first category, we can extend 
it to any additional category we can imagine). The implementation can partially 
occur while spooling the mails and/or when mail is stored in mailbox.
Related discussions: See also discussions on mail intelligent mining on 
http://markmail.org/message/2bodrwvdvtfq3f2v (mahout related) and 
http://markmail.org/thread/pksl6csyvoeo27yh (hama related).
Mentor: eric at apache dot org&  [fill in mentor]
Complexity: high


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [jira] [Issue Comment Edited] (JAMES-1216) [gsoc2011] Design and implement machine learning filters and categorization for mail

Reply via email to