Hi Robert and Eric, I am sorry to say that I can not participate this gsoc because my mentor suggest me to focus my research defense first. I am sorry for quitting this opportunity to contribute JAMES. Since the AI project is on the track, maybe I can help for recommendation, keywords filter to pre-read mail or news later. Thank you for all of your suggestion on my proposal. I will be happy that anyone can use my proposal to continue this project. Thanks. Vicki
On Wed, Apr 6, 2011 at 4:56 PM, Robert Burrell Donkin (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/JAMES-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016521#comment-13016521 > ] > > Robert Burrell Donkin edited comment on JAMES-1216 at 4/6/11 8:56 PM: > ---------------------------------------------------------------------- > > Feature Selection > ------------------------- > Feature extraction from emails may potentially result in a large number of > features, and so high dimensionality. > > For some algorithms, this may have undesirable performance consequences. For > example, k-nearest neighbour implementations typically hold all training data > in memory during classification, and computes distances between the test > point and each training point. To understand this trade-off, it would be > useful to estimate how memory and computation complexity scales with the > number of features, and relate this to desired mail throughput. > > A strong GSOC application should probably consider feature selection, so that > it can be factored into the design even if time does not allow a full > implementation. > > "An Introduction To Variable and Feature Selection" by Guyon and Elisseef; > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.3593&rep=rep1&type=pdf > "Fast Binary Feature Selection with Conditional Mutual Information" by > Fleuret; > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.60.8398&rep=rep1&type=pdf > > was (Author: robertburrelldonkin): > Feature Selection > ------------------------- > Feature extraction from emails may potentially result in a large number of > features, and so high dimensionality. For some algorithms > >> [gsoc2011] Design and implement machine learning filters and categorization >> for mail >> ------------------------------------------------------------------------------------ >> >> Key: JAMES-1216 >> URL: https://issues.apache.org/jira/browse/JAMES-1216 >> Project: JAMES Server >> Issue Type: New Feature >> Reporter: Eric Charles >> Assignee: Eric Charles >> Labels: gsoc2011 >> >> Context: Anti-spam functionality based on SpamAssassin is available at James >> (base on mailets http://james.apache.org/mailet). Bayesian mailets are also >> available, but not completely integrated/documented. Nothing is available to >> automatically categorize mail traffic per user. >> Task: We are willing to align the existing implementation with any modern >> anti-spam solution based on powerfull machine learning implementation (such >> as apache mahout). We are also willing to extend the machine learning usage >> to some mail categorization (spam vs not-spam is a first category, we can >> extend it to any additional category we can imagine). The implementation can >> partially occur while spooling the mails and/or when mail is stored in >> mailbox. >> Related discussions: See also discussions on mail intelligent mining on >> http://markmail.org/message/2bodrwvdvtfq3f2v (mahout related) and >> http://markmail.org/thread/pksl6csyvoeo27yh (hama related). >> Mentor: eric at apache dot org & [fill in mentor] >> Complexity: high > > -- > This message is automatically generated by JIRA. > For more information on JIRA, see: http://www.atlassian.com/software/jira > -- Vicki Fu --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
