Hi Vicki,
Sorry for this, but your research defense first.
Yes, we will continue the AI project and don't hesitate to
answer/contribute on this mailing list with ideas, questions,...
Good luck for your defense :)
Tks,
- Eric
On 25/04/2011 06:22, Vicki Fu wrote:
Hi Robert and Eric,
I am sorry to say that I can not participate this gsoc because my
mentor suggest me to focus my research defense first.
I am sorry for quitting this opportunity to contribute JAMES. Since
the AI project is on the track,
maybe I can help for recommendation, keywords filter to pre-read mail
or news later.
Thank you for all of your suggestion on my proposal. I will be happy
that anyone can use my proposal to continue this project.
Thanks.
Vicki
On Wed, Apr 6, 2011 at 4:56 PM, Robert Burrell Donkin (JIRA)
<[email protected]> wrote:
[
https://issues.apache.org/jira/browse/JAMES-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016521#comment-13016521
]
Robert Burrell Donkin edited comment on JAMES-1216 at 4/6/11 8:56 PM:
----------------------------------------------------------------------
Feature Selection
-------------------------
Feature extraction from emails may potentially result in a large number of
features, and so high dimensionality.
For some algorithms, this may have undesirable performance consequences. For
example, k-nearest neighbour implementations typically hold all training data
in memory during classification, and computes distances between the test point
and each training point. To understand this trade-off, it would be useful to
estimate how memory and computation complexity scales with the number of
features, and relate this to desired mail throughput.
A strong GSOC application should probably consider feature selection, so that
it can be factored into the design even if time does not allow a full
implementation.
"An Introduction To Variable and Feature Selection" by Guyon and Elisseef;
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.3593&rep=rep1&type=pdf
"Fast Binary Feature Selection with Conditional Mutual Information" by Fleuret;
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.60.8398&rep=rep1&type=pdf
was (Author: robertburrelldonkin):
Feature Selection
-------------------------
Feature extraction from emails may potentially result in a large number of
features, and so high dimensionality. For some algorithms
[gsoc2011] Design and implement machine learning filters and categorization for
mail
------------------------------------------------------------------------------------
Key: JAMES-1216
URL: https://issues.apache.org/jira/browse/JAMES-1216
Project: JAMES Server
Issue Type: New Feature
Reporter: Eric Charles
Assignee: Eric Charles
Labels: gsoc2011
Context: Anti-spam functionality based on SpamAssassin is available at James
(base on mailets http://james.apache.org/mailet). Bayesian mailets are also
available, but not completely integrated/documented. Nothing is available to
automatically categorize mail traffic per user.
Task: We are willing to align the existing implementation with any modern
anti-spam solution based on powerfull machine learning implementation (such as
apache mahout). We are also willing to extend the machine learning usage to
some mail categorization (spam vs not-spam is a first category, we can extend
it to any additional category we can imagine). The implementation can partially
occur while spooling the mails and/or when mail is stored in mailbox.
Related discussions: See also discussions on mail intelligent mining on
http://markmail.org/message/2bodrwvdvtfq3f2v (mahout related) and
http://markmail.org/thread/pksl6csyvoeo27yh (hama related).
Mentor: eric at apache dot org& [fill in mentor]
Complexity: high
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]