Introduction
Hi all, I've been following Hadoop and the Mahout project for a while now and I thought I should introduce myself. I'm Daniel Nee, I am a master's student at University College London studying Computational Statistics and Machine Learning. Before that I did my undergraduate in Computer Science at the University of Southampton. I'm keen to help out the project wherever I can, whether that is coding up new algorithms, improving documentation/descriptions of the techniques, etc. One algorithm I am interested in working on is a map/reduce version of fitting a Gaussian Mixture Model (GMM) via EM. I noticed at least one person was interested in implementing this for GSoC. I am happy to provide code/ideas if they do decided to take it further whether as part of GSoC or as their own project. Due to commitments to my master's I cannot participate in GSoC myself. The standard non-parallel EM algorithm for GMM's is pretty straightforward to implement and I think I have a pretty good idea on how to implement a map/reduce version. As a final note, I was wondering if anyone else will be attending the 2nd UK Hadoop User Group (http://huguk.eventwax.com/hadoop-user-group-uk-2)? Dan
Re: gsoc , EM or SVM?
Hi I decided to go with the mixture model for EM. I have modified my proposal and submit it both on gsoc website and apache wiki. Best Regards Yifan 2009/4/1 Yifan Wang heavens...@gmail.com: I will choose Mixture Model for the EM implementation. Yifan 2009/4/1 Ted Dunning ted.dunn...@gmail.com: Yifan, EM is a highly non-specific term and covers a huge range of very different algorithms. For example, pLSI, HMM's, and mixture models can all be estimated using EM. What exactly did you mean to address with an EM implementation? On Wed, Apr 1, 2009 at 1:05 PM, Grant Ingersoll gsing...@apache.org wrote: Hi Yifan, I think both are good candidates, although AIUI, SVM is a bit harder to parallelize, so maybe it would make sense to focus on EM. Of course, we don't have to be distributed, so you could propose a non-distributed SVM implementation as a first cut and then work on the distributed part as the project develops. ... For EM, it is a generalization of the k-means algorithm, and we already have k-means in the Mahout library.
re: Introduction
Hi Daniel I am Yifan. Glad to see someone has the same idea here. I have submitted a proposal of GSOC 2009 for the Mixture Model via EM. I have basic knowledge of the Mixture Model Of EM algorithm. Hope to discuss with you about the algorithm in the future. Best Regards Yifan -邮件原件- 发件人: Daniel Nee [mailto:nee.dan...@googlemail.com] 发送时间: 2009年4月2日 16:53 收件人: mahout-dev@lucene.apache.org 主题: Introduction Hi all, I've been following Hadoop and the Mahout project for a while now and I thought I should introduce myself. I'm Daniel Nee, I am a master's student at University College London studying Computational Statistics and Machine Learning. Before that I did my undergraduate in Computer Science at the University of Southampton. I'm keen to help out the project wherever I can, whether that is coding up new algorithms, improving documentation/descriptions of the techniques, etc. One algorithm I am interested in working on is a map/reduce version of fitting a Gaussian Mixture Model (GMM) via EM. I noticed at least one person was interested in implementing this for GSoC. I am happy to provide code/ideas if they do decided to take it further whether as part of GSoC or as their own project. Due to commitments to my master's I cannot participate in GSoC myself. The standard non-parallel EM algorithm for GMM's is pretty straightforward to implement and I think I have a pretty good idea on how to implement a map/reduce version. As a final note, I was wondering if anyone else will be attending the 2nd UK Hadoop User Group (http://huguk.eventwax.com/hadoop-user-group-uk-2)? Dan
Re: Introduction
Having you guys work together is entirely in keeping and compatible with both the open source ideas and google summer of code ideas. So, Daniel, don't imagine that this idea is taken. Your suggestions and code (parallel or sequential) are highly valued. 2009/4/2 Yifan Wang heavens...@gmail.com I am Yifan. Glad to see someone has the same idea here. I have submitted a proposal of GSOC 2009 for the Mixture Model via EM.
Re: [VOTE] Mahout 0.1
+1 -Yonik On Sat, Mar 28, 2009 at 5:49 AM, Grant Ingersoll gsing...@apache.org wrote: [Take 2. I fixed the NOTICE file, but did not change the artifact generation issue for now.] Please review and vote for releasing Mahout 0.1. This is our first release and is all new code. The artifacts in are located in: http://people.apache.org/~gsingers/staging-repo/mahout/org/apache/mahout/ The mahout directory contains a tarball/zip of the whole project (for building from source) The core, examples and taste-web directories contain the artifacts for each of those components. The other directories contain various dependencies and artifacts. Thanks, Grant