Introduction

2009-04-02 Thread Daniel Nee
Hi all,

I've been following Hadoop and the Mahout project for a while now and
I thought I should introduce myself. I'm Daniel Nee, I am a master's
student at University College London studying Computational Statistics
and Machine Learning. Before that I did my undergraduate in Computer
Science at the University of Southampton. I'm keen to help out the
project wherever I can, whether that is coding up new algorithms,
improving documentation/descriptions of the techniques, etc.

One algorithm I am interested in working on is a map/reduce version of
fitting a Gaussian Mixture Model (GMM) via EM. I noticed at least one
person was interested in implementing this for GSoC. I am happy to
provide code/ideas if they do decided to take it further whether as
part of GSoC or as their own project. Due to commitments to my
master's I cannot participate in GSoC myself. The standard
non-parallel EM algorithm for GMM's is pretty straightforward to
implement and I think I have a pretty good idea on how to implement a
map/reduce version.

As a final note, I was wondering if anyone else will be attending the
2nd UK Hadoop User Group
(http://huguk.eventwax.com/hadoop-user-group-uk-2)?

Dan


Re: gsoc , EM or SVM?

2009-04-02 Thread Yifan Wang
Hi

I decided to go with the mixture model for EM.
I have modified my proposal and submit it both on gsoc website and apache wiki.

Best Regards
Yifan

2009/4/1 Yifan Wang heavens...@gmail.com:
 I will choose Mixture Model for the EM implementation.

 Yifan

 2009/4/1 Ted Dunning ted.dunn...@gmail.com:
 Yifan,

 EM is a highly non-specific term and covers a huge range of very different
 algorithms.  For example, pLSI, HMM's, and mixture models can all be
 estimated using EM.

 What exactly did you mean to address with an EM implementation?

 On Wed, Apr 1, 2009 at 1:05 PM, Grant Ingersoll gsing...@apache.org wrote:

 Hi Yifan,

 I think both are good candidates, although AIUI, SVM is a bit harder to
 parallelize, so maybe it would make sense to focus on EM.  Of course, we
 don't have to be distributed, so you could propose a non-distributed SVM
 implementation as a first cut and then work on the distributed part as the
 project develops.

 ...


 For EM, it is a generalization of the k-means algorithm, and we already
 have
 k-means in the Mahout library.






re: Introduction

2009-04-02 Thread Yifan Wang
Hi Daniel

I am Yifan. Glad to see someone has the same idea here. 
I have submitted a proposal of GSOC 2009 for the Mixture 
Model via EM. I have basic knowledge of the Mixture Model
Of EM algorithm. Hope to discuss with you about the algorithm
in the future.

Best Regards
Yifan



-邮件原件-
发件人: Daniel Nee [mailto:nee.dan...@googlemail.com] 
发送时间: 2009年4月2日 16:53
收件人: mahout-dev@lucene.apache.org
主题: Introduction

Hi all,

I've been following Hadoop and the Mahout project for a while now and
I thought I should introduce myself. I'm Daniel Nee, I am a master's
student at University College London studying Computational Statistics
and Machine Learning. Before that I did my undergraduate in Computer
Science at the University of Southampton. I'm keen to help out the
project wherever I can, whether that is coding up new algorithms,
improving documentation/descriptions of the techniques, etc.

One algorithm I am interested in working on is a map/reduce version of
fitting a Gaussian Mixture Model (GMM) via EM. I noticed at least one
person was interested in implementing this for GSoC. I am happy to
provide code/ideas if they do decided to take it further whether as
part of GSoC or as their own project. Due to commitments to my
master's I cannot participate in GSoC myself. The standard
non-parallel EM algorithm for GMM's is pretty straightforward to
implement and I think I have a pretty good idea on how to implement a
map/reduce version.

As a final note, I was wondering if anyone else will be attending the
2nd UK Hadoop User Group
(http://huguk.eventwax.com/hadoop-user-group-uk-2)?

Dan



Re: Introduction

2009-04-02 Thread Ted Dunning
Having you guys work together is entirely in keeping and compatible with
both the open source ideas and google summer of code ideas.

So, Daniel, don't imagine that this idea is taken.  Your suggestions and
code (parallel or sequential) are highly valued.

2009/4/2 Yifan Wang heavens...@gmail.com


 I am Yifan. Glad to see someone has the same idea here.
 I have submitted a proposal of GSOC 2009 for the Mixture
 Model via EM.


Re: [VOTE] Mahout 0.1

2009-04-02 Thread Yonik Seeley
+1

-Yonik

On Sat, Mar 28, 2009 at 5:49 AM, Grant Ingersoll gsing...@apache.org wrote:
 [Take 2.  I fixed the NOTICE file, but did not change the artifact
 generation issue for now.]

 Please review and vote for releasing Mahout 0.1.  This is our first release
 and is all new code.

 The artifacts in are located in:
 http://people.apache.org/~gsingers/staging-repo/mahout/org/apache/mahout/

 The mahout directory contains a tarball/zip of the whole project (for
 building from source)
 The core, examples and taste-web directories contain the artifacts for each
 of those components.
 The other directories contain various dependencies and artifacts.


 Thanks,
 Grant