Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread RJ Nowling
with LSH. https://issues.apache.org/jira/browse/SPARK-2966 If you have designed the standardized clustering algorithms API, please let me know. best, Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread Jeremy Freeman
-- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -- em rnowl...@gmail.com c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread RJ Nowling
: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -- em rnowl...@gmail.com c 954.496.2314 -- em rnowl...@gmail.com c

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-13 Thread Yu Ishikawa
, please let me know. best, Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7822.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-12 Thread RJ Nowling
.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -- em rnowl...@gmail.com c 954.496.2314 -- em rnowl...@gmail.com c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-19 Thread Jeremy Freeman
Hi RJ, that sounds like a great idea. I'd be happy to look over what you put together. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7418.html Sent from the Apache Spark

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-18 Thread RJ Nowling
, if useful. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -- em rnowl...@gmail.com c

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-17 Thread Jeremy Freeman
work on this piece and / or have you use this as a jumping off point, if useful. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html Sent from the Apache Spark Developers

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread RJ Nowling
I went ahead and created JIRAs. JIRA for Hierarchical Clustering: https://issues.apache.org/jira/browse/SPARK-2429 JIRA for Standarized Clustering APIs: https://issues.apache.org/jira/browse/SPARK-2430 Before submitting a PR for the standardized API, I want to implement a few clustering

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread Nick Pentreath
Might be worth checking out scikit-learn and mahout to get some broad ideas— Sent from Mailbox On Thu, Jul 10, 2014 at 4:25 PM, RJ Nowling rnowl...@gmail.com wrote: I went ahead and created JIRAs. JIRA for Hierarchical Clustering: https://issues.apache.org/jira/browse/SPARK-2429 JIRA for

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-09 Thread RJ Nowling
Thanks everyone for the input. So it seems what people want is: * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and conquer approach, look at DecisionTree implementation as a reference) * Restructure 3 Kmeans clustering algorithm implementations to prevent code duplication and

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-09 Thread Nick Pentreath
Cool seems like a god initiative. Adding a couple extra high quality clustering implantations will be great. I'd say it would make most sense to submit a PR for the Standardised API first, agree that with everyone and then build on it for the specific implementations. — Sent from Mailbox On

Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation,

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
Thanks, Hector! Your feedback is useful. On Tuesday, July 8, 2014, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Dmitriy Lyubimov
Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids.

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Sandy Ryza
Having a common framework for clustering makes sense to me. While we should be careful about what algorithms we include, having solid implementations of minibatch clustering and hierarchical clustering seems like a worthwhile goal, and we should reuse as much code and APIs as reasonable. On

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Dmitriy Lyubimov
sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
K doesn't matter much I've tried anything from 2^10 to 10^3 and the performance doesn't change much as measured by precision @ K. (see table 1 http://machinelearning.wustl.edu/mlpapers/papers/weston13). Although 10^3 kmeans did outperform 2^10 hierarchical SVD slightly in terms of the metrics,

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
No was thinking more top-down: assuming a distributed kmeans system already existing, recursively apply the kmeans algorithm on data already partitioned by the previous level of kmeans. I haven't been much of a fan of bottom up approaches like HAC mainly because they assume there is already a

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Evan R. Sparks
If you're thinking along these lines, have a look at the DecisionTree implementation in MLlib. It uses the same idea and is optimized to prevent multiple passes over the data by computing several splits at each level of tree building. The tradeoff is increased model state and computation per pass

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
Yeah if one were to replace the objective function in decision tree with minimizing the variance of the leaf nodes it would be a hierarchical clusterer. On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com wrote: If you're thinking along these lines, have a look at the

Re: Contributing to MLlib on GLM

2014-07-07 Thread xwei
.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com

Re: Contributing to MLlib

2014-07-03 Thread salexln
-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-tp7125p7169.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

process for contributing to mllib

2014-07-02 Thread Eustache DIEMERT
Hi there, I just created an issue [1] for MLlib on Jira. I also want to contribute a fix, is it a good idea to submit a PR on github [2] ? Should I also mention the issue on this list ? Thanks Eustache [1] https://issues.apache.org/jira/browse/SPARK-2341 [2]

Re: process for contributing to mllib

2014-07-02 Thread Reynold Xin
Yes it would be great to mention the JIRA ticket number on the pull request. Thanks! On Wed, Jul 2, 2014 at 1:01 AM, Eustache DIEMERT eusta...@diemert.fr wrote: Hi there, I just created an issue [1] for MLlib on Jira. I also want to contribute a fix, is it a good idea to submit a PR on

Re: Contributing to MLlib

2014-07-02 Thread salexln
guys??? anyone??? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-tp7125p7155.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Contributing to MLlib

2014-07-02 Thread Evan R. Sparks
On Wed, Jul 2, 2014 at 11:02 AM, salexln sale...@gmail.com wrote: guys??? anyone??? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-tp7125p7155.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Contributing to MLlib

2014-07-02 Thread salexln
thanks for the response ! that's is exactly the way i wanted to implement it :) I will create JIRA ticket and a request. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-tp7125p7157.html Sent from the Apache Spark Developers

Re: Contributing to MLlib

2014-07-02 Thread salexln
I opened a JIRA (https://issues.apache.org/jira/browse/SPARK-2344) and a pull request for this (https://github.com/salexln/spark/pull/1) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-tp7125p7158.html Sent from the Apache Spark

Re: Contributing to MLlib

2014-07-02 Thread RJ Nowling
) and a pull request for this (https://github.com/salexln/spark/pull/1) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-tp7125p7158.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -- em rnowl

Contributing to MLlib

2014-06-30 Thread salexln
this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-tp7125.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Contributing to MLlib on GLM

2014-06-30 Thread Gang Bai
.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com

Re: Contributing to MLlib on GLM

2014-06-27 Thread 白刚
in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Contributing to MLlib on GLM

2014-06-26 Thread xwei
Yes, that's what we did: adding two gradient functions to Gradient.scala and create PoissonRegression and GammaRegression using these gradients. We made a PR on this. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM

Re: Contributing to MLlib on GLM

2014-06-25 Thread Sung Hwan Chung
Well, as you said, MLLib already supports GLM in a sense. Except they only support two link functions - identity (linear regression) and logit (logistic regression). It should not be too hard to add other link functions, as all you have to do is add a different gradient function for Poisson/Gamma,

Contributing to MLlib on GLM

2014-06-17 Thread Xiaokai Wei
Hi, I am an intern at PalantirTech and we are building some stuff on top of MLlib. In Particular, GLM is of great interest to us. Though GeneralizedLinearModel in MLlib 1.0.0 has some important GLMs such as Logistic Regression, Linear Regression, some other important GLMs like Poisson Regression

Re: Contributing to MLlib on GLM

2014-06-17 Thread Sandy Ryza
Hi Xiaokai, I think MLLib is definitely interested in supporting additional GLMs. I'm not aware of anybody working on this at the moment. -Sandy On Tue, Jun 17, 2014 at 5:00 PM, Xiaokai Wei x...@palantir.com wrote: Hi, I am an intern at PalantirTech and we are building some stuff on top

Re: Contributing to MLlib on GLM

2014-06-17 Thread Andrew Ash
Hi Xiaokai, Also take a look through Xiangrui's slides from HadoopSummit a few weeks back: http://www.slideshare.net/xrmeng/m-llib-hadoopsummit The roadmap starting at slide 51 will probably be interesting to you. Andrew On Tue, Jun 17, 2014 at 7:37 PM, Sandy Ryza sandy.r...@cloudera.com