Re: [Edit] Approach for Clustering Data

2014-02-18 Thread Bikash Gupta
Ted/Peter, Thanks for the response. This is exactly what I am trying to achieve. May be I was not able to put my questions clearly. I am clustering on few variables of Customer/User(except their customer_id/user_id) and storing customer_id/user_id list in a separate place. Question) What is

Re: [Edit] Approach for Clustering Data

2014-02-18 Thread Suneel Marthi
On Tuesday, February 18, 2014 3:37 AM, Bikash Gupta bikash.gupt...@gmail.com wrote: Ted/Peter, Thanks for the response. This is exactly what I am trying to achieve. May be I was not able to put my questions clearly. I am clustering on few variables of Customer/User(except their

Re: [Edit] Approach for Clustering Data

2014-02-18 Thread Bikash Gupta
Suneel, Thanks for the information. I am using 0.7 packaged with CDH . On Tue, Feb 18, 2014 at 2:14 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: On Tuesday, February 18, 2014 3:37 AM, Bikash Gupta bikash.gupt...@gmail.com wrote: Ted/Peter, Thanks for the response. This is

Re: [Edit] Approach for Clustering Data

2014-02-18 Thread Ted Dunning
Bikash, Don't use that version. Use a more recent release. We can't help that Cloudera has an old version. On Tue, Feb 18, 2014 at 1:26 AM, Bikash Gupta bikash.gupt...@gmail.comwrote: Suneel, Thanks for the information. I am using 0.7 packaged with CDH . On Tue, Feb 18, 2014 at 2:14

Re: [Edit] Approach for Clustering Data

2014-02-18 Thread Bikash Gupta
Yeah Tedseems there is major change in 0.9 In 0.9 I found that clsuteredPoint data are getting written in PairKey,Vector rather than only Vector. Its good. Thanks to everyone to answer correctly for an unframed question :) On Tue, Feb 18, 2014 at 7:36 PM, Ted Dunning ted.dunn...@gmail.com

RE: seqdumper output?

2014-02-18 Thread Allen, Ronald L.
Hello again, and sorry to bother you with this once again, I'm having a bit of trouble. My CSV files are just full of numbers (doubles). Each line looks something like this: 2.4135,1.1120. I'm not sure if this makes a big difference. But when I try to do step #2, I can't seem to figure out

Re: [Edit] Approach for Clustering Data

2014-02-18 Thread Sean Owen
FYI, CDH5 includes version 0.8 + patches. But 0.9 should work fine with CDH4. You do have to build with the Hadoop 2.x profile, as usual. On Tue, Feb 18, 2014 at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: Bikash, Don't use that version. Use a more recent release. We can't help that

Re: [Edit] Approach for Clustering Data

2014-02-18 Thread Bikash Gupta
Thanks Sean. I will check how to support 0.9 with CDH4. However 0.9 has solved my problem. On Tue, Feb 18, 2014 at 7:45 PM, Sean Owen sro...@gmail.com wrote: FYI, CDH5 includes version 0.8 + patches. But 0.9 should work fine with CDH4. You do have to build with the Hadoop 2.x profile, as

Mahout 0.8, Hadoop 1.2.1 and Lucene version

2014-02-18 Thread Stamatis Rapanakis
I try to run an example and get the following error: eb 18, 2014 4:31:28 PM org.apache.hadoop.mapred.LocalJobRunner$Job run WARNING: job_local_0001 *java.lang.NoSuchFieldError: LUCENE_43* at org.apache.mahout.common.lucene.AnalyzerUtils.createAnalyzer(AnalyzerUtils.java:35) at

Re: Mahout 0.8, Hadoop 1.2.1 and Lucene version

2014-02-18 Thread Suneel Marthi
You definitely don't have to mess with hadoop source. On Tuesday, February 18, 2014 10:28 AM, Stamatis Rapanakis stamrapana...@gmail.com wrote: I try to run an example and get the following error: eb 18, 2014 4:31:28 PM org.apache.hadoop.mapred.LocalJobRunner$Job run WARNING:

Re: reduce is too slow in StreamingKmeans

2014-02-18 Thread Suneel Marthi
Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance that you have been experiencing. How did u come up with -km 63000? Given that u would like 1 clusters (= k) and have 2,000,000 datapoints (= n) so k * ln(n) = 1 * ln(2 * 10^6)  = 145087

Apache Mahout 0.9 released

2014-02-18 Thread Suneel Marthi
The Apache Mahout PMC is pleased to announce the release of Mahout 0.9. Mahout's goal is to build scalable machine learning libraries focused primarily in the areas of collaborative filtering (recommenders), clustering and classification (known collectively as the 3Cs), as well as the necessary

Mahout on Spark?

2014-02-18 Thread Ying Liao
Just wonder what is the future of Mahout. We are seeing new stuff from 0xdata and skytree. And spark is also design for in-memory iterative analysis. What about mahout? Will mahout run on top of spark in future? Thanks, Ying Liao

Re: Mahout on Spark?

2014-02-18 Thread Harshit Bapna
I am very eager to know the same from the community. Thanks for bringing it up. --Harshit On Tue, Feb 18, 2014 at 1:08 PM, Ying Liao yliao...@gmail.com wrote: Just wonder what is the future of Mahout. We are seeing new stuff from 0xdata and skytree. And spark is also design for in-memory

Re: Mahout on Spark?

2014-02-18 Thread Mohit Singh
In general, if you are interested in machine learning.. think there is already a machine learning specific initiative on spark called Mlbase ( http://www.mlbase.org/) and graphx (http://amplab.github.io/graphx/) for graphlab style ml. On Tue, Feb 18, 2014 at 1:14 PM, Harshit Bapna

Re: Mahout on Spark?

2014-02-18 Thread Nick Pentreath
Spark provides a lower-level ML library called MLlib. MLI / MLBase is built on top of this and includes some high-level abstractions similar in nature to distributed matrices / dataframes. But it's still pretty new and rough at this point (https://github.com/amplab/MLI). MLlib already provides (

Re: Mahout on Spark?

2014-02-18 Thread Ted Dunning
On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath nick.pentre...@gmail.comwrote: My (admittedly heavily biased) view is Spark is a superior platform overall for ML. If the two communities can work together to leverage the strengths of Spark, and the large amount of good stuff in Mahout (as well

Re: Mahout on Spark?

2014-02-18 Thread Dmitriy Lyubimov
yes, this is a popular initiative. On Tue, Feb 18, 2014 at 1:08 PM, Ying Liao yliao...@gmail.com wrote: Just wonder what is the future of Mahout. We are seeing new stuff from 0xdata and skytree. And spark is also design for in-memory iterative analysis. What about mahout? Will mahout run on

Re: Mahout on Spark?

2014-02-18 Thread Nick Pentreath
I know the Spark/Mllib devs can occasionally be quite set in ways of doing certain things, but we'd welcome as many Mahout devs as possible to work together. It may be too late, but perhaps a GSoC project to look at a port of some stuff like co occurrence recommender and streaming k-means?

Re: Mahout on Spark?

2014-02-18 Thread Sebastian Schelter
I'm also convinced that Spark is a superior platform for executing distributed ML algorithms. We've had a discussion about a change from Hadoop to another platform some time ago, but at that point in time it was not clear which of the upcoming dataflow processing systems (Spark, Hyracks,