Re: getting the cluster elements from kmeans run

2015-02-11 Thread Suneel Marthi
KMeansModel only returns the cluster centroids.
To get the # of elements in each cluster, try calling kmeans.predict() on each 
of the points in the data used to build the model.
See 
https://github.com/OryxProject/oryx/blob/master/oryx-app-mllib/src/main/java/com/cloudera/oryx/app/mllib/kmeans/KMeansUpdate.java

Look at method fetchClusterCountsFromModel()

   
 

 From: Harini Srinivasan har...@us.ibm.com
 To: user@spark.apache.org 
 Sent: Wednesday, February 11, 2015 12:36 PM
 Subject: getting the cluster elements from kmeans run
   
Hi, 

Is there a way to get the elements ofeach cluster after running kmeans 
clustering? I am using the Java version.



thanks 



  

Re: K-Means final cluster centers

2015-02-05 Thread Suneel Marthi
There's a kMeansModel.clusterCenters() available if u r looking to get the 
centers from KMeansModel.

  From: SK skrishna...@gmail.com
 To: user@spark.apache.org 
 Sent: Thursday, February 5, 2015 5:35 PM
 Subject: K-Means final cluster centers
   
Hi,

I am trying to get the final cluster centers after running the KMeans
algorithm in MLlib in order to characterize the clusters. But the
KMeansModel does not have any public method to retrieve this info. There
appears to be only  a private method called clusterCentersWithNorm. I guess
I could call predict() to get the final cluster assignment for the dataset
and write my own code to compute the means based on this final assignment.
But I would like to know if  there is a way to get this info from MLLib API
directly after running KMeans?

thanks 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/K-Means-final-cluster-centers-tp21523.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



  

Re: Row similarities

2015-01-17 Thread Suneel Marthi
Andrew, u would be better off using Mahout's RowSimilarityJob for what u r 
trying to accomplish.

 1.  It does give u pair-wise distances 2.  U can specify the Distance measure 
u r looking to use 3.  There's the old MapReduce impl and the Spark DSL impl 
per ur preference.

  From: Andrew Musselman andrew.mussel...@gmail.com
 To: Reza Zadeh r...@databricks.com 
Cc: user user@spark.apache.org 
 Sent: Saturday, January 17, 2015 11:29 AM
 Subject: Re: Row similarities
   
Thanks Reza, interesting approach.  I think what I actually want is to 
calculate pair-wise distance, on second thought.  Is there a pattern for that?


On Jan 16, 2015, at 9:53 PM, Reza Zadeh r...@databricks.com wrote:


You can use K-means with a suitably large k. Each cluster should correspond to 
rows that are similar to one another.
On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman andrew.mussel...@gmail.com 
wrote:

What's a good way to calculate similarities between all vector-rows in a matrix 
or RDD[Vector]?

I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm going 
down a good path to transpose a matrix in order to run that.





  

Re: Clustering text data with MLlib

2014-12-29 Thread Suneel Marthi
Here's the Streaming KMeans from Spark 
1.2http://spark.apache.org/docs/latest/mllib-clustering.html#examples-1
Steaming KMeans still needs an initial 'k' to be specified, it then progresses 
to come up with an optimal 'k' IIRC.

  From: Sean Owen so...@cloudera.com
 To: jatinpreet jatinpr...@gmail.com 
Cc: user@spark.apache.org user@spark.apache.org 
 Sent: Monday, December 29, 2014 6:25 AM
 Subject: Re: Clustering text data with MLlib
   
You can try several values of k, apply some evaluation metric to the
clustering, and then use that to decide what k is best, or at least
pretty good. If it's a completely unsupervised problem, the metrics
you can use tend to be some function of the inter-cluster and
intra-cluster distances (good clustering means points are near to
things in their own cluster and far from things in other clusters).

If it's a supervised problem, you can bring things like purity or
mutual information, but I don't think that's the case here. You would
have to implement these metrics yourself.

You can consider clustering algorithms that do not depend on k, like
say DBSCAN. Although this has its own different hyperparameter to
pick. Again you'd have to implement it yourself.

What you describe sounds like topic modeling using LDA. This still
requires you to pick a number of topics, but lets documents belong to
several topics. Maybe that's more like what you want. This isn't in
Spark per se but there is some work done on it
(https://issues.apache.org/jira/browse/SPARK-1405) and Sandy has
written up some text on doing this in Spark.

Finally there is the Hierarchical Dirichlet process which does allow
for the number of topics to be learned dynamically. This is relatively
advanced.

Finally finally, maybe someone can remind me of the streaming k-means
variant that tries to pick k dynamically too. I am not finding what
I'm thinking of but think this exists.

On Mon, Dec 29, 2014 at 10:55 AM, jatinpreet jatinpr...@gmail.com wrote:
 Hi,

 I wish to cluster a set of textual documents into undefined number of
 classes. The clustering algorithm provided in MLlib i.e. K-means requires me
 to give a pre-defined number of classes.

 Is there any algorithm which is intelligent enough to identify how many
 classes should be made based on the input documents. I want to utilize the
 speed and agility of Spark in the process.

 Thanks,
 Jatin



 -
 Novice Big Data Programmer
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



   

Re: K-means faster on Mahout then on Spark

2014-03-25 Thread Suneel Marthi
Mahout does have a kmeans which can be executed in mapreduce and iterative 
modes.

Sent from my iPhone

 On Mar 25, 2014, at 9:25 AM, Prashant Sharma scrapco...@gmail.com wrote:
 
 I think Mahout uses FuzzyKmeans, which is different algorithm and it is not 
 iterative. 
 
 Prashant Sharma
 
 
 On Tue, Mar 25, 2014 at 6:50 PM, Egor Pahomov pahomov.e...@gmail.com wrote:
 Hi, I'm running benchmark, which compares Mahout and SparkML. For now I have 
 next results for k-means:
 Number of iterations= 10, number of elements = 1000, mahouttime= 602, 
 spark time = 138
 Number of iterations= 40, number of elements = 1000, mahouttime= 1917, 
 spark time = 330
 Number of iterations= 70, number of elements = 1000, mahouttime= 3203, 
 spark time = 388
 Number of iterations= 10, number of elements = 1, mahouttime= 1235, 
 spark time = 2226
 Number of iterations= 40, number of elements = 1, mahouttime= 2755, 
 spark time = 6388
 Number of iterations= 70, number of elements = 1, mahouttime= 4107, 
 spark time = 10967
 Number of iterations= 10, number of elements = 10, mahouttime= 7070, 
 spark time = 25268
 
 Time in seconds. It runs on Yarn cluster with about 40 machines. Elements 
 for clusterization are randomly created. When I changed persistence level 
 from Memory to Memory_and_disk, on big data spark started to work faster.
 
 What am I missing?
 
 See my benchmarking code in attachment.
 
 
 -- 
 Sincerely yours
 Egor Pakhomov
 Scala Developer, Yandex