Re: How to run kmeans after pca?

2014-09-30 Thread st553
Thanks for your response Burak it was very helpful.

I am noticing that if I run PCA before KMeans that the KMeans algorithm will
actually take longer to run than if I had just run KMeans without PCA. I was
hoping that by using PCA first it would actually speed up the KMeans
algorithm.

I have followed the steps you've outlined but Im wondering if I need to
cache/persist the RDD[Vector] rows of the RowMatrix returned after
multiplying. Something like:

val newData: RowMatrix = data.multiply(bcPrincipalComponents.value) 
val cachedRows = newData.rows.persist()
KMeans.run(cachedRows) 
cachedRows.unpersist()

It doesnt seem intuitive to me that a smaller dimensional version of my data
set would take longer for KMeans... unless Im missing something?

Thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473p15409.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to run kmeans after pca?

2014-09-30 Thread Evan R. Sparks
Caching after doing the multiply is a good idea. Keep in mind that during
the first iteration of KMeans, the cached rows haven't yet been
materialized - so it is both doing the multiply and the first pass of
KMeans all at once. To isolate which part is slow you can run
cachedRows.numRows() to force this to be materialized before you run KMeans.

Also, KMeans is optimized to run quickly on both sparse and dense data. The
result of PCA is going to be dense, but if your input data has #nnz ~=
size(pca data), performance might be about the same. (I haven't actually
verified this last point.)

Finally, speed is partially going to be dependent on how much data you have
relative to scheduler overheads - if your input data is small it could be
that the costs of distributing your task are greater than the time spent
actually computing - usually this would manifest itself in the stages
taking about the same amount of time even though you're passing datasets
that have different dimensionality.

On Tue, Sep 30, 2014 at 9:00 AM, st553 sthompson...@gmail.com wrote:

 Thanks for your response Burak it was very helpful.

 I am noticing that if I run PCA before KMeans that the KMeans algorithm
 will
 actually take longer to run than if I had just run KMeans without PCA. I
 was
 hoping that by using PCA first it would actually speed up the KMeans
 algorithm.

 I have followed the steps you've outlined but Im wondering if I need to
 cache/persist the RDD[Vector] rows of the RowMatrix returned after
 multiplying. Something like:

 val newData: RowMatrix = data.multiply(bcPrincipalComponents.value)
 val cachedRows = newData.rows.persist()
 KMeans.run(cachedRows)
 cachedRows.unpersist()

 It doesnt seem intuitive to me that a smaller dimensional version of my
 data
 set would take longer for KMeans... unless Im missing something?

 Thanks!




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473p15409.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




How to run kmeans after pca?

2014-09-17 Thread st553
I would like to reduce the dimensionality of my data before running kmeans.
The problem I'm having is that both RowMatrix.computePrincipalComponents()
and RowMatrix.computeSVD() return a DenseMatrix whereas KMeans.train()
requires an RDD[Vector]. Does MLlib provide a way to do this conversion?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org