SPARK-7879 <https://issues.apache.org/jira/browse/SPARK-7879> seems to address your use case (running KMeans on a dataframe and having the results added as an additional column)
On Wed, Jul 1, 2015 at 5:53 PM, Eric Friedman <eric.d.fried...@gmail.com> wrote: > In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans.train > method, is there a cleaner way to create the Vectors than this? > > data.map{r => Vectors.dense(r.getDouble(0), r.getDouble(3), > r.getDouble(4), r.getDouble(5), r.getDouble(6))} > > > Second, once I train the model and call predict on my vectorized dataset, > what's the best way to relate the cluster assignments back to the original > data frame? > > > That is, I started with df1, which has a bunch of domain information in > each row and also the doubles I use to cluster. I vectorize the doubles > and then train on them. I use the resulting model to predict clusters for > the vectors. I'd like to look at the original domain information in light > of the clusters to which they are now assigned. > > >