Hi Matt,

I see. You could use the trained model to predict the cluster id for each
training point. Now you should be able to create a dataset with your
original input data and the associated cluster id for each data point in
the input data. Now you can group this dataset by cluster id and aggregate
over the original 5 features. E.g., get the mean for numerical data or the
value that occurs the most for categorical data.

The exact aggregation is use-case dependent.

I hope this helps,
Christoph

Am 01.03.2018 21:40 schrieb "Matt Hicks" <m...@outr.com>:

Thanks for the response Christoph.

I'm converting large amounts of data into clustering training and I'm just
having a hard time reasoning about reversing the clusters (in code) back to
the original format to properly understand the dominant values in each
cluster.

For example, if I have five fields of data and I've trained ten clusters of
data I'd like to output the five fields of data as represented by each of
the ten clusters.



On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com wrote:

> Hi matt,
>
> the cluster are defined by there centroids / cluster centers. All the
> points belonging to a certain cluster are closer to its than to the
> centroids of any other cluster.
>
> What I typically do is to convert the cluster centers back to the original
> input format or of that is not possible use the point nearest to the
> cluster center and use this as a representation of the whole cluster.
>
> Can you be a little bit more specific about your use-case?
>
> Best,
> Christoph
>
> Am 01.03.2018 20:53 schrieb "Matt Hicks" <m...@outr.com>:
>
> I'm using K Means clustering for a project right now, and it's working
> very well.  However, I'd like to determine from the clusters what
> information distinctions define each cluster so I can explain the "reasons"
> data fits into a specific cluster.
>
> Is there a proper way to do this in Spark ML?
>
>

Reply via email to