I see. So, basically, kind of like dummy variables like with regressions.
Thanks, Sean.
On Jul 11, 2014, at 10:11 AM, Sean Owen so...@cloudera.com wrote:
Since you can't define your own distance function, you will need to
convert these to numeric dimensions. 1-of-n encoding can work OK,
depending on your use case. So a dimension that takes on 3 categorical
values, becomes 3 dimensions, of which all are 0 except one that has
value 1.
On Fri, Jul 11, 2014 at 3:07 PM, Wen Phan wen.p...@mac.com wrote:
Hi Folks,
Does any one have experience or recommendations on incorporating categorical
features (attributes) into k-means clustering in Spark? In other words, I
want to cluster on a set of attributes that include categorical variables.
I know I could probably implement some custom code to parse and calculate my
own similarity function, but I wanted to reach out before I did so. I’d
also prefer to take advantage of the k-means\parallel initialization feature
of the model in MLlib, so an MLlib-based implementation would be preferred.
Thanks in advance.
Best,
-Wen
signature.asc
Description: Message signed with OpenPGP using GPGMail