Categorical Features for K-Means Clustering

2014-07-11 Thread Wen Phan
Hi Folks,

Does any one have experience or recommendations on incorporating categorical 
features (attributes) into k-means clustering in Spark?  In other words, I want 
to cluster on a set of attributes that include categorical variables.

I know I could probably implement some custom code to parse and calculate my 
own similarity function, but I wanted to reach out before I did so.  I’d also 
prefer to take advantage of the k-means\parallel initialization feature of the 
model in MLlib, so an MLlib-based implementation would be preferred.

Thanks in advance.

Best,

-Wen


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Categorical Features for K-Means Clustering

2014-07-11 Thread Wen Phan
I see.  So, basically, kind of like dummy variables like with regressions.  
Thanks, Sean.

On Jul 11, 2014, at 10:11 AM, Sean Owen so...@cloudera.com wrote:

 Since you can't define your own distance function, you will need to
 convert these to numeric dimensions. 1-of-n encoding can work OK,
 depending on your use case. So a dimension that takes on 3 categorical
 values, becomes 3 dimensions, of which all are 0 except one that has
 value 1.
 
 On Fri, Jul 11, 2014 at 3:07 PM, Wen Phan wen.p...@mac.com wrote:
 Hi Folks,
 
 Does any one have experience or recommendations on incorporating categorical 
 features (attributes) into k-means clustering in Spark?  In other words, I 
 want to cluster on a set of attributes that include categorical variables.
 
 I know I could probably implement some custom code to parse and calculate my 
 own similarity function, but I wanted to reach out before I did so.  I’d 
 also prefer to take advantage of the k-means\parallel initialization feature 
 of the model in MLlib, so an MLlib-based implementation would be preferred.
 
 Thanks in advance.
 
 Best,
 
 -Wen



signature.asc
Description: Message signed with OpenPGP using GPGMail