You can try hashing to control the feature dimension. MLlib's k-means implementation can handle sparse data efficiently if the number of features is not huge. -Xiangrui
On Tue, Jun 16, 2015 at 2:44 PM, Rex X <dnsr...@gmail.com> wrote: > Hi Sujit, > > That's a good point. But 1-hot encoding will make our data changing from > Terabytes to Petabytes, because we have tens of categorical attributes, and > some of them contain thousands of categorical values. > > Is there any way to make a good balance of data size and right > representation of categories? > > > -Rex > > > > On Tue, Jun 16, 2015 at 1:27 PM, Sujit Pal <sujitatgt...@gmail.com> wrote: >> >> Hi Rexx, >> >> In general (ie not Spark specific), its best to convert categorical data >> to 1-hot encoding rather than integers - that way the algorithm doesn't use >> the ordering implicit in the integer representation. >> >> -sujit >> >> >> On Tue, Jun 16, 2015 at 1:17 PM, Rex X <dnsr...@gmail.com> wrote: >>> >>> Is it necessary to convert categorical data into integers? >>> >>> Any tips would be greatly appreciated! >>> >>> -Rex >>> >>> On Sun, Jun 14, 2015 at 10:05 AM, Rex X <dnsr...@gmail.com> wrote: >>>> >>>> For clustering analysis, we need a way to measure distances. >>>> >>>> When the data contains different levels of measurement - >>>> binary / categorical (nominal), counts (ordinal), and ratio (scale) >>>> >>>> To be concrete, for example, working with attributes of >>>> city, zip, satisfaction_level, price >>>> >>>> In the meanwhile, the real data usually also contains string attributes, >>>> for example, book titles. The distance between two strings can be measured >>>> by minimum-edit-distance. >>>> >>>> >>>> In SPSS, it provides Two-Step Cluster, which can handle both ratio scale >>>> and ordinal numbers. >>>> >>>> >>>> What is right algorithm to do hierarchical clustering analysis with all >>>> these four-kind attributes above with MLlib? >>>> >>>> >>>> If we cannot find a right metric to measure the distance, an alternative >>>> solution is to do a topological data analysis (e.g. linkage, and etc). Can >>>> we do such kind of analysis with GraphX? >>>> >>>> >>>> -Rex >>>> >>> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org