This kind of data is normally called categorical. If there is an ordering to the data, then it is called ordinal.
Categorical data should generally be encoded using the 1 of n encoding that you suggest for clustering so that distances make good sense and are symmetrical. This representation is a bit redundant, but that isn't a problem for Mahout's clustering or classification algorithms. You might want to use 1 of n-1 encoding for some other kinds of classifiers such as logistic regression without regularization. On Thu, Feb 14, 2013 at 8:34 AM, misterblinky <[email protected]>wrote: > I'm clustering (non-textual) data. Some of the features in my vectors > represent > discrete values or "types" such that, for example, one feature may have > the range > of values 0="red", 1="blue", 2="green", 3="yellow". > > I could also have characterized the same data as 4 features where the > value of > the feature was either 0 or 1, where 1 would imply color blue. > > One distinction between these two approaches is that the first approach > creates > dense vectors, whereas the second approach creates sparse vectors. > > My question is, from the point of view of accurate clusters, is it better > to > characterize the type values one way or the other? A follow up is, for the > recommended approach to characterizing the data in a vector, (if it's > possible to > generalize) what would be the suggested cluster alg and measurement? > > I am new to this, so feel free to be basic in your response! > >
