This kind of data is normally called categorical.  If there is an ordering
to the data, then it is called ordinal.

Categorical data should generally be encoded using the 1 of n encoding that
you suggest for clustering so that distances make good sense and are
symmetrical.  This representation is a bit redundant, but that isn't a
problem for Mahout's clustering or classification algorithms.  You might
want to use 1 of n-1 encoding for some other kinds of classifiers such as
logistic regression without regularization.

On Thu, Feb 14, 2013 at 8:34 AM, misterblinky <[email protected]>wrote:

> I'm clustering (non-textual) data. Some of the features in my vectors
> represent
> discrete values or "types" such that, for example, one feature may have
> the range
> of values 0="red", 1="blue", 2="green", 3="yellow".
>
> I could also have characterized the same data as 4 features where the
> value of
> the feature was either 0 or 1, where 1 would imply color blue.
>
> One distinction between these two approaches is that the first approach
> creates
> dense vectors, whereas the second approach creates sparse vectors.
>
> My question is, from the point of view of accurate clusters, is it better
> to
> characterize the type values one way or the other? A follow up is, for the
> recommended approach to characterizing the data in a vector, (if it's
> possible to
> generalize) what would be the suggested cluster alg and measurement?
>
> I am new to this, so feel free to be basic in your response!
>
>

Reply via email to