I'm clustering (non-textual) data. Some of the features in my vectors represent discrete values or "types" such that, for example, one feature may have the range of values 0="red", 1="blue", 2="green", 3="yellow".
I could also have characterized the same data as 4 features where the value of the feature was either 0 or 1, where 1 would imply color blue. One distinction between these two approaches is that the first approach creates dense vectors, whereas the second approach creates sparse vectors. My question is, from the point of view of accurate clusters, is it better to characterize the type values one way or the other? A follow up is, for the recommended approach to characterizing the data in a vector, (if it's possible to generalize) what would be the suggested cluster alg and measurement? I am new to this, so feel free to be basic in your response!
