Not too sure what you mean by "raw text data", I am doing the usual: remove stop words, stem etc and then computing TF-IDF vectors before trying to cluster them.
2011/7/14 Fernando Fernández <[email protected]> > Hi vcaky, > > Are you using raw text data with k-means? It's usual to obtain some lower > dimension and dense representation of the documents using Singular Value > Decomposition and such techniques, and working with that representation > instead. You may want to take a look at SVD algorithms in mahout. > > Best, > Fernando. > > 2011/7/14 Vckay <[email protected]> > > > I am clustering some real world text data using K-Means. I recently came > > across Kernel K-Means and wanted to know if someone who has had > experience > > with Kernels could comment on their appropriateness for text data, i.e, > > Would using a Kernel boost k-means quality? ( I know this is rather > general > > but it is sort of hard to figure out if my high dimensional real world > data > > is linearly separable.) If so, are there any Kernel's with "practically > > accepted" parameters? > > > > Thanks > > VC > > >
