I'm sorry, I'm mean the TF-IDF vectors you have computed. Another usual step is to reduce dimension of those vectors using SVD so you would have only a few hundred of dense vectors representing almost all the usefull information that your original data contains, and then use these dense vectors for clustering / classifying the documents.
2011/7/14 Vckay <[email protected]> > Not too sure what you mean by "raw text data", I am doing the usual: remove > stop words, stem etc and then computing TF-IDF vectors before trying to > cluster them. > > > 2011/7/14 Fernando Fernández <[email protected]> > > > Hi vcaky, > > > > Are you using raw text data with k-means? It's usual to obtain some lower > > dimension and dense representation of the documents using Singular Value > > Decomposition and such techniques, and working with that representation > > instead. You may want to take a look at SVD algorithms in mahout. > > > > Best, > > Fernando. > > > > 2011/7/14 Vckay <[email protected]> > > > > > I am clustering some real world text data using K-Means. I recently > came > > > across Kernel K-Means and wanted to know if someone who has had > > experience > > > with Kernels could comment on their appropriateness for text data, i.e, > > > Would using a Kernel boost k-means quality? ( I know this is rather > > general > > > but it is sort of hard to figure out if my high dimensional real world > > data > > > is linearly separable.) If so, are there any Kernel's with "practically > > > accepted" parameters? > > > > > > Thanks > > > VC > > > > > >
