I'm sorry,
I'm mean the TF-IDF vectors you have computed. Another usual step is to
reduce dimension of those vectors using SVD so you would have only a few
hundred of dense vectors representing almost all the usefull information
that your original data contains, and then use these dense vectors for
clustering / classifying the documents.

2011/7/14 Vckay <[email protected]>

> Not too sure what you mean by "raw text data", I am doing the usual: remove
> stop words, stem etc and then computing TF-IDF vectors before trying to
> cluster them.
>
>
> 2011/7/14 Fernando Fernández <[email protected]>
>
> > Hi vcaky,
> >
> > Are you using raw text data with k-means? It's usual to obtain some lower
> > dimension and dense representation of the documents using Singular Value
> > Decomposition and such techniques, and working with that representation
> > instead. You may want to take a look at SVD algorithms in mahout.
> >
> > Best,
> > Fernando.
> >
> > 2011/7/14 Vckay <[email protected]>
> >
> > > I am clustering some real world text data using K-Means. I recently
> came
> > > across Kernel K-Means and wanted to know if someone who has had
> > experience
> > > with Kernels could comment on their appropriateness for text data, i.e,
> > > Would using a Kernel boost k-means quality? ( I know this is rather
> > general
> > > but it is sort of hard to figure out if my high dimensional real world
> > data
> > > is linearly separable.) If so, are there any Kernel's with "practically
> > > accepted" parameters?
> > >
> > > Thanks
> > > VC
> > >
> >
>

Reply via email to