Hey, i want to cluster a set of documents using a bag-of-words approach (e.g. using K-means). However, my documents (since they are automatically generated by aggregating text snippets) show huge differences according to their document size.
This means, some document vectors have 50 words with a count greater 0 (thereby those words having some small count), and some small number of document vectors have 1,000,000 words with a count greater 0 (with those words having a power-law like distribution of counts). Even when using tf-idf normalization this will make the sparse document vectors (small docs) having large scores for a few words and the more densely filled document vectors (large docs) having small scores for a huge number of words. I call this a highly heterogeneous input data set (don't know if it is the right term) and expect this to be a problem for existing domains. E.g., in NLP, when clustering similar terms on basis of a term-document matrix some terms will occur just a few times in a few number of documents while a small number of terms will occur a lot in a huge number of documents. People proposed using PPMI and smoothing to get better results, however, in the papers i read they do not explicitly talk about the heterogeneity problem and how it affects the output (e.g. the clusters or the similarity calculation). Someone has a hint what normalization/clustering approach is promising in the presence of huge heterogeneity or can hint me to some relevant papers? I thought about approaches explicitly trying to (a) extract sparse clusters (Sparse PCA), (b) splitting large documents by sampling, or (c) smoothing the doc-term matrix before clustering using SVD, PLSA, or LDA (to make the small docs more densely filled). However i am searching for a somewhat more well-founded approach for that kind of problem or some good resource. Anyone can point me to some good paper covering that problem? That would be of great help. Thanks a lot, Chris
