we are trying to run kmeans on some product titles so that we could cluster together similar products like "nike flex sneaker size 9" vs "nike flex sneaker size 8" it works fine for most but it turns out that a lot of the titles are very short (particularly after filtering stopwords) so I got many 1-word or 2-word titles and somehow these got lumped together into a huge cluster which does not have any similarly between the members at all I followed some specific examples in this cluster, it seems that the algorithm is indeed doing what it's supposed to do.
anybody has similar experience clustering particularly short "documents" ? generally any tricks to force the members to "jump" out and join another cluster ? (I do see other smaller clusters, with matching words) Thanks Yang
