Hi, Do you mean that you'e running K-Means directly on tf-idf bag-of-word vectors? I think your results are expected because of the general lack of big overlap between one hot encoded vectors. The similarity between most vectors is expected to be very close to zero. Those that do end up in the same cluster likely have a lot of similar boilerplate text (assuming the training data comes from crawled new articles, they likely have similar menus and header/footer text)
I would suggest you try some dimensionality reduction on the tf-idf vectors first. You have many options to choose from (LSA, LDA, document2vec, etc). Other than that, this isn't a Spark question. Asher Krim Senior Software Engineer On Fri, Mar 24, 2017 at 9:37 PM, Reth RM <reth.ik...@gmail.com> wrote: > Hi, > > I am using spark k mean for clustering records that consist of news > documents, vectors are created by applying tf-idf. Dataset that I am using > for testing right now is the gold-truth classified http://qwone.com/~ > jason/20Newsgroups/ > > Issue is all the documents are getting assigned to same cluster and others > just have the vector(doc) picked as cluster center(skewed clustering). What > could be the possible reasons for the issue, any suggestions? Should I be > retuning the epsilon? > > > > >