Jatin,

One approach for determining K would be to sample the data set and run PCA.
Then evaluate how many many of the resulting eigenvalue/eigenvector pairs
to use before you reach diminishing returns on cumulative error. That
number provides a reasonably good value for K to use in KMeans.

With recent releases of Spark and MLlib, you don't have to sample, could
run PCA at scale on the full data. but that may be overkill for what you
need.

As Sean mentioned there may be other algorithms that would be more
effective for your use case. LDA is good for topic modeling. In practice
its results can be noisy, unless the pipeline has some parsing/processing
of the text ahead of training.

Word2Vec can be an interesting alternative for topic modeling (also in
Spark MLlib) and you may want to take a look at this tutorial/case study
http://www.yseam.com/blog/WV.html


On Mon, Dec 29, 2014 at 2:55 AM, jatinpreet <jatinpr...@gmail.com> wrote:
>
> Hi,
>
> I wish to cluster a set of textual documents into undefined number of
> classes. The clustering algorithm provided in MLlib i.e. K-means requires
> me
> to give a pre-defined number of classes.
>
> Is there any algorithm which is intelligent enough to identify how many
> classes should be made based on the input documents. I want to utilize the
> speed and agility of Spark in the process.
>
> Thanks,
> Jatin
>
>
>
> -----
> Novice Big Data Programmer
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to