I am using wholeTextFiles api to load bunch of text files and (caching this
object) mapping its text content to tf-idf vectors and then applying kmean
on these vectors.  The k-mean model after training, predicts the clusterId
of trained data by taking list<vectors> of training data, question is how
to map this with wholeTextFiles object?

Use case
 Input:  Set of text files present in a directory, process text files and
cluster through kmean,
 output : get cluster membership of each text-file, read its file content
that is in wholeTextFiles, and write it to respective clusterId directory.

Reply via email to