Hey, My question is for this situation:
Suppose we have 100000 files each containing list of features in each row.

Task is that for each file cluster the features in that file and write the
corresponding cluster along with it in a new file.  So we have to generate
100000 more files by applying clustering in each file individually.

So can I do it this way, that get rdd of list of files and apply map.
Inside the mapper function which will be handling each file, get another
spark context and use Mllib kmeans to get the clustered output file.

Please suggest the appropriate method to tackle this problem.

Thanks,
Rahul Kumar Bhojwani
3rd year, B.Tech
Computer Science Engineering
National Institute Of Technology, Karnataka
9945197359

Reply via email to