Hey, My question is for this situation: Suppose we have 100000 files each containing list of features in each row.
Task is that for each file cluster the features in that file and write the corresponding cluster along with it in a new file. So we have to generate 100000 more files by applying clustering in each file individually. So can I do it this way, that get rdd of list of files and apply map. Inside the mapper function which will be handling each file, get another spark context and use Mllib kmeans to get the clustered output file. Please suggest the appropriate method to tackle this problem. Thanks, Rahul Kumar Bhojwani 3rd year, B.Tech Computer Science Engineering National Institute Of Technology, Karnataka 9945197359