Hi, suppose I have some data of the form k,(x,y) which are all numbers. For each key value (k) I want to do kmeans clustering on all corresponding (x,y) points. For each key value I have few enough points that I'm happy to use a traditional (non-mapreduce) kmeans implementation. The challenge is that I have a lot of different keys so I want to use Hadoop/Spark to help split the clustering up over multiple computers. With Hadoop-streaming and Python the code would be easy: pts = [] current_k = None for k,(x,y) in sys.stdin: if k == current_k: pts.append((x,y)) else: if current_k is not None: #do kmeans clustering on pts current_k = k pts = []
(and obviously run kmeans for the final key as well) How do I express this in Spark? The function f for both filter and filterByKey needs to be transitive (all of the examples I've seen are just adding values). Ideally I'd like to be able to run this iteratively, changing the number of clusters for kmeans (so Spark would be nice). Given how easy this is to do in Hadoop, I feel like this should be easy in Spark as well but I can't figure out a way. thanks :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/understanding-use-of-filter-function-in-Spark-tp11037.html Sent from the Apache Spark User List mailing list archive at Nabble.com.