groupByKey will give you a PairRDD, where for each key k, you have an Iterable over all corresponding (x,y). You can then call mapValues and apply your clustering to the points, to yield a result R. You end up with with a PairRDD of (k,R) pairs. This of course happens in parallel.
On Thu, Jul 31, 2014 at 12:00 PM, Greg <g...@zooniverse.org> wrote: > Hi, suppose I have some data of the form > k,(x,y) > which are all numbers. For each key value (k) I want to do kmeans clustering > on all corresponding (x,y) points. For each key value I have few enough > points that I'm happy to use a traditional (non-mapreduce) kmeans > implementation. The challenge is that I have a lot of different keys so I > want to use Hadoop/Spark to help split the clustering up over multiple > computers. With Hadoop-streaming and Python the code would be easy: > pts = [] > current_k = None > for k,(x,y) in sys.stdin: > if k == current_k: > pts.append((x,y)) > else: > if current_k is not None: > #do kmeans clustering on pts > current_k = k > pts = [] > > (and obviously run kmeans for the final key as well) > How do I express this in Spark? The function f for both filter and > filterByKey needs to be transitive (all of the examples I've seen are just > adding values). Ideally I'd like to be able to run this iteratively, > changing the number of clusters for kmeans (so Spark would be nice). Given > how easy this is to do in Hadoop, I feel like this should be easy in Spark > as well but I can't figure out a way. > > thanks :) > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/understanding-use-of-filter-function-in-Spark-tp11037.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.
