groupByKey will give you a PairRDD, where for each key k, you have an
Iterable over all corresponding (x,y). You can then call mapValues and
apply your clustering to the points, to yield a result R. You end up
with with a PairRDD of (k,R) pairs. This of course happens in
parallel.

On Thu, Jul 31, 2014 at 12:00 PM, Greg <g...@zooniverse.org> wrote:
> Hi, suppose I have some data of the form
> k,(x,y)
> which are all numbers. For each key value (k) I want to do kmeans clustering
> on all corresponding (x,y) points. For each key value I have few enough
> points that I'm happy to use a traditional (non-mapreduce) kmeans
> implementation. The challenge is that I have a lot of different keys so I
> want to use Hadoop/Spark to help split the clustering up over multiple
> computers. With Hadoop-streaming and Python the code would be easy:
> pts = []
> current_k = None
> for k,(x,y) in  sys.stdin:
>   if k == current_k:
>     pts.append((x,y))
>   else:
>     if current_k is not None:
>        #do kmeans clustering on pts
>     current_k = k
>     pts = []
>
> (and obviously run kmeans for the final key as well)
> How do I express this in Spark? The function f for both filter and
> filterByKey needs to be transitive (all of the examples I've seen are just
> adding values). Ideally I'd like to be able to run this iteratively,
> changing the number of clusters for kmeans (so Spark would be nice). Given
> how easy this is to do in Hadoop, I feel like this should be easy in Spark
> as well but I can't figure out a way.
>
> thanks :)
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/understanding-use-of-filter-function-in-Spark-tp11037.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to