The former: a single new RDD is returned. Check the PairRDDFunctions docs (http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions):
def groupByKey(): RDD[(K, Seq[V])] Group the values for each key in the RDD into a single sequence. On Wednesday, March 19, 2014 at 9:32 AM, Adrian Mocanu wrote: > When you partition via groupByKey tulpes (parts of the RDD) are moved from > some node to another node based on key (hash partitioning). > Do the tuples remain part of 1 RDD as before but moved to different nodes or > does this shuffling create, say, several RDDs which will have parts of the > original RDD? > > Thanks > -Adrian > > > >