That would constitute a major change in Spark's architecture. It's not happening anytime soon.
On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar <loni...@gmail.com> wrote: > Possibly in future, if and when spark architecture allows workers to > launch spark jobs (the functions passed to transformation or action APIs of > RDD), it will be possible to have RDD of RDD. > > On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar <loni...@gmail.com> wrote: > >> Simillar question was asked before: >> http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html >> >> Here is one of the reasons why I think RDD[RDD[T]] is not possible: >> >> - RDD is only a handle to the actual data partitions. It has a >> reference/pointer to the *SparkContext* object (*sc*) and a list of >> partitions. >> - The *SparkContext *is an object in the Spark Application/Driver >> Program's JVM. Similarly, the list of partitions is also in the JVM of the >> driver program. Each partition contains kind of "remote references" to the >> partition data on the worker JVMs. >> - The functions passed to RDD's transformations and actions execute >> in the worker's JVMs on different nodes. For example, in "*rdd.map { >> x => x*x }*", the function performing "*x*x*" runs on the JVMs of the >> worker nodes where the partitions of the RDD reside. These JVMs do not >> have >> access to the "*sc*" since its only on the driver's JVM. >> - Thus, in case of your *RDD of RDD*: *outerRDD.map { innerRdd => >> innerRDD.filter { x => x*x } }*, the worker nodes will not be able to >> execute the *filter* on *innerRDD *as the code in the worker does not >> have access to "sc" and can not launch a spark job. >> >> >> Hope it helps. You need to consider List[RDD] or some other collection. >> >> -Kiran >> >> On Tue, Jun 9, 2015 at 2:25 AM, ping yan <sharon...@gmail.com> wrote: >> >>> Hi, >>> >>> >>> The problem I am looking at is as follows: >>> >>> - I read in a log file of multiple users as a RDD >>> >>> - I'd like to group the above RDD into *multiple RDDs* by userIds (the >>> key) >>> >>> - my processEachUser() function then takes in each RDD mapped into >>> each individual user, and calls for RDD.map or DataFrame operations on >>> them. (I already had the function coded, I am therefore reluctant to work >>> with the ResultIterable object coming out of rdd.groupByKey() ... ) >>> >>> I've searched the mailing list and googled on "RDD of RDDs" and seems >>> like it isn't a thing at all. >>> >>> A few choices left seem to be: 1) groupByKey() and then work with the >>> ResultIterable object; 2) groupbyKey() and then write each group into a >>> file, and read them back as individual rdds to process.. >>> >>> Anyone got a better idea or had a similar problem before? >>> >>> >>> Thanks! >>> Ping >>> >>> >>> >>> >>> >>> >>> -- >>> Ping Yan >>> Ph.D. in Management >>> Dept. of Management Information Systems >>> University of Arizona >>> Tucson, AZ 85721 >>> >>> >> >