Yes true. That's why I said if and when. But hopefully I have given correct explanation of why rdd of rdd is not possible. On 09-Jun-2015 10:22 pm, "Mark Hamstra" <m...@clearstorydata.com> wrote:
> That would constitute a major change in Spark's architecture. It's not > happening anytime soon. > > On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar <loni...@gmail.com> wrote: > >> Possibly in future, if and when spark architecture allows workers to >> launch spark jobs (the functions passed to transformation or action APIs of >> RDD), it will be possible to have RDD of RDD. >> >> On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar <loni...@gmail.com> wrote: >> >>> Simillar question was asked before: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html >>> >>> Here is one of the reasons why I think RDD[RDD[T]] is not possible: >>> >>> - RDD is only a handle to the actual data partitions. It has a >>> reference/pointer to the *SparkContext* object (*sc*) and a list of >>> partitions. >>> - The *SparkContext *is an object in the Spark Application/Driver >>> Program's JVM. Similarly, the list of partitions is also in the JVM of >>> the >>> driver program. Each partition contains kind of "remote references" to >>> the >>> partition data on the worker JVMs. >>> - The functions passed to RDD's transformations and actions execute >>> in the worker's JVMs on different nodes. For example, in "*rdd.map { >>> x => x*x }*", the function performing "*x*x*" runs on the JVMs of >>> the worker nodes where the partitions of the RDD reside. These JVMs do >>> not >>> have access to the "*sc*" since its only on the driver's JVM. >>> - Thus, in case of your *RDD of RDD*: *outerRDD.map { innerRdd => >>> innerRDD.filter { x => x*x } }*, the worker nodes will not be able >>> to execute the *filter* on *innerRDD *as the code in the worker does >>> not have access to "sc" and can not launch a spark job. >>> >>> >>> Hope it helps. You need to consider List[RDD] or some other collection. >>> >>> -Kiran >>> >>> On Tue, Jun 9, 2015 at 2:25 AM, ping yan <sharon...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> >>>> The problem I am looking at is as follows: >>>> >>>> - I read in a log file of multiple users as a RDD >>>> >>>> - I'd like to group the above RDD into *multiple RDDs* by userIds (the >>>> key) >>>> >>>> - my processEachUser() function then takes in each RDD mapped into >>>> each individual user, and calls for RDD.map or DataFrame operations on >>>> them. (I already had the function coded, I am therefore reluctant to work >>>> with the ResultIterable object coming out of rdd.groupByKey() ... ) >>>> >>>> I've searched the mailing list and googled on "RDD of RDDs" and seems >>>> like it isn't a thing at all. >>>> >>>> A few choices left seem to be: 1) groupByKey() and then work with the >>>> ResultIterable object; 2) groupbyKey() and then write each group into a >>>> file, and read them back as individual rdds to process.. >>>> >>>> Anyone got a better idea or had a similar problem before? >>>> >>>> >>>> Thanks! >>>> Ping >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Ping Yan >>>> Ph.D. in Management >>>> Dept. of Management Information Systems >>>> University of Arizona >>>> Tucson, AZ 85721 >>>> >>>> >>> >> >