Re: RDD of RDDs
Thanks much for the detailed explanations. I suspected architectural support of the notion of rdd of rdds, but my understanding of Spark or distributed computing in general is not as deep as allowing me to understand better. so this really helps! I ended up going with List[RDD]. The collection of unique users in my dataset is not too bad - 2000 or so, so I simply put each into a RDD by doing for user in users: userrdd = bigrdd.filter(lambda x: x[userid_pos] == user) Thanks for helping out! Ping On Tue, Jun 9, 2015 at 1:17 AM kiran lonikar loni...@gmail.com wrote: Simillar question was asked before: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html Here is one of the reasons why I think RDD[RDD[T]] is not possible: - RDD is only a handle to the actual data partitions. It has a reference/pointer to the *SparkContext* object (*sc*) and a list of partitions. - The *SparkContext *is an object in the Spark Application/Driver Program's JVM. Similarly, the list of partitions is also in the JVM of the driver program. Each partition contains kind of remote references to the partition data on the worker JVMs. - The functions passed to RDD's transformations and actions execute in the worker's JVMs on different nodes. For example, in *rdd.map { x = x*x }*, the function performing *x*x* runs on the JVMs of the worker nodes where the partitions of the RDD reside. These JVMs do not have access to the *sc* since its only on the driver's JVM. - Thus, in case of your *RDD of RDD*: *outerRDD.map { innerRdd = innerRDD.filter { x = x*x } }*, the worker nodes will not be able to execute the *filter* on *innerRDD *as the code in the worker does not have access to sc and can not launch a spark job. Hope it helps. You need to consider List[RDD] or some other collection. -Kiran On Tue, Jun 9, 2015 at 2:25 AM, ping yan sharon...@gmail.com wrote: Hi, The problem I am looking at is as follows: - I read in a log file of multiple users as a RDD - I'd like to group the above RDD into *multiple RDDs* by userIds (the key) - my processEachUser() function then takes in each RDD mapped into each individual user, and calls for RDD.map or DataFrame operations on them. (I already had the function coded, I am therefore reluctant to work with the ResultIterable object coming out of rdd.groupByKey() ... ) I've searched the mailing list and googled on RDD of RDDs and seems like it isn't a thing at all. A few choices left seem to be: 1) groupByKey() and then work with the ResultIterable object; 2) groupbyKey() and then write each group into a file, and read them back as individual rdds to process.. Anyone got a better idea or had a similar problem before? Thanks! Ping -- Ping Yan Ph.D. in Management Dept. of Management Information Systems University of Arizona Tucson, AZ 85721
Re: RDD of RDDs
Possibly in future, if and when spark architecture allows workers to launch spark jobs (the functions passed to transformation or action APIs of RDD), it will be possible to have RDD of RDD. On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar loni...@gmail.com wrote: Simillar question was asked before: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html Here is one of the reasons why I think RDD[RDD[T]] is not possible: - RDD is only a handle to the actual data partitions. It has a reference/pointer to the *SparkContext* object (*sc*) and a list of partitions. - The *SparkContext *is an object in the Spark Application/Driver Program's JVM. Similarly, the list of partitions is also in the JVM of the driver program. Each partition contains kind of remote references to the partition data on the worker JVMs. - The functions passed to RDD's transformations and actions execute in the worker's JVMs on different nodes. For example, in *rdd.map { x = x*x }*, the function performing *x*x* runs on the JVMs of the worker nodes where the partitions of the RDD reside. These JVMs do not have access to the *sc* since its only on the driver's JVM. - Thus, in case of your *RDD of RDD*: *outerRDD.map { innerRdd = innerRDD.filter { x = x*x } }*, the worker nodes will not be able to execute the *filter* on *innerRDD *as the code in the worker does not have access to sc and can not launch a spark job. Hope it helps. You need to consider List[RDD] or some other collection. -Kiran On Tue, Jun 9, 2015 at 2:25 AM, ping yan sharon...@gmail.com wrote: Hi, The problem I am looking at is as follows: - I read in a log file of multiple users as a RDD - I'd like to group the above RDD into *multiple RDDs* by userIds (the key) - my processEachUser() function then takes in each RDD mapped into each individual user, and calls for RDD.map or DataFrame operations on them. (I already had the function coded, I am therefore reluctant to work with the ResultIterable object coming out of rdd.groupByKey() ... ) I've searched the mailing list and googled on RDD of RDDs and seems like it isn't a thing at all. A few choices left seem to be: 1) groupByKey() and then work with the ResultIterable object; 2) groupbyKey() and then write each group into a file, and read them back as individual rdds to process.. Anyone got a better idea or had a similar problem before? Thanks! Ping -- Ping Yan Ph.D. in Management Dept. of Management Information Systems University of Arizona Tucson, AZ 85721
Re: RDD of RDDs
Simillar question was asked before: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html Here is one of the reasons why I think RDD[RDD[T]] is not possible: - RDD is only a handle to the actual data partitions. It has a reference/pointer to the *SparkContext* object (*sc*) and a list of partitions. - The *SparkContext *is an object in the Spark Application/Driver Program's JVM. Similarly, the list of partitions is also in the JVM of the driver program. Each partition contains kind of remote references to the partition data on the worker JVMs. - The functions passed to RDD's transformations and actions execute in the worker's JVMs on different nodes. For example, in *rdd.map { x = x*x }*, the function performing *x*x* runs on the JVMs of the worker nodes where the partitions of the RDD reside. These JVMs do not have access to the *sc* since its only on the driver's JVM. - Thus, in case of your *RDD of RDD*: *outerRDD.map { innerRdd = innerRDD.filter { x = x*x } }*, the worker nodes will not be able to execute the *filter* on *innerRDD *as the code in the worker does not have access to sc and can not launch a spark job. Hope it helps. You need to consider List[RDD] or some other collection. -Kiran On Tue, Jun 9, 2015 at 2:25 AM, ping yan sharon...@gmail.com wrote: Hi, The problem I am looking at is as follows: - I read in a log file of multiple users as a RDD - I'd like to group the above RDD into *multiple RDDs* by userIds (the key) - my processEachUser() function then takes in each RDD mapped into each individual user, and calls for RDD.map or DataFrame operations on them. (I already had the function coded, I am therefore reluctant to work with the ResultIterable object coming out of rdd.groupByKey() ... ) I've searched the mailing list and googled on RDD of RDDs and seems like it isn't a thing at all. A few choices left seem to be: 1) groupByKey() and then work with the ResultIterable object; 2) groupbyKey() and then write each group into a file, and read them back as individual rdds to process.. Anyone got a better idea or had a similar problem before? Thanks! Ping -- Ping Yan Ph.D. in Management Dept. of Management Information Systems University of Arizona Tucson, AZ 85721
Re: Rdd of Rdds
Replicating my answer to another question asked today: Here is one of the reasons why I think RDD[RDD[T]] is not possible: * RDD is only a handle to the actual data partitions. It has a reference/pointer to the /SparkContext /object (/sc/) and a list of partitions. * The SparkContext is an object in the Spark Application/Driver Program's JVM. Similarly, the list of partitions is also in the JVM of the driver program. Each partition contains kind of remote references to the partition data on the worker JVMs. * The functions passed to RDD's transformations and actions execute in the worker's JVMs on different nodes. For example, in *rdd.map { x = x*x }*, the function performing *x*x* runs on the JVMs of the worker nodes where the partitions of the RDD reside. These JVMs do not have access to the sc since its only on the driver's JVM. * Thus, in case of your RDD of RDD: *outerRDD.map { innerRdd = innerRDD.filter { x = x*x } }*, the worker nodes will not be able to execute the filter on innerRDD as the code in the worker does not have access to sc and can not launch a spark job. Hope it helps. You need to consider List[RDD] or some other collection. Possibly in future, if and when spark architecture allows workers to launch spark jobs (the functions passed to transformation or action APIs of RDD), it will be possible to have RDD of RDD. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-tp17025p23217.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RDD of RDDs
That would constitute a major change in Spark's architecture. It's not happening anytime soon. On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar loni...@gmail.com wrote: Possibly in future, if and when spark architecture allows workers to launch spark jobs (the functions passed to transformation or action APIs of RDD), it will be possible to have RDD of RDD. On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar loni...@gmail.com wrote: Simillar question was asked before: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html Here is one of the reasons why I think RDD[RDD[T]] is not possible: - RDD is only a handle to the actual data partitions. It has a reference/pointer to the *SparkContext* object (*sc*) and a list of partitions. - The *SparkContext *is an object in the Spark Application/Driver Program's JVM. Similarly, the list of partitions is also in the JVM of the driver program. Each partition contains kind of remote references to the partition data on the worker JVMs. - The functions passed to RDD's transformations and actions execute in the worker's JVMs on different nodes. For example, in *rdd.map { x = x*x }*, the function performing *x*x* runs on the JVMs of the worker nodes where the partitions of the RDD reside. These JVMs do not have access to the *sc* since its only on the driver's JVM. - Thus, in case of your *RDD of RDD*: *outerRDD.map { innerRdd = innerRDD.filter { x = x*x } }*, the worker nodes will not be able to execute the *filter* on *innerRDD *as the code in the worker does not have access to sc and can not launch a spark job. Hope it helps. You need to consider List[RDD] or some other collection. -Kiran On Tue, Jun 9, 2015 at 2:25 AM, ping yan sharon...@gmail.com wrote: Hi, The problem I am looking at is as follows: - I read in a log file of multiple users as a RDD - I'd like to group the above RDD into *multiple RDDs* by userIds (the key) - my processEachUser() function then takes in each RDD mapped into each individual user, and calls for RDD.map or DataFrame operations on them. (I already had the function coded, I am therefore reluctant to work with the ResultIterable object coming out of rdd.groupByKey() ... ) I've searched the mailing list and googled on RDD of RDDs and seems like it isn't a thing at all. A few choices left seem to be: 1) groupByKey() and then work with the ResultIterable object; 2) groupbyKey() and then write each group into a file, and read them back as individual rdds to process.. Anyone got a better idea or had a similar problem before? Thanks! Ping -- Ping Yan Ph.D. in Management Dept. of Management Information Systems University of Arizona Tucson, AZ 85721
Re: RDD of RDDs
Yes true. That's why I said if and when. But hopefully I have given correct explanation of why rdd of rdd is not possible. On 09-Jun-2015 10:22 pm, Mark Hamstra m...@clearstorydata.com wrote: That would constitute a major change in Spark's architecture. It's not happening anytime soon. On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar loni...@gmail.com wrote: Possibly in future, if and when spark architecture allows workers to launch spark jobs (the functions passed to transformation or action APIs of RDD), it will be possible to have RDD of RDD. On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar loni...@gmail.com wrote: Simillar question was asked before: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html Here is one of the reasons why I think RDD[RDD[T]] is not possible: - RDD is only a handle to the actual data partitions. It has a reference/pointer to the *SparkContext* object (*sc*) and a list of partitions. - The *SparkContext *is an object in the Spark Application/Driver Program's JVM. Similarly, the list of partitions is also in the JVM of the driver program. Each partition contains kind of remote references to the partition data on the worker JVMs. - The functions passed to RDD's transformations and actions execute in the worker's JVMs on different nodes. For example, in *rdd.map { x = x*x }*, the function performing *x*x* runs on the JVMs of the worker nodes where the partitions of the RDD reside. These JVMs do not have access to the *sc* since its only on the driver's JVM. - Thus, in case of your *RDD of RDD*: *outerRDD.map { innerRdd = innerRDD.filter { x = x*x } }*, the worker nodes will not be able to execute the *filter* on *innerRDD *as the code in the worker does not have access to sc and can not launch a spark job. Hope it helps. You need to consider List[RDD] or some other collection. -Kiran On Tue, Jun 9, 2015 at 2:25 AM, ping yan sharon...@gmail.com wrote: Hi, The problem I am looking at is as follows: - I read in a log file of multiple users as a RDD - I'd like to group the above RDD into *multiple RDDs* by userIds (the key) - my processEachUser() function then takes in each RDD mapped into each individual user, and calls for RDD.map or DataFrame operations on them. (I already had the function coded, I am therefore reluctant to work with the ResultIterable object coming out of rdd.groupByKey() ... ) I've searched the mailing list and googled on RDD of RDDs and seems like it isn't a thing at all. A few choices left seem to be: 1) groupByKey() and then work with the ResultIterable object; 2) groupbyKey() and then write each group into a file, and read them back as individual rdds to process.. Anyone got a better idea or had a similar problem before? Thanks! Ping -- Ping Yan Ph.D. in Management Dept. of Management Information Systems University of Arizona Tucson, AZ 85721
RDD of RDDs
Hi, The problem I am looking at is as follows: - I read in a log file of multiple users as a RDD - I'd like to group the above RDD into *multiple RDDs* by userIds (the key) - my processEachUser() function then takes in each RDD mapped into each individual user, and calls for RDD.map or DataFrame operations on them. (I already had the function coded, I am therefore reluctant to work with the ResultIterable object coming out of rdd.groupByKey() ... ) I've searched the mailing list and googled on RDD of RDDs and seems like it isn't a thing at all. A few choices left seem to be: 1) groupByKey() and then work with the ResultIterable object; 2) groupbyKey() and then write each group into a file, and read them back as individual rdds to process.. Anyone got a better idea or had a similar problem before? Thanks! Ping -- Ping Yan Ph.D. in Management Dept. of Management Information Systems University of Arizona Tucson, AZ 85721
Re: How to merge a RDD of RDDs into one uber RDD
You can also use join function of rdd. This is actually kind of append funtion that add up all the rdds and create one uber rdd. On Wed, Jan 7, 2015, 14:30 rkgurram rkgur...@gmail.com wrote: Thank you for the response, sure will try that out. Currently I changed my code such that the first map files.map to files.flatMap, which I guess will do similar what you are saying, it gives me a List[] of elements (in this case LabeledPoints, I could also do RDDs) which I then turned into a mega RDD. The current problem seems to be gone, I no longer get the NPE but further down I am getting a indexOutOfBounds, so trying to figure out if the original problem is manifesting itself as a new one. Regards -Ravi -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/How-to-merge-a-RDD-of-RDDs-into-one- uber-RDD-tp20986p21012.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to merge a RDD of RDDs into one uber RDD
Thank you for the response, sure will try that out. Currently I changed my code such that the first map files.map to files.flatMap, which I guess will do similar what you are saying, it gives me a List[] of elements (in this case LabeledPoints, I could also do RDDs) which I then turned into a mega RDD. The current problem seems to be gone, I no longer get the NPE but further down I am getting a indexOutOfBounds, so trying to figure out if the original problem is manifesting itself as a new one. Regards -Ravi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-merge-a-RDD-of-RDDs-into-one-uber-RDD-tp20986p21012.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to merge a RDD of RDDs into one uber RDD
I think you mean union(). Yes, you could also simply make an RDD for each file, and use SparkContext.union() to put them together. On Wed, Jan 7, 2015 at 9:51 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: You can also use join function of rdd. This is actually kind of append funtion that add up all the rdds and create one uber rdd. On Wed, Jan 7, 2015, 14:30 rkgurram rkgur...@gmail.com wrote: Thank you for the response, sure will try that out. Currently I changed my code such that the first map files.map to files.flatMap, which I guess will do similar what you are saying, it gives me a List[] of elements (in this case LabeledPoints, I could also do RDDs) which I then turned into a mega RDD. The current problem seems to be gone, I no longer get the NPE but further down I am getting a indexOutOfBounds, so trying to figure out if the original problem is manifesting itself as a new one. Regards -Ravi -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/How-to-merge-a-RDD-of-RDDs-into-one- uber-RDD-tp20986p21012.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to merge a RDD of RDDs into one uber RDD
an RDD cannot contain elements of type RDD. (i.e. you can't nest RDDs within RDDs, in fact, I don't think it makes any sense) I suggest rather than having an RDD of file names, collect those file name strings back on to the driver as a Scala array of file names, and then from there, make an array of RDDs from which you can fold over them and merge them. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-merge-a-RDD-of-RDDs-into-one-uber-RDD-tp20986p21007.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Rdd of Rdds
Hello, I would like to parallelize my work on multiple RDDs I have. I wanted to know if spark can support a foreach on an RDD of RDDs. Here's a java example: public static void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName(testapp); sparkConf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(sparkConf); ListString list = Arrays.asList(new String[] {1, 2, 3}); JavaRDDString rdd = sc.parallelize(list); ListString list1 = Arrays.asList(new String[] {a, b, c}); JavaRDDString rdd1 = sc.parallelize(list1); ListJavaRDDString rddList = new ArrayListJavaRDDString(); rddList.add(rdd); rddList.add(rdd1); JavaRDDJavaRDDString rddOfRdds = sc.parallelize(rddList); System.out.println(rddOfRdds.count()); rddOfRdds.foreach(new VoidFunctionJavaRDDString() { @Override public void call(JavaRDDString t) throws Exception { System.out.println(t.count()); } }); } From this code I'm getting a NullPointerException on the internal count method: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:0 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.NullPointerException org.apache.spark.rdd.RDD.count(RDD.scala:861) org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365) org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29) Help will be appreciated. Thanks, Tomer - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Rdd of Rdds
No, there's no such thing as an RDD of RDDs in Spark. Here though, why not just operate on an RDD of Lists? or a List of RDDs? Usually one of these two is the right approach whenever you feel inclined to operate on an RDD of RDDs. On Wed, Oct 22, 2014 at 3:58 PM, Tomer Benyamini tomer@gmail.com wrote: Hello, I would like to parallelize my work on multiple RDDs I have. I wanted to know if spark can support a foreach on an RDD of RDDs. Here's a java example: public static void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName(testapp); sparkConf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(sparkConf); ListString list = Arrays.asList(new String[] {1, 2, 3}); JavaRDDString rdd = sc.parallelize(list); ListString list1 = Arrays.asList(new String[] {a, b, c}); JavaRDDString rdd1 = sc.parallelize(list1); ListJavaRDDString rddList = new ArrayListJavaRDDString(); rddList.add(rdd); rddList.add(rdd1); JavaRDDJavaRDDString rddOfRdds = sc.parallelize(rddList); System.out.println(rddOfRdds.count()); rddOfRdds.foreach(new VoidFunctionJavaRDDString() { @Override public void call(JavaRDDString t) throws Exception { System.out.println(t.count()); } }); } From this code I'm getting a NullPointerException on the internal count method: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:0 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.NullPointerException org.apache.spark.rdd.RDD.count(RDD.scala:861) org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365) org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29) Help will be appreciated. Thanks, Tomer - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Rdd of Rdds
Another approach could be to create artificial keys for each RDD and convert to PairRDDs. So your first RDD becomes JavaPairRDDInt,String rdd1 with values 1,1 ; 1,2 and so on Second RDD becomes rdd2 is 2, a; 2, b;2,c You can union the two RDDs, groupByKey, countByKey etc and maybe achieve what you are trying to do. Sorry this is just a hypothesis, as I am not entirely sure about what you are trying to achieve. Ideally, I would think hard whether multiple RDDs are indeed needed, just as Sean pointed out. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Wed, Oct 22, 2014 at 8:35 PM, Sean Owen so...@cloudera.com wrote: No, there's no such thing as an RDD of RDDs in Spark. Here though, why not just operate on an RDD of Lists? or a List of RDDs? Usually one of these two is the right approach whenever you feel inclined to operate on an RDD of RDDs. On Wed, Oct 22, 2014 at 3:58 PM, Tomer Benyamini tomer@gmail.com wrote: Hello, I would like to parallelize my work on multiple RDDs I have. I wanted to know if spark can support a foreach on an RDD of RDDs. Here's a java example: public static void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName(testapp); sparkConf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(sparkConf); ListString list = Arrays.asList(new String[] {1, 2, 3}); JavaRDDString rdd = sc.parallelize(list); ListString list1 = Arrays.asList(new String[] {a, b, c}); JavaRDDString rdd1 = sc.parallelize(list1); ListJavaRDDString rddList = new ArrayListJavaRDDString(); rddList.add(rdd); rddList.add(rdd1); JavaRDDJavaRDDString rddOfRdds = sc.parallelize(rddList); System.out.println(rddOfRdds.count()); rddOfRdds.foreach(new VoidFunctionJavaRDDString() { @Override public void call(JavaRDDString t) throws Exception { System.out.println(t.count()); } }); } From this code I'm getting a NullPointerException on the internal count method: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:0 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.NullPointerException org.apache.spark.rdd.RDD.count(RDD.scala:861) org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365) org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29) Help will be appreciated. Thanks, Tomer - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Rdd of Rdds
On Wednesday, October 22, 2014 9:06 AM, Sean Owen so...@cloudera.com wrote: No, there's no such thing as an RDD of RDDs in Spark. Here though, why not just operate on an RDD of Lists? or a List of RDDs? Usually one of these two is the right approach whenever you feel inclined to operate on an RDD of RDDs. Depending on one's needs, one could also consider the matrix (RDD[Vector]) operations provided by MLLib, such as https://spark.apache.org/docs/latest/mllib-statistics.html - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org