Re: RDD of RDDs

2015-06-10 Thread ping yan
Thanks much for the detailed explanations. I suspected architectural
support of the notion of rdd of rdds, but my understanding of Spark or
distributed computing in general is not as deep as allowing me to
understand better. so this really helps!

I ended up going with List[RDD]. The collection of unique users in my
dataset is not too bad - 2000 or so, so I simply put each into a RDD by
doing
for user in users:
userrdd = bigrdd.filter(lambda x: x[userid_pos] == user)

Thanks for helping out!
Ping

On Tue, Jun 9, 2015 at 1:17 AM kiran lonikar loni...@gmail.com wrote:

 Simillar question was asked before:
 http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html

 Here is one of the reasons why I think RDD[RDD[T]] is not possible:

- RDD is only a handle to the actual data partitions. It has a
reference/pointer to the *SparkContext* object (*sc*) and a list of
partitions.
- The *SparkContext *is an object in the Spark Application/Driver
Program's JVM. Similarly, the list of partitions is also in the JVM of the
driver program. Each partition contains kind of remote references to the
partition data on the worker JVMs.
- The functions passed to RDD's transformations and actions execute in
the worker's JVMs on different nodes. For example, in *rdd.map { x =
x*x }*, the function performing *x*x* runs on the JVMs of the
worker nodes where the partitions of the RDD reside. These JVMs do not have
access to the *sc* since its only on the driver's JVM.
- Thus, in case of your *RDD of RDD*: *outerRDD.map { innerRdd =
innerRDD.filter { x = x*x } }*, the worker nodes will not be able to
execute the *filter* on *innerRDD *as the code in the worker does not
have access to sc and can not launch a spark job.


 Hope it helps. You need to consider List[RDD] or some other collection.

 -Kiran

 On Tue, Jun 9, 2015 at 2:25 AM, ping yan sharon...@gmail.com wrote:

 Hi,


 The problem I am looking at is as follows:

 - I read in a log file of multiple users as a RDD

 - I'd like to group the above RDD into *multiple RDDs* by userIds (the
 key)

 - my processEachUser() function then takes in each RDD mapped into
 each individual user, and calls for RDD.map or DataFrame operations on
 them. (I already had the function coded, I am therefore reluctant to work
 with the ResultIterable object coming out of rdd.groupByKey() ... )

 I've searched the mailing list and googled on RDD of RDDs and seems
 like it isn't a thing at all.

 A few choices left seem to be: 1) groupByKey() and then work with the
 ResultIterable object; 2) groupbyKey() and then write each group into a
 file, and read them back as individual rdds to process..

 Anyone got a better idea or had a similar problem before?


 Thanks!
 Ping






 --
 Ping Yan
 Ph.D. in Management
 Dept. of Management Information Systems
 University of Arizona
 Tucson, AZ 85721





Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
Possibly in future, if and when spark architecture allows workers to launch
spark jobs (the functions passed to transformation or action APIs of RDD),
it will be possible to have RDD of RDD.

On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar loni...@gmail.com wrote:

 Simillar question was asked before:
 http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html

 Here is one of the reasons why I think RDD[RDD[T]] is not possible:

- RDD is only a handle to the actual data partitions. It has a
reference/pointer to the *SparkContext* object (*sc*) and a list of
partitions.
- The *SparkContext *is an object in the Spark Application/Driver
Program's JVM. Similarly, the list of partitions is also in the JVM of the
driver program. Each partition contains kind of remote references to the
partition data on the worker JVMs.
- The functions passed to RDD's transformations and actions execute in
the worker's JVMs on different nodes. For example, in *rdd.map { x =
x*x }*, the function performing *x*x* runs on the JVMs of the
worker nodes where the partitions of the RDD reside. These JVMs do not have
access to the *sc* since its only on the driver's JVM.
- Thus, in case of your *RDD of RDD*: *outerRDD.map { innerRdd =
innerRDD.filter { x = x*x } }*, the worker nodes will not be able to
execute the *filter* on *innerRDD *as the code in the worker does not
have access to sc and can not launch a spark job.


 Hope it helps. You need to consider List[RDD] or some other collection.

 -Kiran

 On Tue, Jun 9, 2015 at 2:25 AM, ping yan sharon...@gmail.com wrote:

 Hi,


 The problem I am looking at is as follows:

 - I read in a log file of multiple users as a RDD

 - I'd like to group the above RDD into *multiple RDDs* by userIds (the
 key)

 - my processEachUser() function then takes in each RDD mapped into
 each individual user, and calls for RDD.map or DataFrame operations on
 them. (I already had the function coded, I am therefore reluctant to work
 with the ResultIterable object coming out of rdd.groupByKey() ... )

 I've searched the mailing list and googled on RDD of RDDs and seems
 like it isn't a thing at all.

 A few choices left seem to be: 1) groupByKey() and then work with the
 ResultIterable object; 2) groupbyKey() and then write each group into a
 file, and read them back as individual rdds to process..

 Anyone got a better idea or had a similar problem before?


 Thanks!
 Ping






 --
 Ping Yan
 Ph.D. in Management
 Dept. of Management Information Systems
 University of Arizona
 Tucson, AZ 85721





Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
Simillar question was asked before:
http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html

Here is one of the reasons why I think RDD[RDD[T]] is not possible:

   - RDD is only a handle to the actual data partitions. It has a
   reference/pointer to the *SparkContext* object (*sc*) and a list of
   partitions.
   - The *SparkContext *is an object in the Spark Application/Driver
   Program's JVM. Similarly, the list of partitions is also in the JVM of the
   driver program. Each partition contains kind of remote references to the
   partition data on the worker JVMs.
   - The functions passed to RDD's transformations and actions execute in
   the worker's JVMs on different nodes. For example, in *rdd.map { x =
   x*x }*, the function performing *x*x* runs on the JVMs of the worker
   nodes where the partitions of the RDD reside. These JVMs do not have access
   to the *sc* since its only on the driver's JVM.
   - Thus, in case of your *RDD of RDD*: *outerRDD.map { innerRdd =
   innerRDD.filter { x = x*x } }*, the worker nodes will not be able to
   execute the *filter* on *innerRDD *as the code in the worker does not
   have access to sc and can not launch a spark job.


Hope it helps. You need to consider List[RDD] or some other collection.

-Kiran

On Tue, Jun 9, 2015 at 2:25 AM, ping yan sharon...@gmail.com wrote:

 Hi,


 The problem I am looking at is as follows:

 - I read in a log file of multiple users as a RDD

 - I'd like to group the above RDD into *multiple RDDs* by userIds (the
 key)

 - my processEachUser() function then takes in each RDD mapped into
 each individual user, and calls for RDD.map or DataFrame operations on
 them. (I already had the function coded, I am therefore reluctant to work
 with the ResultIterable object coming out of rdd.groupByKey() ... )

 I've searched the mailing list and googled on RDD of RDDs and seems like
 it isn't a thing at all.

 A few choices left seem to be: 1) groupByKey() and then work with the
 ResultIterable object; 2) groupbyKey() and then write each group into a
 file, and read them back as individual rdds to process..

 Anyone got a better idea or had a similar problem before?


 Thanks!
 Ping






 --
 Ping Yan
 Ph.D. in Management
 Dept. of Management Information Systems
 University of Arizona
 Tucson, AZ 85721




Re: Rdd of Rdds

2015-06-09 Thread lonikar
Replicating my answer to another question asked today:

Here is one of the reasons why I think RDD[RDD[T]] is not possible:
   * RDD is only a handle to the actual data partitions. It has a
reference/pointer to the /SparkContext /object (/sc/) and a list of
partitions.
   * The SparkContext is an object in the Spark Application/Driver Program's
JVM. Similarly, the list of partitions is also in the JVM of the driver
program. Each partition contains kind of remote references to the
partition data on the worker JVMs.
   * The functions passed to RDD's transformations and actions execute in
the worker's JVMs on different nodes. For example, in *rdd.map { x = x*x
}*, the function performing *x*x* runs on the JVMs of the worker nodes
where the partitions of the RDD reside. These JVMs do not have access to the
sc since its only on the driver's JVM.
   * Thus, in case of your RDD of RDD: *outerRDD.map { innerRdd =
innerRDD.filter { x = x*x } }*, the worker nodes will not be able to
execute the filter on innerRDD as the code in the worker does not have
access to sc and can not launch a spark job.

Hope it helps. You need to consider List[RDD] or some other collection.

Possibly in future, if and when spark architecture allows workers to launch
spark jobs (the functions passed to transformation or action APIs of RDD),
it will be possible to have RDD of RDD.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-tp17025p23217.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD of RDDs

2015-06-09 Thread Mark Hamstra
That would constitute a major change in Spark's architecture.  It's not
happening anytime soon.

On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar loni...@gmail.com wrote:

 Possibly in future, if and when spark architecture allows workers to
 launch spark jobs (the functions passed to transformation or action APIs of
 RDD), it will be possible to have RDD of RDD.

 On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar loni...@gmail.com wrote:

 Simillar question was asked before:
 http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html

 Here is one of the reasons why I think RDD[RDD[T]] is not possible:

- RDD is only a handle to the actual data partitions. It has a
reference/pointer to the *SparkContext* object (*sc*) and a list of
partitions.
- The *SparkContext *is an object in the Spark Application/Driver
Program's JVM. Similarly, the list of partitions is also in the JVM of the
driver program. Each partition contains kind of remote references to the
partition data on the worker JVMs.
- The functions passed to RDD's transformations and actions execute
in the worker's JVMs on different nodes. For example, in *rdd.map {
x = x*x }*, the function performing *x*x* runs on the JVMs of the
worker nodes where the partitions of the RDD reside. These JVMs do not 
 have
access to the *sc* since its only on the driver's JVM.
- Thus, in case of your *RDD of RDD*: *outerRDD.map { innerRdd =
innerRDD.filter { x = x*x } }*, the worker nodes will not be able to
execute the *filter* on *innerRDD *as the code in the worker does not
have access to sc and can not launch a spark job.


 Hope it helps. You need to consider List[RDD] or some other collection.

 -Kiran

 On Tue, Jun 9, 2015 at 2:25 AM, ping yan sharon...@gmail.com wrote:

 Hi,


 The problem I am looking at is as follows:

 - I read in a log file of multiple users as a RDD

 - I'd like to group the above RDD into *multiple RDDs* by userIds (the
 key)

 - my processEachUser() function then takes in each RDD mapped into
 each individual user, and calls for RDD.map or DataFrame operations on
 them. (I already had the function coded, I am therefore reluctant to work
 with the ResultIterable object coming out of rdd.groupByKey() ... )

 I've searched the mailing list and googled on RDD of RDDs and seems
 like it isn't a thing at all.

 A few choices left seem to be: 1) groupByKey() and then work with the
 ResultIterable object; 2) groupbyKey() and then write each group into a
 file, and read them back as individual rdds to process..

 Anyone got a better idea or had a similar problem before?


 Thanks!
 Ping






 --
 Ping Yan
 Ph.D. in Management
 Dept. of Management Information Systems
 University of Arizona
 Tucson, AZ 85721






Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
Yes true. That's why I said if and when.

But hopefully I have given correct explanation of why rdd of rdd is not
possible.
On 09-Jun-2015 10:22 pm, Mark Hamstra m...@clearstorydata.com wrote:

 That would constitute a major change in Spark's architecture.  It's not
 happening anytime soon.

 On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar loni...@gmail.com wrote:

 Possibly in future, if and when spark architecture allows workers to
 launch spark jobs (the functions passed to transformation or action APIs of
 RDD), it will be possible to have RDD of RDD.

 On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar loni...@gmail.com wrote:

 Simillar question was asked before:
 http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html

 Here is one of the reasons why I think RDD[RDD[T]] is not possible:

- RDD is only a handle to the actual data partitions. It has a
reference/pointer to the *SparkContext* object (*sc*) and a list of
partitions.
- The *SparkContext *is an object in the Spark Application/Driver
Program's JVM. Similarly, the list of partitions is also in the JVM of 
 the
driver program. Each partition contains kind of remote references to 
 the
partition data on the worker JVMs.
- The functions passed to RDD's transformations and actions execute
in the worker's JVMs on different nodes. For example, in *rdd.map {
x = x*x }*, the function performing *x*x* runs on the JVMs of
the worker nodes where the partitions of the RDD reside. These JVMs do 
 not
have access to the *sc* since its only on the driver's JVM.
- Thus, in case of your *RDD of RDD*: *outerRDD.map { innerRdd =
innerRDD.filter { x = x*x } }*, the worker nodes will not be able
to execute the *filter* on *innerRDD *as the code in the worker does
not have access to sc and can not launch a spark job.


 Hope it helps. You need to consider List[RDD] or some other collection.

 -Kiran

 On Tue, Jun 9, 2015 at 2:25 AM, ping yan sharon...@gmail.com wrote:

 Hi,


 The problem I am looking at is as follows:

 - I read in a log file of multiple users as a RDD

 - I'd like to group the above RDD into *multiple RDDs* by userIds (the
 key)

 - my processEachUser() function then takes in each RDD mapped into
 each individual user, and calls for RDD.map or DataFrame operations on
 them. (I already had the function coded, I am therefore reluctant to work
 with the ResultIterable object coming out of rdd.groupByKey() ... )

 I've searched the mailing list and googled on RDD of RDDs and seems
 like it isn't a thing at all.

 A few choices left seem to be: 1) groupByKey() and then work with the
 ResultIterable object; 2) groupbyKey() and then write each group into a
 file, and read them back as individual rdds to process..

 Anyone got a better idea or had a similar problem before?


 Thanks!
 Ping






 --
 Ping Yan
 Ph.D. in Management
 Dept. of Management Information Systems
 University of Arizona
 Tucson, AZ 85721







RDD of RDDs

2015-06-08 Thread ping yan
Hi,


The problem I am looking at is as follows:

- I read in a log file of multiple users as a RDD

- I'd like to group the above RDD into *multiple RDDs* by userIds (the key)

- my processEachUser() function then takes in each RDD mapped into
each individual user, and calls for RDD.map or DataFrame operations on
them. (I already had the function coded, I am therefore reluctant to work
with the ResultIterable object coming out of rdd.groupByKey() ... )

I've searched the mailing list and googled on RDD of RDDs and seems like
it isn't a thing at all.

A few choices left seem to be: 1) groupByKey() and then work with the
ResultIterable object; 2) groupbyKey() and then write each group into a
file, and read them back as individual rdds to process..

Anyone got a better idea or had a similar problem before?


Thanks!
Ping






-- 
Ping Yan
Ph.D. in Management
Dept. of Management Information Systems
University of Arizona
Tucson, AZ 85721


Re: How to merge a RDD of RDDs into one uber RDD

2015-01-07 Thread Raghavendra Pandey
You can also use join function of rdd. This is actually kind of append
funtion that add up all the rdds and create one uber rdd.

On Wed, Jan 7, 2015, 14:30 rkgurram rkgur...@gmail.com wrote:

 Thank you for the response, sure will try that out.

 Currently I changed my code such that the first map files.map to
 files.flatMap, which I guess will do similar what you are saying, it
 gives
 me a List[] of elements (in this case LabeledPoints, I could also do RDDs)
 which I then turned into a mega RDD. The current problem seems to be gone,
 I
 no longer get the NPE but further down I am getting a indexOutOfBounds, so
 trying to figure out if the original problem is manifesting itself as a new
 one.


 Regards
 -Ravi




 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/How-to-merge-a-RDD-of-RDDs-into-one-
 uber-RDD-tp20986p21012.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to merge a RDD of RDDs into one uber RDD

2015-01-07 Thread rkgurram
Thank you for the response, sure will try that out.

Currently I changed my code such that the first map files.map to
files.flatMap, which I guess will do similar what you are saying, it gives
me a List[] of elements (in this case LabeledPoints, I could also do RDDs)
which I then turned into a mega RDD. The current problem seems to be gone, I
no longer get the NPE but further down I am getting a indexOutOfBounds, so
trying to figure out if the original problem is manifesting itself as a new
one.


Regards
-Ravi




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-merge-a-RDD-of-RDDs-into-one-uber-RDD-tp20986p21012.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to merge a RDD of RDDs into one uber RDD

2015-01-07 Thread Sean Owen
I think you mean union(). Yes, you could also simply make an RDD for each
file, and use SparkContext.union() to put them together.

On Wed, Jan 7, 2015 at 9:51 AM, Raghavendra Pandey 
raghavendra.pan...@gmail.com wrote:

 You can also use join function of rdd. This is actually kind of append
 funtion that add up all the rdds and create one uber rdd.

 On Wed, Jan 7, 2015, 14:30 rkgurram rkgur...@gmail.com wrote:

 Thank you for the response, sure will try that out.

 Currently I changed my code such that the first map files.map to
 files.flatMap, which I guess will do similar what you are saying, it
 gives
 me a List[] of elements (in this case LabeledPoints, I could also do RDDs)
 which I then turned into a mega RDD. The current problem seems to be
 gone, I
 no longer get the NPE but further down I am getting a indexOutOfBounds, so
 trying to figure out if the original problem is manifesting itself as a
 new
 one.


 Regards
 -Ravi




 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/How-to-merge-a-RDD-of-RDDs-into-one-
 uber-RDD-tp20986p21012.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to merge a RDD of RDDs into one uber RDD

2015-01-06 Thread k.tham
an RDD cannot contain elements of type RDD. (i.e. you can't nest RDDs within
RDDs, in fact, I don't think it makes any sense)

I suggest rather than having an RDD of file names, collect those file name
strings back on to the driver as a Scala array of file names, and then from
there, make an array of RDDs from which you can fold over them and merge
them.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-merge-a-RDD-of-RDDs-into-one-uber-RDD-tp20986p21007.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Rdd of Rdds

2014-10-22 Thread Tomer Benyamini
Hello,

I would like to parallelize my work on multiple RDDs I have. I wanted
to know if spark can support a foreach on an RDD of RDDs. Here's a
java example:

public static void main(String[] args) {

SparkConf sparkConf = new SparkConf().setAppName(testapp);
sparkConf.setMaster(local);

JavaSparkContext sc = new JavaSparkContext(sparkConf);


ListString list = Arrays.asList(new String[] {1, 2, 3});
JavaRDDString rdd = sc.parallelize(list);

ListString list1 = Arrays.asList(new String[] {a, b, c});
   JavaRDDString rdd1 = sc.parallelize(list1);

ListJavaRDDString rddList = new ArrayListJavaRDDString();
rddList.add(rdd);
rddList.add(rdd1);


JavaRDDJavaRDDString rddOfRdds = sc.parallelize(rddList);
System.out.println(rddOfRdds.count());


rddOfRdds.foreach(new VoidFunctionJavaRDDString() {

   @Override
public void call(JavaRDDString t) throws Exception {
 System.out.println(t.count());
}

   });
}

From this code I'm getting a NullPointerException on the internal count method:

Exception in thread main org.apache.spark.SparkException: Job
aborted due to stage failure: Task 1.0:0 failed 1 times, most recent
failure: Exception failure in TID 1 on host localhost:
java.lang.NullPointerException

org.apache.spark.rdd.RDD.count(RDD.scala:861)

org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365)

org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29)

Help will be appreciated.

Thanks,
Tomer

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Rdd of Rdds

2014-10-22 Thread Sean Owen
No, there's no such thing as an RDD of RDDs in Spark.
Here though, why not just operate on an RDD of Lists? or a List of RDDs?
Usually one of these two is the right approach whenever you feel
inclined to operate on an RDD of RDDs.

On Wed, Oct 22, 2014 at 3:58 PM, Tomer Benyamini tomer@gmail.com wrote:
 Hello,

 I would like to parallelize my work on multiple RDDs I have. I wanted
 to know if spark can support a foreach on an RDD of RDDs. Here's a
 java example:

 public static void main(String[] args) {

 SparkConf sparkConf = new SparkConf().setAppName(testapp);
 sparkConf.setMaster(local);

 JavaSparkContext sc = new JavaSparkContext(sparkConf);


 ListString list = Arrays.asList(new String[] {1, 2, 3});
 JavaRDDString rdd = sc.parallelize(list);

 ListString list1 = Arrays.asList(new String[] {a, b, c});
JavaRDDString rdd1 = sc.parallelize(list1);

 ListJavaRDDString rddList = new ArrayListJavaRDDString();
 rddList.add(rdd);
 rddList.add(rdd1);


 JavaRDDJavaRDDString rddOfRdds = sc.parallelize(rddList);
 System.out.println(rddOfRdds.count());


 rddOfRdds.foreach(new VoidFunctionJavaRDDString() {

@Override
 public void call(JavaRDDString t) throws Exception {
  System.out.println(t.count());
 }

});
 }

 From this code I'm getting a NullPointerException on the internal count 
 method:

 Exception in thread main org.apache.spark.SparkException: Job
 aborted due to stage failure: Task 1.0:0 failed 1 times, most recent
 failure: Exception failure in TID 1 on host localhost:
 java.lang.NullPointerException

 org.apache.spark.rdd.RDD.count(RDD.scala:861)

 
 org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365)

 org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29)

 Help will be appreciated.

 Thanks,
 Tomer

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Rdd of Rdds

2014-10-22 Thread Sonal Goyal
Another approach could be to create artificial keys for each RDD and
convert to PairRDDs. So your first RDD becomes
JavaPairRDDInt,String rdd1 with values 1,1 ; 1,2 and so on

Second RDD becomes rdd2 is 2, a; 2, b;2,c

You can union the two RDDs, groupByKey, countByKey etc and maybe achieve
what you are trying to do. Sorry this is just a hypothesis, as I am not
entirely sure about what you are trying to achieve. Ideally, I would think
hard whether multiple RDDs are indeed needed, just as Sean pointed out.

Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal



On Wed, Oct 22, 2014 at 8:35 PM, Sean Owen so...@cloudera.com wrote:

 No, there's no such thing as an RDD of RDDs in Spark.
 Here though, why not just operate on an RDD of Lists? or a List of RDDs?
 Usually one of these two is the right approach whenever you feel
 inclined to operate on an RDD of RDDs.

 On Wed, Oct 22, 2014 at 3:58 PM, Tomer Benyamini tomer@gmail.com
 wrote:
  Hello,
 
  I would like to parallelize my work on multiple RDDs I have. I wanted
  to know if spark can support a foreach on an RDD of RDDs. Here's a
  java example:
 
  public static void main(String[] args) {
 
  SparkConf sparkConf = new SparkConf().setAppName(testapp);
  sparkConf.setMaster(local);
 
  JavaSparkContext sc = new JavaSparkContext(sparkConf);
 
 
  ListString list = Arrays.asList(new String[] {1, 2, 3});
  JavaRDDString rdd = sc.parallelize(list);
 
  ListString list1 = Arrays.asList(new String[] {a, b, c});
 JavaRDDString rdd1 = sc.parallelize(list1);
 
  ListJavaRDDString rddList = new ArrayListJavaRDDString();
  rddList.add(rdd);
  rddList.add(rdd1);
 
 
  JavaRDDJavaRDDString rddOfRdds = sc.parallelize(rddList);
  System.out.println(rddOfRdds.count());
 
 
  rddOfRdds.foreach(new VoidFunctionJavaRDDString() {
 
 @Override
  public void call(JavaRDDString t) throws Exception {
   System.out.println(t.count());
  }
 
 });
  }
 
  From this code I'm getting a NullPointerException on the internal count
 method:
 
  Exception in thread main org.apache.spark.SparkException: Job
  aborted due to stage failure: Task 1.0:0 failed 1 times, most recent
  failure: Exception failure in TID 1 on host localhost:
  java.lang.NullPointerException
 
  org.apache.spark.rdd.RDD.count(RDD.scala:861)
 
 
  org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365)
 
  org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29)
 
  Help will be appreciated.
 
  Thanks,
  Tomer
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Rdd of Rdds

2014-10-22 Thread Michael Malak
On Wednesday, October 22, 2014 9:06 AM, Sean Owen so...@cloudera.com wrote:

 No, there's no such thing as an RDD of RDDs in Spark.
 Here though, why not just operate on an RDD of Lists? or a List of RDDs?
 Usually one of these two is the right approach whenever you feel
 inclined to operate on an RDD of RDDs.


Depending on one's needs, one could also consider the matrix (RDD[Vector]) 
operations provided by MLLib, such as
https://spark.apache.org/docs/latest/mllib-statistics.html

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org