Working with many RDDs in parallel?

2014-08-18 Thread David Tinker
Hi All.

I need to create a lot of RDDs starting from a set of roots and count the
rows in each. Something like this:

final JavaSparkContext sc = new JavaSparkContext(conf);
ListString roots = ...
MapString, Object res = sc.parallelize(roots).mapToPair(new
PairFunctionString, String, Long(){
public Tuple2String, Long call(String root) throws Exception {
... create RDD based on root from sc somehow ...
return new Tuple2String, Long(root, rdd.count())
}
}).countByKey()

This fails with a message about JavaSparkContext not being serializable.

Is there a way to get at the content inside of the map function or should I
be doing something else entirely?

Thanks
David


Re: Working with many RDDs in parallel?

2014-08-18 Thread Sean Owen
You won't be able to use RDDs inside of RDD operation. I imagine your
immediate problem is that the code you've elided references 'sc' and
that gets referenced by the PairFunction and serialized, but it can't
be.

If you want to play it this way, parallelize across roots in Java.
That is just use an ExecutorService to launch a bunch of operations on
RDDs in parallel. There's no reason you can't do that, although I
suppose there are upper limits as to what makes sense on your cluster.
1000 RDD count()s at once isn't a good idea for example.

It may be the case that you don't really need a bunch of RDDs at all,
but can operate on an RDD of pairs of Strings (roots) and
something-elses, all at once.


On Mon, Aug 18, 2014 at 2:31 PM, David Tinker david.tin...@gmail.com wrote:
 Hi All.

 I need to create a lot of RDDs starting from a set of roots and count the
 rows in each. Something like this:

 final JavaSparkContext sc = new JavaSparkContext(conf);
 ListString roots = ...
 MapString, Object res = sc.parallelize(roots).mapToPair(new
 PairFunctionString, String, Long(){
 public Tuple2String, Long call(String root) throws Exception {
 ... create RDD based on root from sc somehow ...
 return new Tuple2String, Long(root, rdd.count())
 }
 }).countByKey()

 This fails with a message about JavaSparkContext not being serializable.

 Is there a way to get at the content inside of the map function or should I
 be doing something else entirely?

 Thanks
 David

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Working with many RDDs in parallel?

2014-08-18 Thread David Tinker
Hmm I thought as much. I am using Cassandra with the Spark connector. What
I really need is a RDD created from a query against Cassandra of the form
where partition_key = :id where :id is taken from a list. Some grouping
of the ids would be a way to partition this.


On Mon, Aug 18, 2014 at 3:42 PM, Sean Owen so...@cloudera.com wrote:

 You won't be able to use RDDs inside of RDD operation. I imagine your
 immediate problem is that the code you've elided references 'sc' and
 that gets referenced by the PairFunction and serialized, but it can't
 be.

 If you want to play it this way, parallelize across roots in Java.
 That is just use an ExecutorService to launch a bunch of operations on
 RDDs in parallel. There's no reason you can't do that, although I
 suppose there are upper limits as to what makes sense on your cluster.
 1000 RDD count()s at once isn't a good idea for example.

 It may be the case that you don't really need a bunch of RDDs at all,
 but can operate on an RDD of pairs of Strings (roots) and
 something-elses, all at once.


 On Mon, Aug 18, 2014 at 2:31 PM, David Tinker david.tin...@gmail.com
 wrote:
  Hi All.
 
  I need to create a lot of RDDs starting from a set of roots and count
 the
  rows in each. Something like this:
 
  final JavaSparkContext sc = new JavaSparkContext(conf);
  ListString roots = ...
  MapString, Object res = sc.parallelize(roots).mapToPair(new
  PairFunctionString, String, Long(){
  public Tuple2String, Long call(String root) throws Exception {
  ... create RDD based on root from sc somehow ...
  return new Tuple2String, Long(root, rdd.count())
  }
  }).countByKey()
 
  This fails with a message about JavaSparkContext not being serializable.
 
  Is there a way to get at the content inside of the map function or
 should I
  be doing something else entirely?
 
  Thanks
  David




-- 
http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ
Integration