Working with many RDDs in parallel?
Hi All. I need to create a lot of RDDs starting from a set of roots and count the rows in each. Something like this: final JavaSparkContext sc = new JavaSparkContext(conf); ListString roots = ... MapString, Object res = sc.parallelize(roots).mapToPair(new PairFunctionString, String, Long(){ public Tuple2String, Long call(String root) throws Exception { ... create RDD based on root from sc somehow ... return new Tuple2String, Long(root, rdd.count()) } }).countByKey() This fails with a message about JavaSparkContext not being serializable. Is there a way to get at the content inside of the map function or should I be doing something else entirely? Thanks David
Re: Working with many RDDs in parallel?
You won't be able to use RDDs inside of RDD operation. I imagine your immediate problem is that the code you've elided references 'sc' and that gets referenced by the PairFunction and serialized, but it can't be. If you want to play it this way, parallelize across roots in Java. That is just use an ExecutorService to launch a bunch of operations on RDDs in parallel. There's no reason you can't do that, although I suppose there are upper limits as to what makes sense on your cluster. 1000 RDD count()s at once isn't a good idea for example. It may be the case that you don't really need a bunch of RDDs at all, but can operate on an RDD of pairs of Strings (roots) and something-elses, all at once. On Mon, Aug 18, 2014 at 2:31 PM, David Tinker david.tin...@gmail.com wrote: Hi All. I need to create a lot of RDDs starting from a set of roots and count the rows in each. Something like this: final JavaSparkContext sc = new JavaSparkContext(conf); ListString roots = ... MapString, Object res = sc.parallelize(roots).mapToPair(new PairFunctionString, String, Long(){ public Tuple2String, Long call(String root) throws Exception { ... create RDD based on root from sc somehow ... return new Tuple2String, Long(root, rdd.count()) } }).countByKey() This fails with a message about JavaSparkContext not being serializable. Is there a way to get at the content inside of the map function or should I be doing something else entirely? Thanks David - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Working with many RDDs in parallel?
Hmm I thought as much. I am using Cassandra with the Spark connector. What I really need is a RDD created from a query against Cassandra of the form where partition_key = :id where :id is taken from a list. Some grouping of the ids would be a way to partition this. On Mon, Aug 18, 2014 at 3:42 PM, Sean Owen so...@cloudera.com wrote: You won't be able to use RDDs inside of RDD operation. I imagine your immediate problem is that the code you've elided references 'sc' and that gets referenced by the PairFunction and serialized, but it can't be. If you want to play it this way, parallelize across roots in Java. That is just use an ExecutorService to launch a bunch of operations on RDDs in parallel. There's no reason you can't do that, although I suppose there are upper limits as to what makes sense on your cluster. 1000 RDD count()s at once isn't a good idea for example. It may be the case that you don't really need a bunch of RDDs at all, but can operate on an RDD of pairs of Strings (roots) and something-elses, all at once. On Mon, Aug 18, 2014 at 2:31 PM, David Tinker david.tin...@gmail.com wrote: Hi All. I need to create a lot of RDDs starting from a set of roots and count the rows in each. Something like this: final JavaSparkContext sc = new JavaSparkContext(conf); ListString roots = ... MapString, Object res = sc.parallelize(roots).mapToPair(new PairFunctionString, String, Long(){ public Tuple2String, Long call(String root) throws Exception { ... create RDD based on root from sc somehow ... return new Tuple2String, Long(root, rdd.count()) } }).countByKey() This fails with a message about JavaSparkContext not being serializable. Is there a way to get at the content inside of the map function or should I be doing something else entirely? Thanks David -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration