Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs
i believe it is a generalization of some classes inside graphx, where there was/is a need to keep stuff indexed for random access within each rdd partition On Thu, Apr 16, 2015 at 5:00 PM, Evo Eftimov evo.efti...@isecc.com wrote: Can somebody from Data Briks sched more light on this Indexed RDD library https://github.com/amplab/spark-indexedrdd It seems to come from AMP Labs and most of the Data Bricks guys are from there What is especially interesting is whether the Point Lookup (and the other primitives) can work from within a function (e.g. map) running on executors on worker nodes -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/AMP-Lab-Indexed-RDD-question-for-Data-Bricks-AMP-Labs-tp22532.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
AMP Lab Indexed RDD - question for Data Bricks AMP Labs
Can somebody from Data Briks sched more light on this Indexed RDD library https://github.com/amplab/spark-indexedrdd It seems to come from AMP Labs and most of the Data Bricks guys are from there What is especially interesting is whether the Point Lookup (and the other primitives) can work from within a function (e.g. map) running on executors on worker nodes -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/AMP-Lab-Indexed-RDD-question-for-Data-Bricks-AMP-Labs-tp22532.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: AMP Lab Indexed RDD - question for Data Bricks AMP Labs
Thanks but we need a firm statement and preferably from somebody from the spark vendor Data Bricks including answer to the specific question posed by me and assessment/confirmation whether this is a production ready / quality library which can be used for general purpose RDDs not just inside the context of graphx From: Koert Kuipers [mailto:ko...@tresata.com] Sent: Thursday, April 16, 2015 10:31 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs i believe it is a generalization of some classes inside graphx, where there was/is a need to keep stuff indexed for random access within each rdd partition On Thu, Apr 16, 2015 at 5:00 PM, Evo Eftimov evo.efti...@isecc.com wrote: Can somebody from Data Briks sched more light on this Indexed RDD library https://github.com/amplab/spark-indexedrdd It seems to come from AMP Labs and most of the Data Bricks guys are from there What is especially interesting is whether the Point Lookup (and the other primitives) can work from within a function (e.g. map) running on executors on worker nodes -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/AMP-Lab-Indexed-RDD-question-for-Data-Bricks-AMP-Labs-tp22532.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs
I'm the primary author of IndexedRDD. To answer your questions: 1. Operations on an IndexedRDD partition can only be performed from a task operating on that partition, since doing otherwise would require decentralized coordination between workers, which is difficult in Spark. If you want to perform cross-partition lookups, you'll have to do all the lookups in a batch step as follows: val a = IndexedRDD(...) val b = sc.parallelize(...) // Perform an operation on b that produces some keys to look up in a val lookups: RDD[Long] = b.map(...) // Repartition the desired keys to their appropriate partitions in a and do local lookups, returning the corresponding values val results = a.innerJoin(b.map(k = (k, ( { (id, v, unit) = v } 2. IndexedRDD originated from GraphX but can be used for general operations as long as they fit within Spark's batch-oriented programming model. By the way, a new version of IndexedRDD is about to be released. If you decide to use IndexedRDD I'd suggest trying that out, since it provides a cleaner interface, more predictable performance, and support for arbitrary key types: https://github.com/amplab/spark-indexedrdd/pull/4 Ankur http://www.ankurdave.com/ On Thu, Apr 16, 2015 at 2:34 PM, Evo Eftimov evo.efti...@isecc.com wrote: Thanks but we need a firm statement and preferably from somebody from the spark vendor Data Bricks including answer to the specific question posed by me and assessment/confirmation whether this is a production ready / quality library which can be used for general purpose RDDs not just inside the context of graphx *From:* Koert Kuipers [mailto:ko...@tresata.com] *Sent:* Thursday, April 16, 2015 10:31 PM *To:* Evo Eftimov *Cc:* user@spark.apache.org *Subject:* Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs i believe it is a generalization of some classes inside graphx, where there was/is a need to keep stuff indexed for random access within each rdd partition On Thu, Apr 16, 2015 at 5:00 PM, Evo Eftimov evo.efti...@isecc.com wrote: Can somebody from Data Briks sched more light on this Indexed RDD library https://github.com/amplab/spark-indexedrdd It seems to come from AMP Labs and most of the Data Bricks guys are from there What is especially interesting is whether the Point Lookup (and the other primitives) can work from within a function (e.g. map) running on executors on worker nodes -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/AMP-Lab-Indexed-RDD-question-for-Data-Bricks-AMP-Labs-tp22532.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Indexed RDD
Hi, I'm trying to implement a custom RDD that essentially works as a distributed hash table, i.e. the key space is split up into partitions and within a partition, an element can be looked up efficiently by the key. However, the RDD lookup() function (in PairRDDFunctions) is implemented in a way iterate through all elements of a partition and find the matching ones. Is there a better way to do what I want to do, short of just implementing new methods on the custom RDD? Thanks, Akshat