I'm the primary author of IndexedRDD. To answer your questions: 1. Operations on an IndexedRDD partition can only be performed from a task operating on that partition, since doing otherwise would require decentralized coordination between workers, which is difficult in Spark. If you want to perform cross-partition lookups, you'll have to do all the lookups in a batch step as follows:
val a = IndexedRDD(...) val b = sc.parallelize(...) // Perform an operation on b that produces some keys to look up in a val lookups: RDD[Long] = b.map(...) // Repartition the desired keys to their appropriate partitions in a and do local lookups, returning the corresponding values val results = a.innerJoin(b.map(k => (k, ()))) { (id, v, unit) => v } 2. IndexedRDD originated from GraphX but can be used for general operations as long as they fit within Spark's batch-oriented programming model. By the way, a new version of IndexedRDD is about to be released. If you decide to use IndexedRDD I'd suggest trying that out, since it provides a cleaner interface, more predictable performance, and support for arbitrary key types: https://github.com/amplab/spark-indexedrdd/pull/4 Ankur <http://www.ankurdave.com/> On Thu, Apr 16, 2015 at 2:34 PM, Evo Eftimov <evo.efti...@isecc.com> wrote: > Thanks but we need a firm statement and preferably from somebody from the > spark vendor Data Bricks including answer to the specific question posed by > me and assessment/confirmation whether this is a production ready / quality > library which can be used for general purpose RDDs not just inside the > context of graphx > > > > *From:* Koert Kuipers [mailto:ko...@tresata.com] > *Sent:* Thursday, April 16, 2015 10:31 PM > *To:* Evo Eftimov > *Cc:* user@spark.apache.org > *Subject:* Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs > > > > i believe it is a generalization of some classes inside graphx, where > there was/is a need to keep stuff indexed for random access within each rdd > partition > > > > On Thu, Apr 16, 2015 at 5:00 PM, Evo Eftimov <evo.efti...@isecc.com> > wrote: > > Can somebody from Data Briks sched more light on this Indexed RDD library > > https://github.com/amplab/spark-indexedrdd > > It seems to come from AMP Labs and most of the Data Bricks guys are from > there > > What is especially interesting is whether the Point Lookup (and the other > primitives) can work from within a function (e.g. map) running on executors > on worker nodes > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/AMP-Lab-Indexed-RDD-question-for-Data-Bricks-AMP-Labs-tp22532.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > >