Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

Ankur Dave Thu, 16 Apr 2015 21:14:05 -0700

I'm the primary author of IndexedRDD. To answer your questions:

1. Operations on an IndexedRDD partition can only be performed from a task
operating on that partition, since doing otherwise would require
decentralized coordination between workers, which is difficult in Spark. If
you want to perform cross-partition lookups, you'll have to do all the
lookups in a batch step as follows:


val a = IndexedRDD(...)
val b = sc.parallelize(...)
// Perform an operation on b that produces some keys to look up in a
val lookups: RDD[Long] = b.map(...)
// Repartition the desired keys to their appropriate partitions in a and do
local lookups, returning the corresponding values
val results = a.innerJoin(b.map(k => (k, ()))) { (id, v, unit) => v }

2. IndexedRDD originated from GraphX but can be used for general operations
as long as they fit within Spark's batch-oriented programming model.

By the way, a new version of IndexedRDD is about to be released. If you
decide to use IndexedRDD I'd suggest trying that out, since it provides a
cleaner interface, more predictable performance, and support for arbitrary
key types: https://github.com/amplab/spark-indexedrdd/pull/4

Ankur <http://www.ankurdave.com/>

On Thu, Apr 16, 2015 at 2:34 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:

> Thanks but we need a firm statement and preferably from somebody from the
> spark vendor Data Bricks including answer to the specific question posed by
> me and assessment/confirmation whether this is a production ready / quality
> library which can be used for general purpose RDDs not just inside the
> context of graphx
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* Thursday, April 16, 2015 10:31 PM
> *To:* Evo Eftimov
> *Cc:* user@spark.apache.org
> *Subject:* Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs
>
>
>
> i believe it is a generalization of some classes inside graphx, where
> there was/is a need to keep stuff indexed for random access within each rdd
> partition
>
>
>
> On Thu, Apr 16, 2015 at 5:00 PM, Evo Eftimov <evo.efti...@isecc.com>
> wrote:
>
> Can somebody from Data Briks sched more light on this Indexed RDD library
>
> https://github.com/amplab/spark-indexedrdd
>
> It seems to come from AMP Labs and most of the Data Bricks guys are from
> there
>
> What is especially interesting is whether the Point Lookup (and the other
> primitives) can work from within a function (e.g. map) running on executors
> on worker nodes
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/AMP-Lab-Indexed-RDD-question-for-Data-Bricks-AMP-Labs-tp22532.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

Reply via email to