Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Koert Kuipers
i believe it is a generalization of some classes inside graphx, where there
was/is a need to keep stuff indexed for random access within each rdd
partition

On Thu, Apr 16, 2015 at 5:00 PM, Evo Eftimov evo.efti...@isecc.com wrote:

 Can somebody from Data Briks sched more light on this Indexed RDD library

 https://github.com/amplab/spark-indexedrdd

 It seems to come from AMP Labs and most of the Data Bricks guys are from
 there

 What is especially interesting is whether the Point Lookup (and the other
 primitives) can work from within a function (e.g. map) running on executors
 on worker nodes



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/AMP-Lab-Indexed-RDD-question-for-Data-Bricks-AMP-Labs-tp22532.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Evo Eftimov
Can somebody from Data Briks sched more light on this Indexed RDD library 

https://github.com/amplab/spark-indexedrdd 

It seems to come from AMP Labs and most of the Data Bricks guys are from
there 

What is especially interesting is whether the Point Lookup (and the other
primitives) can work from within a function (e.g. map) running on executors
on worker nodes 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/AMP-Lab-Indexed-RDD-question-for-Data-Bricks-AMP-Labs-tp22532.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Evo Eftimov
Thanks but we need a firm statement and preferably from somebody from the spark 
vendor Data Bricks including answer to the specific question posed by me and 
assessment/confirmation whether this is a production ready / quality library 
which can be used for general purpose RDDs not just inside the context of 
graphx 

 

From: Koert Kuipers [mailto:ko...@tresata.com] 
Sent: Thursday, April 16, 2015 10:31 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

 

i believe it is a generalization of some classes inside graphx, where there 
was/is a need to keep stuff indexed for random access within each rdd partition

 

On Thu, Apr 16, 2015 at 5:00 PM, Evo Eftimov evo.efti...@isecc.com wrote:

Can somebody from Data Briks sched more light on this Indexed RDD library

https://github.com/amplab/spark-indexedrdd

It seems to come from AMP Labs and most of the Data Bricks guys are from
there

What is especially interesting is whether the Point Lookup (and the other
primitives) can work from within a function (e.g. map) running on executors
on worker nodes



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/AMP-Lab-Indexed-RDD-question-for-Data-Bricks-AMP-Labs-tp22532.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

 



Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Ankur Dave
I'm the primary author of IndexedRDD. To answer your questions:

1. Operations on an IndexedRDD partition can only be performed from a task
operating on that partition, since doing otherwise would require
decentralized coordination between workers, which is difficult in Spark. If
you want to perform cross-partition lookups, you'll have to do all the
lookups in a batch step as follows:

val a = IndexedRDD(...)
val b = sc.parallelize(...)
// Perform an operation on b that produces some keys to look up in a
val lookups: RDD[Long] = b.map(...)
// Repartition the desired keys to their appropriate partitions in a and do
local lookups, returning the corresponding values
val results = a.innerJoin(b.map(k = (k, ( { (id, v, unit) = v }

2. IndexedRDD originated from GraphX but can be used for general operations
as long as they fit within Spark's batch-oriented programming model.

By the way, a new version of IndexedRDD is about to be released. If you
decide to use IndexedRDD I'd suggest trying that out, since it provides a
cleaner interface, more predictable performance, and support for arbitrary
key types: https://github.com/amplab/spark-indexedrdd/pull/4

Ankur http://www.ankurdave.com/

On Thu, Apr 16, 2015 at 2:34 PM, Evo Eftimov evo.efti...@isecc.com wrote:

 Thanks but we need a firm statement and preferably from somebody from the
 spark vendor Data Bricks including answer to the specific question posed by
 me and assessment/confirmation whether this is a production ready / quality
 library which can be used for general purpose RDDs not just inside the
 context of graphx



 *From:* Koert Kuipers [mailto:ko...@tresata.com]
 *Sent:* Thursday, April 16, 2015 10:31 PM
 *To:* Evo Eftimov
 *Cc:* user@spark.apache.org
 *Subject:* Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs



 i believe it is a generalization of some classes inside graphx, where
 there was/is a need to keep stuff indexed for random access within each rdd
 partition



 On Thu, Apr 16, 2015 at 5:00 PM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 Can somebody from Data Briks sched more light on this Indexed RDD library

 https://github.com/amplab/spark-indexedrdd

 It seems to come from AMP Labs and most of the Data Bricks guys are from
 there

 What is especially interesting is whether the Point Lookup (and the other
 primitives) can work from within a function (e.g. map) running on executors
 on worker nodes



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/AMP-Lab-Indexed-RDD-question-for-Data-Bricks-AMP-Labs-tp22532.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Indexed RDD

2014-09-16 Thread Akshat Aranya
Hi,

I'm trying to implement a custom RDD that essentially works as a
distributed hash table, i.e. the key space is split up into partitions and
within a partition, an element can be looked up efficiently by the key.
However, the RDD lookup() function (in PairRDDFunctions) is implemented in
a way iterate through all elements of a partition and find the matching
ones.  Is there a better way to do what I want to do, short of just
implementing new methods on the custom RDD?

Thanks,
Akshat