Re: Why is RDD lookup slow?

2015-02-20 Thread shahab
Thanks you all. Just changing RDD to Map structure saved me approx. 1 second. Yes, I will check out IndexedRDD to see if it has better performance. best, /Shahab On Thu, Feb 19, 2015 at 6:38 PM, Burak Yavuz wrote: > If your dataset is large, there is a Spark Package called IndexedRDD > optimi

Re: Why is RDD lookup slow?

2015-02-19 Thread Burak Yavuz
If your dataset is large, there is a Spark Package called IndexedRDD optimized for lookups. Feel free to check that out. Burak On Feb 19, 2015 7:37 AM, "Ilya Ganelin" wrote: > Hi Shahab - if your data structures are small enough a broadcasted Map is > going to provide faster lookup. Lookup withi

Re: Why is RDD lookup slow?

2015-02-19 Thread Sean Owen
RDDs are not Maps. lookup() does a linear scan -- parallel by partition, but stil linear. Yes, it is not supposed be an O(1) lookup data structure. It'd be much nicer to broadcast the relatively small data set as a Map and look it up fast, locally. On Thu, Feb 19, 2015 at 3:29 PM, shahab wrote: >

Re: Why is RDD lookup slow?

2015-02-19 Thread Ilya Ganelin
Hi Shahab - if your data structures are small enough a broadcasted Map is going to provide faster lookup. Lookup within an RDD is an O(m) operation where m is the size of the partition. For RDDs with multiple partitions, executors can operate on it in parallel so you get some improvement for larger

Why is RDD lookup slow?

2015-02-19 Thread shahab
Hi, I am doing lookup on cached RDDs [(Int,String)], and I noticed that the lookup is relatively slow 30-100 ms ?? I even tried this on one machine with single partition, but no difference! The RDDs are not large at all, 3-30 MB. Is this expected behaviour? should I use other data structures, li