Thanks you all. Just changing RDD to Map structure saved me approx. 1
second.
Yes, I will check out IndexedRDD to see if it has better performance.
best,
/Shahab
On Thu, Feb 19, 2015 at 6:38 PM, Burak Yavuz wrote:
> If your dataset is large, there is a Spark Package called IndexedRDD
> optimi
If your dataset is large, there is a Spark Package called IndexedRDD
optimized for lookups. Feel free to check that out.
Burak
On Feb 19, 2015 7:37 AM, "Ilya Ganelin" wrote:
> Hi Shahab - if your data structures are small enough a broadcasted Map is
> going to provide faster lookup. Lookup withi
RDDs are not Maps. lookup() does a linear scan -- parallel by
partition, but stil linear. Yes, it is not supposed be an O(1) lookup
data structure. It'd be much nicer to broadcast the relatively small
data set as a Map and look it up fast, locally.
On Thu, Feb 19, 2015 at 3:29 PM, shahab wrote:
>
Hi Shahab - if your data structures are small enough a broadcasted Map is
going to provide faster lookup. Lookup within an RDD is an O(m) operation
where m is the size of the partition. For RDDs with multiple partitions,
executors can operate on it in parallel so you get some improvement for
larger
Hi,
I am doing lookup on cached RDDs [(Int,String)], and I noticed that the
lookup is relatively slow 30-100 ms ?? I even tried this on one machine
with single partition, but no difference!
The RDDs are not large at all, 3-30 MB.
Is this expected behaviour? should I use other data structures, li