Alternative to Large Broadcast Variables

Hemminger Jeff Fri, 28 Aug 2015 03:40:00 -0700

Hi,

I am working on a Spark application that is using of a large (~3G)
broadcast variable as a lookup table. The application refines the data in
this lookup table in an iterative manner. So this large variable is
broadcast many times during the lifetime of the application process.


>From what I have observed perhaps 60% of the execution time is spent
waiting for the variable to broadcast in each iteration. My reading of a
Spark performance article[1] suggests that the time spent broadcasting will
increase with the number of nodes I add.

My question for the group - what would you suggest as an alternative to
broadcasting a large variable like this?

One approach I have considered is segmenting my RDD and adding a copy of
the lookup table for each X number of values to process. So, for example,
if I have a list of 1 million entries to process (eg, RDD[Entry]), I could
split this into segments of 100K entries, with a copy of the lookup table,
and make that an RDD[(Lookup, Array[Entry]).

Another solution I am looking at it is making the lookup table an RDD
instead of a broadcast variable. Perhaps I could use an IndexedRDD[2] to
improve performance. One issue with this approach is that I would have to
rewrite my application code to use two RDDs so that I do not reference the
lookup RDD in the from within the closure of another RDD.

Any other recommendations?

Jeff


[1]
http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/mosharaf-spark-bc-report-spring10.pdf

[2]https://github.com/amplab/spark-indexedrdd

Alternative to Large Broadcast Variables

Reply via email to