Hi, I am working on a Spark application that is using of a large (~3G) broadcast variable as a lookup table. The application refines the data in this lookup table in an iterative manner. So this large variable is broadcast many times during the lifetime of the application process.
>From what I have observed perhaps 60% of the execution time is spent waiting for the variable to broadcast in each iteration. My reading of a Spark performance article[1] suggests that the time spent broadcasting will increase with the number of nodes I add. My question for the group - what would you suggest as an alternative to broadcasting a large variable like this? One approach I have considered is segmenting my RDD and adding a copy of the lookup table for each X number of values to process. So, for example, if I have a list of 1 million entries to process (eg, RDD[Entry]), I could split this into segments of 100K entries, with a copy of the lookup table, and make that an RDD[(Lookup, Array[Entry]). Another solution I am looking at it is making the lookup table an RDD instead of a broadcast variable. Perhaps I could use an IndexedRDD[2] to improve performance. One issue with this approach is that I would have to rewrite my application code to use two RDDs so that I do not reference the lookup RDD in the from within the closure of another RDD. Any other recommendations? Jeff [1] http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/mosharaf-spark-bc-report-spring10.pdf [2]https://github.com/amplab/spark-indexedrdd