Hi. You can use a broadcast variable to make data available to all the nodes in your cluster that can live longer then just the current distributed task.
For example if you need a to access a large structure in multiple sub-tasks, instead of sending that structure again and again with each sub-task you can send it only once and access the data inside the operation (map, flatmap etc.) by way of the broadcast variable name .value See : https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables Note however that you should treat the broadcast variable as a read-only structure as it is not synced between workers after it is broadcasted. To broadcast, your data must be serializable. If the data you are trying to broadcast is a distributed RDD (and thus I assumably large), perhaps what you need is some form of join operation (or cogroup)? Regards, Gylfi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Passing-Broadcast-variable-as-parameter-tp23760p23898.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org