Hi Imran, If I understood you correctly, you are suggesting to simply call broadcast again from the driver program. This is exactly what I am hoping will work as I have the Broadcast data wrapped up and I am indeed (re)broadcasting the wrapper over again when the underlying data changes. However, documentation seems to suggest that one cannot re-broadcast. Is my understanding accurate?
Thanks NB On Mon, May 18, 2015 at 6:24 PM, Imran Rashid <iras...@cloudera.com> wrote: > Rather than "updating" the broadcast variable, can't you simply create a > new one? When the old one can be gc'ed in your program, it will also get > gc'ed from spark's cache (and all executors). > > I think this will make your code *slightly* more complicated, as you need > to add in another layer of indirection for which broadcast variable to use, > but not too bad. Eg., from > > var myBroadcast = sc.broadcast( ...) > (0 to 20).foreach{ iteration => > // ... some rdd operations that involve myBroadcast ... > myBroadcast.update(...) // wrong! dont' update a broadcast variable > } > > instead do something like: > > def oneIteration(myRDD: RDD[...], myBroadcastVar: Broadcast[...]): Unit = { > ... > } > > var myBroadcast = sc.broadcast(...) > (0 to 20).foreach { iteration => > oneIteration(myRDD, myBroadcast) > var myBroadcast = sc.broadcast(...) // create a NEW broadcast here, with > whatever you need to update it > } > > On Sat, May 16, 2015 at 2:01 AM, N B <nb.nos...@gmail.com> wrote: > >> Thanks Ayan. Can we rebroadcast after updating in the driver? >> >> Thanks >> NB. >> >> >> On Fri, May 15, 2015 at 6:40 PM, ayan guha <guha.a...@gmail.com> wrote: >> >>> Hi >>> >>> broadcast variables are shipped for the first time it is accessed in a >>> transformation to the executors used by the transformation. It will NOT >>> updated subsequently, even if the value has changed. However, a new value >>> will be shipped to any new executor comes into play after the value has >>> changed. This way, changing value of broadcast variable is not a good idea >>> as it can create inconsistency within cluster. From documentatins: >>> >>> In addition, the object v should not be modified after it is broadcast >>> in order to ensure that all nodes get the same value of the broadcast >>> variable >>> >>> >>> On Sat, May 16, 2015 at 10:39 AM, N B <nb.nos...@gmail.com> wrote: >>> >>>> Thanks Ilya. Does one have to call broadcast again once the underlying >>>> data is updated in order to get the changes visible on all nodes? >>>> >>>> Thanks >>>> NB >>>> >>>> >>>> On Fri, May 15, 2015 at 5:29 PM, Ilya Ganelin <ilgan...@gmail.com> >>>> wrote: >>>> >>>>> The broadcast variable is like a pointer. If the underlying data >>>>> changes then the changes will be visible throughout the cluster. >>>>> On Fri, May 15, 2015 at 5:18 PM NB <nb.nos...@gmail.com> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> Once a broadcast variable is created using sparkContext.broadcast(), >>>>>> can it >>>>>> ever be updated again? The use case is for something like the >>>>>> underlying >>>>>> lookup data changing over time. >>>>>> >>>>>> Thanks >>>>>> NB >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-can-be-rebroadcast-tp22908.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>>> >>>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> >