A while ago we changed it so the task gets broadcasted too, so I think the two are fairly similar.
On Mon, Sep 23, 2019 at 8:17 PM, Dhrubajyoti Hati < dhruba.w...@gmail.com > wrote: > > I was wondering if anyone could help with this question. > > On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hati, < dhruba. work@ gmail. com > ( dhruba.w...@gmail.com ) > wrote: > > >> Hi, >> >> >> I have a question regarding passing a dictionary from driver to executors >> in spark on yarn. This dictionary is needed in an udf. I am using pyspark. >> >> >> As I understand this can be passed in two ways: >> >> >> 1. Broadcast the variable and then use it in the udfs >> >> >> 2. Pass the dictionary in the udf itself, in something like this: >> >> >> def udf1(col1, dict): >> .. >> def udf 1 _ fn (dict): >> return udf(lambda col_ data : udf1( col_data, dict )) >> >> >> df.withColumn("column_new", udf 1 _ fn (dict)("old_column")) >> >> >> Well I have tested with both the ways and it works both ways. >> >> >> Now I am wondering what is fundamentally different between the two. I >> understand how broadcast work but I am not sure how the data is passed >> across in the 2nd way. Is the dictionary passed to each executor every >> time when new task is running on that executor or they are passed only >> once. Also how the data is passed to the python processes. They are python >> udfs so I think they are executed natively in python.(Plz correct me if I >> am wrong). So the data will be serialised and passed to python. >> >> So in summary my question is which will be better/efficient way to write >> the whole thing and why? >> >> >> Thank you! >> >> >> R egards, >> Dhrub >> > >