I was wondering if anyone could help with this question. On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hati, <dhruba.w...@gmail.com> wrote:
> Hi, > > I have a question regarding passing a dictionary from driver to executors > in spark on yarn. This dictionary is needed in an udf. I am using pyspark. > > As I understand this can be passed in two ways: > > 1. Broadcast the variable and then use it in the udfs > > 2. Pass the dictionary in the udf itself, in something like this: > > def udf1(col1, dict): > .. > def udf1_fn(dict): > return udf(lambda col_data: udf1(col_data, dict)) > > df.withColumn("column_new", udf1_fn(dict)("old_column")) > > Well I have tested with both the ways and it works both ways. > > Now I am wondering what is fundamentally different between the two. I > understand how broadcast work but I am not sure how the data is passed > across in the 2nd way. Is the dictionary passed to each executor every time > when new task is running on that executor or they are passed only once. > Also how the data is passed to the python processes. They are python udfs > so I think they are executed natively in python.(Plz correct me if I am > wrong). So the data will be serialised and passed to python. > > So in summary my question is which will be better/efficient way to write > the whole thing and why? > > Thank you! > > Regards, > Dhrub >