Re: Broadcast join data reuse
The broadcasted table can't seem to be resued across multiple actions. e.g. val small_df_bc = broadcast(small_df) big_df1.join(small_df_bc, Seq("id")).write.parquet("/test1") big_df2.join(small_df_bc, Seq("id")).write.parquet("/test2") we can tell the small df has been distributed twice in the spark web UI. so how can we make it happen only once? thanks a million. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Broadcast join data reuse
Hi Tyson, The broadcast variable should remain in-memory of the executors and reused unless you unpersist, destroy it or it goes out of context. Hope this helps. Thanks Ankur On Wed, Jun 10, 2020 at 5:28 PM wrote: > We have a case where data the is small enough to be broadcasted in joined > with multiple tables in a single plan. Looking at the physical plan, I do > not see anything that indicates if the broadcast data is done only once > i.e., the BroadcastExchange is being reused i.i.e., that data is not > redistributed from scratch. Could someone with insight into the physical > plan strategy for such a case confirm whether previous broadcasted data is > reused or if subsequent BroadcastExechange steps are done from scratch. > > > > Thanks and best regards, > > Tyson >
Broadcast join data reuse
We have a case where data the is small enough to be broadcasted in joined with multiple tables in a single plan. Looking at the physical plan, I do not see anything that indicates if the broadcast data is done only once i.e., the BroadcastExchange is being reused i.i.e., that data is not redistributed from scratch. Could someone with insight into the physical plan strategy for such a case confirm whether previous broadcasted data is reused or if subsequent BroadcastExechange steps are done from scratch. Thanks and best regards, Tyson