Re: Broadcast join data reuse

2020-06-15 Thread gypsysunny
The broadcasted table can't seem to be resued across multiple actions.
e.g.
val small_df_bc = broadcast(small_df)
big_df1.join(small_df_bc, Seq("id")).write.parquet("/test1")
big_df2.join(small_df_bc, Seq("id")).write.parquet("/test2")

we can tell the small df has been distributed twice in the spark web UI.

so how can we make it happen only once?

thanks a million.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Broadcast join data reuse

2020-06-11 Thread Ankur Srivastava
Hi Tyson,

The broadcast variable should remain in-memory of the executors and reused
unless you unpersist, destroy it or it goes out of context.

Hope this helps.

Thanks
Ankur

On Wed, Jun 10, 2020 at 5:28 PM  wrote:

> We have a case where data the is small enough to be broadcasted in joined
> with multiple tables in a single plan. Looking at the physical plan, I do
> not see anything that indicates if the broadcast data is done only once
> i.e., the BroadcastExchange is being reused i.i.e., that data is not
> redistributed from scratch. Could someone with insight into the physical
> plan strategy for such a case confirm whether previous broadcasted data is
> reused or if subsequent BroadcastExechange steps are done from scratch.
>
>
>
> Thanks and best regards,
>
> Tyson
>