Re: Broadcast join data reuse

2020-06-15 Thread gypsysunny
The broadcasted table can't seem to be resued across multiple actions.
e.g.
val small_df_bc = broadcast(small_df)
big_df1.join(small_df_bc, Seq("id")).write.parquet("/test1")
big_df2.join(small_df_bc, Seq("id")).write.parquet("/test2")

we can tell the small df has been distributed twice in the spark web UI.

so how can we make it happen only once?

thanks a million.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Broadcast join data reuse

2020-06-11 Thread Ankur Srivastava
Hi Tyson,

The broadcast variable should remain in-memory of the executors and reused
unless you unpersist, destroy it or it goes out of context.

Hope this helps.

Thanks
Ankur

On Wed, Jun 10, 2020 at 5:28 PM  wrote:

> We have a case where data the is small enough to be broadcasted in joined
> with multiple tables in a single plan. Looking at the physical plan, I do
> not see anything that indicates if the broadcast data is done only once
> i.e., the BroadcastExchange is being reused i.i.e., that data is not
> redistributed from scratch. Could someone with insight into the physical
> plan strategy for such a case confirm whether previous broadcasted data is
> reused or if subsequent BroadcastExechange steps are done from scratch.
>
>
>
> Thanks and best regards,
>
> Tyson
>


Broadcast join data reuse

2020-06-10 Thread tcondie
We have a case where data the is small enough to be broadcasted in joined
with multiple tables in a single plan. Looking at the physical plan, I do
not see anything that indicates if the broadcast data is done only once
i.e., the BroadcastExchange is being reused i.i.e., that data is not
redistributed from scratch. Could someone with insight into the physical
plan strategy for such a case confirm whether previous broadcasted data is
reused or if subsequent BroadcastExechange steps are done from scratch. 

 

Thanks and best regards,

Tyson