Spark seems to think that a particular broadcast variable is large in size

Venkat Dabri Mon, 15 Oct 2018 11:54:01 -0700

I am trying to do a broadcast join on two tables. The size of the
smaller table will vary based upon the parameters but the size of the
larger table is close to 2TB. What I have noticed is that if I don't
set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
operations do a SortMergeJoin instead of a broadcast join. But the
size of the smaller table shouldn't be this big at all. I wrote the
smaller table to a s3 folder and it took only 12.6 MB of space. I
didn't some operations on the smaller table so the shuffle size
appears on the Spark History Server and the size in memory seemed to
be 150 MB nowhere near 10G. Also if I force a broadcast join on the
smaller table it takes a long time to broadcast, leading me to think
that the table might not just be 150 MB in size. What would be a good
way to figure out the actual size that Spark is seeing and deciding
whether it crosses the spark.sql.autoBroadcastJoinThreshold?


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark seems to think that a particular broadcast variable is large in size

Reply via email to