There are two distinct parts here. Optimisation + execution.
Spark does not have a Cost Based Optimizer (CBO) yet but that does not
matter for now.
When we do such operation say outer join between (s) and (t) DFs below, we
see
scala> val rs = s.join(t,s("time_id")===t("time_id"), "fullouter")
I do not think so. What I understand Spark will still use Catalyst to join.
DF always has an RDD underneath, but that does not mean any action will
force less optimal path.
On Sun, Aug 14, 2016 at 3:04 PM, mayur bhole
wrote:
> HI All,
>
> Lets say, we have
>
> val df = bigTableA.join(bigTableB,b
HI All,
Lets say, we have
val df = bigTableA.join(bigTableB,bigTableA("A")===bigTableB("A"),"left")
val rddFromDF = df.rdd
println(rddFromDF.count)
My understanding is that spark will convert all data frame operations
before "rddFromDF.count" into RDD equivalent operation as we are not
performin