Hi,

I thought I understood RDDs and DataFrames, but one noob thing is bugging
me (because I'm seeing weird errors involving joins):

*What does Spark do when you pass a big dataframe as an argument to a
function? *

Are these dataframes included in the closure of the function, and is
therefore each big argument dataframe shipped off to each node, instead of
respecting the locality of the distributed data of the dataframe?

Or, are the *references* to these distributed objects included in the
closure, and not the objects themselves? I was assuming the latter is true,
but am not sure anymore.

For example, I wrote a function called myLeftJoin(df1, df2, Seq(columns))
that merges df1 and df2 based on multiple columns present in both, but I
don't want an equijoin, I want a left join and I don't want repeated
columns. Inside the function I build an sql statement to execute. Will this
construct be inefficient? Will it exhaust memory somehow due to passing df1
and df2 in the function?

I had instances of when calling a function in spark-shell would produce an
error "unable to acquire xxxx bytes of memory", but subsequently, not
wrapping the code in a function, and instead pasting the function body in
the spark-shell would not produce the same error. So, does calling a
function in Spark include a memory overhead somehow?

Thanks for any clarifications!

Kristina

Reply via email to