Agree, since they can be easily replaced by .flatMap (to do explosion)
and .select (to rename output columns)
Cheng
On 5/25/16 12:30 PM, Reynold Xin wrote:
Based on this discussion I'm thinking we should deprecate the two
explode functions.
On Wednesday, May 25, 2016, Koert Kuipers <ko...@tresata.com
<mailto:ko...@tresata.com>> wrote:
wenchen,
that definition of explode seems identical to flatMap, so you dont
need it either?
michael,
i didn't know about the column expression version of explode, that
makes sense. i will experiment with that instead.
On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan
<wenc...@databricks.com
<javascript:_e(%7B%7D,'cvml','wenc...@databricks.com');>> wrote:
I think we only need this version: `def explode[B :
Encoder](f: A => TraversableOnce[B]): Dataset[B]`
For untyped one, `df.select(explode($"arrayCol").as("item"))`
should be the best choice.
On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust
<mich...@databricks.com
<javascript:_e(%7B%7D,'cvml','mich...@databricks.com');>> wrote:
These APIs predate Datasets / encoders, so that is why
they are Row instead of objects. We should probably
rethink that.
Honestly, I usually end up using the column expression
version of explode now that it exists (i.e.
explode($"arrayCol").as("Item")). It would be great to
understand more why you are using these instead.
On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers
<ko...@tresata.com
<javascript:_e(%7B%7D,'cvml','ko...@tresata.com');>> wrote:
we currently have 2 explode definitions in Dataset:
def explode[A <: Product : TypeTag](input:
Column*)(f: Row => TraversableOnce[A]): DataFrame
def explode[A, B : TypeTag](inputColumn: String,
outputColumn: String)(f: A => TraversableOnce[B]):
DataFrame
1) the separation of the functions into their own
argument lists is nice, but unfortunately scala's type
inference doesn't handle this well, meaning that the
generic types always have to be explicitly provided. i
assume this was done to allow the "input" to be a
varargs in the first method, and then kept the same in
the second for reasons of symmetry.
2) i am surprised the first definition returns a
DataFrame. this seems to suggest DataFrame usage (so
DataFrame to DataFrame), but there is no way to
specify the output column names, which limits its
usability for DataFrames. i frequently end up using
the first definition for DataFrames anyhow because of
the need to return more than 1 column (and the data
has columns unknown at compile time that i need to
carry along making flatMap on Dataset
clumsy/unusable), but relying on the output columns
being called _1 and _2 and renaming then afterwards
seems like an anti-pattern.
3) using Row objects isn't very pretty. why not f: A
=> TraversableOnce[B] or something like that for the
first definition? how about:
def explode[A: TypeTag, B: TypeTag](input:
Seq[Column], output: Seq[Column])(f: A =>
TraversableOnce[B]): DataFrame
best,
koert