oh yes, this was by accident, it should have gone to dev On Wed, May 25, 2016 at 4:20 PM, Reynold Xin <r...@databricks.com> wrote:
> Created JIRA ticket: https://issues.apache.org/jira/browse/SPARK-15533 > > @Koert - Please keep API feedback coming. One thing - in the future, can > you send api feedbacks to the dev@ list instead of user@? > > > > On Wed, May 25, 2016 at 1:05 PM, Cheng Lian <l...@databricks.com> wrote: > >> Agree, since they can be easily replaced by .flatMap (to do explosion) >> and .select (to rename output columns) >> >> Cheng >> >> >> On 5/25/16 12:30 PM, Reynold Xin wrote: >> >> Based on this discussion I'm thinking we should deprecate the two explode >> functions. >> >> On Wednesday, May 25, 2016, Koert Kuipers < <ko...@tresata.com> >> ko...@tresata.com> wrote: >> >>> wenchen, >>> that definition of explode seems identical to flatMap, so you dont need >>> it either? >>> >>> michael, >>> i didn't know about the column expression version of explode, that makes >>> sense. i will experiment with that instead. >>> >>> On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan <wenc...@databricks.com> >>> wrote: >>> >>>> I think we only need this version: `def explode[B : Encoder](f: A >>>> => TraversableOnce[B]): Dataset[B]` >>>> >>>> For untyped one, `df.select(explode($"arrayCol").as("item"))` should be >>>> the best choice. >>>> >>>> On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust < >>>> mich...@databricks.com> wrote: >>>> >>>>> These APIs predate Datasets / encoders, so that is why they are Row >>>>> instead of objects. We should probably rethink that. >>>>> >>>>> Honestly, I usually end up using the column expression version of >>>>> explode now that it exists (i.e. explode($"arrayCol").as("Item")). >>>>> It would be great to understand more why you are using these instead. >>>>> >>>>> On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers <ko...@tresata.com> >>>>> wrote: >>>>> >>>>>> we currently have 2 explode definitions in Dataset: >>>>>> >>>>>> def explode[A <: Product : TypeTag](input: Column*)(f: Row => >>>>>> TraversableOnce[A]): DataFrame >>>>>> >>>>>> def explode[A, B : TypeTag](inputColumn: String, outputColumn: >>>>>> String)(f: A => TraversableOnce[B]): DataFrame >>>>>> >>>>>> 1) the separation of the functions into their own argument lists is >>>>>> nice, but unfortunately scala's type inference doesn't handle this well, >>>>>> meaning that the generic types always have to be explicitly provided. i >>>>>> assume this was done to allow the "input" to be a varargs in the first >>>>>> method, and then kept the same in the second for reasons of symmetry. >>>>>> >>>>>> 2) i am surprised the first definition returns a DataFrame. this >>>>>> seems to suggest DataFrame usage (so DataFrame to DataFrame), but there >>>>>> is >>>>>> no way to specify the output column names, which limits its usability for >>>>>> DataFrames. i frequently end up using the first definition for DataFrames >>>>>> anyhow because of the need to return more than 1 column (and the data has >>>>>> columns unknown at compile time that i need to carry along making flatMap >>>>>> on Dataset clumsy/unusable), but relying on the output columns being >>>>>> called >>>>>> _1 and _2 and renaming then afterwards seems like an anti-pattern. >>>>>> >>>>>> 3) using Row objects isn't very pretty. why not f: A => >>>>>> TraversableOnce[B] or something like that for the first definition? how >>>>>> about: >>>>>> def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output: >>>>>> Seq[Column])(f: A => TraversableOnce[B]): DataFrame >>>>>> >>>>>> best, >>>>>> koert >>>>>> >>>>> >>>>> >>>> >>> >> >