Re: feedback on dataset api explode

Cheng Lian Wed, 25 May 2016 13:26:38 -0700

Agree, since they can be easily replaced by .flatMap (to do explosion)and .select (to rename output columns)


Cheng


On 5/25/16 12:30 PM, Reynold Xin wrote:

Based on this discussion I'm thinking we should deprecate the twoexplode functions.

On Wednesday, May 25, 2016, Koert Kuipers <ko...@tresata.com<mailto:ko...@tresata.com>> wrote:


    wenchen,
    that definition of explode seems identical to flatMap, so you dont
    need it either?

    michael,
    i didn't know about the column expression version of explode, that
    makes sense. i will experiment with that instead.

    On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan
    <wenc...@databricks.com
    <javascript:_e(%7B%7D,'cvml','wenc...@databricks.com');>> wrote:

        I think we only need this version:  `def explode[B :
        Encoder](f: A => TraversableOnce[B]): Dataset[B]`

        For untyped one, `df.select(explode($"arrayCol").as("item"))`
        should be the best choice.

        On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust
        <mich...@databricks.com
        <javascript:_e(%7B%7D,'cvml','mich...@databricks.com');>> wrote:

            These APIs predate Datasets / encoders, so that is why
            they are Row instead of objects.  We should probably
            rethink that.

            Honestly, I usually end up using the column expression
            version of explode now that it exists (i.e.
            explode($"arrayCol").as("Item")). It would be great to
            understand more why you are using these instead.

            On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers
            <ko...@tresata.com
            <javascript:_e(%7B%7D,'cvml','ko...@tresata.com');>> wrote:

                we currently have 2 explode definitions in Dataset:

                 def explode[A <: Product : TypeTag](input:
                Column*)(f: Row => TraversableOnce[A]): DataFrame

                 def explode[A, B : TypeTag](inputColumn: String,
                outputColumn: String)(f: A => TraversableOnce[B]):
                DataFrame

                1) the separation of the functions into their own
                argument lists is nice, but unfortunately scala's type
                inference doesn't handle this well, meaning that the
                generic types always have to be explicitly provided. i
                assume this was done to allow the "input" to be a
                varargs in the first method, and then kept the same in
                the second for reasons of symmetry.

                2) i am surprised the first definition returns a
                DataFrame. this seems to suggest DataFrame usage (so
                DataFrame to DataFrame), but there is no way to
                specify the output column names, which limits its
                usability for DataFrames. i frequently end up using
                the first definition for DataFrames anyhow because of
                the need to return more than 1 column (and the data
                has columns unknown at compile time that i need to
                carry along making flatMap on Dataset
                clumsy/unusable), but relying on the output columns
                being called _1 and _2 and renaming then afterwards
                seems like an anti-pattern.

                3) using Row objects isn't very pretty. why not f: A
                => TraversableOnce[B] or something like that for the
                first definition? how about:
                 def explode[A: TypeTag, B: TypeTag](input:
                Seq[Column], output: Seq[Column])(f: A =>
                TraversableOnce[B]): DataFrame

                best,
                koert

Re: feedback on dataset api explode

Reply via email to