Re: feedback on dataset api explode

Koert Kuipers Wed, 25 May 2016 13:54:46 -0700

oh yes, this was by accident, it should have gone to dev

On Wed, May 25, 2016 at 4:20 PM, Reynold Xin <r...@databricks.com> wrote:


> Created JIRA ticket: https://issues.apache.org/jira/browse/SPARK-15533
>
> @Koert - Please keep API feedback coming. One thing - in the future, can
> you send api feedbacks to the dev@ list instead of user@?
>
>
>
> On Wed, May 25, 2016 at 1:05 PM, Cheng Lian <l...@databricks.com> wrote:
>
>> Agree, since they can be easily replaced by .flatMap (to do explosion)
>> and .select (to rename output columns)
>>
>> Cheng
>>
>>
>> On 5/25/16 12:30 PM, Reynold Xin wrote:
>>
>> Based on this discussion I'm thinking we should deprecate the two explode
>> functions.
>>
>> On Wednesday, May 25, 2016, Koert Kuipers < <ko...@tresata.com>
>> ko...@tresata.com> wrote:
>>
>>> wenchen,
>>> that definition of explode seems identical to flatMap, so you dont need
>>> it either?
>>>
>>> michael,
>>> i didn't know about the column expression version of explode, that makes
>>> sense. i will experiment with that instead.
>>>
>>> On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan <wenc...@databricks.com>
>>> wrote:
>>>
>>>> I think we only need this version:  `def explode[B : Encoder](f: A
>>>> => TraversableOnce[B]): Dataset[B]`
>>>>
>>>> For untyped one, `df.select(explode($"arrayCol").as("item"))` should be
>>>> the best choice.
>>>>
>>>> On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> These APIs predate Datasets / encoders, so that is why they are Row
>>>>> instead of objects.  We should probably rethink that.
>>>>>
>>>>> Honestly, I usually end up using the column expression version of
>>>>> explode now that it exists (i.e. explode($"arrayCol").as("Item")).
>>>>> It would be great to understand more why you are using these instead.
>>>>>
>>>>> On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers <ko...@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> we currently have 2 explode definitions in Dataset:
>>>>>>
>>>>>>  def explode[A <: Product : TypeTag](input: Column*)(f: Row =>
>>>>>> TraversableOnce[A]): DataFrame
>>>>>>
>>>>>>  def explode[A, B : TypeTag](inputColumn: String, outputColumn:
>>>>>> String)(f: A => TraversableOnce[B]): DataFrame
>>>>>>
>>>>>> 1) the separation of the functions into their own argument lists is
>>>>>> nice, but unfortunately scala's type inference doesn't handle this well,
>>>>>> meaning that the generic types always have to be explicitly provided. i
>>>>>> assume this was done to allow the "input" to be a varargs in the first
>>>>>> method, and then kept the same in the second for reasons of symmetry.
>>>>>>
>>>>>> 2) i am surprised the first definition returns a DataFrame. this
>>>>>> seems to suggest DataFrame usage (so DataFrame to DataFrame), but there 
>>>>>> is
>>>>>> no way to specify the output column names, which limits its usability for
>>>>>> DataFrames. i frequently end up using the first definition for DataFrames
>>>>>> anyhow because of the need to return more than 1 column (and the data has
>>>>>> columns unknown at compile time that i need to carry along making flatMap
>>>>>> on Dataset clumsy/unusable), but relying on the output columns being 
>>>>>> called
>>>>>> _1 and _2 and renaming then afterwards seems like an anti-pattern.
>>>>>>
>>>>>> 3) using Row objects isn't very pretty. why not f: A =>
>>>>>> TraversableOnce[B] or something like that for the first definition? how
>>>>>> about:
>>>>>>  def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output:
>>>>>> Seq[Column])(f: A => TraversableOnce[B]): DataFrame
>>>>>>
>>>>>> best,
>>>>>> koert
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: feedback on dataset api explode

Reply via email to