Re: Encoding categorical variables

Woo Jae Jung Fri, 28 Oct 2016 13:33:06 -0700

+Vatsan for his thoughts as well!

On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung <wj...@pivotal.io> wrote:


> Also agree that double-quoted column names are not ideal.  In addition to
> the net-new features described in this thread, it'd be nice to see
> non-double-quoted output as default behavior in the
> existing create_indicator_variables() function.
>
> Thanks,
> Woo
>
> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
>
>> I like the one-hot encoded feature.  Another variant of this idea would
>> be an "all other" variable (distinct from the reference class) that
>> contains occurrences of the less frequent category types.  In both of these
>> scenarios, the threshold for 'less frequent' could be user-supplied.
>>
>> Thanks,
>> Woo
>>
>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <rahulri...@gmail.com>
>> wrote:
>>
>>> An alternative to dropping is to assign the less frequent values to the
>>> reference i.e. all one-hot encoded features will be 0.
>>> Also important to note: total runtime will increase with this option
>>> since
>>> we'll have to compute the exact frequency distribution.
>>>
>>> Another suggested change is to call this function 'one_hot_encoding'
>>> since
>>> that is the output here (similar to sklearn's OneHotEncoder
>>> <http://scikit-learn.org/stable/modules/generated/sklearn.pr
>>> eprocessing.OneHotEncoder.html>).
>>> We can keep the current name as a deprecated alias till 2.0 is released.
>>>
>>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <fmcquil...@pivotal.io
>>> >
>>> wrote:
>>>
>>> > Jarrod,
>>> >
>>> > Just trying to write up detailed requirements.  How would you see this
>>> one
>>> > working?
>>> >
>>> > "2) Option to dummy code only the top n most frequently occurring
>>> values in
>>> > any column"
>>> >
>>> > With 1 column I can picture it, you would drop the rows with the less
>>> > frequently occurring values and end up with a smaller table.  But what
>>> if
>>> > you are encoding multiple rows?    Would you want a per row
>>> specification
>>> > of n? i.e., top 3 values for column x, top 10 values for column y?  If
>>> you
>>> > did this then your result set might include low frequency values for
>>> column
>>> > x (not in top 3) because they are in the top 10 for column y - this
>>> might
>>> > be confusing.
>>> >
>>> > Frank
>>> >
>>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <
>>> fmcquil...@pivotal.io>
>>> > wrote:
>>> >
>>> >> great, thanks for the additional information
>>> >>
>>> >> Frank
>>> >>
>>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jvawd...@pivotal.io>
>>> >> wrote:
>>> >>
>>> >>> IMO
>>> >>>
>>> >>> 1) Option to define resulting column names. Please see pdltools
>>> >>> implementation - the ability to pass in a function is especially
>>> useful (
>>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>>> >>> 2) Option to dummy code only the top n most frequently occurring
>>> values
>>> >>> in
>>> >>> any column
>>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>>> >>> pivotcol_val2
>>> >>> ...) instead of values in column names + secondary mapping table
>>> >>> 4) Option to exclude original column from results table
>>> >>>
>>> >>> (1) & (2) are much higher priority than (3) & (4).
>>> >>>
>>> >>> Agreed that these could also be applied to Pivoting (especially 1).
>>> >>>
>>> >>>
>>> >>>
>>> >>> Jarrod Vawdrey
>>> >>> Sr. Data Scientist
>>> >>> Data Science & Engineering | Pivotal
>>> >>> (650) 315-8905
>>> >>> https://pivotal.io/
>>> >>>
>>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
>>> fmcquil...@pivotal.io>
>>> >>> wrote:
>>> >>>
>>> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty
>>> useful -
>>> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in
>>> the
>>> >>> order
>>> >>> > of priority as you see it?
>>> >>> >
>>> >>> > Also it seems like some of these could be applied to the Pivot
>>> >>> function as
>>> >>> > well, e.g., UDF for column naming.
>>> >>> >
>>> >>> > Frank
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
>>> jvawd...@pivotal.io>
>>> >>> > wrote:
>>> >>> >
>>> >>> >> Hey Frank,
>>> >>> >>
>>> >>> >> How are special character values handled today? It is often not
>>> ideal
>>> >>> to
>>> >>> >> end up with column names that require double quotes to call due to
>>> >>> >> downstream scripts.
>>> >>> >>
>>> >>> >> A couple of features that would be useful
>>> >>> >>
>>> >>> >> * Option to define resulting column names. Please see pdltools
>>> >>> >> implementation - the ability to pass in a function is especially
>>> >>> useful (
>>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
>>> 1.html)
>>> >>> >> * Option to dummy code only the top n most frequently occurring
>>> >>> values in
>>> >>> >> any column
>>> >>> >> * Option to exclude original column from results table
>>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>>> >>> >> pivotcol_val2 ...) instead of values in column names + secondary
>>> >>> mapping
>>> >>> >> table
>>> >>> >>
>>> >>> >> Thank you
>>> >>> >>
>>> >>> >> Jarrod Vawdrey
>>> >>> >> Sr. Data Scientist
>>> >>> >> Data Science & Engineering | Pivotal
>>> >>> >> (650) 315-8905
>>> >>> >> https://pivotal.io/
>>> >>> >>
>>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>>> >>> fmcquil...@pivotal.io>
>>> >>> >> wrote:
>>> >>> >>
>>> >>> >>> For the module encoding categorical variables
>>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>>> >>> >>> ata__prep.html
>>> >>> >>> does anyone have any suggestions on improvements that we could
>>> make?
>>> >>> >>>
>>> >>> >>> Here is a video on how encoding categorical variables works for
>>> >>> those not
>>> >>> >>> familiar with it
>>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>> >>> >>>
>>> >>> >>
>>> >>> >>
>>> >>> >
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Re: Encoding categorical variables

Reply via email to