Re: Encoding categorical variables

2016-11-08 Thread Frank McQuillan
Here is the JIRA with attached requirements doc.
https://issues.apache.org/jira/browse/MADLIB-1038

Please put your comments in the JIRA.  There are still some outstanding
questions to be puzzled out.

Frank

On Fri, Oct 28, 2016 at 3:04 PM, Frank McQuillan 
wrote:

> Yes thanks Vatsan we have been looking at that.
>
> On Fri, Oct 28, 2016 at 2:39 PM, Srivatsan R  wrote:
>
>> You guys may have already seen this, but linking just in case:
>> http://pandas.pydata.org/pandas-docs/stable/generated/pandas
>> .get_dummies.html
>>
>> On Fri, Oct 28, 2016 at 1:32 PM, Woo Jae Jung  wrote:
>>
>> > +Vatsan for his thoughts as well!
>> >
>> > On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung  wrote:
>> >
>> >> Also agree that double-quoted column names are not ideal.  In addition
>> to
>> >> the net-new features described in this thread, it'd be nice to see
>> >> non-double-quoted output as default behavior in the
>> >> existing create_indicator_variables() function.
>> >>
>> >> Thanks,
>> >> Woo
>> >>
>> >> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung 
>> wrote:
>> >>
>> >>> I like the one-hot encoded feature.  Another variant of this idea
>> would
>> >>> be an "all other" variable (distinct from the reference class) that
>> >>> contains occurrences of the less frequent category types.  In both of
>> these
>> >>> scenarios, the threshold for 'less frequent' could be user-supplied.
>> >>>
>> >>> Thanks,
>> >>> Woo
>> >>>
>> >>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer 
>> >>> wrote:
>> >>>
>>  An alternative to dropping is to assign the less frequent values to
>> the
>>  reference i.e. all one-hot encoded features will be 0.
>>  Also important to note: total runtime will increase with this option
>>  since
>>  we'll have to compute the exact frequency distribution.
>> 
>>  Another suggested change is to call this function 'one_hot_encoding'
>>  since
>>  that is the output here (similar to sklearn's OneHotEncoder
>>  >  eprocessing.OneHotEncoder.html>).
>>  We can keep the current name as a deprecated alias till 2.0 is
>> released.
>> 
>>  On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <
>>  fmcquil...@pivotal.io>
>>  wrote:
>> 
>>  > Jarrod,
>>  >
>>  > Just trying to write up detailed requirements.  How would you see
>>  this one
>>  > working?
>>  >
>>  > "2) Option to dummy code only the top n most frequently occurring
>>  values in
>>  > any column"
>>  >
>>  > With 1 column I can picture it, you would drop the rows with the
>> less
>>  > frequently occurring values and end up with a smaller table.  But
>>  what if
>>  > you are encoding multiple rows?Would you want a per row
>>  specification
>>  > of n? i.e., top 3 values for column x, top 10 values for column y?
>>  If you
>>  > did this then your result set might include low frequency values
>> for
>>  column
>>  > x (not in top 3) because they are in the top 10 for column y - this
>>  might
>>  > be confusing.
>>  >
>>  > Frank
>>  >
>>  > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <
>>  fmcquil...@pivotal.io>
>>  > wrote:
>>  >
>>  >> great, thanks for the additional information
>>  >>
>>  >> Frank
>>  >>
>>  >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <
>> jvawd...@pivotal.io
>>  >
>>  >> wrote:
>>  >>
>>  >>> IMO
>>  >>>
>>  >>> 1) Option to define resulting column names. Please see pdltools
>>  >>> implementation - the ability to pass in a function is especially
>>  useful (
>>  >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
>> 1.html)
>>  >>> 2) Option to dummy code only the top n most frequently occurring
>>  values
>>  >>> in
>>  >>> any column
>>  >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>>  >>> pivotcol_val2
>>  >>> ...) instead of values in column names + secondary mapping table
>>  >>> 4) Option to exclude original column from results table
>>  >>>
>>  >>> (1) & (2) are much higher priority than (3) & (4).
>>  >>>
>>  >>> Agreed that these could also be applied to Pivoting (especially
>> 1).
>>  >>>
>>  >>>
>>  >>>
>>  >>> Jarrod Vawdrey
>>  >>> Sr. Data Scientist
>>  >>> Data Science & Engineering | Pivotal
>>  >>> (650) 315-8905
>>  >>> https://pivotal.io/
>>  >>>
>>  >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
>>  fmcquil...@pivotal.io>
>>  >>> wrote:
>>  >>>
>>  >>> > Thanks for those suggestions, Jarrod.  They all sound pretty
>>  useful -
>>  >>> > would you mind taking a crack at numbering them 1,2,3... etc,
>> in
>>  the
>>  >>> order
>>  >>> > of priority as you see it?
>>  >>> >
>>  >>> > Also it seems like some of these could be applied 

Re: Encoding categorical variables

2016-10-28 Thread Frank McQuillan
Yes thanks Vatsan we have been looking at that.

On Fri, Oct 28, 2016 at 2:39 PM, Srivatsan R  wrote:

> You guys may have already seen this, but linking just in case:
> http://pandas.pydata.org/pandas-docs/stable/generated/
> pandas.get_dummies.html
>
> On Fri, Oct 28, 2016 at 1:32 PM, Woo Jae Jung  wrote:
>
> > +Vatsan for his thoughts as well!
> >
> > On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung  wrote:
> >
> >> Also agree that double-quoted column names are not ideal.  In addition
> to
> >> the net-new features described in this thread, it'd be nice to see
> >> non-double-quoted output as default behavior in the
> >> existing create_indicator_variables() function.
> >>
> >> Thanks,
> >> Woo
> >>
> >> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung  wrote:
> >>
> >>> I like the one-hot encoded feature.  Another variant of this idea would
> >>> be an "all other" variable (distinct from the reference class) that
> >>> contains occurrences of the less frequent category types.  In both of
> these
> >>> scenarios, the threshold for 'less frequent' could be user-supplied.
> >>>
> >>> Thanks,
> >>> Woo
> >>>
> >>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer 
> >>> wrote:
> >>>
>  An alternative to dropping is to assign the less frequent values to
> the
>  reference i.e. all one-hot encoded features will be 0.
>  Also important to note: total runtime will increase with this option
>  since
>  we'll have to compute the exact frequency distribution.
> 
>  Another suggested change is to call this function 'one_hot_encoding'
>  since
>  that is the output here (similar to sklearn's OneHotEncoder
>    eprocessing.OneHotEncoder.html>).
>  We can keep the current name as a deprecated alias till 2.0 is
> released.
> 
>  On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <
>  fmcquil...@pivotal.io>
>  wrote:
> 
>  > Jarrod,
>  >
>  > Just trying to write up detailed requirements.  How would you see
>  this one
>  > working?
>  >
>  > "2) Option to dummy code only the top n most frequently occurring
>  values in
>  > any column"
>  >
>  > With 1 column I can picture it, you would drop the rows with the
> less
>  > frequently occurring values and end up with a smaller table.  But
>  what if
>  > you are encoding multiple rows?Would you want a per row
>  specification
>  > of n? i.e., top 3 values for column x, top 10 values for column y?
>  If you
>  > did this then your result set might include low frequency values for
>  column
>  > x (not in top 3) because they are in the top 10 for column y - this
>  might
>  > be confusing.
>  >
>  > Frank
>  >
>  > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <
>  fmcquil...@pivotal.io>
>  > wrote:
>  >
>  >> great, thanks for the additional information
>  >>
>  >> Frank
>  >>
>  >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <
> jvawd...@pivotal.io
>  >
>  >> wrote:
>  >>
>  >>> IMO
>  >>>
>  >>> 1) Option to define resulting column names. Please see pdltools
>  >>> implementation - the ability to pass in a function is especially
>  useful (
>  >>> http://pivotalsoftware.github.io/PDLTools/group__grp__
> pivot01.html)
>  >>> 2) Option to dummy code only the top n most frequently occurring
>  values
>  >>> in
>  >>> any column
>  >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>  >>> pivotcol_val2
>  >>> ...) instead of values in column names + secondary mapping table
>  >>> 4) Option to exclude original column from results table
>  >>>
>  >>> (1) & (2) are much higher priority than (3) & (4).
>  >>>
>  >>> Agreed that these could also be applied to Pivoting (especially
> 1).
>  >>>
>  >>>
>  >>>
>  >>> Jarrod Vawdrey
>  >>> Sr. Data Scientist
>  >>> Data Science & Engineering | Pivotal
>  >>> (650) 315-8905
>  >>> https://pivotal.io/
>  >>>
>  >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
>  fmcquil...@pivotal.io>
>  >>> wrote:
>  >>>
>  >>> > Thanks for those suggestions, Jarrod.  They all sound pretty
>  useful -
>  >>> > would you mind taking a crack at numbering them 1,2,3... etc, in
>  the
>  >>> order
>  >>> > of priority as you see it?
>  >>> >
>  >>> > Also it seems like some of these could be applied to the Pivot
>  >>> function as
>  >>> > well, e.g., UDF for column naming.
>  >>> >
>  >>> > Frank
>  >>> >
>  >>> >
>  >>> >
>  >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
>  jvawd...@pivotal.io>
>  >>> > wrote:
>  >>> >
>  >>> >> Hey Frank,
>  >>> >>
>  >>> >> How are special character values handled today? It is often not
>  ideal
>  >

Re: Encoding categorical variables

2016-10-28 Thread Srivatsan R
You guys may have already seen this, but linking just in case:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

On Fri, Oct 28, 2016 at 1:32 PM, Woo Jae Jung  wrote:

> +Vatsan for his thoughts as well!
>
> On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung  wrote:
>
>> Also agree that double-quoted column names are not ideal.  In addition to
>> the net-new features described in this thread, it'd be nice to see
>> non-double-quoted output as default behavior in the
>> existing create_indicator_variables() function.
>>
>> Thanks,
>> Woo
>>
>> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung  wrote:
>>
>>> I like the one-hot encoded feature.  Another variant of this idea would
>>> be an "all other" variable (distinct from the reference class) that
>>> contains occurrences of the less frequent category types.  In both of these
>>> scenarios, the threshold for 'less frequent' could be user-supplied.
>>>
>>> Thanks,
>>> Woo
>>>
>>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer 
>>> wrote:
>>>
 An alternative to dropping is to assign the less frequent values to the
 reference i.e. all one-hot encoded features will be 0.
 Also important to note: total runtime will increase with this option
 since
 we'll have to compute the exact frequency distribution.

 Another suggested change is to call this function 'one_hot_encoding'
 since
 that is the output here (similar to sklearn's OneHotEncoder
 ).
 We can keep the current name as a deprecated alias till 2.0 is released.

 On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <
 fmcquil...@pivotal.io>
 wrote:

 > Jarrod,
 >
 > Just trying to write up detailed requirements.  How would you see
 this one
 > working?
 >
 > "2) Option to dummy code only the top n most frequently occurring
 values in
 > any column"
 >
 > With 1 column I can picture it, you would drop the rows with the less
 > frequently occurring values and end up with a smaller table.  But
 what if
 > you are encoding multiple rows?Would you want a per row
 specification
 > of n? i.e., top 3 values for column x, top 10 values for column y?
 If you
 > did this then your result set might include low frequency values for
 column
 > x (not in top 3) because they are in the top 10 for column y - this
 might
 > be confusing.
 >
 > Frank
 >
 > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <
 fmcquil...@pivotal.io>
 > wrote:
 >
 >> great, thanks for the additional information
 >>
 >> Frank
 >>
 >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey >>> >
 >> wrote:
 >>
 >>> IMO
 >>>
 >>> 1) Option to define resulting column names. Please see pdltools
 >>> implementation - the ability to pass in a function is especially
 useful (
 >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
 >>> 2) Option to dummy code only the top n most frequently occurring
 values
 >>> in
 >>> any column
 >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
 >>> pivotcol_val2
 >>> ...) instead of values in column names + secondary mapping table
 >>> 4) Option to exclude original column from results table
 >>>
 >>> (1) & (2) are much higher priority than (3) & (4).
 >>>
 >>> Agreed that these could also be applied to Pivoting (especially 1).
 >>>
 >>>
 >>>
 >>> Jarrod Vawdrey
 >>> Sr. Data Scientist
 >>> Data Science & Engineering | Pivotal
 >>> (650) 315-8905
 >>> https://pivotal.io/
 >>>
 >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
 fmcquil...@pivotal.io>
 >>> wrote:
 >>>
 >>> > Thanks for those suggestions, Jarrod.  They all sound pretty
 useful -
 >>> > would you mind taking a crack at numbering them 1,2,3... etc, in
 the
 >>> order
 >>> > of priority as you see it?
 >>> >
 >>> > Also it seems like some of these could be applied to the Pivot
 >>> function as
 >>> > well, e.g., UDF for column naming.
 >>> >
 >>> > Frank
 >>> >
 >>> >
 >>> >
 >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
 jvawd...@pivotal.io>
 >>> > wrote:
 >>> >
 >>> >> Hey Frank,
 >>> >>
 >>> >> How are special character values handled today? It is often not
 ideal
 >>> to
 >>> >> end up with column names that require double quotes to call due
 to
 >>> >> downstream scripts.
 >>> >>
 >>> >> A couple of features that would be useful
 >>> >>
 >>> >> * Option to define resulting column names. Please see pdltools
 >>> >> implementation - the ability to pass in a function is especially
 >>> useful (
 >>> >> http://pivotalsoftware.github.io/PDLTool

Re: Encoding categorical variables

2016-10-28 Thread Woo Jae Jung
+Vatsan for his thoughts as well!

On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung  wrote:

> Also agree that double-quoted column names are not ideal.  In addition to
> the net-new features described in this thread, it'd be nice to see
> non-double-quoted output as default behavior in the
> existing create_indicator_variables() function.
>
> Thanks,
> Woo
>
> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung  wrote:
>
>> I like the one-hot encoded feature.  Another variant of this idea would
>> be an "all other" variable (distinct from the reference class) that
>> contains occurrences of the less frequent category types.  In both of these
>> scenarios, the threshold for 'less frequent' could be user-supplied.
>>
>> Thanks,
>> Woo
>>
>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer 
>> wrote:
>>
>>> An alternative to dropping is to assign the less frequent values to the
>>> reference i.e. all one-hot encoded features will be 0.
>>> Also important to note: total runtime will increase with this option
>>> since
>>> we'll have to compute the exact frequency distribution.
>>>
>>> Another suggested change is to call this function 'one_hot_encoding'
>>> since
>>> that is the output here (similar to sklearn's OneHotEncoder
>>> >> eprocessing.OneHotEncoder.html>).
>>> We can keep the current name as a deprecated alias till 2.0 is released.
>>>
>>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan >> >
>>> wrote:
>>>
>>> > Jarrod,
>>> >
>>> > Just trying to write up detailed requirements.  How would you see this
>>> one
>>> > working?
>>> >
>>> > "2) Option to dummy code only the top n most frequently occurring
>>> values in
>>> > any column"
>>> >
>>> > With 1 column I can picture it, you would drop the rows with the less
>>> > frequently occurring values and end up with a smaller table.  But what
>>> if
>>> > you are encoding multiple rows?Would you want a per row
>>> specification
>>> > of n? i.e., top 3 values for column x, top 10 values for column y?  If
>>> you
>>> > did this then your result set might include low frequency values for
>>> column
>>> > x (not in top 3) because they are in the top 10 for column y - this
>>> might
>>> > be confusing.
>>> >
>>> > Frank
>>> >
>>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <
>>> fmcquil...@pivotal.io>
>>> > wrote:
>>> >
>>> >> great, thanks for the additional information
>>> >>
>>> >> Frank
>>> >>
>>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey 
>>> >> wrote:
>>> >>
>>> >>> IMO
>>> >>>
>>> >>> 1) Option to define resulting column names. Please see pdltools
>>> >>> implementation - the ability to pass in a function is especially
>>> useful (
>>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>>> >>> 2) Option to dummy code only the top n most frequently occurring
>>> values
>>> >>> in
>>> >>> any column
>>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>>> >>> pivotcol_val2
>>> >>> ...) instead of values in column names + secondary mapping table
>>> >>> 4) Option to exclude original column from results table
>>> >>>
>>> >>> (1) & (2) are much higher priority than (3) & (4).
>>> >>>
>>> >>> Agreed that these could also be applied to Pivoting (especially 1).
>>> >>>
>>> >>>
>>> >>>
>>> >>> Jarrod Vawdrey
>>> >>> Sr. Data Scientist
>>> >>> Data Science & Engineering | Pivotal
>>> >>> (650) 315-8905
>>> >>> https://pivotal.io/
>>> >>>
>>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
>>> fmcquil...@pivotal.io>
>>> >>> wrote:
>>> >>>
>>> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty
>>> useful -
>>> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in
>>> the
>>> >>> order
>>> >>> > of priority as you see it?
>>> >>> >
>>> >>> > Also it seems like some of these could be applied to the Pivot
>>> >>> function as
>>> >>> > well, e.g., UDF for column naming.
>>> >>> >
>>> >>> > Frank
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
>>> jvawd...@pivotal.io>
>>> >>> > wrote:
>>> >>> >
>>> >>> >> Hey Frank,
>>> >>> >>
>>> >>> >> How are special character values handled today? It is often not
>>> ideal
>>> >>> to
>>> >>> >> end up with column names that require double quotes to call due to
>>> >>> >> downstream scripts.
>>> >>> >>
>>> >>> >> A couple of features that would be useful
>>> >>> >>
>>> >>> >> * Option to define resulting column names. Please see pdltools
>>> >>> >> implementation - the ability to pass in a function is especially
>>> >>> useful (
>>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0
>>> 1.html)
>>> >>> >> * Option to dummy code only the top n most frequently occurring
>>> >>> values in
>>> >>> >> any column
>>> >>> >> * Option to exclude original column from results table
>>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>>> >>> >> pivotcol_val2 ...) instead of values in column names + secondary
>>> >>> mapp

Re: Encoding categorical variables

2016-10-28 Thread Woo Jae Jung
Also agree that double-quoted column names are not ideal.  In addition to
the net-new features described in this thread, it'd be nice to see
non-double-quoted output as default behavior in the
existing create_indicator_variables() function.

Thanks,
Woo

On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung  wrote:

> I like the one-hot encoded feature.  Another variant of this idea would be
> an "all other" variable (distinct from the reference class) that contains
> occurrences of the less frequent category types.  In both of these
> scenarios, the threshold for 'less frequent' could be user-supplied.
>
> Thanks,
> Woo
>
> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer  wrote:
>
>> An alternative to dropping is to assign the less frequent values to the
>> reference i.e. all one-hot encoded features will be 0.
>> Also important to note: total runtime will increase with this option since
>> we'll have to compute the exact frequency distribution.
>>
>> Another suggested change is to call this function 'one_hot_encoding' since
>> that is the output here (similar to sklearn's OneHotEncoder
>> > preprocessing.OneHotEncoder.html>).
>> We can keep the current name as a deprecated alias till 2.0 is released.
>>
>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan 
>> wrote:
>>
>> > Jarrod,
>> >
>> > Just trying to write up detailed requirements.  How would you see this
>> one
>> > working?
>> >
>> > "2) Option to dummy code only the top n most frequently occurring
>> values in
>> > any column"
>> >
>> > With 1 column I can picture it, you would drop the rows with the less
>> > frequently occurring values and end up with a smaller table.  But what
>> if
>> > you are encoding multiple rows?Would you want a per row
>> specification
>> > of n? i.e., top 3 values for column x, top 10 values for column y?  If
>> you
>> > did this then your result set might include low frequency values for
>> column
>> > x (not in top 3) because they are in the top 10 for column y - this
>> might
>> > be confusing.
>> >
>> > Frank
>> >
>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan > >
>> > wrote:
>> >
>> >> great, thanks for the additional information
>> >>
>> >> Frank
>> >>
>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey 
>> >> wrote:
>> >>
>> >>> IMO
>> >>>
>> >>> 1) Option to define resulting column names. Please see pdltools
>> >>> implementation - the ability to pass in a function is especially
>> useful (
>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> >>> 2) Option to dummy code only the top n most frequently occurring
>> values
>> >>> in
>> >>> any column
>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>> >>> pivotcol_val2
>> >>> ...) instead of values in column names + secondary mapping table
>> >>> 4) Option to exclude original column from results table
>> >>>
>> >>> (1) & (2) are much higher priority than (3) & (4).
>> >>>
>> >>> Agreed that these could also be applied to Pivoting (especially 1).
>> >>>
>> >>>
>> >>>
>> >>> Jarrod Vawdrey
>> >>> Sr. Data Scientist
>> >>> Data Science & Engineering | Pivotal
>> >>> (650) 315-8905
>> >>> https://pivotal.io/
>> >>>
>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
>> fmcquil...@pivotal.io>
>> >>> wrote:
>> >>>
>> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful
>> -
>> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
>> >>> order
>> >>> > of priority as you see it?
>> >>> >
>> >>> > Also it seems like some of these could be applied to the Pivot
>> >>> function as
>> >>> > well, e.g., UDF for column naming.
>> >>> >
>> >>> > Frank
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <
>> jvawd...@pivotal.io>
>> >>> > wrote:
>> >>> >
>> >>> >> Hey Frank,
>> >>> >>
>> >>> >> How are special character values handled today? It is often not
>> ideal
>> >>> to
>> >>> >> end up with column names that require double quotes to call due to
>> >>> >> downstream scripts.
>> >>> >>
>> >>> >> A couple of features that would be useful
>> >>> >>
>> >>> >> * Option to define resulting column names. Please see pdltools
>> >>> >> implementation - the ability to pass in a function is especially
>> >>> useful (
>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html
>> )
>> >>> >> * Option to dummy code only the top n most frequently occurring
>> >>> values in
>> >>> >> any column
>> >>> >> * Option to exclude original column from results table
>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>> >>> >> pivotcol_val2 ...) instead of values in column names + secondary
>> >>> mapping
>> >>> >> table
>> >>> >>
>> >>> >> Thank you
>> >>> >>
>> >>> >> Jarrod Vawdrey
>> >>> >> Sr. Data Scientist
>> >>> >> Data Science & Engineering | Pivotal
>> >>> >> (650) 315-8905
>> >>> >> https://pivotal.io/
>> >>> >>
>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuil

Re: Encoding categorical variables

2016-10-28 Thread Woo Jae Jung
I like the one-hot encoded feature.  Another variant of this idea would be
an "all other" variable (distinct from the reference class) that contains
occurrences of the less frequent category types.  In both of these
scenarios, the threshold for 'less frequent' could be user-supplied.

Thanks,
Woo

On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer  wrote:

> An alternative to dropping is to assign the less frequent values to the
> reference i.e. all one-hot encoded features will be 0.
> Also important to note: total runtime will increase with this option since
> we'll have to compute the exact frequency distribution.
>
> Another suggested change is to call this function 'one_hot_encoding' since
> that is the output here (similar to sklearn's OneHotEncoder
>  OneHotEncoder.html>).
> We can keep the current name as a deprecated alias till 2.0 is released.
>
> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan 
> wrote:
>
> > Jarrod,
> >
> > Just trying to write up detailed requirements.  How would you see this
> one
> > working?
> >
> > "2) Option to dummy code only the top n most frequently occurring values
> in
> > any column"
> >
> > With 1 column I can picture it, you would drop the rows with the less
> > frequently occurring values and end up with a smaller table.  But what if
> > you are encoding multiple rows?Would you want a per row specification
> > of n? i.e., top 3 values for column x, top 10 values for column y?  If
> you
> > did this then your result set might include low frequency values for
> column
> > x (not in top 3) because they are in the top 10 for column y - this might
> > be confusing.
> >
> > Frank
> >
> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan 
> > wrote:
> >
> >> great, thanks for the additional information
> >>
> >> Frank
> >>
> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey 
> >> wrote:
> >>
> >>> IMO
> >>>
> >>> 1) Option to define resulting column names. Please see pdltools
> >>> implementation - the ability to pass in a function is especially
> useful (
> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> >>> 2) Option to dummy code only the top n most frequently occurring values
> >>> in
> >>> any column
> >>> 3) Option to create numeric column names (E.g. pivotcol_val1,
> >>> pivotcol_val2
> >>> ...) instead of values in column names + secondary mapping table
> >>> 4) Option to exclude original column from results table
> >>>
> >>> (1) & (2) are much higher priority than (3) & (4).
> >>>
> >>> Agreed that these could also be applied to Pivoting (especially 1).
> >>>
> >>>
> >>>
> >>> Jarrod Vawdrey
> >>> Sr. Data Scientist
> >>> Data Science & Engineering | Pivotal
> >>> (650) 315-8905
> >>> https://pivotal.io/
> >>>
> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <
> fmcquil...@pivotal.io>
> >>> wrote:
> >>>
> >>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful -
> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
> >>> order
> >>> > of priority as you see it?
> >>> >
> >>> > Also it seems like some of these could be applied to the Pivot
> >>> function as
> >>> > well, e.g., UDF for column naming.
> >>> >
> >>> > Frank
> >>> >
> >>> >
> >>> >
> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey  >
> >>> > wrote:
> >>> >
> >>> >> Hey Frank,
> >>> >>
> >>> >> How are special character values handled today? It is often not
> ideal
> >>> to
> >>> >> end up with column names that require double quotes to call due to
> >>> >> downstream scripts.
> >>> >>
> >>> >> A couple of features that would be useful
> >>> >>
> >>> >> * Option to define resulting column names. Please see pdltools
> >>> >> implementation - the ability to pass in a function is especially
> >>> useful (
> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> >>> >> * Option to dummy code only the top n most frequently occurring
> >>> values in
> >>> >> any column
> >>> >> * Option to exclude original column from results table
> >>> >> * Option to create numeric column names (E.g. pivotcol_val1,
> >>> >> pivotcol_val2 ...) instead of values in column names + secondary
> >>> mapping
> >>> >> table
> >>> >>
> >>> >> Thank you
> >>> >>
> >>> >> Jarrod Vawdrey
> >>> >> Sr. Data Scientist
> >>> >> Data Science & Engineering | Pivotal
> >>> >> (650) 315-8905
> >>> >> https://pivotal.io/
> >>> >>
> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
> >>> fmcquil...@pivotal.io>
> >>> >> wrote:
> >>> >>
> >>> >>> For the module encoding categorical variables
> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
> >>> >>> ata__prep.html
> >>> >>> does anyone have any suggestions on improvements that we could
> make?
> >>> >>>
> >>> >>> Here is a video on how encoding categorical variables works for
> >>> those not
> >>> >>> familiar with it
> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
> >>>

Re: Encoding categorical variables

2016-10-28 Thread Rahul Iyer
An alternative to dropping is to assign the less frequent values to the
reference i.e. all one-hot encoded features will be 0.
Also important to note: total runtime will increase with this option since
we'll have to compute the exact frequency distribution.

Another suggested change is to call this function 'one_hot_encoding' since
that is the output here (similar to sklearn's OneHotEncoder
).
We can keep the current name as a deprecated alias till 2.0 is released.

On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan 
wrote:

> Jarrod,
>
> Just trying to write up detailed requirements.  How would you see this one
> working?
>
> "2) Option to dummy code only the top n most frequently occurring values in
> any column"
>
> With 1 column I can picture it, you would drop the rows with the less
> frequently occurring values and end up with a smaller table.  But what if
> you are encoding multiple rows?Would you want a per row specification
> of n? i.e., top 3 values for column x, top 10 values for column y?  If you
> did this then your result set might include low frequency values for column
> x (not in top 3) because they are in the top 10 for column y - this might
> be confusing.
>
> Frank
>
> On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan 
> wrote:
>
>> great, thanks for the additional information
>>
>> Frank
>>
>> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey 
>> wrote:
>>
>>> IMO
>>>
>>> 1) Option to define resulting column names. Please see pdltools
>>> implementation - the ability to pass in a function is especially useful (
>>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>>> 2) Option to dummy code only the top n most frequently occurring values
>>> in
>>> any column
>>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>>> pivotcol_val2
>>> ...) instead of values in column names + secondary mapping table
>>> 4) Option to exclude original column from results table
>>>
>>> (1) & (2) are much higher priority than (3) & (4).
>>>
>>> Agreed that these could also be applied to Pivoting (especially 1).
>>>
>>>
>>>
>>> Jarrod Vawdrey
>>> Sr. Data Scientist
>>> Data Science & Engineering | Pivotal
>>> (650) 315-8905
>>> https://pivotal.io/
>>>
>>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan 
>>> wrote:
>>>
>>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful -
>>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
>>> order
>>> > of priority as you see it?
>>> >
>>> > Also it seems like some of these could be applied to the Pivot
>>> function as
>>> > well, e.g., UDF for column naming.
>>> >
>>> > Frank
>>> >
>>> >
>>> >
>>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey 
>>> > wrote:
>>> >
>>> >> Hey Frank,
>>> >>
>>> >> How are special character values handled today? It is often not ideal
>>> to
>>> >> end up with column names that require double quotes to call due to
>>> >> downstream scripts.
>>> >>
>>> >> A couple of features that would be useful
>>> >>
>>> >> * Option to define resulting column names. Please see pdltools
>>> >> implementation - the ability to pass in a function is especially
>>> useful (
>>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>>> >> * Option to dummy code only the top n most frequently occurring
>>> values in
>>> >> any column
>>> >> * Option to exclude original column from results table
>>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>>> >> pivotcol_val2 ...) instead of values in column names + secondary
>>> mapping
>>> >> table
>>> >>
>>> >> Thank you
>>> >>
>>> >> Jarrod Vawdrey
>>> >> Sr. Data Scientist
>>> >> Data Science & Engineering | Pivotal
>>> >> (650) 315-8905
>>> >> https://pivotal.io/
>>> >>
>>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>>> fmcquil...@pivotal.io>
>>> >> wrote:
>>> >>
>>> >>> For the module encoding categorical variables
>>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>>> >>> ata__prep.html
>>> >>> does anyone have any suggestions on improvements that we could make?
>>> >>>
>>> >>> Here is a video on how encoding categorical variables works for
>>> those not
>>> >>> familiar with it
>>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>
>>
>


Re: Encoding categorical variables

2016-10-28 Thread Frank McQuillan
Jarrod,

Just trying to write up detailed requirements.  How would you see this one
working?

"2) Option to dummy code only the top n most frequently occurring values in
any column"

With 1 column I can picture it, you would drop the rows with the less
frequently occurring values and end up with a smaller table.  But what if
you are encoding multiple rows?Would you want a per row specification
of n? i.e., top 3 values for column x, top 10 values for column y?  If you
did this then your result set might include low frequency values for column
x (not in top 3) because they are in the top 10 for column y - this might
be confusing.

Frank

On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan 
wrote:

> great, thanks for the additional information
>
> Frank
>
> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey 
> wrote:
>
>> IMO
>>
>> 1) Option to define resulting column names. Please see pdltools
>> implementation - the ability to pass in a function is especially useful (
>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> 2) Option to dummy code only the top n most frequently occurring values in
>> any column
>> 3) Option to create numeric column names (E.g. pivotcol_val1,
>> pivotcol_val2
>> ...) instead of values in column names + secondary mapping table
>> 4) Option to exclude original column from results table
>>
>> (1) & (2) are much higher priority than (3) & (4).
>>
>> Agreed that these could also be applied to Pivoting (especially 1).
>>
>>
>>
>> Jarrod Vawdrey
>> Sr. Data Scientist
>> Data Science & Engineering | Pivotal
>> (650) 315-8905
>> https://pivotal.io/
>>
>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan 
>> wrote:
>>
>> > Thanks for those suggestions, Jarrod.  They all sound pretty useful -
>> > would you mind taking a crack at numbering them 1,2,3... etc, in the
>> order
>> > of priority as you see it?
>> >
>> > Also it seems like some of these could be applied to the Pivot function
>> as
>> > well, e.g., UDF for column naming.
>> >
>> > Frank
>> >
>> >
>> >
>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey 
>> > wrote:
>> >
>> >> Hey Frank,
>> >>
>> >> How are special character values handled today? It is often not ideal
>> to
>> >> end up with column names that require double quotes to call due to
>> >> downstream scripts.
>> >>
>> >> A couple of features that would be useful
>> >>
>> >> * Option to define resulting column names. Please see pdltools
>> >> implementation - the ability to pass in a function is especially
>> useful (
>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> >> * Option to dummy code only the top n most frequently occurring values
>> in
>> >> any column
>> >> * Option to exclude original column from results table
>> >> * Option to create numeric column names (E.g. pivotcol_val1,
>> >> pivotcol_val2 ...) instead of values in column names + secondary
>> mapping
>> >> table
>> >>
>> >> Thank you
>> >>
>> >> Jarrod Vawdrey
>> >> Sr. Data Scientist
>> >> Data Science & Engineering | Pivotal
>> >> (650) 315-8905
>> >> https://pivotal.io/
>> >>
>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan <
>> fmcquil...@pivotal.io>
>> >> wrote:
>> >>
>> >>> For the module encoding categorical variables
>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>> >>> ata__prep.html
>> >>> does anyone have any suggestions on improvements that we could make?
>> >>>
>> >>> Here is a video on how encoding categorical variables works for those
>> not
>> >>> familiar with it
>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>> >>>
>> >>
>> >>
>> >
>>
>
>


Re: Encoding categorical variables

2016-10-19 Thread Frank McQuillan
great, thanks for the additional information

Frank

On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey  wrote:

> IMO
>
> 1) Option to define resulting column names. Please see pdltools
> implementation - the ability to pass in a function is especially useful (
> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> 2) Option to dummy code only the top n most frequently occurring values in
> any column
> 3) Option to create numeric column names (E.g. pivotcol_val1, pivotcol_val2
> ...) instead of values in column names + secondary mapping table
> 4) Option to exclude original column from results table
>
> (1) & (2) are much higher priority than (3) & (4).
>
> Agreed that these could also be applied to Pivoting (especially 1).
>
>
>
> Jarrod Vawdrey
> Sr. Data Scientist
> Data Science & Engineering | Pivotal
> (650) 315-8905
> https://pivotal.io/
>
> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan 
> wrote:
>
> > Thanks for those suggestions, Jarrod.  They all sound pretty useful -
> > would you mind taking a crack at numbering them 1,2,3... etc, in the
> order
> > of priority as you see it?
> >
> > Also it seems like some of these could be applied to the Pivot function
> as
> > well, e.g., UDF for column naming.
> >
> > Frank
> >
> >
> >
> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey 
> > wrote:
> >
> >> Hey Frank,
> >>
> >> How are special character values handled today? It is often not ideal to
> >> end up with column names that require double quotes to call due to
> >> downstream scripts.
> >>
> >> A couple of features that would be useful
> >>
> >> * Option to define resulting column names. Please see pdltools
> >> implementation - the ability to pass in a function is especially useful
> (
> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> >> * Option to dummy code only the top n most frequently occurring values
> in
> >> any column
> >> * Option to exclude original column from results table
> >> * Option to create numeric column names (E.g. pivotcol_val1,
> >> pivotcol_val2 ...) instead of values in column names + secondary mapping
> >> table
> >>
> >> Thank you
> >>
> >> Jarrod Vawdrey
> >> Sr. Data Scientist
> >> Data Science & Engineering | Pivotal
> >> (650) 315-8905
> >> https://pivotal.io/
> >>
> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan  >
> >> wrote:
> >>
> >>> For the module encoding categorical variables
> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
> >>> ata__prep.html
> >>> does anyone have any suggestions on improvements that we could make?
> >>>
> >>> Here is a video on how encoding categorical variables works for those
> not
> >>> familiar with it
> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
> >>>
> >>
> >>
> >
>


Re: Encoding categorical variables

2016-10-19 Thread Jarrod Vawdrey
IMO

1) Option to define resulting column names. Please see pdltools
implementation - the ability to pass in a function is especially useful (
http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
2) Option to dummy code only the top n most frequently occurring values in
any column
3) Option to create numeric column names (E.g. pivotcol_val1, pivotcol_val2
...) instead of values in column names + secondary mapping table
4) Option to exclude original column from results table

(1) & (2) are much higher priority than (3) & (4).

Agreed that these could also be applied to Pivoting (especially 1).



Jarrod Vawdrey
Sr. Data Scientist
Data Science & Engineering | Pivotal
(650) 315-8905
https://pivotal.io/

On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan 
wrote:

> Thanks for those suggestions, Jarrod.  They all sound pretty useful -
> would you mind taking a crack at numbering them 1,2,3... etc, in the order
> of priority as you see it?
>
> Also it seems like some of these could be applied to the Pivot function as
> well, e.g., UDF for column naming.
>
> Frank
>
>
>
> On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey 
> wrote:
>
>> Hey Frank,
>>
>> How are special character values handled today? It is often not ideal to
>> end up with column names that require double quotes to call due to
>> downstream scripts.
>>
>> A couple of features that would be useful
>>
>> * Option to define resulting column names. Please see pdltools
>> implementation - the ability to pass in a function is especially useful (
>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
>> * Option to dummy code only the top n most frequently occurring values in
>> any column
>> * Option to exclude original column from results table
>> * Option to create numeric column names (E.g. pivotcol_val1,
>> pivotcol_val2 ...) instead of values in column names + secondary mapping
>> table
>>
>> Thank you
>>
>> Jarrod Vawdrey
>> Sr. Data Scientist
>> Data Science & Engineering | Pivotal
>> (650) 315-8905
>> https://pivotal.io/
>>
>> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan 
>> wrote:
>>
>>> For the module encoding categorical variables
>>> http://madlib.incubator.apache.org/docs/latest/group__grp__d
>>> ata__prep.html
>>> does anyone have any suggestions on improvements that we could make?
>>>
>>> Here is a video on how encoding categorical variables works for those not
>>> familiar with it
>>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>>
>>
>>
>


Re: Encoding categorical variables

2016-10-19 Thread Frank McQuillan
Thanks for those suggestions, Jarrod.  They all sound pretty useful - would
you mind taking a crack at numbering them 1,2,3... etc, in the order of
priority as you see it?

Also it seems like some of these could be applied to the Pivot function as
well, e.g., UDF for column naming.

Frank



On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey  wrote:

> Hey Frank,
>
> How are special character values handled today? It is often not ideal to
> end up with column names that require double quotes to call due to
> downstream scripts.
>
> A couple of features that would be useful
>
> * Option to define resulting column names. Please see pdltools
> implementation - the ability to pass in a function is especially useful (
> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
> * Option to dummy code only the top n most frequently occurring values in
> any column
> * Option to exclude original column from results table
> * Option to create numeric column names (E.g. pivotcol_val1, pivotcol_val2
> ...) instead of values in column names + secondary mapping table
>
> Thank you
>
> Jarrod Vawdrey
> Sr. Data Scientist
> Data Science & Engineering | Pivotal
> (650) 315-8905
> https://pivotal.io/
>
> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan 
> wrote:
>
>> For the module encoding categorical variables
>> http://madlib.incubator.apache.org/docs/latest/group__grp__
>> data__prep.html
>> does anyone have any suggestions on improvements that we could make?
>>
>> Here is a video on how encoding categorical variables works for those not
>> familiar with it
>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6
>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ
>>
>
>


Re: Encoding categorical variables

2016-10-14 Thread Jarrod Vawdrey
Hey Frank,

How are special character values handled today? It is often not ideal to
end up with column names that require double quotes to call due to
downstream scripts.

A couple of features that would be useful

* Option to define resulting column names. Please see pdltools
implementation - the ability to pass in a function is especially useful (
http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html)
* Option to dummy code only the top n most frequently occurring values in
any column
* Option to exclude original column from results table
* Option to create numeric column names (E.g. pivotcol_val1, pivotcol_val2
...) instead of values in column names + secondary mapping table

Thank you

Jarrod Vawdrey
Sr. Data Scientist
Data Science & Engineering | Pivotal
(650) 315-8905
https://pivotal.io/

On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan 
wrote:

> For the module encoding categorical variables
> http://madlib.incubator.apache.org/docs/latest/group__grp__data__prep.html
> does anyone have any suggestions on improvements that we could make?
>
> Here is a video on how encoding categorical variables works for those not
> familiar with it
> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL62pIycqXx-
> Qf6EXu5FDxUgXW23BHOtcQ
>