+Vatsan for his thoughts as well! On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung <wj...@pivotal.io> wrote:
> Also agree that double-quoted column names are not ideal. In addition to > the net-new features described in this thread, it'd be nice to see > non-double-quoted output as default behavior in the > existing create_indicator_variables() function. > > Thanks, > Woo > > On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wj...@pivotal.io> wrote: > >> I like the one-hot encoded feature. Another variant of this idea would >> be an "all other" variable (distinct from the reference class) that >> contains occurrences of the less frequent category types. In both of these >> scenarios, the threshold for 'less frequent' could be user-supplied. >> >> Thanks, >> Woo >> >> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <rahulri...@gmail.com> >> wrote: >> >>> An alternative to dropping is to assign the less frequent values to the >>> reference i.e. all one-hot encoded features will be 0. >>> Also important to note: total runtime will increase with this option >>> since >>> we'll have to compute the exact frequency distribution. >>> >>> Another suggested change is to call this function 'one_hot_encoding' >>> since >>> that is the output here (similar to sklearn's OneHotEncoder >>> <http://scikit-learn.org/stable/modules/generated/sklearn.pr >>> eprocessing.OneHotEncoder.html>). >>> We can keep the current name as a deprecated alias till 2.0 is released. >>> >>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <fmcquil...@pivotal.io >>> > >>> wrote: >>> >>> > Jarrod, >>> > >>> > Just trying to write up detailed requirements. How would you see this >>> one >>> > working? >>> > >>> > "2) Option to dummy code only the top n most frequently occurring >>> values in >>> > any column" >>> > >>> > With 1 column I can picture it, you would drop the rows with the less >>> > frequently occurring values and end up with a smaller table. But what >>> if >>> > you are encoding multiple rows? Would you want a per row >>> specification >>> > of n? i.e., top 3 values for column x, top 10 values for column y? If >>> you >>> > did this then your result set might include low frequency values for >>> column >>> > x (not in top 3) because they are in the top 10 for column y - this >>> might >>> > be confusing. >>> > >>> > Frank >>> > >>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan < >>> fmcquil...@pivotal.io> >>> > wrote: >>> > >>> >> great, thanks for the additional information >>> >> >>> >> Frank >>> >> >>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jvawd...@pivotal.io> >>> >> wrote: >>> >> >>> >>> IMO >>> >>> >>> >>> 1) Option to define resulting column names. Please see pdltools >>> >>> implementation - the ability to pass in a function is especially >>> useful ( >>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html) >>> >>> 2) Option to dummy code only the top n most frequently occurring >>> values >>> >>> in >>> >>> any column >>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1, >>> >>> pivotcol_val2 >>> >>> ...) instead of values in column names + secondary mapping table >>> >>> 4) Option to exclude original column from results table >>> >>> >>> >>> (1) & (2) are much higher priority than (3) & (4). >>> >>> >>> >>> Agreed that these could also be applied to Pivoting (especially 1). >>> >>> >>> >>> >>> >>> >>> >>> Jarrod Vawdrey >>> >>> Sr. Data Scientist >>> >>> Data Science & Engineering | Pivotal >>> >>> (650) 315-8905 >>> >>> https://pivotal.io/ >>> >>> >>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan < >>> fmcquil...@pivotal.io> >>> >>> wrote: >>> >>> >>> >>> > Thanks for those suggestions, Jarrod. They all sound pretty >>> useful - >>> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in >>> the >>> >>> order >>> >>> > of priority as you see it? >>> >>> > >>> >>> > Also it seems like some of these could be applied to the Pivot >>> >>> function as >>> >>> > well, e.g., UDF for column naming. >>> >>> > >>> >>> > Frank >>> >>> > >>> >>> > >>> >>> > >>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey < >>> jvawd...@pivotal.io> >>> >>> > wrote: >>> >>> > >>> >>> >> Hey Frank, >>> >>> >> >>> >>> >> How are special character values handled today? It is often not >>> ideal >>> >>> to >>> >>> >> end up with column names that require double quotes to call due to >>> >>> >> downstream scripts. >>> >>> >> >>> >>> >> A couple of features that would be useful >>> >>> >> >>> >>> >> * Option to define resulting column names. Please see pdltools >>> >>> >> implementation - the ability to pass in a function is especially >>> >>> useful ( >>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0 >>> 1.html) >>> >>> >> * Option to dummy code only the top n most frequently occurring >>> >>> values in >>> >>> >> any column >>> >>> >> * Option to exclude original column from results table >>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1, >>> >>> >> pivotcol_val2 ...) instead of values in column names + secondary >>> >>> mapping >>> >>> >> table >>> >>> >> >>> >>> >> Thank you >>> >>> >> >>> >>> >> Jarrod Vawdrey >>> >>> >> Sr. Data Scientist >>> >>> >> Data Science & Engineering | Pivotal >>> >>> >> (650) 315-8905 >>> >>> >> https://pivotal.io/ >>> >>> >> >>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan < >>> >>> fmcquil...@pivotal.io> >>> >>> >> wrote: >>> >>> >> >>> >>> >>> For the module encoding categorical variables >>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d >>> >>> >>> ata__prep.html >>> >>> >>> does anyone have any suggestions on improvements that we could >>> make? >>> >>> >>> >>> >>> >>> Here is a video on how encoding categorical variables works for >>> >>> those not >>> >>> >>> familiar with it >>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6 >>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ >>> >>> >>> >>> >>> >> >>> >>> >> >>> >>> > >>> >>> >>> >> >>> >> >>> > >>> >> >> >