[jira] [Updated] (FLINK-31189) Allow special handle of less frequent values in StringIndexer
[ https://issues.apache.org/jira/browse/FLINK-31189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-31189: --- Labels: pull-request-available (was: ) > Allow special handle of less frequent values in StringIndexer > - > > Key: FLINK-31189 > URL: https://issues.apache.org/jira/browse/FLINK-31189 > Project: Flink > Issue Type: Improvement > Components: Library / Machine Learning >Reporter: Fan Hong >Priority: Major > Labels: pull-request-available > > Real-world datasets often contain categorical features with millions of > distinct values, some of which may only appear a few times. To maximize the > performance of certain algorithms, it is important to treat these less > frequent values properly. A popular approach is to put them to a special > index, as is done in sklearn's OneHotEncoder [1]. > > [1] > https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31189) Allow special handle of less frequent values in StringIndexer
[ https://issues.apache.org/jira/browse/FLINK-31189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Hong updated FLINK-31189: - Description: Real-world datasets often contain categorical features with millions of distinct values, some of which may only appear a few times. To maximize the performance of certain algorithms, it is important to treat these less frequent values properly. A popular approach is to put them to a special index, as is done in sklearn's OneHotEncoder [1]. [1] https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html was: In real-world datasets, categorical features may have millions of distinct values, while some of them may only occur few times. Special handling of less frequent values can bring performance increase in some algorithms. One > Allow special handle of less frequent values in StringIndexer > - > > Key: FLINK-31189 > URL: https://issues.apache.org/jira/browse/FLINK-31189 > Project: Flink > Issue Type: Improvement > Components: Library / Machine Learning >Reporter: Fan Hong >Priority: Major > > Real-world datasets often contain categorical features with millions of > distinct values, some of which may only appear a few times. To maximize the > performance of certain algorithms, it is important to treat these less > frequent values properly. A popular approach is to put them to a special > index, as is done in sklearn's OneHotEncoder [1]. > > [1] > https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31189) Allow special handle of less frequent values in StringIndexer
[ https://issues.apache.org/jira/browse/FLINK-31189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Hong updated FLINK-31189: - Summary: Allow special handle of less frequent values in StringIndexer (was: Allow ignore less frequent values in StringIndexer) > Allow special handle of less frequent values in StringIndexer > - > > Key: FLINK-31189 > URL: https://issues.apache.org/jira/browse/FLINK-31189 > Project: Flink > Issue Type: Improvement > Components: Library / Machine Learning >Reporter: Fan Hong >Priority: Major > > In real-world datasets, categorical features may have millions of distinct > values, while some of them may only occur few times. Special handling of less > frequent values can bring performance increase in some algorithms. > > One -- This message was sent by Atlassian Jira (v8.20.10#820010)