[jira] [Updated] (FLINK-31189) Allow special handle of less frequent values in StringIndexer

2023-03-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-31189:
---
Labels: pull-request-available  (was: )

> Allow special handle of less frequent values in StringIndexer
> -
>
> Key: FLINK-31189
> URL: https://issues.apache.org/jira/browse/FLINK-31189
> Project: Flink
>  Issue Type: Improvement
>  Components: Library / Machine Learning
>Reporter: Fan Hong
>Priority: Major
>  Labels: pull-request-available
>
> Real-world datasets often contain categorical features with millions of 
> distinct values, some of which may only appear a few times. To maximize the 
> performance of certain algorithms, it is important to treat these less 
> frequent values properly. A popular approach is to put them to a special 
> index, as is done in sklearn's OneHotEncoder [1].
>  
> [1] 
> https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31189) Allow special handle of less frequent values in StringIndexer

2023-02-22 Thread Fan Hong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Hong updated FLINK-31189:
-
Description: 
Real-world datasets often contain categorical features with millions of 
distinct values, some of which may only appear a few times. To maximize the 
performance of certain algorithms, it is important to treat these less frequent 
values properly. A popular approach is to put them to a special index, as is 
done in sklearn's OneHotEncoder [1].

 

[1] 
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

  was:
In real-world datasets, categorical features may have millions of distinct 
values, while some of them may only occur few times. Special handling of less 
frequent values can bring performance increase in some algorithms.

 

One  


> Allow special handle of less frequent values in StringIndexer
> -
>
> Key: FLINK-31189
> URL: https://issues.apache.org/jira/browse/FLINK-31189
> Project: Flink
>  Issue Type: Improvement
>  Components: Library / Machine Learning
>Reporter: Fan Hong
>Priority: Major
>
> Real-world datasets often contain categorical features with millions of 
> distinct values, some of which may only appear a few times. To maximize the 
> performance of certain algorithms, it is important to treat these less 
> frequent values properly. A popular approach is to put them to a special 
> index, as is done in sklearn's OneHotEncoder [1].
>  
> [1] 
> https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31189) Allow special handle of less frequent values in StringIndexer

2023-02-22 Thread Fan Hong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Hong updated FLINK-31189:
-
Summary: Allow special handle of less frequent values in StringIndexer  
(was: Allow ignore less frequent values in StringIndexer)

> Allow special handle of less frequent values in StringIndexer
> -
>
> Key: FLINK-31189
> URL: https://issues.apache.org/jira/browse/FLINK-31189
> Project: Flink
>  Issue Type: Improvement
>  Components: Library / Machine Learning
>Reporter: Fan Hong
>Priority: Major
>
> In real-world datasets, categorical features may have millions of distinct 
> values, while some of them may only occur few times. Special handling of less 
> frequent values can bring performance increase in some algorithms.
>  
> One  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)