[ 
https://issues.apache.org/jira/browse/SPARK-20619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang updated SPARK-20619:
--------------------------------
    Description: 
StringIndexer maps labels to numbers according to the descending order of label 
frequency. Other types of ordering (e.g., alphabetical) may be needed in 
feature ETL. For example, the ordering will affect the result in one-hot 
encoding and RFormula. Propose to support other ordering methods and we add a 
parameter stringOrderType that supports the following four options:

   - 'freq_desc': descending order by label frequency (most frequent label 
assigned 0)
   - 'freq_asc': ascending order by label frequency (least frequent label 
assigned 0)
   - 'alphabet_desc': descending alphabetical order
   - 'alphabet_asc': ascending alphabetical order

  was:
StringIndexer maps labels to numbers according to the descending order of label 
frequency. Other types of ordering (e.g., alphabetical) may be needed in 
feature ETL, for example, in one-hot encoding. Propose to support alphabetic 
order, and ascending order of label frequency. For example, add a parameter 
stringOrderType to control how string is ordered which supports four options:

   - 'freq_desc': descending order by label frequency (most frequent label 
assigned 0)
   - 'freq_asc': ascending order by label frequency (least frequent label 
assigned 0)
   - 'alphabet_desc': descending alphabetical order
   - 'alphabet_asc': ascending alphabetical order


> StringIndexer supports multiple ways of label ordering
> ------------------------------------------------------
>
>                 Key: SPARK-20619
>                 URL: https://issues.apache.org/jira/browse/SPARK-20619
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Wayne Zhang
>
> StringIndexer maps labels to numbers according to the descending order of 
> label frequency. Other types of ordering (e.g., alphabetical) may be needed 
> in feature ETL. For example, the ordering will affect the result in one-hot 
> encoding and RFormula. Propose to support other ordering methods and we add a 
> parameter stringOrderType that supports the following four options:
>    - 'freq_desc': descending order by label frequency (most frequent label 
> assigned 0)
>    - 'freq_asc': ascending order by label frequency (least frequent label 
> assigned 0)
>    - 'alphabet_desc': descending alphabetical order
>    - 'alphabet_asc': ascending alphabetical order



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to