[jira] [Assigned] (SPARK-16957) Use weighted midpoints for split values.

2017-05-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-16957:
-

Assignee: Yan Facai (颜发才)

> Use weighted midpoints for split values.
> 
>
> Key: SPARK-16957
> URL: https://issues.apache.org/jira/browse/SPARK-16957
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Assignee: Yan Facai (颜发才)
>Priority: Trivial
> Fix For: 2.3.0
>
>
> We should be using weighted split points rather than the actual continuous 
> binned feature values. For instance, in a dataset containing binary features 
> (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} 
> and {{x > 0.0}}. For any real data with some smoothness qualities, this is 
> asymptotically bad compared to GBM's approach. The split point should be a 
> weighted split point of the two values of the "innermost" feature bins; e.g., 
> if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at 
> {{0.75}}.
> Example:
> {code}
> +++-+-+
> |feature0|feature1|label|count|
> +++-+-+
> | 0.0| 0.0|  0.0|   23|
> | 1.0| 0.0|  0.0|2|
> | 0.0| 0.0|  1.0|2|
> | 0.0| 1.0|  0.0|7|
> | 1.0| 0.0|  1.0|   23|
> | 0.0| 1.0|  1.0|   18|
> | 1.0| 1.0|  1.0|7|
> | 1.0| 1.0|  0.0|   18|
> +++-+-+
> DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
>   If (feature 0 <= 0.0)
>If (feature 1 <= 0.0)
> Predict: -0.56
>Else (feature 1 > 0.0)
> Predict: 0.29333
>   Else (feature 0 > 0.0)
>If (feature 1 <= 0.0)
> Predict: 0.56
>Else (feature 1 > 0.0)
> Predict: -0.29333
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16957) Use weighted midpoints for split values.

2017-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16957:


Assignee: Apache Spark

> Use weighted midpoints for split values.
> 
>
> Key: SPARK-16957
> URL: https://issues.apache.org/jira/browse/SPARK-16957
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Assignee: Apache Spark
>Priority: Trivial
>
> Just like R's gbm, we should be using weighted split points rather than the 
> actual continuous binned feature values. For instance, in a dataset 
> containing binary features (that are fed in as continuous ones), our splits 
> are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some 
> smoothness qualities, this is asymptotically bad compared to GBM's approach. 
> The split point should be a weighted split point of the two values of the 
> "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, 
> the above split should be at {{0.75}}.
> Example:
> {code}
> +++-+-+
> |feature0|feature1|label|count|
> +++-+-+
> | 0.0| 0.0|  0.0|   23|
> | 1.0| 0.0|  0.0|2|
> | 0.0| 0.0|  1.0|2|
> | 0.0| 1.0|  0.0|7|
> | 1.0| 0.0|  1.0|   23|
> | 0.0| 1.0|  1.0|   18|
> | 1.0| 1.0|  1.0|7|
> | 1.0| 1.0|  0.0|   18|
> +++-+-+
> DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
>   If (feature 0 <= 0.0)
>If (feature 1 <= 0.0)
> Predict: -0.56
>Else (feature 1 > 0.0)
> Predict: 0.29333
>   Else (feature 0 > 0.0)
>If (feature 1 <= 0.0)
> Predict: 0.56
>Else (feature 1 > 0.0)
> Predict: -0.29333
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16957) Use weighted midpoints for split values.

2017-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16957:


Assignee: (was: Apache Spark)

> Use weighted midpoints for split values.
> 
>
> Key: SPARK-16957
> URL: https://issues.apache.org/jira/browse/SPARK-16957
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Priority: Trivial
>
> Just like R's gbm, we should be using weighted split points rather than the 
> actual continuous binned feature values. For instance, in a dataset 
> containing binary features (that are fed in as continuous ones), our splits 
> are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some 
> smoothness qualities, this is asymptotically bad compared to GBM's approach. 
> The split point should be a weighted split point of the two values of the 
> "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, 
> the above split should be at {{0.75}}.
> Example:
> {code}
> +++-+-+
> |feature0|feature1|label|count|
> +++-+-+
> | 0.0| 0.0|  0.0|   23|
> | 1.0| 0.0|  0.0|2|
> | 0.0| 0.0|  1.0|2|
> | 0.0| 1.0|  0.0|7|
> | 1.0| 0.0|  1.0|   23|
> | 0.0| 1.0|  1.0|   18|
> | 1.0| 1.0|  1.0|7|
> | 1.0| 1.0|  0.0|   18|
> +++-+-+
> DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
>   If (feature 0 <= 0.0)
>If (feature 1 <= 0.0)
> Predict: -0.56
>Else (feature 1 > 0.0)
> Predict: 0.29333
>   Else (feature 0 > 0.0)
>If (feature 1 <= 0.0)
> Predict: 0.56
>Else (feature 1 > 0.0)
> Predict: -0.29333
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org