[jira] [Created] (SPARK-29566) Imputer should support single-column input/ouput

2019-10-23 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29566:


 Summary: Imputer should support single-column input/ouput
 Key: SPARK-29566
 URL: https://issues.apache.org/jira/browse/SPARK-29566
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


Imputer should support single-column input/ouput



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29565) OneHotEncoder should support single-column input/ouput

2019-10-23 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29565:


 Summary: OneHotEncoder should support single-column input/ouput
 Key: SPARK-29565
 URL: https://issues.apache.org/jira/browse/SPARK-29565
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


Current feature algs 
({color:#5a6e5a}QuantileDiscretizer/Binarizer/Bucketizer/StringIndexer{color}) 
are designed to support both single-col & multi-col.

And there is already some internal utils (like 
{color:#c7a65d}checkSingleVsMultiColumnParams{color}) for this.

For OneHotEncoder, it's reasonable to support single-col.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29093) Remove automatically generated param setters in _shared_params_code_gen.py

2019-10-23 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-29093:


Assignee: Huaxin Gao

> Remove automatically generated param setters in _shared_params_code_gen.py
> --
>
> Key: SPARK-29093
> URL: https://issues.apache.org/jira/browse/SPARK-29093
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Major
>
> The main difference between scala and py sides come from the automatically 
> generated param setter in _shared_params_code_gen.py.
> To make them in sync, we should remove those setters in _shared_.py, and add 
> the corresponding setters manually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29093) Remove automatically generated param setters in _shared_params_code_gen.py

2019-10-23 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957601#comment-16957601
 ] 

zhengruifeng commented on SPARK-29093:
--

[~huaxingao] Thanks!

> Remove automatically generated param setters in _shared_params_code_gen.py
> --
>
> Key: SPARK-29093
> URL: https://issues.apache.org/jira/browse/SPARK-29093
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> The main difference between scala and py sides come from the automatically 
> generated param setter in _shared_params_code_gen.py.
> To make them in sync, we should remove those setters in _shared_.py, and add 
> the corresponding setters manually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29232) RandomForestRegressionModel does not update the parameter maps of the DecisionTreeRegressionModels underneath

2019-10-22 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-29232:


Assignee: Huaxin Gao

> RandomForestRegressionModel does not update the parameter maps of the 
> DecisionTreeRegressionModels underneath
> -
>
> Key: SPARK-29232
> URL: https://issues.apache.org/jira/browse/SPARK-29232
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Jiaqi Guo
>Assignee: Huaxin Gao
>Priority: Major
>
> We trained a RandomForestRegressionModel, and tried to access the trees. Even 
> though the DecisionTreeRegressionModel is correctly built with the proper 
> parameters from random forest, the parameter map is not updated, and still 
> contains only the default value. 
> For example, if a RandomForestRegressor was trained with maxDepth of 12, then 
> accessing the tree information, extractParamMap still returns the default 
> values, with maxDepth=5. Calling the depth itself of 
> DecisionTreeRegressionModel returns the correct value of 12 though.
> This creates issues when we want to access each individual tree and build the 
> trees back up for the random forest estimator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29232) RandomForestRegressionModel does not update the parameter maps of the DecisionTreeRegressionModels underneath

2019-10-22 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-29232.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26154
[https://github.com/apache/spark/pull/26154]

> RandomForestRegressionModel does not update the parameter maps of the 
> DecisionTreeRegressionModels underneath
> -
>
> Key: SPARK-29232
> URL: https://issues.apache.org/jira/browse/SPARK-29232
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Jiaqi Guo
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> We trained a RandomForestRegressionModel, and tried to access the trees. Even 
> though the DecisionTreeRegressionModel is correctly built with the proper 
> parameters from random forest, the parameter map is not updated, and still 
> contains only the default value. 
> For example, if a RandomForestRegressor was trained with maxDepth of 12, then 
> accessing the tree information, extractParamMap still returns the default 
> values, with maxDepth=5. Calling the depth itself of 
> DecisionTreeRegressionModel returns the correct value of 12 though.
> This creates issues when we want to access each individual tree and build the 
> trees back up for the random forest estimator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29489) ml.evaluation support log-loss

2019-10-18 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-29489:


Assignee: zhengruifeng

> ml.evaluation support log-loss
> --
>
> Key: SPARK-29489
> URL: https://issues.apache.org/jira/browse/SPARK-29489
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> {color:#5a6e5a}log-loss (aka logistic loss or cross-entropy loss) is one of 
> the most widely used metrics in classification tasks. It is already impled in 
> famous libraries like sklearn.
> {color}
> {color:#5a6e5a}However, it is missing so far.
> {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29489) ml.evaluation support log-loss

2019-10-18 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-29489.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26135
[https://github.com/apache/spark/pull/26135]

> ml.evaluation support log-loss
> --
>
> Key: SPARK-29489
> URL: https://issues.apache.org/jira/browse/SPARK-29489
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.0.0
>
>
> {color:#5a6e5a}log-loss (aka logistic loss or cross-entropy loss) is one of 
> the most widely used metrics in classification tasks. It is already impled in 
> famous libraries like sklearn.
> {color}
> {color:#5a6e5a}However, it is missing so far.
> {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23578) Add multicolumn support for Binarizer

2019-10-16 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-23578:


Assignee: zhengruifeng

> Add multicolumn support for Binarizer
> -
>
> Key: SPARK-23578
> URL: https://issues.apache.org/jira/browse/SPARK-23578
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Assignee: zhengruifeng
>Priority: Minor
>
> [Spark-20542] added an API that Bucketizer that can bin multiple columns. 
> Based on this change, a multicolumn support is added for Binarizer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23578) Add multicolumn support for Binarizer

2019-10-16 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-23578.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26064
[https://github.com/apache/spark/pull/26064]

> Add multicolumn support for Binarizer
> -
>
> Key: SPARK-23578
> URL: https://issues.apache.org/jira/browse/SPARK-23578
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> [Spark-20542] added an API that Bucketizer that can bin multiple columns. 
> Based on this change, a multicolumn support is added for Binarizer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29489) ml.evaluation support log-loss

2019-10-16 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29489:


 Summary: ml.evaluation support log-loss
 Key: SPARK-29489
 URL: https://issues.apache.org/jira/browse/SPARK-29489
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


{color:#5a6e5a}log-loss (aka logistic loss or cross-entropy loss) is one of the 
most widely used metrics in classification tasks. It is already impled in 
famous libraries like sklearn.
{color}

{color:#5a6e5a}However, it is missing so far.
{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29381) Add 'private' _XXXParams classes for classification & regression

2019-10-15 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951657#comment-16951657
 ] 

zhengruifeng commented on SPARK-29381:
--

[~huaxingao]  Hi, I think we need another PR to add 'private' classes like 
'_LinearSVCParams'/'_LinearRegressionParams'. I am sorry for late response.

> Add 'private' _XXXParams classes for classification & regression
> 
>
> Key: SPARK-29381
> URL: https://issues.apache.org/jira/browse/SPARK-29381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> ping [~huaxingao]  would you like to work on this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29377) parity between scala ml tuning and python ml tuning

2019-10-14 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-29377:


Assignee: Huaxin Gao

> parity between scala ml tuning and python ml tuning
> ---
>
> Key: SPARK-29377
> URL: https://issues.apache.org/jira/browse/SPARK-29377
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29377) parity between scala ml tuning and python ml tuning

2019-10-14 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-29377.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26057
[https://github.com/apache/spark/pull/26057]

> parity between scala ml tuning and python ml tuning
> ---
>
> Key: SPARK-29377
> URL: https://issues.apache.org/jira/browse/SPARK-29377
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29380) RFormula avoid repeated 'first' jobs to get vector size

2019-10-12 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-29380:


Assignee: zhengruifeng

> RFormula avoid repeated 'first' jobs to get vector size
> ---
>
> Key: SPARK-29380
> URL: https://issues.apache.org/jira/browse/SPARK-29380
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> In current impl, {{RFormula}} will trigger one {{first}} job to get the 
> vector size, if the size can not be obtained from {{AttributeGroup.}}
> {{This can be optimized by get the first row lazily, and reuse it for each 
> vector column.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29380) RFormula avoid repeated 'first' jobs to get vector size

2019-10-12 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-29380.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26052
[https://github.com/apache/spark/pull/26052]

> RFormula avoid repeated 'first' jobs to get vector size
> ---
>
> Key: SPARK-29380
> URL: https://issues.apache.org/jira/browse/SPARK-29380
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> In current impl, {{RFormula}} will trigger one {{first}} job to get the 
> vector size, if the size can not be obtained from {{AttributeGroup.}}
> {{This can be optimized by get the first row lazily, and reuse it for each 
> vector column.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29116) Refactor py classes related to DecisionTree

2019-10-12 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-29116.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25929
[https://github.com/apache/spark/pull/25929]

> Refactor py classes related to DecisionTree
> ---
>
> Key: SPARK-29116
> URL: https://issues.apache.org/jira/browse/SPARK-29116
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> 1, Like the scala side, move related classes to a seperate file 'tree.py'
> 2, add method 'predictLeaf' in 'DecisionTreeModel' & 'TreeEnsembleModel'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29116) Refactor py classes related to DecisionTree

2019-10-12 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-29116:


Assignee: Huaxin Gao

> Refactor py classes related to DecisionTree
> ---
>
> Key: SPARK-29116
> URL: https://issues.apache.org/jira/browse/SPARK-29116
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
>
> 1, Like the scala side, move related classes to a seperate file 'tree.py'
> 2, add method 'predictLeaf' in 'DecisionTreeModel' & 'TreeEnsembleModel'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29381) Add 'private' _XXXParams classes for classification & regression

2019-10-08 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29381:


 Summary: Add 'private' _XXXParams classes for classification & 
regression
 Key: SPARK-29381
 URL: https://issues.apache.org/jira/browse/SPARK-29381
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


ping [~huaxingao]  would you like to work on this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29380) RFormula avoid repeated 'first' jobs to get vector size

2019-10-08 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29380:


 Summary: RFormula avoid repeated 'first' jobs to get vector size
 Key: SPARK-29380
 URL: https://issues.apache.org/jira/browse/SPARK-29380
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


In current impl, {{RFormula}} will trigger one {{first}} job to get the vector 
size, if the size can not be obtained from {{AttributeGroup.}}

{{This can be optimized by get the first row lazily, and reuse it for each 
vector column.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend

2019-10-08 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946583#comment-16946583
 ] 

zhengruifeng commented on SPARK-29212:
--

[~zero323]  ??we should remove Java specific mixins, if they don't serve any 
practical value (provide no implementation whatsoever or don't extend other 
{{Java*}} mixins, like {{JavaPredictorParams}}, or have no JVM wrapper specific 
implementation, like {{JavaPredictor}}).??

I am neutral on it, what's is your thoughts? [~huaxingao] [~srowen]

 

??As of the second point there is additional consideration here - some 
{{Java*}} classes are considered part of the public API, and this should stay 
as is (these provide crucial information to the end user). ??

I guess we have reached an agreement in related tickets (like _XXXParams in 
featuers/clustering).

 

??On a side note current approach to ML API requires a lot of boilerplate code. 
Lately I've been playing with [some 
ideas|https://gist.github.com/zero323/ee36bce57ddeac82322e3ab4ef547611], that 
wouldn't require code generation - they have some caveats, but maybe there is 
something there. ??

It looks succinct, I think we may take it into account in the future.

 

> Add common classes without using JVM backend
> 
>
> Key: SPARK-29212
> URL: https://issues.apache.org/jira/browse/SPARK-29212
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> Copied from [https://github.com/apache/spark/pull/25776].
>  
>  Maciej's *Concern*:
> *Use cases for public ML type hierarchy*
>  * Add Python-only Transformer implementations:
>  * 
>  ** I am Python user and want to implement pure Python ML classifier without 
> providing JVM backend.
>  ** I want this classifier to be meaningfully positioned in the existing type 
> hierarchy.
>  ** However I have access only to high level classes ({{Estimator}}, 
> {{Model}}, {{MLReader}} / {{MLReadable}}).
>  * Run time parameter validation for both user defined (see above) and 
> existing class hierarchy,
>  * 
>  ** I am a library developer who provides functions that are meaningful only 
> for specific categories of {{Estimators}} - here classifiers.
>  ** I want to validate that user passed argument is indeed a classifier:
>  *** For built-in objects using "private" type hierarchy is not really 
> satisfying (actually, what is the rationale behind making it "private"? If 
> the goal is Scala API parity, and Scala counterparts are public, shouldn't 
> these be too?).
>  ** For user defined objects I can:
>  *** Use duck typing (on {{setRawPredictionCol}} for classifier, on 
> {{numClasses}} for classification model) but it hardly satisfying.
>  *** Provide parallel non-abstract type hierarchy ({{Classifier}} or 
> {{PythonClassifier}} and so on) and require users to implement such 
> interfaces. That however would require separate logic for checking for 
> built-in and and user-provided classes.
>  *** Provide parallel abstract type hierarchy, register all existing built-in 
> classes and require users to do the same.
>  Clearly these are not satisfying solutions as they require either defensive 
> programming or reinventing the same functionality for different 3rd party 
> APIs.
>  * Static type checking
>  * 
>  ** I am either end user or library developer and want to use PEP-484 
> annotations to indicate components that require classifier or classification 
> model.
>  * 
>  ** Currently I can provide only imprecise annotations, [such 
> as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
>  def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
>  or try to narrow things down using structural subtyping:
>  class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, 
> value: str) -> Classifier: ... class Classifier(Protocol, Model): def 
> setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> 
> int: ...
> (...)
>  * First of all nothing in the original API indicated this. On the contrary, 
> the original API clearly suggests that non-Java path is supported, by 
> providing base classes (Params, Transformer, Estimator, Model, ML 
> \{Reader,Writer}, ML\{Readable,Writable}) as well as Java specific 
> implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel, 
> JavaML\{Reader,Writer}, JavaML
> {Readable,Writable}
> ).
>  * Furthermore authoritative (IMHO) and open Python ML extensions exist 
> (spark-sklearn is one of these, but if I recall correctly spark-deep-learning 
> provides so pure-Python utilities). Personally I've seen quite a lot of 
> private implementations, but that's just anecdotal evidence.
> Let us assume 

[jira] [Assigned] (SPARK-29269) Pyspark ALSModel support getters/setters

2019-10-08 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-29269:


Assignee: Huaxin Gao

> Pyspark ALSModel support getters/setters
> 
>
> Key: SPARK-29269
> URL: https://issues.apache.org/jira/browse/SPARK-29269
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> ping [~huaxingao] , would you like to work on this? This is similar to your 
> previous works. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-29269) Pyspark ALSModel support getters/setters

2019-10-08 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-29269:
-
Comment: was deleted

(was: It seems that I do not have the permission to assign a tickect:
```
JIRAError: JiraError HTTP 403 url: 
https://issues.apache.org/jira/rest/api/latest/issue/SPARK-29269/assignee
 text: You do not have permission to assign issues.

```

[~dongjoon] Could you please help assign this ticket to Huaxin? Thanks!)

> Pyspark ALSModel support getters/setters
> 
>
> Key: SPARK-29269
> URL: https://issues.apache.org/jira/browse/SPARK-29269
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> ping [~huaxingao] , would you like to work on this? This is similar to your 
> previous works. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29269) Pyspark ALSModel support getters/setters

2019-10-08 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946524#comment-16946524
 ] 

zhengruifeng commented on SPARK-29269:
--

It seems that I do not have the permission to assign a tickect:
```
JIRAError: JiraError HTTP 403 url: 
https://issues.apache.org/jira/rest/api/latest/issue/SPARK-29269/assignee
 text: You do not have permission to assign issues.

```

[~dongjoon] Could you please help assign this ticket to Huaxin? Thanks!

> Pyspark ALSModel support getters/setters
> 
>
> Key: SPARK-29269
> URL: https://issues.apache.org/jira/browse/SPARK-29269
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> ping [~huaxingao] , would you like to work on this? This is similar to your 
> previous works. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29269) Pyspark ALSModel support getters/setters

2019-10-08 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-29269.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25947
[https://github.com/apache/spark/pull/25947]

> Pyspark ALSModel support getters/setters
> 
>
> Key: SPARK-29269
> URL: https://issues.apache.org/jira/browse/SPARK-29269
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> ping [~huaxingao] , would you like to work on this? This is similar to your 
> previous works. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29258) parity between ml.evaluator and mllib.metrics

2019-09-26 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-29258.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25940
[https://github.com/apache/spark/pull/25940]

> parity between ml.evaluator and mllib.metrics
> -
>
> Key: SPARK-29258
> URL: https://issues.apache.org/jira/browse/SPARK-29258
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> 1, expose {{BinaryClassificationMetrics.numBins}} in 
> {{BinaryClassificationEvaluator}}
> 2, expose {{RegressionMetrics.throughOrigin}} in {{RegressionEvaluator}}
> 3, add metric {{explainedVariance}} in {{RegressionEvaluator}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29269) Pyspark ALSModel support getters/setters

2019-09-26 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29269:


 Summary: Pyspark ALSModel support getters/setters
 Key: SPARK-29269
 URL: https://issues.apache.org/jira/browse/SPARK-29269
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


ping [~huaxingao] , would you like to work on this? This is similar to your 
previous works. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29142) Pyspark clustering models support column setters/getters/predict

2019-09-26 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-29142.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25859
[https://github.com/apache/spark/pull/25859]

> Pyspark clustering models support column setters/getters/predict
> 
>
> Key: SPARK-29142
> URL: https://issues.apache.org/jira/browse/SPARK-29142
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> Unlike the reg/clf models, clustering models do not have some common class, 
> so we need to add them one by one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend

2019-09-26 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939082#comment-16939082
 ] 

zhengruifeng commented on SPARK-29212:
--

[~zero323] I had not notice the base hierarchy without JVM-backend in 
SPARK-28985, and thanks you for pointing out it.

I guess we reach some consensus on:

1, add base classes without JVM-backend, and make JVM-classes extends them; 
(may limited to classes modified in SPARK-28985 at first)

2, rename private classname following PEP-8.

 

 

> Add common classes without using JVM backend
> 
>
> Key: SPARK-29212
> URL: https://issues.apache.org/jira/browse/SPARK-29212
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> copyed from [https://github.com/apache/spark/pull/25776.]
>  
> Maciej's *Concern*:
> *Use cases for public ML type hierarchy*
>  * Add Python-only Transformer implementations:
>  ** I am Python user and want to implement pure Python ML classifier without 
> providing JVM backend.
>  ** I want this classifier to be meaningfully positioned in the existing type 
> hierarchy.
>  ** However I have access only to high level classes ({{Estimator}}, 
> {{Model}}, {{MLReader}} / {{MLReadable}}).
>  * Run time parameter validation for both user defined (see above) and 
> existing class hierarchy,
>  ** I am a library developer who provides functions that are meaningful only 
> for specific categories of {{Estimators}} - here classifiers.
>  ** I want to validate that user passed argument is indeed a classifier:
>  *** For built-in objects using "private" type hierarchy is not really 
> satisfying (actually, what is the rationale behind making it "private"? If 
> the goal is Scala API parity, and Scala counterparts are public, shouldn't 
> these be too?).
>  ** For user defined objects I can:
>  *** Use duck typing (on {{setRawPredictionCol}} for classifier, on 
> {{numClasses}} for classification model) but it hardly satisfying.
>  *** Provide parallel non-abstract type hierarchy ({{Classifier}} or 
> {{PythonClassifier}} and so on) and require users to implement such 
> interfaces. That however would require separate logic for checking for 
> built-in and and user-provided classes.
>  *** Provide parallel abstract type hierarchy, register all existing built-in 
> classes and require users to do the same.
> Clearly these are not satisfying solutions as they require either defensive 
> programming or reinventing the same functionality for different 3rd party 
> APIs.
>  * Static type checking
>  ** I am either end user or library developer and want to use PEP-484 
> annotations to indicate components that require classifier or classification 
> model.
>  ** Currently I can provide only imprecise annotations, [such 
> as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
> def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
> or try to narrow things down using structural subtyping:
> class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, 
> value: str) -> Classifier: ... class Classifier(Protocol, Model): def 
> setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> 
> int: ...
>  
> Maciej's *Proposal*:
> {code:java}
> Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e.
> class ClassifierParams: ...
> class Predictor(Estimator,PredictorParams):
> def setLabelCol(self, value): ...
> def setFeaturesCol(self, value): ...
> def setPredictionCol(self, value): ...
> class Classifier(Predictor, ClassifierParams):
> def setRawPredictionCol(self, value): ...
> class PredictionModel(Model,PredictorParams):
> def setFeaturesCol(self, value): ...
> def setPredictionCol(self, value): ...
> def numFeatures(self): ...
> def predict(self, value): ...
> and JVM interop should extend from this hierarchy, i.e.
> class JavaPredictionModel(PredictionModel): ...
> In other words it should be consistent with existing approach, where we have 
> ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and 
> Java* variants are their subclasses.
>  {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29258) parity between ml.evaluator and mllib.metrics

2019-09-26 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29258:


 Summary: parity between ml.evaluator and mllib.metrics
 Key: SPARK-29258
 URL: https://issues.apache.org/jira/browse/SPARK-29258
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


1, expose {{BinaryClassificationMetrics.numBins}} in 
{{BinaryClassificationEvaluator}}

2, expose {{RegressionMetrics.throughOrigin}} in {{RegressionEvaluator}}

3, add metric {{explainedVariance}} in {{RegressionEvaluator}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend

2019-09-25 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938205#comment-16938205
 ] 

zhengruifeng commented on SPARK-29212:
--

[~zero323] Would you like to help work on this?

> Add common classes without using JVM backend
> 
>
> Key: SPARK-29212
> URL: https://issues.apache.org/jira/browse/SPARK-29212
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> copyed from [https://github.com/apache/spark/pull/25776.]
>  
> Maciej's *Concern*:
> *Use cases for public ML type hierarchy*
>  * Add Python-only Transformer implementations:
>  ** I am Python user and want to implement pure Python ML classifier without 
> providing JVM backend.
>  ** I want this classifier to be meaningfully positioned in the existing type 
> hierarchy.
>  ** However I have access only to high level classes ({{Estimator}}, 
> {{Model}}, {{MLReader}} / {{MLReadable}}).
>  * Run time parameter validation for both user defined (see above) and 
> existing class hierarchy,
>  ** I am a library developer who provides functions that are meaningful only 
> for specific categories of {{Estimators}} - here classifiers.
>  ** I want to validate that user passed argument is indeed a classifier:
>  *** For built-in objects using "private" type hierarchy is not really 
> satisfying (actually, what is the rationale behind making it "private"? If 
> the goal is Scala API parity, and Scala counterparts are public, shouldn't 
> these be too?).
>  ** For user defined objects I can:
>  *** Use duck typing (on {{setRawPredictionCol}} for classifier, on 
> {{numClasses}} for classification model) but it hardly satisfying.
>  *** Provide parallel non-abstract type hierarchy ({{Classifier}} or 
> {{PythonClassifier}} and so on) and require users to implement such 
> interfaces. That however would require separate logic for checking for 
> built-in and and user-provided classes.
>  *** Provide parallel abstract type hierarchy, register all existing built-in 
> classes and require users to do the same.
> Clearly these are not satisfying solutions as they require either defensive 
> programming or reinventing the same functionality for different 3rd party 
> APIs.
>  * Static type checking
>  ** I am either end user or library developer and want to use PEP-484 
> annotations to indicate components that require classifier or classification 
> model.
>  ** Currently I can provide only imprecise annotations, [such 
> as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
> def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
> or try to narrow things down using structural subtyping:
> class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, 
> value: str) -> Classifier: ... class Classifier(Protocol, Model): def 
> setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> 
> int: ...
>  
> Maciej's *Proposal*:
> {code:java}
> Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e.
> class ClassifierParams: ...
> class Predictor(Estimator,PredictorParams):
> def setLabelCol(self, value): ...
> def setFeaturesCol(self, value): ...
> def setPredictionCol(self, value): ...
> class Classifier(Predictor, ClassifierParams):
> def setRawPredictionCol(self, value): ...
> class PredictionModel(Model,PredictorParams):
> def setFeaturesCol(self, value): ...
> def setPredictionCol(self, value): ...
> def numFeatures(self): ...
> def predict(self, value): ...
> and JVM interop should extend from this hierarchy, i.e.
> class JavaPredictionModel(PredictionModel): ...
> In other words it should be consistent with existing approach, where we have 
> ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and 
> Java* variants are their subclasses.
>  {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend

2019-09-23 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935678#comment-16935678
 ] 

zhengruifeng commented on SPARK-29212:
--

It seems useful to impl some algs in pure python (like wrap scikit-learn as a 
pyspark.ml alg)

I personally think [~zero323]'s proposal is reasonable, although Pyspark.ML is 
now
mostly there to wrap the Scala side.

I had a discussion with [~huaxingao] and [~srowen] , I guess they are fairly 
neutral on it.

 

How do you think of this? [~holden.ka...@gmail.com] [~bryanc]

> Add common classes without using JVM backend
> 
>
> Key: SPARK-29212
> URL: https://issues.apache.org/jira/browse/SPARK-29212
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> copyed from [https://github.com/apache/spark/pull/25776.]
>  
> Maciej's *Concern*:
> *Use cases for public ML type hierarchy*
>  * Add Python-only Transformer implementations:
>  ** I am Python user and want to implement pure Python ML classifier without 
> providing JVM backend.
>  ** I want this classifier to be meaningfully positioned in the existing type 
> hierarchy.
>  ** However I have access only to high level classes ({{Estimator}}, 
> {{Model}}, {{MLReader}} / {{MLReadable}}).
>  * Run time parameter validation for both user defined (see above) and 
> existing class hierarchy,
>  ** I am a library developer who provides functions that are meaningful only 
> for specific categories of {{Estimators}} - here classifiers.
>  ** I want to validate that user passed argument is indeed a classifier:
>  *** For built-in objects using "private" type hierarchy is not really 
> satisfying (actually, what is the rationale behind making it "private"? If 
> the goal is Scala API parity, and Scala counterparts are public, shouldn't 
> these be too?).
>  ** For user defined objects I can:
>  *** Use duck typing (on {{setRawPredictionCol}} for classifier, on 
> {{numClasses}} for classification model) but it hardly satisfying.
>  *** Provide parallel non-abstract type hierarchy ({{Classifier}} or 
> {{PythonClassifier}} and so on) and require users to implement such 
> interfaces. That however would require separate logic for checking for 
> built-in and and user-provided classes.
>  *** Provide parallel abstract type hierarchy, register all existing built-in 
> classes and require users to do the same.
> Clearly these are not satisfying solutions as they require either defensive 
> programming or reinventing the same functionality for different 3rd party 
> APIs.
>  * Static type checking
>  ** I am either end user or library developer and want to use PEP-484 
> annotations to indicate components that require classifier or classification 
> model.
>  ** Currently I can provide only imprecise annotations, [such 
> as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
> def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
> or try to narrow things down using structural subtyping:
> class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, 
> value: str) -> Classifier: ... class Classifier(Protocol, Model): def 
> setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> 
> int: ...
>  
> Maciej's *Proposal*:
> {code:java}
> Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e.
> class ClassifierParams: ...
> class Predictor(Estimator,PredictorParams):
> def setLabelCol(self, value): ...
> def setFeaturesCol(self, value): ...
> def setPredictionCol(self, value): ...
> class Classifier(Predictor, ClassifierParams):
> def setRawPredictionCol(self, value): ...
> class PredictionModel(Model,PredictorParams):
> def setFeaturesCol(self, value): ...
> def setPredictionCol(self, value): ...
> def numFeatures(self): ...
> def predict(self, value): ...
> and JVM interop should extend from this hierarchy, i.e.
> class JavaPredictionModel(PredictionModel): ...
> In other words it should be consistent with existing approach, where we have 
> ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and 
> Java* variants are their subclasses.
>  {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29212) Add common classes without using JVM backend

2019-09-23 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29212:


 Summary: Add common classes without using JVM backend
 Key: SPARK-29212
 URL: https://issues.apache.org/jira/browse/SPARK-29212
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


copyed from [https://github.com/apache/spark/pull/25776.]

 

Maciej's *Concern*:

*Use cases for public ML type hierarchy*
 * Add Python-only Transformer implementations:

 ** I am Python user and want to implement pure Python ML classifier without 
providing JVM backend.
 ** I want this classifier to be meaningfully positioned in the existing type 
hierarchy.
 ** However I have access only to high level classes ({{Estimator}}, {{Model}}, 
{{MLReader}} / {{MLReadable}}).
 * Run time parameter validation for both user defined (see above) and existing 
class hierarchy,

 ** I am a library developer who provides functions that are meaningful only 
for specific categories of {{Estimators}} - here classifiers.
 ** I want to validate that user passed argument is indeed a classifier:
 *** For built-in objects using "private" type hierarchy is not really 
satisfying (actually, what is the rationale behind making it "private"? If the 
goal is Scala API parity, and Scala counterparts are public, shouldn't these be 
too?).
 ** For user defined objects I can:
 *** Use duck typing (on {{setRawPredictionCol}} for classifier, on 
{{numClasses}} for classification model) but it hardly satisfying.
 *** Provide parallel non-abstract type hierarchy ({{Classifier}} or 
{{PythonClassifier}} and so on) and require users to implement such interfaces. 
That however would require separate logic for checking for built-in and and 
user-provided classes.
 *** Provide parallel abstract type hierarchy, register all existing built-in 
classes and require users to do the same.
Clearly these are not satisfying solutions as they require either defensive 
programming or reinventing the same functionality for different 3rd party APIs.

 * Static type checking

 ** I am either end user or library developer and want to use PEP-484 
annotations to indicate components that require classifier or classification 
model.

 ** Currently I can provide only imprecise annotations, [such 
as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
or try to narrow things down using structural subtyping:
class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, value: 
str) -> Classifier: ... class Classifier(Protocol, Model): def 
setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> 
int: ...

 

Maciej's *Proposal*:
{code:java}
Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e.
class ClassifierParams: ...

class Predictor(Estimator,PredictorParams):
def setLabelCol(self, value): ...
def setFeaturesCol(self, value): ...
def setPredictionCol(self, value): ...

class Classifier(Predictor, ClassifierParams):
def setRawPredictionCol(self, value): ...

class PredictionModel(Model,PredictorParams):
def setFeaturesCol(self, value): ...
def setPredictionCol(self, value): ...
def numFeatures(self): ...
def predict(self, value): ...
and JVM interop should extend from this hierarchy, i.e.
class JavaPredictionModel(PredictionModel): ...
In other words it should be consistent with existing approach, where we have 
ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and Java* 
variants are their subclasses.
 {code}
 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29144) Binarizer handle sparse vectors incorrectly with negative threshold

2019-09-18 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-29144:
-
Summary: Binarizer handle sparse vectors incorrectly with negative 
threshold  (was: Binarizer handel sparse vector incorrectly with negative 
threshold)

> Binarizer handle sparse vectors incorrectly with negative threshold
> ---
>
> Key: SPARK-29144
> URL: https://issues.apache.org/jira/browse/SPARK-29144
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: zhengruifeng
>Priority: Minor
>
> the process on sparse vector is wrong if thread<0:
> {code:java}
> scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, 
> Vectors.dense(Array(0.0, 0.5, 0.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), 
> (1,[0.0,0.5,0.0]))
> scala> val df = data.toDF("id", "feature")
> df: org.apache.spark.sql.DataFrame = [id: int, feature: vector]
> scala> val binarizer: Binarizer = new 
> Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5)
> binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8
> scala> binarizer.transform(df).show()
> +---+-+-+
> | id|  feature|binarized_feature|
> +---+-+-+
> |  0|(3,[1],[0.5])|[0.0,1.0,0.0]|
> |  1|[0.0,0.5,0.0]|[1.0,1.0,1.0]|
> +---+-+-+
> {code}
> expected outputs of the above two input vectors should be the same.
>  
> To deal with sparse vectors with threshold < 0, we have two options:
> 1, return 1 for non-active items, but this will convert sparse vectors to 
> dense ones
> 2, throw an exception like what Scikit-Learn's 
> [Binarizer|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html]
>  does:
> {code:java}
> import numpy as np
> from scipy.sparse import csr_matrix
> from sklearn.preprocessing import Binarizer
> row = np.array([0, 0, 1, 2, 2, 2])
> col = np.array([0, 2, 2, 0, 1, 2])
> data = np.array([1, 2, 3, 4, 5, 6])
> a = csr_matrix((data, (row, col)), shape=(3, 3))
> binarizer = Binarizer(threshold=-1.0)
> binarizer.transform(a)
> Traceback (most recent call last):  File "", 
> line 1, in 
> binarizer.transform(a)  File 
> "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
>  line 1874, in transform
> return binarize(X, threshold=self.threshold, copy=copy)  File 
> "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
>  line 1774, in binarize
> raise ValueError('Cannot binarize a sparse matrix with threshold 
> 'ValueError: Cannot binarize a sparse matrix with threshold < 0 {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29144) Binarizer handel sparse vector incorrectly with negative threshold

2019-09-18 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932272#comment-16932272
 ] 

zhengruifeng commented on SPARK-29144:
--

I prefer option 2, and will send a PR for this.

> Binarizer handel sparse vector incorrectly with negative threshold
> --
>
> Key: SPARK-29144
> URL: https://issues.apache.org/jira/browse/SPARK-29144
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: zhengruifeng
>Priority: Minor
>
> the process on sparse vector is wrong if thread<0:
> {code:java}
> scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, 
> Vectors.dense(Array(0.0, 0.5, 0.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), 
> (1,[0.0,0.5,0.0]))
> scala> val df = data.toDF("id", "feature")
> df: org.apache.spark.sql.DataFrame = [id: int, feature: vector]
> scala> val binarizer: Binarizer = new 
> Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5)
> binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8
> scala> binarizer.transform(df).show()
> +---+-+-+
> | id|  feature|binarized_feature|
> +---+-+-+
> |  0|(3,[1],[0.5])|[0.0,1.0,0.0]|
> |  1|[0.0,0.5,0.0]|[1.0,1.0,1.0]|
> +---+-+-+
> {code}
> expected outputs of the above two input vectors should be the same.
>  
> To deal with sparse vectors with threshold < 0, we have two options:
> 1, return 1 for non-active items, but this will convert sparse vectors to 
> dense ones
> 2, throw an exception like what Scikit-Learn's 
> [Binarizer|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html]
>  does:
> {code:java}
> import numpy as np
> from scipy.sparse import csr_matrix
> from sklearn.preprocessing import Binarizer
> row = np.array([0, 0, 1, 2, 2, 2])
> col = np.array([0, 2, 2, 0, 1, 2])
> data = np.array([1, 2, 3, 4, 5, 6])
> a = csr_matrix((data, (row, col)), shape=(3, 3))
> binarizer = Binarizer(threshold=-1.0)
> binarizer.transform(a)
> Traceback (most recent call last):  File "", 
> line 1, in 
> binarizer.transform(a)  File 
> "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
>  line 1874, in transform
> return binarize(X, threshold=self.threshold, copy=copy)  File 
> "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
>  line 1774, in binarize
> raise ValueError('Cannot binarize a sparse matrix with threshold 
> 'ValueError: Cannot binarize a sparse matrix with threshold < 0 {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29144) Binarizer handel sparse vector incorrectly with negative threshold

2019-09-18 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-29144:
-
Description: 
the process on sparse vector is wrong if thread<0:
{code:java}
scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, 
Vectors.dense(Array(0.0, 0.5, 0.0
data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), 
(1,[0.0,0.5,0.0]))

scala> val df = data.toDF("id", "feature")
df: org.apache.spark.sql.DataFrame = [id: int, feature: vector]

scala> val binarizer: Binarizer = new 
Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5)
binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8

scala> binarizer.transform(df).show()
+---+-+-+
| id|  feature|binarized_feature|
+---+-+-+
|  0|(3,[1],[0.5])|[0.0,1.0,0.0]|
|  1|[0.0,0.5,0.0]|[1.0,1.0,1.0]|
+---+-+-+
{code}
expected outputs of the above two input vectors should be the same.

 

To deal with sparse vectors with threshold < 0, we have two options:

1, return 1 for non-active items, but this will convert sparse vectors to dense 
ones

2, throw an exception like what Scikit-Learn's 
[Binarizer|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html]
 does:
{code:java}
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.preprocessing import Binarizer

row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
a = csr_matrix((data, (row, col)), shape=(3, 3))
binarizer = Binarizer(threshold=-1.0)
binarizer.transform(a)
Traceback (most recent call last):  File "", 
line 1, in 
binarizer.transform(a)  File 
"/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
 line 1874, in transform
return binarize(X, threshold=self.threshold, copy=copy)  File 
"/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
 line 1774, in binarize
raise ValueError('Cannot binarize a sparse matrix with threshold 
'ValueError: Cannot binarize a sparse matrix with threshold < 0 {code}
 

  was:
the process on sparse vector is wrong if thread<0:
{code:java}
scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, 
Vectors.dense(Array(0.0, 0.5, 0.0
data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), 
(1,[0.0,0.5,0.0]))

scala> val df = data.toDF("id", "feature")
df: org.apache.spark.sql.DataFrame = [id: int, feature: vector]

scala> val binarizer: Binarizer = new 
Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5)
binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8

scala> binarizer.transform(df).show()
+---+-+-+
| id|  feature|binarized_feature|
+---+-+-+
|  0|(3,[1],[0.5])|[0.0,1.0,0.0]|
|  1|[0.0,0.5,0.0]|[1.0,1.0,1.0]|
+---+-+-+
{code}
expected outputs of the above two input vectors should be the same.

 

To deal with sparse vectors with threshold < 0, we have two options:

1, return 1 for non-active items, but this will convert sparse vectors to dense 
ones

2, throw an exception like what Scikit-Learn's 
[Binarizer|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.htm]
 does:
{code:java}
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.preprocessing import Binarizer

row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
a = csr_matrix((data, (row, col)), shape=(3, 3))
binarizer = Binarizer(threshold=-1.0)
binarizer.transform(a)
Traceback (most recent call last):  File "", 
line 1, in 
binarizer.transform(a)  File 
"/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
 line 1874, in transform
return binarize(X, threshold=self.threshold, copy=copy)  File 
"/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
 line 1774, in binarize
raise ValueError('Cannot binarize a sparse matrix with threshold 
'ValueError: Cannot binarize a sparse matrix with threshold < 0 {code}
 


> Binarizer handel sparse vector incorrectly with negative threshold
> --
>
> Key: SPARK-29144
> URL: https://issues.apache.org/jira/browse/SPARK-29144
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: zhengruifeng
>Priority: Minor
>
> the process on sparse vector is wrong if thread<0:
> {code:java}
> scala> val data = Seq((0, 

[jira] [Created] (SPARK-29144) Binarizer handel sparse vector incorrectly with negative threshold

2019-09-18 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29144:


 Summary: Binarizer handel sparse vector incorrectly with negative 
threshold
 Key: SPARK-29144
 URL: https://issues.apache.org/jira/browse/SPARK-29144
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.4.0, 2.3.0, 2.2.0, 2.1.0, 2.0.0
Reporter: zhengruifeng


the process on sparse vector is wrong if thread<0:
{code:java}
scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, 
Vectors.dense(Array(0.0, 0.5, 0.0
data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), 
(1,[0.0,0.5,0.0]))

scala> val df = data.toDF("id", "feature")
df: org.apache.spark.sql.DataFrame = [id: int, feature: vector]

scala> val binarizer: Binarizer = new 
Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5)
binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8

scala> binarizer.transform(df).show()
+---+-+-+
| id|  feature|binarized_feature|
+---+-+-+
|  0|(3,[1],[0.5])|[0.0,1.0,0.0]|
|  1|[0.0,0.5,0.0]|[1.0,1.0,1.0]|
+---+-+-+
{code}
expected outputs of the above two input vectors should be the same.

 

To deal with sparse vectors with threshold < 0, we have two options:

1, return 1 for non-active items, but this will convert sparse vectors to dense 
ones

2, throw an exception like what Scikit-Learn's 
[Binarizer|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.htm]
 does:
{code:java}
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.preprocessing import Binarizer

row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
a = csr_matrix((data, (row, col)), shape=(3, 3))
binarizer = Binarizer(threshold=-1.0)
binarizer.transform(a)
Traceback (most recent call last):  File "", 
line 1, in 
binarizer.transform(a)  File 
"/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
 line 1874, in transform
return binarize(X, threshold=self.threshold, copy=copy)  File 
"/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
 line 1774, in binarize
raise ValueError('Cannot binarize a sparse matrix with threshold 
'ValueError: Cannot binarize a sparse matrix with threshold < 0 {code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23578) Add multicolumn support for Binarizer

2019-09-18 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reopened SPARK-23578:
--

this ticket is for Binarizer not Bucketizer

> Add multicolumn support for Binarizer
> -
>
> Key: SPARK-23578
> URL: https://issues.apache.org/jira/browse/SPARK-23578
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Priority: Minor
>
> [Spark-20542] added an API that Bucketizer that can bin multiple columns. 
> Based on this change, a multicolumn support is added for Binarizer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23578) Add multicolumn support for Binarizer

2019-09-18 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-23578.
--
Resolution: Duplicate

> Add multicolumn support for Binarizer
> -
>
> Key: SPARK-23578
> URL: https://issues.apache.org/jira/browse/SPARK-23578
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Priority: Minor
>
> [Spark-20542] added an API that Bucketizer that can bin multiple columns. 
> Based on this change, a multicolumn support is added for Binarizer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29143) Pyspark feature models support column setters/getters

2019-09-18 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29143:


 Summary: Pyspark feature models support column setters/getters
 Key: SPARK-29143
 URL: https://issues.apache.org/jira/browse/SPARK-29143
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29142) Pyspark clustering models support column setters/getters/predict

2019-09-18 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29142:


 Summary: Pyspark clustering models support column 
setters/getters/predict
 Key: SPARK-29142
 URL: https://issues.apache.org/jira/browse/SPARK-29142
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


Unlike the reg/clf models, clustering models do not have some common class, so 
we need to add them one by one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29118) Avoid redundant computation in GMM.transform && GLR.transform

2019-09-17 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-29118:
-
Description: 
In SPARK-27944, the computation for output columns with empty name is skipped.

Now, I find that we can furthermore optimize

1, GMM.transform by directly obtaining the prediction(double) from its 
probabilty prediction(vector), like what ProbabilisticClassificationModel and 
ClassificationModel do.

2, GLR.transform by obtaining the prediction(double) from its link 
prediction(double)

  was:
In SPARK-27944, the computation for output columns with empty name is skipped.

Now, I find that we can furthermore optimize GMM.transform by directly 
obtaining the prediction(double) from its probabilty prediction(vector), like 
what ProbabilisticClassificationModel and ClassificationModel do.


> Avoid redundant computation in GMM.transform && GLR.transform
> -
>
> Key: SPARK-29118
> URL: https://issues.apache.org/jira/browse/SPARK-29118
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> In SPARK-27944, the computation for output columns with empty name is skipped.
> Now, I find that we can furthermore optimize
> 1, GMM.transform by directly obtaining the prediction(double) from its 
> probabilty prediction(vector), like what ProbabilisticClassificationModel and 
> ClassificationModel do.
> 2, GLR.transform by obtaining the prediction(double) from its link 
> prediction(double)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29118) Avoid redundant computation in GMM.transform && GLR.transform

2019-09-17 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-29118:
-
Summary: Avoid redundant computation in GMM.transform && GLR.transform  
(was: Avoid redundant computation in GMM.transform)

> Avoid redundant computation in GMM.transform && GLR.transform
> -
>
> Key: SPARK-29118
> URL: https://issues.apache.org/jira/browse/SPARK-29118
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> In SPARK-27944, the computation for output columns with empty name is skipped.
> Now, I find that we can furthermore optimize GMM.transform by directly 
> obtaining the prediction(double) from its probabilty prediction(vector), like 
> what ProbabilisticClassificationModel and ClassificationModel do.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29118) Avoid redundant computation in GMM.transform

2019-09-17 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29118:


 Summary: Avoid redundant computation in GMM.transform
 Key: SPARK-29118
 URL: https://issues.apache.org/jira/browse/SPARK-29118
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


In SPARK-27944, the computation for output columns with empty name is skipped.

Now, I find that we can furthermore optimize GMM.transform by directly 
obtaining the prediction(double) from its probabilty prediction(vector), like 
what ProbabilisticClassificationModel and ClassificationModel do.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29116) Refactor py classes related to DecisionTree

2019-09-17 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931179#comment-16931179
 ] 

zhengruifeng commented on SPARK-29116:
--

friendly ping [~huaxingao] , are you willing to work on this?

> Refactor py classes related to DecisionTree
> ---
>
> Key: SPARK-29116
> URL: https://issues.apache.org/jira/browse/SPARK-29116
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> 1, Like the scala side, move related classes to a seperate file 'tree.py'
> 2, add method 'predictLeaf' in 'DecisionTreeModel' & 'TreeEnsembleModel'



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29116) Refactor py classes related to DecisionTree

2019-09-17 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29116:


 Summary: Refactor py classes related to DecisionTree
 Key: SPARK-29116
 URL: https://issues.apache.org/jira/browse/SPARK-29116
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


1, Like the scala side, move related classes to a seperate file 'tree.py'

2, add method 'predictLeaf' in 'DecisionTreeModel' & 'TreeEnsembleModel'



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer

2019-09-17 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931125#comment-16931125
 ] 

zhengruifeng commented on SPARK-22796:
--

[~huaxingao]   https://issues.apache.org/jira/browse/SPARK-22797 is now 
resolved, you can continue now

> Add multiple column support to PySpark QuantileDiscretizer
> --
>
> Key: SPARK-22796
> URL: https://issues.apache.org/jira/browse/SPARK-22796
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22797) Add multiple column support to PySpark Bucketizer

2019-09-17 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-22797.
--
Resolution: Done

> Add multiple column support to PySpark Bucketizer
> -
>
> Key: SPARK-22797
> URL: https://issues.apache.org/jira/browse/SPARK-22797
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29094) Add extractInstances method

2019-09-16 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-29094.
--
Resolution: Duplicate

> Add extractInstances method
> ---
>
> Key: SPARK-29094
> URL: https://issues.apache.org/jira/browse/SPARK-29094
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29095) add extractInstances

2019-09-16 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29095:


 Summary: add extractInstances
 Key: SPARK-29095
 URL: https://issues.apache.org/jira/browse/SPARK-29095
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


There was method extractLabeledPoints for ml algs to transform dataset into rdd 
of labelPoints.

Now more and more algs support sample weighting and extractLabeledPoints is 
less used, so we should support extract weight in the common methods.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29094) Add extractInstances method

2019-09-16 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29094:


 Summary: Add extractInstances method
 Key: SPARK-29094
 URL: https://issues.apache.org/jira/browse/SPARK-29094
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29093) Remove automatically generated param setters in _shared_params_code_gen.py

2019-09-16 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29093:


 Summary: Remove automatically generated param setters in 
_shared_params_code_gen.py
 Key: SPARK-29093
 URL: https://issues.apache.org/jira/browse/SPARK-29093
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


The main difference between scala and py sides come from the automatically 
generated param setter in _shared_params_code_gen.py.

To make them in sync, we should remove those setters in _shared_.py, and add 
the corresponding setters manually.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28985) Pyspark ClassificationModel and RegressionModel support column setters/getters/predict

2019-09-11 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927498#comment-16927498
 ] 

zhengruifeng commented on SPARK-28985:
--

[~huaxingao] You can refer to my old prs 
[https://github.com/apache/spark/pull/16171] and 
[https://github.com/apache/spark/pull/25662] if you want to take it over. 
Thanks!

> Pyspark ClassificationModel and RegressionModel support column 
> setters/getters/predict
> --
>
> Key: SPARK-28985
> URL: https://issues.apache.org/jira/browse/SPARK-28985
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> 1, add common abstract classes like JavaClassificationModel & 
> JavaProbabilisticClassificationModel
> 2, add column setters/getters, and predict method
> 3, update the test suites to verify newly added functions



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9612) Add instance weight support for GBTs

2019-09-06 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924090#comment-16924090
 ] 

zhengruifeng commented on SPARK-9612:
-

https://issues.apache.org/jira/browse/SPARK-19591 is now resolved by 
[~imatiach] 

[~dbtsai]  Will you go on working on this?

> Add instance weight support for GBTs
> 
>
> Key: SPARK-9612
> URL: https://issues.apache.org/jira/browse/SPARK-9612
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: DB Tsai
>Priority: Minor
>  Labels: bulk-closed
>
> GBT support for instance weights could be handled by:
> * sampling data before passing it to trees
> * passing weights to trees (requiring weight support for trees first, but 
> probably better in the end)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28968) Add HasNumFeatures in the scala side

2019-09-06 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-28968.
--
Resolution: Resolved

> Add HasNumFeatures in the scala side
> 
>
> Key: SPARK-28968
> URL: https://issues.apache.org/jira/browse/SPARK-28968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> In the py side, HasNumFeatures is provided and inherited by 'HashingTF' and 
> 'FeatureHasher'.
> It is reasonable to also add HasNumFeatures in the scala side.
> Since '1<<18' is used by default in all place, we should add it as a default 
> into param trait.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28985) Pyspark ClassificationModel and RegressionModel support column setters/getters/predict

2019-09-05 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-28985:
-
Description: 
1, add common abstract classes like JavaClassificationModel & 
JavaProbabilisticClassificationModel

2, add column setters/getters, and predict method

3, update the test suites to verify newly added functions

  was:
1, add common abstract classes like ClassificationModel & 
ProbabilisticClassificationModel

2, add column setters/getters, and predict method

3, update the test suites to verify newly added functions


> Pyspark ClassificationModel and RegressionModel support column 
> setters/getters/predict
> --
>
> Key: SPARK-28985
> URL: https://issues.apache.org/jira/browse/SPARK-28985
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> 1, add common abstract classes like JavaClassificationModel & 
> JavaProbabilisticClassificationModel
> 2, add column setters/getters, and predict method
> 3, update the test suites to verify newly added functions



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-05 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923163#comment-16923163
 ] 

zhengruifeng commented on SPARK-28927:
--

[~JerryHouse]  As to AUC, which impl do you use? BinaryClassificationEvaluator 
or BinaryClassificationMetrics?

If you use BinaryClassificationMetrics, you may try to set numBins=0 to avoid 
down-sampling, then we can see whether the score is stable.

Moreover, could you please provide a (small) dataframe to reproduce?

 

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Priority: Major
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
> Here is the our code:
> val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, 
> 

[jira] [Updated] (SPARK-28958) pyspark.ml function parity

2019-09-05 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-28958:
-
Description: 
I looked into the hierarchy of both py and scala sides, and found that they are 
quite different, which damage the parity and make the codebase hard to maintain.

The main inconvenience is that most models in pyspark do not support any param 
getters and setters.

In the py side, I think we need to do:

1, remove setters generated by _shared_params_code_gen.py;

2, add common abstract classes like the side side, such as 
JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier;

3, for each alg, add its param trait, such as LinearSVCParams;

4, since sharedParam do not have setters, we need to add them in right places;

Unfortunately, I notice that if we do 1 (remove setters generated by 
_shared_params_code_gen.py), all algs 
(classification/regression/clustering/features/fpm/recommendation) need to be 
modified in one batch.

The scala side also need some small improvements, but I think they can be leave 
alone at first

  was:
I looked into the hierarchy of both py and scala sides, and found that they are 
quite different, which damage the parity and make the codebase hard to maintain.

The main inconvenience is that most models in pyspark do not support any param 
getters and setters.


In the py side, I think we need to do:

1, remove setters generated by _shared_params_code_gen.py;

2, add common abstract classes like the side side, such as 
JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier;

3, for each alg, add its param trait, such as LinearSVCParams;

4, since sharedParam do not have setters, we need to add them in right places;


Unfortunately, I notice that if we do 1 (remove setters generated by 
_shared_params_code_gen.py), all algs 
(classification/regression/clustering/features/fpm/recommendation) need to be 
modified in one batch.


The scala side also need some small improvements, but I think they can be leave 
alone at first, to avoid a lot of MiMa Failures.


> pyspark.ml function parity
> --
>
> Key: SPARK-28958
> URL: https://issues.apache.org/jira/browse/SPARK-28958
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
> Attachments: ML_SYNC.pdf
>
>
> I looked into the hierarchy of both py and scala sides, and found that they 
> are quite different, which damage the parity and make the codebase hard to 
> maintain.
> The main inconvenience is that most models in pyspark do not support any 
> param getters and setters.
> In the py side, I think we need to do:
> 1, remove setters generated by _shared_params_code_gen.py;
> 2, add common abstract classes like the side side, such as 
> JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier;
> 3, for each alg, add its param trait, such as LinearSVCParams;
> 4, since sharedParam do not have setters, we need to add them in right places;
> Unfortunately, I notice that if we do 1 (remove setters generated by 
> _shared_params_code_gen.py), all algs 
> (classification/regression/clustering/features/fpm/recommendation) need to be 
> modified in one batch.
> The scala side also need some small improvements, but I think they can be 
> leave alone at first



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28985) Pyspark ClassificationModel and RegressionModel support column setters/getters/predict

2019-09-05 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-28985:


 Summary: Pyspark ClassificationModel and RegressionModel support 
column setters/getters/predict
 Key: SPARK-28985
 URL: https://issues.apache.org/jira/browse/SPARK-28985
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


1, add common abstract classes like ClassificationModel & 
ProbabilisticClassificationModel

2, add column setters/getters, and predict method

3, update the test suites to verify newly added functions



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28969) OneVsRestModel in the py side should not set WeightCol and Classifier

2019-09-04 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-28969:
-
Parent: SPARK-28958
Issue Type: Sub-task  (was: Improvement)

> OneVsRestModel in the py side should not set WeightCol and Classifier
> -
>
> Key: SPARK-28969
> URL: https://issues.apache.org/jira/browse/SPARK-28969
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> 'WeightCol' and 'Classifier' can only be set in the estimator.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28969) OneVsRestModel in the py side should not set WeightCol and Classifier

2019-09-04 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922994#comment-16922994
 ] 

zhengruifeng commented on SPARK-28969:
--

friendly ping [~huaxingao]

> OneVsRestModel in the py side should not set WeightCol and Classifier
> -
>
> Key: SPARK-28969
> URL: https://issues.apache.org/jira/browse/SPARK-28969
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> 'WeightCol' and 'Classifier' can only be set in the estimator.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28969) OneVsRestModel in the py side should not set WeightCol and Classifier

2019-09-03 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-28969:


 Summary: OneVsRestModel in the py side should not set WeightCol 
and Classifier
 Key: SPARK-28969
 URL: https://issues.apache.org/jira/browse/SPARK-28969
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


'WeightCol' and 'Classifier' can only be set in the estimator.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28968) Add HasNumFeatures in the scala side

2019-09-03 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-28968:


 Summary: Add HasNumFeatures in the scala side
 Key: SPARK-28968
 URL: https://issues.apache.org/jira/browse/SPARK-28968
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


In the py side, HasNumFeatures is provided and inherited by 'HashingTF' and 
'FeatureHasher'.

It is reasonable to also add HasNumFeatures in the scala side.

Since '1<<18' is used by default in all place, we should add it as a default 
into param trait.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28958) pyspark.ml function parity

2019-09-03 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-28958:
-
Attachment: ML_SYNC.pdf

> pyspark.ml function parity
> --
>
> Key: SPARK-28958
> URL: https://issues.apache.org/jira/browse/SPARK-28958
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
> Attachments: ML_SYNC.pdf
>
>
> I looked into the hierarchy of both py and scala sides, and found that they 
> are quite different, which damage the parity and make the codebase hard to 
> maintain.
> The main inconvenience is that most models in pyspark do not support any 
> param getters and setters.
> In the py side, I think we need to do:
> 1, remove setters generated by _shared_params_code_gen.py;
> 2, add common abstract classes like the side side, such as 
> JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier;
> 3, for each alg, add its param trait, such as LinearSVCParams;
> 4, since sharedParam do not have setters, we need to add them in right places;
> Unfortunately, I notice that if we do 1 (remove setters generated by 
> _shared_params_code_gen.py), all algs 
> (classification/regression/clustering/features/fpm/recommendation) need to be 
> modified in one batch.
> The scala side also need some small improvements, but I think they can be 
> leave alone at first, to avoid a lot of MiMa Failures.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28958) pyspark.ml function parity

2019-09-03 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-28958:


 Summary: pyspark.ml function parity
 Key: SPARK-28958
 URL: https://issues.apache.org/jira/browse/SPARK-28958
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


I looked into the hierarchy of both py and scala sides, and found that they are 
quite different, which damage the parity and make the codebase hard to maintain.

The main inconvenience is that most models in pyspark do not support any param 
getters and setters.


In the py side, I think we need to do:

1, remove setters generated by _shared_params_code_gen.py;

2, add common abstract classes like the side side, such as 
JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier;

3, for each alg, add its param trait, such as LinearSVCParams;

4, since sharedParam do not have setters, we need to add them in right places;


Unfortunately, I notice that if we do 1 (remove setters generated by 
_shared_params_code_gen.py), all algs 
(classification/regression/clustering/features/fpm/recommendation) need to be 
modified in one batch.


The scala side also need some small improvements, but I think they can be leave 
alone at first, to avoid a lot of MiMa Failures.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28372) Document Spark WEB UI

2019-09-02 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921151#comment-16921151
 ] 

zhengruifeng commented on SPARK-28372:
--

[~smilegator] I think we may need to add a subtask for streaming?  As 
[~planga82]  suggested.

> Document Spark WEB UI
> -
>
> Key: SPARK-28372
> URL: https://issues.apache.org/jira/browse/SPARK-28372
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> Spark web UIs are being used to monitor the status and resource consumption 
> of your Spark applications and clusters. However, we do not have the 
> corresponding document. It is hard for end users to use and understand them. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28373) Document JDBC/ODBC Server page

2019-09-02 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921138#comment-16921138
 ] 

zhengruifeng commented on SPARK-28373:
--

[~planga82] Thanks!:D

> Document JDBC/ODBC Server page
> --
>
> Key: SPARK-28373
> URL: https://issues.apache.org/jira/browse/SPARK-28373
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> !https://user-images.githubusercontent.com/5399861/60809590-9dcf2500-a1bd-11e9-826e-33729bb97daf.png|width=1720,height=503!
>  
> [https://github.com/apache/spark/pull/25062] added a new column CLOSE TIME 
> and EXECUTION TIME. It is hard to understand the difference. We need to 
> document them; otherwise, it is hard for end users to understand them
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28373) Document JDBC/ODBC Server page

2019-09-01 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920599#comment-16920599
 ] 

zhengruifeng commented on SPARK-28373:
--

[~smilegator] [~yumwang]  I am afraid I have no time to do it this week.  

[~planga82]  Could you please take it over?

> Document JDBC/ODBC Server page
> --
>
> Key: SPARK-28373
> URL: https://issues.apache.org/jira/browse/SPARK-28373
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> !https://user-images.githubusercontent.com/5399861/60809590-9dcf2500-a1bd-11e9-826e-33729bb97daf.png|width=1720,height=503!
>  
> [https://github.com/apache/spark/pull/25062] added a new column CLOSE TIME 
> and EXECUTION TIME. It is hard to understand the difference. We need to 
> document them; otherwise, it is hard for end users to understand them
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28858) add tree-based transformation in the py side

2019-08-23 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-28858:


 Summary: add tree-based transformation in the py side
 Key: SPARK-28858
 URL: https://issues.apache.org/jira/browse/SPARK-28858
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


Expose the newly add tree-based transformation in the py side



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28780) Delete the incorrect setWeightCol method in LinearSVCModel

2019-08-20 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-28780:


 Summary: Delete the incorrect setWeightCol method in LinearSVCModel
 Key: SPARK-28780
 URL: https://issues.apache.org/jira/browse/SPARK-28780
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.4.0, 2.3.0, 2.2.0, 3.0.0
Reporter: zhengruifeng


1, the weightCol is only used in training, and should not be set in  
LinearSVCModel;

2, the method 'def setWeightCol(value: Double): this.type = set(threshold, 
value)' is wrongly defined, since value should be a string and weightCol 
instead of threshold should be set.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28542) Document Stages page

2019-08-18 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910145#comment-16910145
 ] 

zhengruifeng commented on SPARK-28542:
--

[~planga82]  Just go ahead! Thanks!

> Document Stages page
> 
>
> Key: SPARK-28542
> URL: https://issues.apache.org/jira/browse/SPARK-28542
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28373) Document JDBC/ODBC Server page

2019-08-13 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905937#comment-16905937
 ] 

zhengruifeng commented on SPARK-28373:
--

[~yumwang]  I had just create a page in 
https://issues.apache.org/jira/browse/SPARK-28538, you can add the relative doc 
in it. Thanks.

> Document JDBC/ODBC Server page
> --
>
> Key: SPARK-28373
> URL: https://issues.apache.org/jira/browse/SPARK-28373
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> !https://user-images.githubusercontent.com/5399861/60809590-9dcf2500-a1bd-11e9-826e-33729bb97daf.png|width=1720,height=503!
>  
> [https://github.com/apache/spark/pull/25062] added a new column CLOSE TIME 
> and EXECUTION TIME. It is hard to understand the difference. We need to 
> document them; otherwise, it is hard for end users to understand them
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28543) Document Spark Jobs page

2019-08-13 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905935#comment-16905935
 ] 

zhengruifeng commented on SPARK-28543:
--

[~planga82] I had just create a page in 
https://issues.apache.org/jira/browse/SPARK-28538, you can add the relative doc 
in it.

> Document Spark Jobs page
> 
>
> Key: SPARK-28543
> URL: https://issues.apache.org/jira/browse/SPARK-28543
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28579) MaxAbsScaler avoids conversion to breeze.vector

2019-07-31 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-28579:


 Summary: MaxAbsScaler avoids conversion to breeze.vector
 Key: SPARK-28579
 URL: https://issues.apache.org/jira/browse/SPARK-28579
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


In current impl, MaxAbsScaler will convert each vector to a breeze.vector in 
transformation.

This should be skipped.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28514) Remove the redundant transformImpl method in RF & GBT

2019-07-25 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-28514:


 Summary: Remove the redundant transformImpl method in RF & GBT
 Key: SPARK-28514
 URL: https://issues.apache.org/jira/browse/SPARK-28514
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


1, In GBTClassifier & RandomForestClassifier, the real transform methods 
inherit from ProbabilisticClassificationModel which can deal with multi output 
columns.

The transformImpl method, which deals with only one column - predictionCol, 
completely does nothing. This is quite confusing.

 

2, In GBTRegressor & RandomForestRegressor, the transformImpl do exactly what 
the superclass PredictionModel does (except model broadcasting), so can be 
removed.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28499) Optimize MinMaxScaler

2019-07-24 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-28499:
-
Description: 
current impl of MinMaxScaler has some small places to be optimized:

1, avoid call param getter in udf.

If I remember correctly, there was some tickets and prs about this, calling 
param getter in udf or map function, will significantly slow down the 
computation.

2, for a constant dim, the transformed value is also a constant value, which 
can be precomputed.

3, for a usual dim (i-th), the value is update by

values(i) = (values(i) - minArray(i)) / range(i) * scale + $(min)

here, we can precompute  scale / range, so that a division can be skipped.

  was:
current impl of MinMaxScaler has some small places to be optimized:

1, avoid call param getter in udf.

If I remember correctly, there was some tickets and prs about this, calling 
param getter in udf or map function, will significantly slow down the 
computation.

2, for a constant dim, the transformed value is also a constant value, which 
can be precomputed.

3, for a usual dim (i-th), the value is update by

values(i) = (values(i) - minArray(i)) / range(i) * scale + $(min)

here, we can precompute range * scale, so that a division can be skipped.


> Optimize MinMaxScaler
> -
>
> Key: SPARK-28499
> URL: https://issues.apache.org/jira/browse/SPARK-28499
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> current impl of MinMaxScaler has some small places to be optimized:
> 1, avoid call param getter in udf.
> If I remember correctly, there was some tickets and prs about this, calling 
> param getter in udf or map function, will significantly slow down the 
> computation.
> 2, for a constant dim, the transformed value is also a constant value, which 
> can be precomputed.
> 3, for a usual dim (i-th), the value is update by
> values(i) = (values(i) - minArray(i)) / range(i) * scale + $(min)
> here, we can precompute  scale / range, so that a division can be skipped.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28499) Optimize MinMaxScaler

2019-07-24 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-28499:


 Summary: Optimize MinMaxScaler
 Key: SPARK-28499
 URL: https://issues.apache.org/jira/browse/SPARK-28499
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


current impl of MinMaxScaler has some small places to be optimized:

1, avoid call param getter in udf.

If I remember correctly, there was some tickets and prs about this, calling 
param getter in udf or map function, will significantly slow down the 
computation.

2, for a constant dim, the transformed value is also a constant value, which 
can be precomputed.

3, for a usual dim (i-th), the value is update by

values(i) = (values(i) - minArray(i)) / range(i) * scale + $(min)

here, we can precompute range * scale, so that a division can be skipped.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13677) Support Tree-Based Feature Transformation for ML

2019-07-24 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-13677:
-
Description: 
It would be nice to be able to use RF and GBT for feature transformation:
 First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
the training set. Then each leaf of each tree in the ensemble is assigned a 
fixed arbitrary feature index in a new feature space. These leaf indices are 
then encoded in a one-hot fashion.

This method was first introduced by 
facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
implemented in famous libraries:

sklearn   
[apply|[http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]]

xgboost  
[predict_leaf_index|[https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]]

lightgbm 
[predict_leaf_index|https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_leaf_index]

catboost 
[calc_leaf_index|https://github.com/catboost/tutorials/tree/master/leaf_indexes_calculation]

 

 

Refering to the design of above impls, I propose following api:

val model1 : DecisionTreeClassificationModel= ...

model1.setLeafCol("leaves")
 model1.transform(df)

 

val model2 : GBTClassificationModel = ...

model2.getLeafCol
 model2.transform(df)

 

 The detailed design doc: 
[https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing]

  was:
It would be nice to be able to use RF and GBT for feature transformation:
 First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
the training set. Then each leaf of each tree in the ensemble is assigned a 
fixed arbitrary feature index in a new feature space. These leaf indices are 
then encoded in a one-hot fashion.

This method was first introduced by 
facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
implemented in famous libraries:

sklearn   
[apply|[http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]]

xgboost  
[lpredict_leaf_index|[https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]]

lightgbm 
[predict_leaf_index|https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_leaf_index]

catboost 
[calc_leaf_index|https://github.com/catboost/tutorials/tree/master/leaf_indexes_calculation]

 

 

Refering to the design of above impls, I propose following api:

val model1 : DecisionTreeClassificationModel= ...

model1.setLeafCol("leaves")
 model1.transform(df)

 

val model2 : GBTClassificationModel = ...

model2.getLeafCol
 model2.transform(df)

 

 The detailed design doc: 
[https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing]


> Support Tree-Based Feature Transformation for ML
> 
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Priority: Major
>
> It would be nice to be able to use RF and GBT for feature transformation:
>  First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
> implemented in famous libraries:
> sklearn   
> [apply|[http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]]
> xgboost  
> [predict_leaf_index|[https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]]
> lightgbm 
> [predict_leaf_index|https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_leaf_index]
> catboost 
> [calc_leaf_index|https://github.com/catboost/tutorials/tree/master/leaf_indexes_calculation]
>  
>  
> Refering to the design of above impls, I propose following api:
> val model1 : DecisionTreeClassificationModel= ...
> model1.setLeafCol("leaves")
>  model1.transform(df)
>  
> val model2 : GBTClassificationModel = ...
> model2.getLeafCol
>  model2.transform(df)
>  
>  The detailed design doc: 
> [https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13677) Support Tree-Based Feature Transformation for ML

2019-07-24 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-13677:
-
Description: 
It would be nice to be able to use RF and GBT for feature transformation:
 First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
the training set. Then each leaf of each tree in the ensemble is assigned a 
fixed arbitrary feature index in a new feature space. These leaf indices are 
then encoded in a one-hot fashion.

This method was first introduced by 
facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
implemented in famous libraries:

sklearn   
[apply|[http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]]

xgboost  
[lpredict_leaf_index|[https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]]

lightgbm 
[predict_leaf_index|https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_leaf_index]

catboost 
[calc_leaf_index|https://github.com/catboost/tutorials/tree/master/leaf_indexes_calculation]

 

 

Refering to the design of above impls, I propose following api:

val model1 : DecisionTreeClassificationModel= ...

model1.setLeafCol("leaves")
 model1.transform(df)

 

val model2 : GBTClassificationModel = ...

model2.getLeafCol
 model2.transform(df)

 

 The detailed design doc: 
[https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing]

  was:
It would be nice to be able to use RF and GBT for feature transformation:
 First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
the training set. Then each leaf of each tree in the ensemble is assigned a 
fixed arbitrary feature index in a new feature space. These leaf indices are 
then encoded in a one-hot fashion.

This method was first introduced by 
facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
implemented in two famous library:
 sklearn 
([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py])
 xgboost 
([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py])

 

api:

val model1 : DecisionTreeClassificationModel= ...

model1.setLeafCol("leaves")
 model1.transform(df)

 

val model2 : GBTClassificationModel = ...

model2.getLeafCol
 model2.transform(df)

 

 

design doc: 
[https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing]


> Support Tree-Based Feature Transformation for ML
> 
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Priority: Major
>
> It would be nice to be able to use RF and GBT for feature transformation:
>  First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
> implemented in famous libraries:
> sklearn   
> [apply|[http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]]
> xgboost  
> [lpredict_leaf_index|[https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]]
> lightgbm 
> [predict_leaf_index|https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_leaf_index]
> catboost 
> [calc_leaf_index|https://github.com/catboost/tutorials/tree/master/leaf_indexes_calculation]
>  
>  
> Refering to the design of above impls, I propose following api:
> val model1 : DecisionTreeClassificationModel= ...
> model1.setLeafCol("leaves")
>  model1.transform(df)
>  
> val model2 : GBTClassificationModel = ...
> model2.getLeafCol
>  model2.transform(df)
>  
>  The detailed design doc: 
> [https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28421) SparseVector.apply performance optimization

2019-07-17 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-28421:


 Summary: SparseVector.apply performance optimization
 Key: SPARK-28421
 URL: https://issues.apache.org/jira/browse/SPARK-28421
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


Current impl of SparseVector.apply is inefficient:

on each call,  breeze.linalg.SparseVector & 
breeze.collection.mutable.SparseArray are created internally, then 
binary-search is used to search the input position.

 

This place should be optimized like .ml.SparseMatrix, which directly use binary 
search, without conversion to breeze.linalg.Matrix.

 

I tested the performance and found that if we avoid the internal conversions, 
then a 2.5~5X speed up can be obtained.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28399) Impl RobustScaler

2019-07-15 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-28399:
-
Issue Type: New Feature  (was: Improvement)

> Impl RobustScaler
> -
>
> Key: SPARK-28399
> URL: https://issues.apache.org/jira/browse/SPARK-28399
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> RobustScaler is a kind of widely-used scaler, which use median/IQR to replace 
> mean/std in StandardScaler. It can produce stable result that are much more 
> robust to outliers. It is already a part of 
> [Scikit-Learn|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler].
> So far, it is now implemented in ML.
> I encounter a practical case that need this feature, and notice that other 
> users also wanted this function in SPARK-17934, so I am to add it in ML.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28399) Impl RobustScaler

2019-07-15 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-28399:


 Summary: Impl RobustScaler
 Key: SPARK-28399
 URL: https://issues.apache.org/jira/browse/SPARK-28399
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


RobustScaler is a kind of widely-used scaler, which use median/IQR to replace 
mean/std in StandardScaler. It can produce stable result that are much more 
robust to outliers. It is already a part of 
[Scikit-Learn|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler].

So far, it is now implemented in ML.

I encounter a practical case that need this feature, and notice that other 
users also wanted this function in SPARK-17934, so I am to add it in ML.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27656) Safely register class for GraphX

2019-06-25 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-27656.
--
Resolution: Not A Problem

> Safely register class for GraphX
> 
>
> Key: SPARK-27656
> URL: https://issues.apache.org/jira/browse/SPARK-27656
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.4.3
>Reporter: zhengruifeng
>Priority: Major
>
> GraphX common classes (such as: Edge, EdgeTriplet) are not registered in Kryo 
> by default.
> Users can register those classes via 
> {{GraphXUtils.{color:#ffc66d}registerKryoClasses{color}}}, however, it seems 
> that none graphx-lib impls call it, and users tend to ignore this 
> registration.
> So I prefer to safely register them in \{{KryoSerializer.scala}}, like what  
> SQL and ML do.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28159) Make the transform natively in ml framework to avoid extra conversion

2019-06-25 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-28159:
-
Description: 
It is a long time since ML was released.

However, there are still many TODOs (like in 
[ChiSqSelector.scala|https://github.com/apache/spark/pull/24963/files#diff-9b0bc8a01b34c38958ce45c14f9c5da5]
 {// TODO: Make the transformer natively in ml framework to avoid extra 
conversion.}) on making transform natively in ml framework.

 

I try to make ml algs no longer need to convert ml-vector to mllib-vector in 
transforms.

Including: 
LDA/ChiSqSelector/ElementwiseProduct/HashingTF/IDF/Normalizer/PCA/StandardScaler.

 

  was:
It is a long time since ML was released.

However, there are still many TODOs on making transform natively in ml 
framework.


> Make the transform natively in ml framework to avoid extra conversion
> -
>
> Key: SPARK-28159
> URL: https://issues.apache.org/jira/browse/SPARK-28159
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> It is a long time since ML was released.
> However, there are still many TODOs (like in 
> [ChiSqSelector.scala|https://github.com/apache/spark/pull/24963/files#diff-9b0bc8a01b34c38958ce45c14f9c5da5]
>  {// TODO: Make the transformer natively in ml framework to avoid extra 
> conversion.}) on making transform natively in ml framework.
>  
> I try to make ml algs no longer need to convert ml-vector to mllib-vector in 
> transforms.
> Including: 
> LDA/ChiSqSelector/ElementwiseProduct/HashingTF/IDF/Normalizer/PCA/StandardScaler.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28159) Make the transform natively in ml framework to avoid extra conversion

2019-06-25 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-28159:


 Summary: Make the transform natively in ml framework to avoid 
extra conversion
 Key: SPARK-28159
 URL: https://issues.apache.org/jira/browse/SPARK-28159
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


It is a long time since ML was released.

However, there are still many TODOs on making transform natively in ml 
framework.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28154) GMM fix double caching

2019-06-24 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-28154:


 Summary: GMM fix double caching
 Key: SPARK-28154
 URL: https://issues.apache.org/jira/browse/SPARK-28154
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.0, 2.3.0, 3.0.0
Reporter: zhengruifeng


The intermediate rdd is always cached.  We should only cache it if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28117) LDA and BisectingKMeans cache the input dataset if necessary

2019-06-19 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-28117:


 Summary: LDA and BisectingKMeans cache the input dataset if 
necessary
 Key: SPARK-28117
 URL: https://issues.apache.org/jira/browse/SPARK-28117
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


In MLLIB-LDA, if the EM solver caches the dataset internally, while the Online 
do not.

So in the ML-LDA, we need to cache the internmediate dataset if necessary.

 

BisectingKMeans also needs too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13677) Support Tree-Based Feature Transformation for ML

2019-06-19 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-13677:
-
Description: 
It would be nice to be able to use RF and GBT for feature transformation:
 First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
the training set. Then each leaf of each tree in the ensemble is assigned a 
fixed arbitrary feature index in a new feature space. These leaf indices are 
then encoded in a one-hot fashion.

This method was first introduced by 
facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
implemented in two famous library:
 sklearn 
([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py])
 xgboost 
([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py])

 

api:

val model1 : DecisionTreeClassificationModel= ...

model1.setLeafCol("leaves")
 model1.transform(df)

 

val model2 : GBTClassificationModel = ...

model2.getLeafCol
 model2.transform(df)

 

 

design doc: 
[https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing]

  was:
It would be nice to be able to use RF and GBT for feature transformation:
 First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
the training set. Then each leaf of each tree in the ensemble is assigned a 
fixed arbitrary feature index in a new feature space. These leaf indices are 
then encoded in a one-hot fashion.

This method was first introduced by 
facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
implemented in two famous library:
 sklearn 
([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py])
 xgboost 
([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py])

I have implement it in mllib:


 val model1 : DecisionTreeClassificationModel= ...

model1.setLeafCol("leaves")
model1.transform(df)

val model2 : GBTClassificationModel = ...
model2.transform(df)

 

 

design doc: 
https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing


> Support Tree-Based Feature Transformation for ML
> 
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Priority: Major
>
> It would be nice to be able to use RF and GBT for feature transformation:
>  First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
> implemented in two famous library:
>  sklearn 
> ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py])
>  xgboost 
> ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py])
>  
> api:
> val model1 : DecisionTreeClassificationModel= ...
> model1.setLeafCol("leaves")
>  model1.transform(df)
>  
> val model2 : GBTClassificationModel = ...
> model2.getLeafCol
>  model2.transform(df)
>  
>  
> design doc: 
> [https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13677) Support Tree-Based Feature Transformation for ML

2019-06-19 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-13677:
-
Priority: Major  (was: Minor)

> Support Tree-Based Feature Transformation for ML
> 
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Priority: Major
>
> It would be nice to be able to use RF and GBT for feature transformation:
>  First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
> implemented in two famous library:
>  sklearn 
> ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py])
>  xgboost 
> ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py])
> I have implement it in mllib:
>  val model1 : DecisionTreeClassificationModel= ...
> model1.setLeafCol("leaves")
> model1.transform(df)
> val model2 : GBTClassificationModel = ...
> model2.transform(df)
>  
>  
> design doc: 
> https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13677) Support Tree-Based Feature Transformation for ML

2019-06-19 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867507#comment-16867507
 ] 

zhengruifeng commented on SPARK-13677:
--

I closed this ticket since the old pr was based on mllib-api, and at that time 
the impl of trees were being refactored and impled directly in ml.

I reopen it now since I re-design it on the ml side.

> Support Tree-Based Feature Transformation for ML
> 
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> It would be nice to be able to use RF and GBT for feature transformation:
>  First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
> implemented in two famous library:
>  sklearn 
> ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py])
>  xgboost 
> ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py])
> I have implement it in mllib:
>  val model1 : DecisionTreeClassificationModel= ...
> model1.setLeafCol("leaves")
> model1.transform(df)
> val model2 : GBTClassificationModel = ...
> model2.transform(df)
>  
>  
> design doc: 
> https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13677) Support Tree-Based Feature Transformation for ML

2019-06-19 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reopened SPARK-13677:
--

update the design

> Support Tree-Based Feature Transformation for ML
> 
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> It would be nice to be able to use RF and GBT for feature transformation:
>  First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
> implemented in two famous library:
>  sklearn 
> ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py])
>  xgboost 
> ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py])
> I have implement it in mllib:
>  val model1 : DecisionTreeClassificationModel= ...
> model1.setLeafCol("leaves")
> model1.transform(df)
> val model2 : GBTClassificationModel = ...
> model2.transform(df)
>  
>  
> design doc: 
> https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13677) Support Tree-Based Feature Transformation for ML

2019-06-19 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-13677:
-
Description: 
It would be nice to be able to use RF and GBT for feature transformation:
 First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
the training set. Then each leaf of each tree in the ensemble is assigned a 
fixed arbitrary feature index in a new feature space. These leaf indices are 
then encoded in a one-hot fashion.

This method was first introduced by 
facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
implemented in two famous library:
 sklearn 
([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py])
 xgboost 
([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py])

I have implement it in mllib:


 val model1 : DecisionTreeClassificationModel= ...

model1.setLeafCol("leaves")
model1.transform(df)

val model2 : GBTClassificationModel = ...
model2.transform(df)

 

 

design doc: 
https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing

  was:
It would be nice to be able to use RF and GBT for feature transformation:
First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
the training set. Then each leaf of each tree in the ensemble is assigned a 
fixed arbitrary feature index in a new feature space. These leaf indices are 
then encoded in a one-hot fashion.

This method was first introduced by 
facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is 
implemented in two famous library:
sklearn 
(http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py)
xgboost 
(https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py)

I have implement it in mllib:

val features : RDD[Vector] = ...
val model1 : RandomForestModel = ...
val transformed1 : RDD[Vector] = model1.leaf(features)

val model2 : GradientBoostedTreesModel = ...
val transformed2 : RDD[Vector] = model2.leaf(features)




> Support Tree-Based Feature Transformation for ML
> 
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> It would be nice to be able to use RF and GBT for feature transformation:
>  First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
> implemented in two famous library:
>  sklearn 
> ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py])
>  xgboost 
> ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py])
> I have implement it in mllib:
>  val model1 : DecisionTreeClassificationModel= ...
> model1.setLeafCol("leaves")
> model1.transform(df)
> val model2 : GBTClassificationModel = ...
> model2.transform(df)
>  
>  
> design doc: 
> https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier

2019-06-13 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-27018:
-
Component/s: Spark Core

> Checkpointed RDD deleted prematurely when using GBTClassifier
> -
>
> Key: SPARK-27018
> URL: https://issues.apache.org/jira/browse/SPARK-27018
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core
>Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0
> Environment: OS: Ubuntu Linux 18.10
> Java: java version "1.8.0_201"
> Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
> Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
> Reproducible with a single-node Spark in standalone mode.
> Reproducible with Zepellin or Spark shell.
>  
>Reporter: Piotr Kołaczkowski
>Priority: Major
> Attachments: 
> Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch
>
>
> Steps to reproduce:
> {noformat}
> import org.apache.spark.ml.linalg.Vectors
> import org.apache.spark.ml.classification.GBTClassifier
> case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int)
> sc.setCheckpointDir("/checkpoints")
> val trainingData = sc.parallelize(1 to 2426874, 256).map(x => 
> Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF
> val classifier = new GBTClassifier()
>   .setLabelCol("label")
>   .setFeaturesCol("features")
>   .setProbabilityCol("probability")
>   .setMaxIter(100)
>   .setMaxDepth(10)
>   .setCheckpointInterval(2)
> classifier.fit(trainingData){noformat}
>  
> The last line fails with:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 
> (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: 
> /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39)
> at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> 

[jira] [Resolved] (SPARK-27925) Better control numBins of curves in BinaryClassificationMetrics

2019-06-13 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-27925.
--
Resolution: Not A Problem

> Better control numBins of curves in BinaryClassificationMetrics
> ---
>
> Key: SPARK-27925
> URL: https://issues.apache.org/jira/browse/SPARK-27925
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> In case of large datasets with tens of thousands of partitions, current curve 
> down-sampling method tend to generate much more bins than the value set by 
> #numBins.
> Since in current impl, grouping is done within partitions, that is to say, 
> each partition contains at least one bin.
> A more reasonable way is to bring the grouping op forward into the sort op, 
> then we can directly set the #bins as the #partitions, and regard one 
> partition as one bin.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28045) add missing RankingEvaluator

2019-06-13 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-28045:


 Summary: add missing RankingEvaluator
 Key: SPARK-28045
 URL: https://issues.apache.org/jira/browse/SPARK-28045
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


expose RankingEvaluator



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28044) MulticlassClassificationEvaluator support more metrics

2019-06-13 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-28044:


 Summary: MulticlassClassificationEvaluator support more metrics
 Key: SPARK-28044
 URL: https://issues.apache.org/jira/browse/SPARK-28044
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


expose more metrics in evaluator:

weightedTruePositiveRate

weightedFalsePositiveRate

weightedFMeasure

truePositiveRateByLabel

falsePositiveRateByLabel

precisionByLabel

recallByLabel

fMeasureByLabel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label

2019-06-11 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860922#comment-16860922
 ] 

zhengruifeng edited comment on SPARK-24875 at 6/11/19 10:59 AM:


The  dataset is usually much smaller than the training dataset 
containing ,

if the score data is to huge to perform a simple op like countByValue, how 
could you train/evaluate the model?

I doubt whether it is worth to apply a approximation.


was (Author: podongfeng):
The  dataset is usually much smaller than the training dataset 
containing ,

if the score data is to huge to perform a simple op like countByValue, how 
could you train the model?

I doubt whether it is worth to apply a approximation.

> MulticlassMetrics should offer a more efficient way to compute count by label
> -
>
> Key: SPARK-24875
> URL: https://issues.apache.org/jira/browse/SPARK-24875
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Antoine Galataud
>Priority: Minor
>
> Currently _MulticlassMetrics_ calls _countByValue_() to get count by 
> class/label
> {code:java}
> private lazy val labelCountByClass: Map[Double, Long] = 
> predictionAndLabels.values.countByValue()
> {code}
> If input _RDD[(Double, Double)]_ is huge (which can be the case with a large 
> test dataset), it will lead to poor execution performance.
> One option could be to allow using _countByValueApprox_ (could require adding 
> an extra configuration param for MulticlassMetrics).
> Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, 
> I don't know how this could be ported there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label

2019-06-11 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860922#comment-16860922
 ] 

zhengruifeng commented on SPARK-24875:
--

The  dataset is usually much smaller than the training dataset 
containing ,

if the score data is to huge to perform a simple op like countByValue, how 
could you train the model?

I doubt whether it is worth to apply a approximation.

> MulticlassMetrics should offer a more efficient way to compute count by label
> -
>
> Key: SPARK-24875
> URL: https://issues.apache.org/jira/browse/SPARK-24875
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Antoine Galataud
>Priority: Minor
>
> Currently _MulticlassMetrics_ calls _countByValue_() to get count by 
> class/label
> {code:java}
> private lazy val labelCountByClass: Map[Double, Long] = 
> predictionAndLabels.values.countByValue()
> {code}
> If input _RDD[(Double, Double)]_ is huge (which can be the case with a large 
> test dataset), it will lead to poor execution performance.
> One option could be to allow using _countByValueApprox_ (could require adding 
> an extra configuration param for MulticlassMetrics).
> Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, 
> I don't know how this could be ported there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26185) add weightCol in python MulticlassClassificationEvaluator

2019-06-11 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860919#comment-16860919
 ] 

zhengruifeng commented on SPARK-26185:
--

Seems resolved?

> add weightCol in python MulticlassClassificationEvaluator
> -
>
> Key: SPARK-26185
> URL: https://issues.apache.org/jira/browse/SPARK-26185
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-24101 added weightCol in 
> MulticlassClassificationEvaluator.scala. This Jira will add weightCol in 
> python version of MulticlassClassificationEvaluator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25360) Parallelized RDDs of Ranges could have known partitioner

2019-06-11 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860759#comment-16860759
 ] 

zhengruifeng commented on SPARK-25360:
--

But I think it maybe worth to impl a direct version of `sc.range` other than 
`parallelize.maptition`, to simplify lineage.

> Parallelized RDDs of Ranges could have known partitioner
> 
>
> Key: SPARK-25360
> URL: https://issues.apache.org/jira/browse/SPARK-25360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> We already have the logic to split up the generator, we could expose the same 
> logic as a partitioner. This would be useful when joining a small 
> parallelized collection with a larger collection and other cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    2   3   4   5   6   7   8   9   10   11   >