[jira] [Created] (SPARK-29566) Imputer should support single-column input/ouput
zhengruifeng created SPARK-29566: Summary: Imputer should support single-column input/ouput Key: SPARK-29566 URL: https://issues.apache.org/jira/browse/SPARK-29566 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng Imputer should support single-column input/ouput -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29565) OneHotEncoder should support single-column input/ouput
zhengruifeng created SPARK-29565: Summary: OneHotEncoder should support single-column input/ouput Key: SPARK-29565 URL: https://issues.apache.org/jira/browse/SPARK-29565 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng Current feature algs ({color:#5a6e5a}QuantileDiscretizer/Binarizer/Bucketizer/StringIndexer{color}) are designed to support both single-col & multi-col. And there is already some internal utils (like {color:#c7a65d}checkSingleVsMultiColumnParams{color}) for this. For OneHotEncoder, it's reasonable to support single-col. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29093) Remove automatically generated param setters in _shared_params_code_gen.py
[ https://issues.apache.org/jira/browse/SPARK-29093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-29093: Assignee: Huaxin Gao > Remove automatically generated param setters in _shared_params_code_gen.py > -- > > Key: SPARK-29093 > URL: https://issues.apache.org/jira/browse/SPARK-29093 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Major > > The main difference between scala and py sides come from the automatically > generated param setter in _shared_params_code_gen.py. > To make them in sync, we should remove those setters in _shared_.py, and add > the corresponding setters manually. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29093) Remove automatically generated param setters in _shared_params_code_gen.py
[ https://issues.apache.org/jira/browse/SPARK-29093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957601#comment-16957601 ] zhengruifeng commented on SPARK-29093: -- [~huaxingao] Thanks! > Remove automatically generated param setters in _shared_params_code_gen.py > -- > > Key: SPARK-29093 > URL: https://issues.apache.org/jira/browse/SPARK-29093 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > The main difference between scala and py sides come from the automatically > generated param setter in _shared_params_code_gen.py. > To make them in sync, we should remove those setters in _shared_.py, and add > the corresponding setters manually. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29232) RandomForestRegressionModel does not update the parameter maps of the DecisionTreeRegressionModels underneath
[ https://issues.apache.org/jira/browse/SPARK-29232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-29232: Assignee: Huaxin Gao > RandomForestRegressionModel does not update the parameter maps of the > DecisionTreeRegressionModels underneath > - > > Key: SPARK-29232 > URL: https://issues.apache.org/jira/browse/SPARK-29232 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 >Reporter: Jiaqi Guo >Assignee: Huaxin Gao >Priority: Major > > We trained a RandomForestRegressionModel, and tried to access the trees. Even > though the DecisionTreeRegressionModel is correctly built with the proper > parameters from random forest, the parameter map is not updated, and still > contains only the default value. > For example, if a RandomForestRegressor was trained with maxDepth of 12, then > accessing the tree information, extractParamMap still returns the default > values, with maxDepth=5. Calling the depth itself of > DecisionTreeRegressionModel returns the correct value of 12 though. > This creates issues when we want to access each individual tree and build the > trees back up for the random forest estimator. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29232) RandomForestRegressionModel does not update the parameter maps of the DecisionTreeRegressionModels underneath
[ https://issues.apache.org/jira/browse/SPARK-29232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-29232. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26154 [https://github.com/apache/spark/pull/26154] > RandomForestRegressionModel does not update the parameter maps of the > DecisionTreeRegressionModels underneath > - > > Key: SPARK-29232 > URL: https://issues.apache.org/jira/browse/SPARK-29232 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 >Reporter: Jiaqi Guo >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > We trained a RandomForestRegressionModel, and tried to access the trees. Even > though the DecisionTreeRegressionModel is correctly built with the proper > parameters from random forest, the parameter map is not updated, and still > contains only the default value. > For example, if a RandomForestRegressor was trained with maxDepth of 12, then > accessing the tree information, extractParamMap still returns the default > values, with maxDepth=5. Calling the depth itself of > DecisionTreeRegressionModel returns the correct value of 12 though. > This creates issues when we want to access each individual tree and build the > trees back up for the random forest estimator. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29489) ml.evaluation support log-loss
[ https://issues.apache.org/jira/browse/SPARK-29489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-29489: Assignee: zhengruifeng > ml.evaluation support log-loss > -- > > Key: SPARK-29489 > URL: https://issues.apache.org/jira/browse/SPARK-29489 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > {color:#5a6e5a}log-loss (aka logistic loss or cross-entropy loss) is one of > the most widely used metrics in classification tasks. It is already impled in > famous libraries like sklearn. > {color} > {color:#5a6e5a}However, it is missing so far. > {color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29489) ml.evaluation support log-loss
[ https://issues.apache.org/jira/browse/SPARK-29489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-29489. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26135 [https://github.com/apache/spark/pull/26135] > ml.evaluation support log-loss > -- > > Key: SPARK-29489 > URL: https://issues.apache.org/jira/browse/SPARK-29489 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 3.0.0 > > > {color:#5a6e5a}log-loss (aka logistic loss or cross-entropy loss) is one of > the most widely used metrics in classification tasks. It is already impled in > famous libraries like sklearn. > {color} > {color:#5a6e5a}However, it is missing so far. > {color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23578) Add multicolumn support for Binarizer
[ https://issues.apache.org/jira/browse/SPARK-23578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-23578: Assignee: zhengruifeng > Add multicolumn support for Binarizer > - > > Key: SPARK-23578 > URL: https://issues.apache.org/jira/browse/SPARK-23578 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Teng Peng >Assignee: zhengruifeng >Priority: Minor > > [Spark-20542] added an API that Bucketizer that can bin multiple columns. > Based on this change, a multicolumn support is added for Binarizer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23578) Add multicolumn support for Binarizer
[ https://issues.apache.org/jira/browse/SPARK-23578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-23578. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26064 [https://github.com/apache/spark/pull/26064] > Add multicolumn support for Binarizer > - > > Key: SPARK-23578 > URL: https://issues.apache.org/jira/browse/SPARK-23578 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Teng Peng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > [Spark-20542] added an API that Bucketizer that can bin multiple columns. > Based on this change, a multicolumn support is added for Binarizer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29489) ml.evaluation support log-loss
zhengruifeng created SPARK-29489: Summary: ml.evaluation support log-loss Key: SPARK-29489 URL: https://issues.apache.org/jira/browse/SPARK-29489 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng {color:#5a6e5a}log-loss (aka logistic loss or cross-entropy loss) is one of the most widely used metrics in classification tasks. It is already impled in famous libraries like sklearn. {color} {color:#5a6e5a}However, it is missing so far. {color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29381) Add 'private' _XXXParams classes for classification & regression
[ https://issues.apache.org/jira/browse/SPARK-29381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951657#comment-16951657 ] zhengruifeng commented on SPARK-29381: -- [~huaxingao] Hi, I think we need another PR to add 'private' classes like '_LinearSVCParams'/'_LinearRegressionParams'. I am sorry for late response. > Add 'private' _XXXParams classes for classification & regression > > > Key: SPARK-29381 > URL: https://issues.apache.org/jira/browse/SPARK-29381 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > ping [~huaxingao] would you like to work on this? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29377) parity between scala ml tuning and python ml tuning
[ https://issues.apache.org/jira/browse/SPARK-29377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-29377: Assignee: Huaxin Gao > parity between scala ml tuning and python ml tuning > --- > > Key: SPARK-29377 > URL: https://issues.apache.org/jira/browse/SPARK-29377 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29377) parity between scala ml tuning and python ml tuning
[ https://issues.apache.org/jira/browse/SPARK-29377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-29377. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26057 [https://github.com/apache/spark/pull/26057] > parity between scala ml tuning and python ml tuning > --- > > Key: SPARK-29377 > URL: https://issues.apache.org/jira/browse/SPARK-29377 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29380) RFormula avoid repeated 'first' jobs to get vector size
[ https://issues.apache.org/jira/browse/SPARK-29380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-29380: Assignee: zhengruifeng > RFormula avoid repeated 'first' jobs to get vector size > --- > > Key: SPARK-29380 > URL: https://issues.apache.org/jira/browse/SPARK-29380 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > In current impl, {{RFormula}} will trigger one {{first}} job to get the > vector size, if the size can not be obtained from {{AttributeGroup.}} > {{This can be optimized by get the first row lazily, and reuse it for each > vector column.}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29380) RFormula avoid repeated 'first' jobs to get vector size
[ https://issues.apache.org/jira/browse/SPARK-29380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-29380. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26052 [https://github.com/apache/spark/pull/26052] > RFormula avoid repeated 'first' jobs to get vector size > --- > > Key: SPARK-29380 > URL: https://issues.apache.org/jira/browse/SPARK-29380 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > In current impl, {{RFormula}} will trigger one {{first}} job to get the > vector size, if the size can not be obtained from {{AttributeGroup.}} > {{This can be optimized by get the first row lazily, and reuse it for each > vector column.}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29116) Refactor py classes related to DecisionTree
[ https://issues.apache.org/jira/browse/SPARK-29116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-29116. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25929 [https://github.com/apache/spark/pull/25929] > Refactor py classes related to DecisionTree > --- > > Key: SPARK-29116 > URL: https://issues.apache.org/jira/browse/SPARK-29116 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > 1, Like the scala side, move related classes to a seperate file 'tree.py' > 2, add method 'predictLeaf' in 'DecisionTreeModel' & 'TreeEnsembleModel' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29116) Refactor py classes related to DecisionTree
[ https://issues.apache.org/jira/browse/SPARK-29116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-29116: Assignee: Huaxin Gao > Refactor py classes related to DecisionTree > --- > > Key: SPARK-29116 > URL: https://issues.apache.org/jira/browse/SPARK-29116 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Minor > > 1, Like the scala side, move related classes to a seperate file 'tree.py' > 2, add method 'predictLeaf' in 'DecisionTreeModel' & 'TreeEnsembleModel' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29381) Add 'private' _XXXParams classes for classification & regression
zhengruifeng created SPARK-29381: Summary: Add 'private' _XXXParams classes for classification & regression Key: SPARK-29381 URL: https://issues.apache.org/jira/browse/SPARK-29381 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng ping [~huaxingao] would you like to work on this? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29380) RFormula avoid repeated 'first' jobs to get vector size
zhengruifeng created SPARK-29380: Summary: RFormula avoid repeated 'first' jobs to get vector size Key: SPARK-29380 URL: https://issues.apache.org/jira/browse/SPARK-29380 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng In current impl, {{RFormula}} will trigger one {{first}} job to get the vector size, if the size can not be obtained from {{AttributeGroup.}} {{This can be optimized by get the first row lazily, and reuse it for each vector column.}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend
[ https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946583#comment-16946583 ] zhengruifeng commented on SPARK-29212: -- [~zero323] ??we should remove Java specific mixins, if they don't serve any practical value (provide no implementation whatsoever or don't extend other {{Java*}} mixins, like {{JavaPredictorParams}}, or have no JVM wrapper specific implementation, like {{JavaPredictor}}).?? I am neutral on it, what's is your thoughts? [~huaxingao] [~srowen] ??As of the second point there is additional consideration here - some {{Java*}} classes are considered part of the public API, and this should stay as is (these provide crucial information to the end user). ?? I guess we have reached an agreement in related tickets (like _XXXParams in featuers/clustering). ??On a side note current approach to ML API requires a lot of boilerplate code. Lately I've been playing with [some ideas|https://gist.github.com/zero323/ee36bce57ddeac82322e3ab4ef547611], that wouldn't require code generation - they have some caveats, but maybe there is something there. ?? It looks succinct, I think we may take it into account in the future. > Add common classes without using JVM backend > > > Key: SPARK-29212 > URL: https://issues.apache.org/jira/browse/SPARK-29212 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > Copied from [https://github.com/apache/spark/pull/25776]. > > Maciej's *Concern*: > *Use cases for public ML type hierarchy* > * Add Python-only Transformer implementations: > * > ** I am Python user and want to implement pure Python ML classifier without > providing JVM backend. > ** I want this classifier to be meaningfully positioned in the existing type > hierarchy. > ** However I have access only to high level classes ({{Estimator}}, > {{Model}}, {{MLReader}} / {{MLReadable}}). > * Run time parameter validation for both user defined (see above) and > existing class hierarchy, > * > ** I am a library developer who provides functions that are meaningful only > for specific categories of {{Estimators}} - here classifiers. > ** I want to validate that user passed argument is indeed a classifier: > *** For built-in objects using "private" type hierarchy is not really > satisfying (actually, what is the rationale behind making it "private"? If > the goal is Scala API parity, and Scala counterparts are public, shouldn't > these be too?). > ** For user defined objects I can: > *** Use duck typing (on {{setRawPredictionCol}} for classifier, on > {{numClasses}} for classification model) but it hardly satisfying. > *** Provide parallel non-abstract type hierarchy ({{Classifier}} or > {{PythonClassifier}} and so on) and require users to implement such > interfaces. That however would require separate logic for checking for > built-in and and user-provided classes. > *** Provide parallel abstract type hierarchy, register all existing built-in > classes and require users to do the same. > Clearly these are not satisfying solutions as they require either defensive > programming or reinventing the same functionality for different 3rd party > APIs. > * Static type checking > * > ** I am either end user or library developer and want to use PEP-484 > annotations to indicate components that require classifier or classification > model. > * > ** Currently I can provide only imprecise annotations, [such > as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] > def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... > or try to narrow things down using structural subtyping: > class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, > value: str) -> Classifier: ... class Classifier(Protocol, Model): def > setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> > int: ... > (...) > * First of all nothing in the original API indicated this. On the contrary, > the original API clearly suggests that non-Java path is supported, by > providing base classes (Params, Transformer, Estimator, Model, ML > \{Reader,Writer}, ML\{Readable,Writable}) as well as Java specific > implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel, > JavaML\{Reader,Writer}, JavaML > {Readable,Writable} > ). > * Furthermore authoritative (IMHO) and open Python ML extensions exist > (spark-sklearn is one of these, but if I recall correctly spark-deep-learning > provides so pure-Python utilities). Personally I've seen quite a lot of > private implementations, but that's just anecdotal evidence. > Let us assume
[jira] [Assigned] (SPARK-29269) Pyspark ALSModel support getters/setters
[ https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-29269: Assignee: Huaxin Gao > Pyspark ALSModel support getters/setters > > > Key: SPARK-29269 > URL: https://issues.apache.org/jira/browse/SPARK-29269 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > ping [~huaxingao] , would you like to work on this? This is similar to your > previous works. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29269) Pyspark ALSModel support getters/setters
[ https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-29269: - Comment: was deleted (was: It seems that I do not have the permission to assign a tickect: ``` JIRAError: JiraError HTTP 403 url: https://issues.apache.org/jira/rest/api/latest/issue/SPARK-29269/assignee text: You do not have permission to assign issues. ``` [~dongjoon] Could you please help assign this ticket to Huaxin? Thanks!) > Pyspark ALSModel support getters/setters > > > Key: SPARK-29269 > URL: https://issues.apache.org/jira/browse/SPARK-29269 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > ping [~huaxingao] , would you like to work on this? This is similar to your > previous works. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29269) Pyspark ALSModel support getters/setters
[ https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946524#comment-16946524 ] zhengruifeng commented on SPARK-29269: -- It seems that I do not have the permission to assign a tickect: ``` JIRAError: JiraError HTTP 403 url: https://issues.apache.org/jira/rest/api/latest/issue/SPARK-29269/assignee text: You do not have permission to assign issues. ``` [~dongjoon] Could you please help assign this ticket to Huaxin? Thanks! > Pyspark ALSModel support getters/setters > > > Key: SPARK-29269 > URL: https://issues.apache.org/jira/browse/SPARK-29269 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > ping [~huaxingao] , would you like to work on this? This is similar to your > previous works. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29269) Pyspark ALSModel support getters/setters
[ https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-29269. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25947 [https://github.com/apache/spark/pull/25947] > Pyspark ALSModel support getters/setters > > > Key: SPARK-29269 > URL: https://issues.apache.org/jira/browse/SPARK-29269 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > ping [~huaxingao] , would you like to work on this? This is similar to your > previous works. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29258) parity between ml.evaluator and mllib.metrics
[ https://issues.apache.org/jira/browse/SPARK-29258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-29258. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25940 [https://github.com/apache/spark/pull/25940] > parity between ml.evaluator and mllib.metrics > - > > Key: SPARK-29258 > URL: https://issues.apache.org/jira/browse/SPARK-29258 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > 1, expose {{BinaryClassificationMetrics.numBins}} in > {{BinaryClassificationEvaluator}} > 2, expose {{RegressionMetrics.throughOrigin}} in {{RegressionEvaluator}} > 3, add metric {{explainedVariance}} in {{RegressionEvaluator}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29269) Pyspark ALSModel support getters/setters
zhengruifeng created SPARK-29269: Summary: Pyspark ALSModel support getters/setters Key: SPARK-29269 URL: https://issues.apache.org/jira/browse/SPARK-29269 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng ping [~huaxingao] , would you like to work on this? This is similar to your previous works. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29142) Pyspark clustering models support column setters/getters/predict
[ https://issues.apache.org/jira/browse/SPARK-29142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-29142. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25859 [https://github.com/apache/spark/pull/25859] > Pyspark clustering models support column setters/getters/predict > > > Key: SPARK-29142 > URL: https://issues.apache.org/jira/browse/SPARK-29142 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > Unlike the reg/clf models, clustering models do not have some common class, > so we need to add them one by one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend
[ https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939082#comment-16939082 ] zhengruifeng commented on SPARK-29212: -- [~zero323] I had not notice the base hierarchy without JVM-backend in SPARK-28985, and thanks you for pointing out it. I guess we reach some consensus on: 1, add base classes without JVM-backend, and make JVM-classes extends them; (may limited to classes modified in SPARK-28985 at first) 2, rename private classname following PEP-8. > Add common classes without using JVM backend > > > Key: SPARK-29212 > URL: https://issues.apache.org/jira/browse/SPARK-29212 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > copyed from [https://github.com/apache/spark/pull/25776.] > > Maciej's *Concern*: > *Use cases for public ML type hierarchy* > * Add Python-only Transformer implementations: > ** I am Python user and want to implement pure Python ML classifier without > providing JVM backend. > ** I want this classifier to be meaningfully positioned in the existing type > hierarchy. > ** However I have access only to high level classes ({{Estimator}}, > {{Model}}, {{MLReader}} / {{MLReadable}}). > * Run time parameter validation for both user defined (see above) and > existing class hierarchy, > ** I am a library developer who provides functions that are meaningful only > for specific categories of {{Estimators}} - here classifiers. > ** I want to validate that user passed argument is indeed a classifier: > *** For built-in objects using "private" type hierarchy is not really > satisfying (actually, what is the rationale behind making it "private"? If > the goal is Scala API parity, and Scala counterparts are public, shouldn't > these be too?). > ** For user defined objects I can: > *** Use duck typing (on {{setRawPredictionCol}} for classifier, on > {{numClasses}} for classification model) but it hardly satisfying. > *** Provide parallel non-abstract type hierarchy ({{Classifier}} or > {{PythonClassifier}} and so on) and require users to implement such > interfaces. That however would require separate logic for checking for > built-in and and user-provided classes. > *** Provide parallel abstract type hierarchy, register all existing built-in > classes and require users to do the same. > Clearly these are not satisfying solutions as they require either defensive > programming or reinventing the same functionality for different 3rd party > APIs. > * Static type checking > ** I am either end user or library developer and want to use PEP-484 > annotations to indicate components that require classifier or classification > model. > ** Currently I can provide only imprecise annotations, [such > as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] > def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... > or try to narrow things down using structural subtyping: > class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, > value: str) -> Classifier: ... class Classifier(Protocol, Model): def > setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> > int: ... > > Maciej's *Proposal*: > {code:java} > Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e. > class ClassifierParams: ... > class Predictor(Estimator,PredictorParams): > def setLabelCol(self, value): ... > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > class Classifier(Predictor, ClassifierParams): > def setRawPredictionCol(self, value): ... > class PredictionModel(Model,PredictorParams): > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > def numFeatures(self): ... > def predict(self, value): ... > and JVM interop should extend from this hierarchy, i.e. > class JavaPredictionModel(PredictionModel): ... > In other words it should be consistent with existing approach, where we have > ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and > Java* variants are their subclasses. > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29258) parity between ml.evaluator and mllib.metrics
zhengruifeng created SPARK-29258: Summary: parity between ml.evaluator and mllib.metrics Key: SPARK-29258 URL: https://issues.apache.org/jira/browse/SPARK-29258 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng 1, expose {{BinaryClassificationMetrics.numBins}} in {{BinaryClassificationEvaluator}} 2, expose {{RegressionMetrics.throughOrigin}} in {{RegressionEvaluator}} 3, add metric {{explainedVariance}} in {{RegressionEvaluator}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend
[ https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938205#comment-16938205 ] zhengruifeng commented on SPARK-29212: -- [~zero323] Would you like to help work on this? > Add common classes without using JVM backend > > > Key: SPARK-29212 > URL: https://issues.apache.org/jira/browse/SPARK-29212 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > copyed from [https://github.com/apache/spark/pull/25776.] > > Maciej's *Concern*: > *Use cases for public ML type hierarchy* > * Add Python-only Transformer implementations: > ** I am Python user and want to implement pure Python ML classifier without > providing JVM backend. > ** I want this classifier to be meaningfully positioned in the existing type > hierarchy. > ** However I have access only to high level classes ({{Estimator}}, > {{Model}}, {{MLReader}} / {{MLReadable}}). > * Run time parameter validation for both user defined (see above) and > existing class hierarchy, > ** I am a library developer who provides functions that are meaningful only > for specific categories of {{Estimators}} - here classifiers. > ** I want to validate that user passed argument is indeed a classifier: > *** For built-in objects using "private" type hierarchy is not really > satisfying (actually, what is the rationale behind making it "private"? If > the goal is Scala API parity, and Scala counterparts are public, shouldn't > these be too?). > ** For user defined objects I can: > *** Use duck typing (on {{setRawPredictionCol}} for classifier, on > {{numClasses}} for classification model) but it hardly satisfying. > *** Provide parallel non-abstract type hierarchy ({{Classifier}} or > {{PythonClassifier}} and so on) and require users to implement such > interfaces. That however would require separate logic for checking for > built-in and and user-provided classes. > *** Provide parallel abstract type hierarchy, register all existing built-in > classes and require users to do the same. > Clearly these are not satisfying solutions as they require either defensive > programming or reinventing the same functionality for different 3rd party > APIs. > * Static type checking > ** I am either end user or library developer and want to use PEP-484 > annotations to indicate components that require classifier or classification > model. > ** Currently I can provide only imprecise annotations, [such > as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] > def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... > or try to narrow things down using structural subtyping: > class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, > value: str) -> Classifier: ... class Classifier(Protocol, Model): def > setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> > int: ... > > Maciej's *Proposal*: > {code:java} > Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e. > class ClassifierParams: ... > class Predictor(Estimator,PredictorParams): > def setLabelCol(self, value): ... > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > class Classifier(Predictor, ClassifierParams): > def setRawPredictionCol(self, value): ... > class PredictionModel(Model,PredictorParams): > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > def numFeatures(self): ... > def predict(self, value): ... > and JVM interop should extend from this hierarchy, i.e. > class JavaPredictionModel(PredictionModel): ... > In other words it should be consistent with existing approach, where we have > ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and > Java* variants are their subclasses. > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend
[ https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935678#comment-16935678 ] zhengruifeng commented on SPARK-29212: -- It seems useful to impl some algs in pure python (like wrap scikit-learn as a pyspark.ml alg) I personally think [~zero323]'s proposal is reasonable, although Pyspark.ML is now mostly there to wrap the Scala side. I had a discussion with [~huaxingao] and [~srowen] , I guess they are fairly neutral on it. How do you think of this? [~holden.ka...@gmail.com] [~bryanc] > Add common classes without using JVM backend > > > Key: SPARK-29212 > URL: https://issues.apache.org/jira/browse/SPARK-29212 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > copyed from [https://github.com/apache/spark/pull/25776.] > > Maciej's *Concern*: > *Use cases for public ML type hierarchy* > * Add Python-only Transformer implementations: > ** I am Python user and want to implement pure Python ML classifier without > providing JVM backend. > ** I want this classifier to be meaningfully positioned in the existing type > hierarchy. > ** However I have access only to high level classes ({{Estimator}}, > {{Model}}, {{MLReader}} / {{MLReadable}}). > * Run time parameter validation for both user defined (see above) and > existing class hierarchy, > ** I am a library developer who provides functions that are meaningful only > for specific categories of {{Estimators}} - here classifiers. > ** I want to validate that user passed argument is indeed a classifier: > *** For built-in objects using "private" type hierarchy is not really > satisfying (actually, what is the rationale behind making it "private"? If > the goal is Scala API parity, and Scala counterparts are public, shouldn't > these be too?). > ** For user defined objects I can: > *** Use duck typing (on {{setRawPredictionCol}} for classifier, on > {{numClasses}} for classification model) but it hardly satisfying. > *** Provide parallel non-abstract type hierarchy ({{Classifier}} or > {{PythonClassifier}} and so on) and require users to implement such > interfaces. That however would require separate logic for checking for > built-in and and user-provided classes. > *** Provide parallel abstract type hierarchy, register all existing built-in > classes and require users to do the same. > Clearly these are not satisfying solutions as they require either defensive > programming or reinventing the same functionality for different 3rd party > APIs. > * Static type checking > ** I am either end user or library developer and want to use PEP-484 > annotations to indicate components that require classifier or classification > model. > ** Currently I can provide only imprecise annotations, [such > as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] > def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... > or try to narrow things down using structural subtyping: > class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, > value: str) -> Classifier: ... class Classifier(Protocol, Model): def > setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> > int: ... > > Maciej's *Proposal*: > {code:java} > Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e. > class ClassifierParams: ... > class Predictor(Estimator,PredictorParams): > def setLabelCol(self, value): ... > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > class Classifier(Predictor, ClassifierParams): > def setRawPredictionCol(self, value): ... > class PredictionModel(Model,PredictorParams): > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > def numFeatures(self): ... > def predict(self, value): ... > and JVM interop should extend from this hierarchy, i.e. > class JavaPredictionModel(PredictionModel): ... > In other words it should be consistent with existing approach, where we have > ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and > Java* variants are their subclasses. > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29212) Add common classes without using JVM backend
zhengruifeng created SPARK-29212: Summary: Add common classes without using JVM backend Key: SPARK-29212 URL: https://issues.apache.org/jira/browse/SPARK-29212 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng copyed from [https://github.com/apache/spark/pull/25776.] Maciej's *Concern*: *Use cases for public ML type hierarchy* * Add Python-only Transformer implementations: ** I am Python user and want to implement pure Python ML classifier without providing JVM backend. ** I want this classifier to be meaningfully positioned in the existing type hierarchy. ** However I have access only to high level classes ({{Estimator}}, {{Model}}, {{MLReader}} / {{MLReadable}}). * Run time parameter validation for both user defined (see above) and existing class hierarchy, ** I am a library developer who provides functions that are meaningful only for specific categories of {{Estimators}} - here classifiers. ** I want to validate that user passed argument is indeed a classifier: *** For built-in objects using "private" type hierarchy is not really satisfying (actually, what is the rationale behind making it "private"? If the goal is Scala API parity, and Scala counterparts are public, shouldn't these be too?). ** For user defined objects I can: *** Use duck typing (on {{setRawPredictionCol}} for classifier, on {{numClasses}} for classification model) but it hardly satisfying. *** Provide parallel non-abstract type hierarchy ({{Classifier}} or {{PythonClassifier}} and so on) and require users to implement such interfaces. That however would require separate logic for checking for built-in and and user-provided classes. *** Provide parallel abstract type hierarchy, register all existing built-in classes and require users to do the same. Clearly these are not satisfying solutions as they require either defensive programming or reinventing the same functionality for different 3rd party APIs. * Static type checking ** I am either end user or library developer and want to use PEP-484 annotations to indicate components that require classifier or classification model. ** Currently I can provide only imprecise annotations, [such as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... or try to narrow things down using structural subtyping: class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, value: str) -> Classifier: ... class Classifier(Protocol, Model): def setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> int: ... Maciej's *Proposal*: {code:java} Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e. class ClassifierParams: ... class Predictor(Estimator,PredictorParams): def setLabelCol(self, value): ... def setFeaturesCol(self, value): ... def setPredictionCol(self, value): ... class Classifier(Predictor, ClassifierParams): def setRawPredictionCol(self, value): ... class PredictionModel(Model,PredictorParams): def setFeaturesCol(self, value): ... def setPredictionCol(self, value): ... def numFeatures(self): ... def predict(self, value): ... and JVM interop should extend from this hierarchy, i.e. class JavaPredictionModel(PredictionModel): ... In other words it should be consistent with existing approach, where we have ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and Java* variants are their subclasses. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29144) Binarizer handle sparse vectors incorrectly with negative threshold
[ https://issues.apache.org/jira/browse/SPARK-29144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-29144: - Summary: Binarizer handle sparse vectors incorrectly with negative threshold (was: Binarizer handel sparse vector incorrectly with negative threshold) > Binarizer handle sparse vectors incorrectly with negative threshold > --- > > Key: SPARK-29144 > URL: https://issues.apache.org/jira/browse/SPARK-29144 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: zhengruifeng >Priority: Minor > > the process on sparse vector is wrong if thread<0: > {code:java} > scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, > Vectors.dense(Array(0.0, 0.5, 0.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), > (1,[0.0,0.5,0.0])) > scala> val df = data.toDF("id", "feature") > df: org.apache.spark.sql.DataFrame = [id: int, feature: vector] > scala> val binarizer: Binarizer = new > Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5) > binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8 > scala> binarizer.transform(df).show() > +---+-+-+ > | id| feature|binarized_feature| > +---+-+-+ > | 0|(3,[1],[0.5])|[0.0,1.0,0.0]| > | 1|[0.0,0.5,0.0]|[1.0,1.0,1.0]| > +---+-+-+ > {code} > expected outputs of the above two input vectors should be the same. > > To deal with sparse vectors with threshold < 0, we have two options: > 1, return 1 for non-active items, but this will convert sparse vectors to > dense ones > 2, throw an exception like what Scikit-Learn's > [Binarizer|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html] > does: > {code:java} > import numpy as np > from scipy.sparse import csr_matrix > from sklearn.preprocessing import Binarizer > row = np.array([0, 0, 1, 2, 2, 2]) > col = np.array([0, 2, 2, 0, 1, 2]) > data = np.array([1, 2, 3, 4, 5, 6]) > a = csr_matrix((data, (row, col)), shape=(3, 3)) > binarizer = Binarizer(threshold=-1.0) > binarizer.transform(a) > Traceback (most recent call last): File "", > line 1, in > binarizer.transform(a) File > "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", > line 1874, in transform > return binarize(X, threshold=self.threshold, copy=copy) File > "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", > line 1774, in binarize > raise ValueError('Cannot binarize a sparse matrix with threshold > 'ValueError: Cannot binarize a sparse matrix with threshold < 0 {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29144) Binarizer handel sparse vector incorrectly with negative threshold
[ https://issues.apache.org/jira/browse/SPARK-29144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932272#comment-16932272 ] zhengruifeng commented on SPARK-29144: -- I prefer option 2, and will send a PR for this. > Binarizer handel sparse vector incorrectly with negative threshold > -- > > Key: SPARK-29144 > URL: https://issues.apache.org/jira/browse/SPARK-29144 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: zhengruifeng >Priority: Minor > > the process on sparse vector is wrong if thread<0: > {code:java} > scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, > Vectors.dense(Array(0.0, 0.5, 0.0 > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), > (1,[0.0,0.5,0.0])) > scala> val df = data.toDF("id", "feature") > df: org.apache.spark.sql.DataFrame = [id: int, feature: vector] > scala> val binarizer: Binarizer = new > Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5) > binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8 > scala> binarizer.transform(df).show() > +---+-+-+ > | id| feature|binarized_feature| > +---+-+-+ > | 0|(3,[1],[0.5])|[0.0,1.0,0.0]| > | 1|[0.0,0.5,0.0]|[1.0,1.0,1.0]| > +---+-+-+ > {code} > expected outputs of the above two input vectors should be the same. > > To deal with sparse vectors with threshold < 0, we have two options: > 1, return 1 for non-active items, but this will convert sparse vectors to > dense ones > 2, throw an exception like what Scikit-Learn's > [Binarizer|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html] > does: > {code:java} > import numpy as np > from scipy.sparse import csr_matrix > from sklearn.preprocessing import Binarizer > row = np.array([0, 0, 1, 2, 2, 2]) > col = np.array([0, 2, 2, 0, 1, 2]) > data = np.array([1, 2, 3, 4, 5, 6]) > a = csr_matrix((data, (row, col)), shape=(3, 3)) > binarizer = Binarizer(threshold=-1.0) > binarizer.transform(a) > Traceback (most recent call last): File "", > line 1, in > binarizer.transform(a) File > "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", > line 1874, in transform > return binarize(X, threshold=self.threshold, copy=copy) File > "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", > line 1774, in binarize > raise ValueError('Cannot binarize a sparse matrix with threshold > 'ValueError: Cannot binarize a sparse matrix with threshold < 0 {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29144) Binarizer handel sparse vector incorrectly with negative threshold
[ https://issues.apache.org/jira/browse/SPARK-29144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-29144: - Description: the process on sparse vector is wrong if thread<0: {code:java} scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, Vectors.dense(Array(0.0, 0.5, 0.0 data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), (1,[0.0,0.5,0.0])) scala> val df = data.toDF("id", "feature") df: org.apache.spark.sql.DataFrame = [id: int, feature: vector] scala> val binarizer: Binarizer = new Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5) binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8 scala> binarizer.transform(df).show() +---+-+-+ | id| feature|binarized_feature| +---+-+-+ | 0|(3,[1],[0.5])|[0.0,1.0,0.0]| | 1|[0.0,0.5,0.0]|[1.0,1.0,1.0]| +---+-+-+ {code} expected outputs of the above two input vectors should be the same. To deal with sparse vectors with threshold < 0, we have two options: 1, return 1 for non-active items, but this will convert sparse vectors to dense ones 2, throw an exception like what Scikit-Learn's [Binarizer|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html] does: {code:java} import numpy as np from scipy.sparse import csr_matrix from sklearn.preprocessing import Binarizer row = np.array([0, 0, 1, 2, 2, 2]) col = np.array([0, 2, 2, 0, 1, 2]) data = np.array([1, 2, 3, 4, 5, 6]) a = csr_matrix((data, (row, col)), shape=(3, 3)) binarizer = Binarizer(threshold=-1.0) binarizer.transform(a) Traceback (most recent call last): File "", line 1, in binarizer.transform(a) File "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 1874, in transform return binarize(X, threshold=self.threshold, copy=copy) File "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 1774, in binarize raise ValueError('Cannot binarize a sparse matrix with threshold 'ValueError: Cannot binarize a sparse matrix with threshold < 0 {code} was: the process on sparse vector is wrong if thread<0: {code:java} scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, Vectors.dense(Array(0.0, 0.5, 0.0 data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), (1,[0.0,0.5,0.0])) scala> val df = data.toDF("id", "feature") df: org.apache.spark.sql.DataFrame = [id: int, feature: vector] scala> val binarizer: Binarizer = new Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5) binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8 scala> binarizer.transform(df).show() +---+-+-+ | id| feature|binarized_feature| +---+-+-+ | 0|(3,[1],[0.5])|[0.0,1.0,0.0]| | 1|[0.0,0.5,0.0]|[1.0,1.0,1.0]| +---+-+-+ {code} expected outputs of the above two input vectors should be the same. To deal with sparse vectors with threshold < 0, we have two options: 1, return 1 for non-active items, but this will convert sparse vectors to dense ones 2, throw an exception like what Scikit-Learn's [Binarizer|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.htm] does: {code:java} import numpy as np from scipy.sparse import csr_matrix from sklearn.preprocessing import Binarizer row = np.array([0, 0, 1, 2, 2, 2]) col = np.array([0, 2, 2, 0, 1, 2]) data = np.array([1, 2, 3, 4, 5, 6]) a = csr_matrix((data, (row, col)), shape=(3, 3)) binarizer = Binarizer(threshold=-1.0) binarizer.transform(a) Traceback (most recent call last): File "", line 1, in binarizer.transform(a) File "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 1874, in transform return binarize(X, threshold=self.threshold, copy=copy) File "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 1774, in binarize raise ValueError('Cannot binarize a sparse matrix with threshold 'ValueError: Cannot binarize a sparse matrix with threshold < 0 {code} > Binarizer handel sparse vector incorrectly with negative threshold > -- > > Key: SPARK-29144 > URL: https://issues.apache.org/jira/browse/SPARK-29144 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: zhengruifeng >Priority: Minor > > the process on sparse vector is wrong if thread<0: > {code:java} > scala> val data = Seq((0,
[jira] [Created] (SPARK-29144) Binarizer handel sparse vector incorrectly with negative threshold
zhengruifeng created SPARK-29144: Summary: Binarizer handel sparse vector incorrectly with negative threshold Key: SPARK-29144 URL: https://issues.apache.org/jira/browse/SPARK-29144 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.4.0, 2.3.0, 2.2.0, 2.1.0, 2.0.0 Reporter: zhengruifeng the process on sparse vector is wrong if thread<0: {code:java} scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, Vectors.dense(Array(0.0, 0.5, 0.0 data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), (1,[0.0,0.5,0.0])) scala> val df = data.toDF("id", "feature") df: org.apache.spark.sql.DataFrame = [id: int, feature: vector] scala> val binarizer: Binarizer = new Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5) binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8 scala> binarizer.transform(df).show() +---+-+-+ | id| feature|binarized_feature| +---+-+-+ | 0|(3,[1],[0.5])|[0.0,1.0,0.0]| | 1|[0.0,0.5,0.0]|[1.0,1.0,1.0]| +---+-+-+ {code} expected outputs of the above two input vectors should be the same. To deal with sparse vectors with threshold < 0, we have two options: 1, return 1 for non-active items, but this will convert sparse vectors to dense ones 2, throw an exception like what Scikit-Learn's [Binarizer|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.htm] does: {code:java} import numpy as np from scipy.sparse import csr_matrix from sklearn.preprocessing import Binarizer row = np.array([0, 0, 1, 2, 2, 2]) col = np.array([0, 2, 2, 0, 1, 2]) data = np.array([1, 2, 3, 4, 5, 6]) a = csr_matrix((data, (row, col)), shape=(3, 3)) binarizer = Binarizer(threshold=-1.0) binarizer.transform(a) Traceback (most recent call last): File "", line 1, in binarizer.transform(a) File "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 1874, in transform return binarize(X, threshold=self.threshold, copy=copy) File "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 1774, in binarize raise ValueError('Cannot binarize a sparse matrix with threshold 'ValueError: Cannot binarize a sparse matrix with threshold < 0 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-23578) Add multicolumn support for Binarizer
[ https://issues.apache.org/jira/browse/SPARK-23578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reopened SPARK-23578: -- this ticket is for Binarizer not Bucketizer > Add multicolumn support for Binarizer > - > > Key: SPARK-23578 > URL: https://issues.apache.org/jira/browse/SPARK-23578 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Teng Peng >Priority: Minor > > [Spark-20542] added an API that Bucketizer that can bin multiple columns. > Based on this change, a multicolumn support is added for Binarizer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23578) Add multicolumn support for Binarizer
[ https://issues.apache.org/jira/browse/SPARK-23578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-23578. -- Resolution: Duplicate > Add multicolumn support for Binarizer > - > > Key: SPARK-23578 > URL: https://issues.apache.org/jira/browse/SPARK-23578 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Teng Peng >Priority: Minor > > [Spark-20542] added an API that Bucketizer that can bin multiple columns. > Based on this change, a multicolumn support is added for Binarizer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29143) Pyspark feature models support column setters/getters
zhengruifeng created SPARK-29143: Summary: Pyspark feature models support column setters/getters Key: SPARK-29143 URL: https://issues.apache.org/jira/browse/SPARK-29143 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29142) Pyspark clustering models support column setters/getters/predict
zhengruifeng created SPARK-29142: Summary: Pyspark clustering models support column setters/getters/predict Key: SPARK-29142 URL: https://issues.apache.org/jira/browse/SPARK-29142 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng Unlike the reg/clf models, clustering models do not have some common class, so we need to add them one by one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29118) Avoid redundant computation in GMM.transform && GLR.transform
[ https://issues.apache.org/jira/browse/SPARK-29118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-29118: - Description: In SPARK-27944, the computation for output columns with empty name is skipped. Now, I find that we can furthermore optimize 1, GMM.transform by directly obtaining the prediction(double) from its probabilty prediction(vector), like what ProbabilisticClassificationModel and ClassificationModel do. 2, GLR.transform by obtaining the prediction(double) from its link prediction(double) was: In SPARK-27944, the computation for output columns with empty name is skipped. Now, I find that we can furthermore optimize GMM.transform by directly obtaining the prediction(double) from its probabilty prediction(vector), like what ProbabilisticClassificationModel and ClassificationModel do. > Avoid redundant computation in GMM.transform && GLR.transform > - > > Key: SPARK-29118 > URL: https://issues.apache.org/jira/browse/SPARK-29118 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > In SPARK-27944, the computation for output columns with empty name is skipped. > Now, I find that we can furthermore optimize > 1, GMM.transform by directly obtaining the prediction(double) from its > probabilty prediction(vector), like what ProbabilisticClassificationModel and > ClassificationModel do. > 2, GLR.transform by obtaining the prediction(double) from its link > prediction(double) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29118) Avoid redundant computation in GMM.transform && GLR.transform
[ https://issues.apache.org/jira/browse/SPARK-29118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-29118: - Summary: Avoid redundant computation in GMM.transform && GLR.transform (was: Avoid redundant computation in GMM.transform) > Avoid redundant computation in GMM.transform && GLR.transform > - > > Key: SPARK-29118 > URL: https://issues.apache.org/jira/browse/SPARK-29118 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > In SPARK-27944, the computation for output columns with empty name is skipped. > Now, I find that we can furthermore optimize GMM.transform by directly > obtaining the prediction(double) from its probabilty prediction(vector), like > what ProbabilisticClassificationModel and ClassificationModel do. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29118) Avoid redundant computation in GMM.transform
zhengruifeng created SPARK-29118: Summary: Avoid redundant computation in GMM.transform Key: SPARK-29118 URL: https://issues.apache.org/jira/browse/SPARK-29118 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng In SPARK-27944, the computation for output columns with empty name is skipped. Now, I find that we can furthermore optimize GMM.transform by directly obtaining the prediction(double) from its probabilty prediction(vector), like what ProbabilisticClassificationModel and ClassificationModel do. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29116) Refactor py classes related to DecisionTree
[ https://issues.apache.org/jira/browse/SPARK-29116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931179#comment-16931179 ] zhengruifeng commented on SPARK-29116: -- friendly ping [~huaxingao] , are you willing to work on this? > Refactor py classes related to DecisionTree > --- > > Key: SPARK-29116 > URL: https://issues.apache.org/jira/browse/SPARK-29116 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > 1, Like the scala side, move related classes to a seperate file 'tree.py' > 2, add method 'predictLeaf' in 'DecisionTreeModel' & 'TreeEnsembleModel' -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29116) Refactor py classes related to DecisionTree
zhengruifeng created SPARK-29116: Summary: Refactor py classes related to DecisionTree Key: SPARK-29116 URL: https://issues.apache.org/jira/browse/SPARK-29116 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng 1, Like the scala side, move related classes to a seperate file 'tree.py' 2, add method 'predictLeaf' in 'DecisionTreeModel' & 'TreeEnsembleModel' -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-22796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931125#comment-16931125 ] zhengruifeng commented on SPARK-22796: -- [~huaxingao] https://issues.apache.org/jira/browse/SPARK-22797 is now resolved, you can continue now > Add multiple column support to PySpark QuantileDiscretizer > -- > > Key: SPARK-22796 > URL: https://issues.apache.org/jira/browse/SPARK-22796 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22797) Add multiple column support to PySpark Bucketizer
[ https://issues.apache.org/jira/browse/SPARK-22797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-22797. -- Resolution: Done > Add multiple column support to PySpark Bucketizer > - > > Key: SPARK-22797 > URL: https://issues.apache.org/jira/browse/SPARK-22797 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Assignee: zhengruifeng >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29094) Add extractInstances method
[ https://issues.apache.org/jira/browse/SPARK-29094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-29094. -- Resolution: Duplicate > Add extractInstances method > --- > > Key: SPARK-29094 > URL: https://issues.apache.org/jira/browse/SPARK-29094 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29095) add extractInstances
zhengruifeng created SPARK-29095: Summary: add extractInstances Key: SPARK-29095 URL: https://issues.apache.org/jira/browse/SPARK-29095 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng There was method extractLabeledPoints for ml algs to transform dataset into rdd of labelPoints. Now more and more algs support sample weighting and extractLabeledPoints is less used, so we should support extract weight in the common methods. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29094) Add extractInstances method
zhengruifeng created SPARK-29094: Summary: Add extractInstances method Key: SPARK-29094 URL: https://issues.apache.org/jira/browse/SPARK-29094 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29093) Remove automatically generated param setters in _shared_params_code_gen.py
zhengruifeng created SPARK-29093: Summary: Remove automatically generated param setters in _shared_params_code_gen.py Key: SPARK-29093 URL: https://issues.apache.org/jira/browse/SPARK-29093 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng The main difference between scala and py sides come from the automatically generated param setter in _shared_params_code_gen.py. To make them in sync, we should remove those setters in _shared_.py, and add the corresponding setters manually. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28985) Pyspark ClassificationModel and RegressionModel support column setters/getters/predict
[ https://issues.apache.org/jira/browse/SPARK-28985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927498#comment-16927498 ] zhengruifeng commented on SPARK-28985: -- [~huaxingao] You can refer to my old prs [https://github.com/apache/spark/pull/16171] and [https://github.com/apache/spark/pull/25662] if you want to take it over. Thanks! > Pyspark ClassificationModel and RegressionModel support column > setters/getters/predict > -- > > Key: SPARK-28985 > URL: https://issues.apache.org/jira/browse/SPARK-28985 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > 1, add common abstract classes like JavaClassificationModel & > JavaProbabilisticClassificationModel > 2, add column setters/getters, and predict method > 3, update the test suites to verify newly added functions -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9612) Add instance weight support for GBTs
[ https://issues.apache.org/jira/browse/SPARK-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924090#comment-16924090 ] zhengruifeng commented on SPARK-9612: - https://issues.apache.org/jira/browse/SPARK-19591 is now resolved by [~imatiach] [~dbtsai] Will you go on working on this? > Add instance weight support for GBTs > > > Key: SPARK-9612 > URL: https://issues.apache.org/jira/browse/SPARK-9612 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: DB Tsai >Priority: Minor > Labels: bulk-closed > > GBT support for instance weights could be handled by: > * sampling data before passing it to trees > * passing weights to trees (requiring weight support for trees first, but > probably better in the end) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28968) Add HasNumFeatures in the scala side
[ https://issues.apache.org/jira/browse/SPARK-28968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-28968. -- Resolution: Resolved > Add HasNumFeatures in the scala side > > > Key: SPARK-28968 > URL: https://issues.apache.org/jira/browse/SPARK-28968 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > In the py side, HasNumFeatures is provided and inherited by 'HashingTF' and > 'FeatureHasher'. > It is reasonable to also add HasNumFeatures in the scala side. > Since '1<<18' is used by default in all place, we should add it as a default > into param trait. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28985) Pyspark ClassificationModel and RegressionModel support column setters/getters/predict
[ https://issues.apache.org/jira/browse/SPARK-28985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-28985: - Description: 1, add common abstract classes like JavaClassificationModel & JavaProbabilisticClassificationModel 2, add column setters/getters, and predict method 3, update the test suites to verify newly added functions was: 1, add common abstract classes like ClassificationModel & ProbabilisticClassificationModel 2, add column setters/getters, and predict method 3, update the test suites to verify newly added functions > Pyspark ClassificationModel and RegressionModel support column > setters/getters/predict > -- > > Key: SPARK-28985 > URL: https://issues.apache.org/jira/browse/SPARK-28985 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > 1, add common abstract classes like JavaClassificationModel & > JavaProbabilisticClassificationModel > 2, add column setters/getters, and predict method > 3, update the test suites to verify newly added functions -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923163#comment-16923163 ] zhengruifeng commented on SPARK-28927: -- [~JerryHouse] As to AUC, which impl do you use? BinaryClassificationEvaluator or BinaryClassificationMetrics? If you use BinaryClassificationMetrics, you may try to set numBins=0 to avoid down-sampling, then we can see whether the score is stable. Moreover, could you please provide a (small) dataframe to reproduce? > ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets > with 12 billion instances > --- > > Key: SPARK-28927 > URL: https://issues.apache.org/jira/browse/SPARK-28927 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Priority: Major > Attachments: image-2019-09-02-11-55-33-596.png > > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > This exception happened sometimes. And we also found that the AUC metric was > not stable when evaluating the inner product of the user factors and the item > factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 > which was not stable for production environment. > Dataset capacity: ~12 billion ratings > Here is the our code: > val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, >
[jira] [Updated] (SPARK-28958) pyspark.ml function parity
[ https://issues.apache.org/jira/browse/SPARK-28958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-28958: - Description: I looked into the hierarchy of both py and scala sides, and found that they are quite different, which damage the parity and make the codebase hard to maintain. The main inconvenience is that most models in pyspark do not support any param getters and setters. In the py side, I think we need to do: 1, remove setters generated by _shared_params_code_gen.py; 2, add common abstract classes like the side side, such as JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier; 3, for each alg, add its param trait, such as LinearSVCParams; 4, since sharedParam do not have setters, we need to add them in right places; Unfortunately, I notice that if we do 1 (remove setters generated by _shared_params_code_gen.py), all algs (classification/regression/clustering/features/fpm/recommendation) need to be modified in one batch. The scala side also need some small improvements, but I think they can be leave alone at first was: I looked into the hierarchy of both py and scala sides, and found that they are quite different, which damage the parity and make the codebase hard to maintain. The main inconvenience is that most models in pyspark do not support any param getters and setters. In the py side, I think we need to do: 1, remove setters generated by _shared_params_code_gen.py; 2, add common abstract classes like the side side, such as JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier; 3, for each alg, add its param trait, such as LinearSVCParams; 4, since sharedParam do not have setters, we need to add them in right places; Unfortunately, I notice that if we do 1 (remove setters generated by _shared_params_code_gen.py), all algs (classification/regression/clustering/features/fpm/recommendation) need to be modified in one batch. The scala side also need some small improvements, but I think they can be leave alone at first, to avoid a lot of MiMa Failures. > pyspark.ml function parity > -- > > Key: SPARK-28958 > URL: https://issues.apache.org/jira/browse/SPARK-28958 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > Attachments: ML_SYNC.pdf > > > I looked into the hierarchy of both py and scala sides, and found that they > are quite different, which damage the parity and make the codebase hard to > maintain. > The main inconvenience is that most models in pyspark do not support any > param getters and setters. > In the py side, I think we need to do: > 1, remove setters generated by _shared_params_code_gen.py; > 2, add common abstract classes like the side side, such as > JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier; > 3, for each alg, add its param trait, such as LinearSVCParams; > 4, since sharedParam do not have setters, we need to add them in right places; > Unfortunately, I notice that if we do 1 (remove setters generated by > _shared_params_code_gen.py), all algs > (classification/regression/clustering/features/fpm/recommendation) need to be > modified in one batch. > The scala side also need some small improvements, but I think they can be > leave alone at first -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28985) Pyspark ClassificationModel and RegressionModel support column setters/getters/predict
zhengruifeng created SPARK-28985: Summary: Pyspark ClassificationModel and RegressionModel support column setters/getters/predict Key: SPARK-28985 URL: https://issues.apache.org/jira/browse/SPARK-28985 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng 1, add common abstract classes like ClassificationModel & ProbabilisticClassificationModel 2, add column setters/getters, and predict method 3, update the test suites to verify newly added functions -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28969) OneVsRestModel in the py side should not set WeightCol and Classifier
[ https://issues.apache.org/jira/browse/SPARK-28969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-28969: - Parent: SPARK-28958 Issue Type: Sub-task (was: Improvement) > OneVsRestModel in the py side should not set WeightCol and Classifier > - > > Key: SPARK-28969 > URL: https://issues.apache.org/jira/browse/SPARK-28969 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > 'WeightCol' and 'Classifier' can only be set in the estimator. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28969) OneVsRestModel in the py side should not set WeightCol and Classifier
[ https://issues.apache.org/jira/browse/SPARK-28969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922994#comment-16922994 ] zhengruifeng commented on SPARK-28969: -- friendly ping [~huaxingao] > OneVsRestModel in the py side should not set WeightCol and Classifier > - > > Key: SPARK-28969 > URL: https://issues.apache.org/jira/browse/SPARK-28969 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > 'WeightCol' and 'Classifier' can only be set in the estimator. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28969) OneVsRestModel in the py side should not set WeightCol and Classifier
zhengruifeng created SPARK-28969: Summary: OneVsRestModel in the py side should not set WeightCol and Classifier Key: SPARK-28969 URL: https://issues.apache.org/jira/browse/SPARK-28969 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng 'WeightCol' and 'Classifier' can only be set in the estimator. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28968) Add HasNumFeatures in the scala side
zhengruifeng created SPARK-28968: Summary: Add HasNumFeatures in the scala side Key: SPARK-28968 URL: https://issues.apache.org/jira/browse/SPARK-28968 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng In the py side, HasNumFeatures is provided and inherited by 'HashingTF' and 'FeatureHasher'. It is reasonable to also add HasNumFeatures in the scala side. Since '1<<18' is used by default in all place, we should add it as a default into param trait. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28958) pyspark.ml function parity
[ https://issues.apache.org/jira/browse/SPARK-28958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-28958: - Attachment: ML_SYNC.pdf > pyspark.ml function parity > -- > > Key: SPARK-28958 > URL: https://issues.apache.org/jira/browse/SPARK-28958 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > Attachments: ML_SYNC.pdf > > > I looked into the hierarchy of both py and scala sides, and found that they > are quite different, which damage the parity and make the codebase hard to > maintain. > The main inconvenience is that most models in pyspark do not support any > param getters and setters. > In the py side, I think we need to do: > 1, remove setters generated by _shared_params_code_gen.py; > 2, add common abstract classes like the side side, such as > JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier; > 3, for each alg, add its param trait, such as LinearSVCParams; > 4, since sharedParam do not have setters, we need to add them in right places; > Unfortunately, I notice that if we do 1 (remove setters generated by > _shared_params_code_gen.py), all algs > (classification/regression/clustering/features/fpm/recommendation) need to be > modified in one batch. > The scala side also need some small improvements, but I think they can be > leave alone at first, to avoid a lot of MiMa Failures. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28958) pyspark.ml function parity
zhengruifeng created SPARK-28958: Summary: pyspark.ml function parity Key: SPARK-28958 URL: https://issues.apache.org/jira/browse/SPARK-28958 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng I looked into the hierarchy of both py and scala sides, and found that they are quite different, which damage the parity and make the codebase hard to maintain. The main inconvenience is that most models in pyspark do not support any param getters and setters. In the py side, I think we need to do: 1, remove setters generated by _shared_params_code_gen.py; 2, add common abstract classes like the side side, such as JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier; 3, for each alg, add its param trait, such as LinearSVCParams; 4, since sharedParam do not have setters, we need to add them in right places; Unfortunately, I notice that if we do 1 (remove setters generated by _shared_params_code_gen.py), all algs (classification/regression/clustering/features/fpm/recommendation) need to be modified in one batch. The scala side also need some small improvements, but I think they can be leave alone at first, to avoid a lot of MiMa Failures. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28372) Document Spark WEB UI
[ https://issues.apache.org/jira/browse/SPARK-28372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921151#comment-16921151 ] zhengruifeng commented on SPARK-28372: -- [~smilegator] I think we may need to add a subtask for streaming? As [~planga82] suggested. > Document Spark WEB UI > - > > Key: SPARK-28372 > URL: https://issues.apache.org/jira/browse/SPARK-28372 > Project: Spark > Issue Type: Umbrella > Components: Documentation, Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > Spark web UIs are being used to monitor the status and resource consumption > of your Spark applications and clusters. However, we do not have the > corresponding document. It is hard for end users to use and understand them. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28373) Document JDBC/ODBC Server page
[ https://issues.apache.org/jira/browse/SPARK-28373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921138#comment-16921138 ] zhengruifeng commented on SPARK-28373: -- [~planga82] Thanks!:D > Document JDBC/ODBC Server page > -- > > Key: SPARK-28373 > URL: https://issues.apache.org/jira/browse/SPARK-28373 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > !https://user-images.githubusercontent.com/5399861/60809590-9dcf2500-a1bd-11e9-826e-33729bb97daf.png|width=1720,height=503! > > [https://github.com/apache/spark/pull/25062] added a new column CLOSE TIME > and EXECUTION TIME. It is hard to understand the difference. We need to > document them; otherwise, it is hard for end users to understand them > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28373) Document JDBC/ODBC Server page
[ https://issues.apache.org/jira/browse/SPARK-28373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920599#comment-16920599 ] zhengruifeng commented on SPARK-28373: -- [~smilegator] [~yumwang] I am afraid I have no time to do it this week. [~planga82] Could you please take it over? > Document JDBC/ODBC Server page > -- > > Key: SPARK-28373 > URL: https://issues.apache.org/jira/browse/SPARK-28373 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > !https://user-images.githubusercontent.com/5399861/60809590-9dcf2500-a1bd-11e9-826e-33729bb97daf.png|width=1720,height=503! > > [https://github.com/apache/spark/pull/25062] added a new column CLOSE TIME > and EXECUTION TIME. It is hard to understand the difference. We need to > document them; otherwise, it is hard for end users to understand them > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28858) add tree-based transformation in the py side
zhengruifeng created SPARK-28858: Summary: add tree-based transformation in the py side Key: SPARK-28858 URL: https://issues.apache.org/jira/browse/SPARK-28858 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng Expose the newly add tree-based transformation in the py side -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28780) Delete the incorrect setWeightCol method in LinearSVCModel
zhengruifeng created SPARK-28780: Summary: Delete the incorrect setWeightCol method in LinearSVCModel Key: SPARK-28780 URL: https://issues.apache.org/jira/browse/SPARK-28780 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.4.0, 2.3.0, 2.2.0, 3.0.0 Reporter: zhengruifeng 1, the weightCol is only used in training, and should not be set in LinearSVCModel; 2, the method 'def setWeightCol(value: Double): this.type = set(threshold, value)' is wrongly defined, since value should be a string and weightCol instead of threshold should be set. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28542) Document Stages page
[ https://issues.apache.org/jira/browse/SPARK-28542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910145#comment-16910145 ] zhengruifeng commented on SPARK-28542: -- [~planga82] Just go ahead! Thanks! > Document Stages page > > > Key: SPARK-28542 > URL: https://issues.apache.org/jira/browse/SPARK-28542 > Project: Spark > Issue Type: Sub-task > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28373) Document JDBC/ODBC Server page
[ https://issues.apache.org/jira/browse/SPARK-28373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905937#comment-16905937 ] zhengruifeng commented on SPARK-28373: -- [~yumwang] I had just create a page in https://issues.apache.org/jira/browse/SPARK-28538, you can add the relative doc in it. Thanks. > Document JDBC/ODBC Server page > -- > > Key: SPARK-28373 > URL: https://issues.apache.org/jira/browse/SPARK-28373 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > !https://user-images.githubusercontent.com/5399861/60809590-9dcf2500-a1bd-11e9-826e-33729bb97daf.png|width=1720,height=503! > > [https://github.com/apache/spark/pull/25062] added a new column CLOSE TIME > and EXECUTION TIME. It is hard to understand the difference. We need to > document them; otherwise, it is hard for end users to understand them > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28543) Document Spark Jobs page
[ https://issues.apache.org/jira/browse/SPARK-28543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905935#comment-16905935 ] zhengruifeng commented on SPARK-28543: -- [~planga82] I had just create a page in https://issues.apache.org/jira/browse/SPARK-28538, you can add the relative doc in it. > Document Spark Jobs page > > > Key: SPARK-28543 > URL: https://issues.apache.org/jira/browse/SPARK-28543 > Project: Spark > Issue Type: Sub-task > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28579) MaxAbsScaler avoids conversion to breeze.vector
zhengruifeng created SPARK-28579: Summary: MaxAbsScaler avoids conversion to breeze.vector Key: SPARK-28579 URL: https://issues.apache.org/jira/browse/SPARK-28579 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng In current impl, MaxAbsScaler will convert each vector to a breeze.vector in transformation. This should be skipped. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28514) Remove the redundant transformImpl method in RF & GBT
zhengruifeng created SPARK-28514: Summary: Remove the redundant transformImpl method in RF & GBT Key: SPARK-28514 URL: https://issues.apache.org/jira/browse/SPARK-28514 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng 1, In GBTClassifier & RandomForestClassifier, the real transform methods inherit from ProbabilisticClassificationModel which can deal with multi output columns. The transformImpl method, which deals with only one column - predictionCol, completely does nothing. This is quite confusing. 2, In GBTRegressor & RandomForestRegressor, the transformImpl do exactly what the superclass PredictionModel does (except model broadcasting), so can be removed. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28499) Optimize MinMaxScaler
[ https://issues.apache.org/jira/browse/SPARK-28499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-28499: - Description: current impl of MinMaxScaler has some small places to be optimized: 1, avoid call param getter in udf. If I remember correctly, there was some tickets and prs about this, calling param getter in udf or map function, will significantly slow down the computation. 2, for a constant dim, the transformed value is also a constant value, which can be precomputed. 3, for a usual dim (i-th), the value is update by values(i) = (values(i) - minArray(i)) / range(i) * scale + $(min) here, we can precompute scale / range, so that a division can be skipped. was: current impl of MinMaxScaler has some small places to be optimized: 1, avoid call param getter in udf. If I remember correctly, there was some tickets and prs about this, calling param getter in udf or map function, will significantly slow down the computation. 2, for a constant dim, the transformed value is also a constant value, which can be precomputed. 3, for a usual dim (i-th), the value is update by values(i) = (values(i) - minArray(i)) / range(i) * scale + $(min) here, we can precompute range * scale, so that a division can be skipped. > Optimize MinMaxScaler > - > > Key: SPARK-28499 > URL: https://issues.apache.org/jira/browse/SPARK-28499 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > current impl of MinMaxScaler has some small places to be optimized: > 1, avoid call param getter in udf. > If I remember correctly, there was some tickets and prs about this, calling > param getter in udf or map function, will significantly slow down the > computation. > 2, for a constant dim, the transformed value is also a constant value, which > can be precomputed. > 3, for a usual dim (i-th), the value is update by > values(i) = (values(i) - minArray(i)) / range(i) * scale + $(min) > here, we can precompute scale / range, so that a division can be skipped. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28499) Optimize MinMaxScaler
zhengruifeng created SPARK-28499: Summary: Optimize MinMaxScaler Key: SPARK-28499 URL: https://issues.apache.org/jira/browse/SPARK-28499 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng current impl of MinMaxScaler has some small places to be optimized: 1, avoid call param getter in udf. If I remember correctly, there was some tickets and prs about this, calling param getter in udf or map function, will significantly slow down the computation. 2, for a constant dim, the transformed value is also a constant value, which can be precomputed. 3, for a usual dim (i-th), the value is update by values(i) = (values(i) - minArray(i)) / range(i) * scale + $(min) here, we can precompute range * scale, so that a division can be skipped. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13677) Support Tree-Based Feature Transformation for ML
[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-13677: - Description: It would be nice to be able to use RF and GBT for feature transformation: First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion. This method was first introduced by facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is implemented in famous libraries: sklearn [apply|[http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]] xgboost [predict_leaf_index|[https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]] lightgbm [predict_leaf_index|https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_leaf_index] catboost [calc_leaf_index|https://github.com/catboost/tutorials/tree/master/leaf_indexes_calculation] Refering to the design of above impls, I propose following api: val model1 : DecisionTreeClassificationModel= ... model1.setLeafCol("leaves") model1.transform(df) val model2 : GBTClassificationModel = ... model2.getLeafCol model2.transform(df) The detailed design doc: [https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing] was: It would be nice to be able to use RF and GBT for feature transformation: First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion. This method was first introduced by facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is implemented in famous libraries: sklearn [apply|[http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]] xgboost [lpredict_leaf_index|[https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]] lightgbm [predict_leaf_index|https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_leaf_index] catboost [calc_leaf_index|https://github.com/catboost/tutorials/tree/master/leaf_indexes_calculation] Refering to the design of above impls, I propose following api: val model1 : DecisionTreeClassificationModel= ... model1.setLeafCol("leaves") model1.transform(df) val model2 : GBTClassificationModel = ... model2.getLeafCol model2.transform(df) The detailed design doc: [https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing] > Support Tree-Based Feature Transformation for ML > > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: zhengruifeng >Priority: Major > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is > implemented in famous libraries: > sklearn > [apply|[http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]] > xgboost > [predict_leaf_index|[https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]] > lightgbm > [predict_leaf_index|https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_leaf_index] > catboost > [calc_leaf_index|https://github.com/catboost/tutorials/tree/master/leaf_indexes_calculation] > > > Refering to the design of above impls, I propose following api: > val model1 : DecisionTreeClassificationModel= ... > model1.setLeafCol("leaves") > model1.transform(df) > > val model2 : GBTClassificationModel = ... > model2.getLeafCol > model2.transform(df) > > The detailed design doc: > [https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13677) Support Tree-Based Feature Transformation for ML
[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-13677: - Description: It would be nice to be able to use RF and GBT for feature transformation: First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion. This method was first introduced by facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is implemented in famous libraries: sklearn [apply|[http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]] xgboost [lpredict_leaf_index|[https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]] lightgbm [predict_leaf_index|https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_leaf_index] catboost [calc_leaf_index|https://github.com/catboost/tutorials/tree/master/leaf_indexes_calculation] Refering to the design of above impls, I propose following api: val model1 : DecisionTreeClassificationModel= ... model1.setLeafCol("leaves") model1.transform(df) val model2 : GBTClassificationModel = ... model2.getLeafCol model2.transform(df) The detailed design doc: [https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing] was: It would be nice to be able to use RF and GBT for feature transformation: First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion. This method was first introduced by facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is implemented in two famous library: sklearn ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]) xgboost ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]) api: val model1 : DecisionTreeClassificationModel= ... model1.setLeafCol("leaves") model1.transform(df) val model2 : GBTClassificationModel = ... model2.getLeafCol model2.transform(df) design doc: [https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing] > Support Tree-Based Feature Transformation for ML > > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: zhengruifeng >Priority: Major > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is > implemented in famous libraries: > sklearn > [apply|[http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]] > xgboost > [lpredict_leaf_index|[https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]] > lightgbm > [predict_leaf_index|https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_leaf_index] > catboost > [calc_leaf_index|https://github.com/catboost/tutorials/tree/master/leaf_indexes_calculation] > > > Refering to the design of above impls, I propose following api: > val model1 : DecisionTreeClassificationModel= ... > model1.setLeafCol("leaves") > model1.transform(df) > > val model2 : GBTClassificationModel = ... > model2.getLeafCol > model2.transform(df) > > The detailed design doc: > [https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28421) SparseVector.apply performance optimization
zhengruifeng created SPARK-28421: Summary: SparseVector.apply performance optimization Key: SPARK-28421 URL: https://issues.apache.org/jira/browse/SPARK-28421 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng Current impl of SparseVector.apply is inefficient: on each call, breeze.linalg.SparseVector & breeze.collection.mutable.SparseArray are created internally, then binary-search is used to search the input position. This place should be optimized like .ml.SparseMatrix, which directly use binary search, without conversion to breeze.linalg.Matrix. I tested the performance and found that if we avoid the internal conversions, then a 2.5~5X speed up can be obtained. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28399) Impl RobustScaler
[ https://issues.apache.org/jira/browse/SPARK-28399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-28399: - Issue Type: New Feature (was: Improvement) > Impl RobustScaler > - > > Key: SPARK-28399 > URL: https://issues.apache.org/jira/browse/SPARK-28399 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > RobustScaler is a kind of widely-used scaler, which use median/IQR to replace > mean/std in StandardScaler. It can produce stable result that are much more > robust to outliers. It is already a part of > [Scikit-Learn|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler]. > So far, it is now implemented in ML. > I encounter a practical case that need this feature, and notice that other > users also wanted this function in SPARK-17934, so I am to add it in ML. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28399) Impl RobustScaler
zhengruifeng created SPARK-28399: Summary: Impl RobustScaler Key: SPARK-28399 URL: https://issues.apache.org/jira/browse/SPARK-28399 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng RobustScaler is a kind of widely-used scaler, which use median/IQR to replace mean/std in StandardScaler. It can produce stable result that are much more robust to outliers. It is already a part of [Scikit-Learn|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler]. So far, it is now implemented in ML. I encounter a practical case that need this feature, and notice that other users also wanted this function in SPARK-17934, so I am to add it in ML. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27656) Safely register class for GraphX
[ https://issues.apache.org/jira/browse/SPARK-27656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-27656. -- Resolution: Not A Problem > Safely register class for GraphX > > > Key: SPARK-27656 > URL: https://issues.apache.org/jira/browse/SPARK-27656 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 2.4.3 >Reporter: zhengruifeng >Priority: Major > > GraphX common classes (such as: Edge, EdgeTriplet) are not registered in Kryo > by default. > Users can register those classes via > {{GraphXUtils.{color:#ffc66d}registerKryoClasses{color}}}, however, it seems > that none graphx-lib impls call it, and users tend to ignore this > registration. > So I prefer to safely register them in \{{KryoSerializer.scala}}, like what > SQL and ML do. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28159) Make the transform natively in ml framework to avoid extra conversion
[ https://issues.apache.org/jira/browse/SPARK-28159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-28159: - Description: It is a long time since ML was released. However, there are still many TODOs (like in [ChiSqSelector.scala|https://github.com/apache/spark/pull/24963/files#diff-9b0bc8a01b34c38958ce45c14f9c5da5] {// TODO: Make the transformer natively in ml framework to avoid extra conversion.}) on making transform natively in ml framework. I try to make ml algs no longer need to convert ml-vector to mllib-vector in transforms. Including: LDA/ChiSqSelector/ElementwiseProduct/HashingTF/IDF/Normalizer/PCA/StandardScaler. was: It is a long time since ML was released. However, there are still many TODOs on making transform natively in ml framework. > Make the transform natively in ml framework to avoid extra conversion > - > > Key: SPARK-28159 > URL: https://issues.apache.org/jira/browse/SPARK-28159 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > It is a long time since ML was released. > However, there are still many TODOs (like in > [ChiSqSelector.scala|https://github.com/apache/spark/pull/24963/files#diff-9b0bc8a01b34c38958ce45c14f9c5da5] > {// TODO: Make the transformer natively in ml framework to avoid extra > conversion.}) on making transform natively in ml framework. > > I try to make ml algs no longer need to convert ml-vector to mllib-vector in > transforms. > Including: > LDA/ChiSqSelector/ElementwiseProduct/HashingTF/IDF/Normalizer/PCA/StandardScaler. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28159) Make the transform natively in ml framework to avoid extra conversion
zhengruifeng created SPARK-28159: Summary: Make the transform natively in ml framework to avoid extra conversion Key: SPARK-28159 URL: https://issues.apache.org/jira/browse/SPARK-28159 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng It is a long time since ML was released. However, there are still many TODOs on making transform natively in ml framework. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28154) GMM fix double caching
zhengruifeng created SPARK-28154: Summary: GMM fix double caching Key: SPARK-28154 URL: https://issues.apache.org/jira/browse/SPARK-28154 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.4.0, 2.3.0, 3.0.0 Reporter: zhengruifeng The intermediate rdd is always cached. We should only cache it if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28117) LDA and BisectingKMeans cache the input dataset if necessary
zhengruifeng created SPARK-28117: Summary: LDA and BisectingKMeans cache the input dataset if necessary Key: SPARK-28117 URL: https://issues.apache.org/jira/browse/SPARK-28117 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng In MLLIB-LDA, if the EM solver caches the dataset internally, while the Online do not. So in the ML-LDA, we need to cache the internmediate dataset if necessary. BisectingKMeans also needs too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13677) Support Tree-Based Feature Transformation for ML
[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-13677: - Description: It would be nice to be able to use RF and GBT for feature transformation: First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion. This method was first introduced by facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is implemented in two famous library: sklearn ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]) xgboost ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]) api: val model1 : DecisionTreeClassificationModel= ... model1.setLeafCol("leaves") model1.transform(df) val model2 : GBTClassificationModel = ... model2.getLeafCol model2.transform(df) design doc: [https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing] was: It would be nice to be able to use RF and GBT for feature transformation: First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion. This method was first introduced by facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is implemented in two famous library: sklearn ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]) xgboost ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]) I have implement it in mllib: val model1 : DecisionTreeClassificationModel= ... model1.setLeafCol("leaves") model1.transform(df) val model2 : GBTClassificationModel = ... model2.transform(df) design doc: https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing > Support Tree-Based Feature Transformation for ML > > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: zhengruifeng >Priority: Major > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is > implemented in two famous library: > sklearn > ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]) > xgboost > ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]) > > api: > val model1 : DecisionTreeClassificationModel= ... > model1.setLeafCol("leaves") > model1.transform(df) > > val model2 : GBTClassificationModel = ... > model2.getLeafCol > model2.transform(df) > > > design doc: > [https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13677) Support Tree-Based Feature Transformation for ML
[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-13677: - Priority: Major (was: Minor) > Support Tree-Based Feature Transformation for ML > > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: zhengruifeng >Priority: Major > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is > implemented in two famous library: > sklearn > ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]) > xgboost > ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]) > I have implement it in mllib: > val model1 : DecisionTreeClassificationModel= ... > model1.setLeafCol("leaves") > model1.transform(df) > val model2 : GBTClassificationModel = ... > model2.transform(df) > > > design doc: > https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13677) Support Tree-Based Feature Transformation for ML
[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867507#comment-16867507 ] zhengruifeng commented on SPARK-13677: -- I closed this ticket since the old pr was based on mllib-api, and at that time the impl of trees were being refactored and impled directly in ml. I reopen it now since I re-design it on the ml side. > Support Tree-Based Feature Transformation for ML > > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: zhengruifeng >Priority: Minor > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is > implemented in two famous library: > sklearn > ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]) > xgboost > ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]) > I have implement it in mllib: > val model1 : DecisionTreeClassificationModel= ... > model1.setLeafCol("leaves") > model1.transform(df) > val model2 : GBTClassificationModel = ... > model2.transform(df) > > > design doc: > https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-13677) Support Tree-Based Feature Transformation for ML
[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reopened SPARK-13677: -- update the design > Support Tree-Based Feature Transformation for ML > > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: zhengruifeng >Priority: Minor > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is > implemented in two famous library: > sklearn > ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]) > xgboost > ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]) > I have implement it in mllib: > val model1 : DecisionTreeClassificationModel= ... > model1.setLeafCol("leaves") > model1.transform(df) > val model2 : GBTClassificationModel = ... > model2.transform(df) > > > design doc: > https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13677) Support Tree-Based Feature Transformation for ML
[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-13677: - Description: It would be nice to be able to use RF and GBT for feature transformation: First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion. This method was first introduced by facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is implemented in two famous library: sklearn ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]) xgboost ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]) I have implement it in mllib: val model1 : DecisionTreeClassificationModel= ... model1.setLeafCol("leaves") model1.transform(df) val model2 : GBTClassificationModel = ... model2.transform(df) design doc: https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing was: It would be nice to be able to use RF and GBT for feature transformation: First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion. This method was first introduced by facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is implemented in two famous library: sklearn (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py) xgboost (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py) I have implement it in mllib: val features : RDD[Vector] = ... val model1 : RandomForestModel = ... val transformed1 : RDD[Vector] = model1.leaf(features) val model2 : GradientBoostedTreesModel = ... val transformed2 : RDD[Vector] = model2.leaf(features) > Support Tree-Based Feature Transformation for ML > > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: zhengruifeng >Priority: Minor > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is > implemented in two famous library: > sklearn > ([http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]) > xgboost > ([https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]) > I have implement it in mllib: > val model1 : DecisionTreeClassificationModel= ... > model1.setLeafCol("leaves") > model1.transform(df) > val model2 : GBTClassificationModel = ... > model2.transform(df) > > > design doc: > https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-27018: - Component/s: Spark Core > Checkpointed RDD deleted prematurely when using GBTClassifier > - > > Key: SPARK-27018 > URL: https://issues.apache.org/jira/browse/SPARK-27018 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core >Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0 > Environment: OS: Ubuntu Linux 18.10 > Java: java version "1.8.0_201" > Java(TM) SE Runtime Environment (build 1.8.0_201-b09) > Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode) > Reproducible with a single-node Spark in standalone mode. > Reproducible with Zepellin or Spark shell. > >Reporter: Piotr Kołaczkowski >Priority: Major > Attachments: > Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch > > > Steps to reproduce: > {noformat} > import org.apache.spark.ml.linalg.Vectors > import org.apache.spark.ml.classification.GBTClassifier > case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int) > sc.setCheckpointDir("/checkpoints") > val trainingData = sc.parallelize(1 to 2426874, 256).map(x => > Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF > val classifier = new GBTClassifier() > .setLabelCol("label") > .setFeaturesCol("features") > .setProbabilityCol("probability") > .setMaxIter(100) > .setMaxDepth(10) > .setCheckpointInterval(2) > classifier.fit(trainingData){noformat} > > The last line fails with: > {noformat} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 > (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: > /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51 > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264) > at > com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31) > at > com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39) > at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269) > at > org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292) > at > org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) >
[jira] [Resolved] (SPARK-27925) Better control numBins of curves in BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-27925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-27925. -- Resolution: Not A Problem > Better control numBins of curves in BinaryClassificationMetrics > --- > > Key: SPARK-27925 > URL: https://issues.apache.org/jira/browse/SPARK-27925 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > In case of large datasets with tens of thousands of partitions, current curve > down-sampling method tend to generate much more bins than the value set by > #numBins. > Since in current impl, grouping is done within partitions, that is to say, > each partition contains at least one bin. > A more reasonable way is to bring the grouping op forward into the sort op, > then we can directly set the #bins as the #partitions, and regard one > partition as one bin. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28045) add missing RankingEvaluator
zhengruifeng created SPARK-28045: Summary: add missing RankingEvaluator Key: SPARK-28045 URL: https://issues.apache.org/jira/browse/SPARK-28045 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng expose RankingEvaluator -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28044) MulticlassClassificationEvaluator support more metrics
zhengruifeng created SPARK-28044: Summary: MulticlassClassificationEvaluator support more metrics Key: SPARK-28044 URL: https://issues.apache.org/jira/browse/SPARK-28044 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng expose more metrics in evaluator: weightedTruePositiveRate weightedFalsePositiveRate weightedFMeasure truePositiveRateByLabel falsePositiveRateByLabel precisionByLabel recallByLabel fMeasureByLabel -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label
[ https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860922#comment-16860922 ] zhengruifeng edited comment on SPARK-24875 at 6/11/19 10:59 AM: The dataset is usually much smaller than the training dataset containing , if the score data is to huge to perform a simple op like countByValue, how could you train/evaluate the model? I doubt whether it is worth to apply a approximation. was (Author: podongfeng): The dataset is usually much smaller than the training dataset containing , if the score data is to huge to perform a simple op like countByValue, how could you train the model? I doubt whether it is worth to apply a approximation. > MulticlassMetrics should offer a more efficient way to compute count by label > - > > Key: SPARK-24875 > URL: https://issues.apache.org/jira/browse/SPARK-24875 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.1 >Reporter: Antoine Galataud >Priority: Minor > > Currently _MulticlassMetrics_ calls _countByValue_() to get count by > class/label > {code:java} > private lazy val labelCountByClass: Map[Double, Long] = > predictionAndLabels.values.countByValue() > {code} > If input _RDD[(Double, Double)]_ is huge (which can be the case with a large > test dataset), it will lead to poor execution performance. > One option could be to allow using _countByValueApprox_ (could require adding > an extra configuration param for MulticlassMetrics). > Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, > I don't know how this could be ported there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label
[ https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860922#comment-16860922 ] zhengruifeng commented on SPARK-24875: -- The dataset is usually much smaller than the training dataset containing , if the score data is to huge to perform a simple op like countByValue, how could you train the model? I doubt whether it is worth to apply a approximation. > MulticlassMetrics should offer a more efficient way to compute count by label > - > > Key: SPARK-24875 > URL: https://issues.apache.org/jira/browse/SPARK-24875 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.1 >Reporter: Antoine Galataud >Priority: Minor > > Currently _MulticlassMetrics_ calls _countByValue_() to get count by > class/label > {code:java} > private lazy val labelCountByClass: Map[Double, Long] = > predictionAndLabels.values.countByValue() > {code} > If input _RDD[(Double, Double)]_ is huge (which can be the case with a large > test dataset), it will lead to poor execution performance. > One option could be to allow using _countByValueApprox_ (could require adding > an extra configuration param for MulticlassMetrics). > Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, > I don't know how this could be ported there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26185) add weightCol in python MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860919#comment-16860919 ] zhengruifeng commented on SPARK-26185: -- Seems resolved? > add weightCol in python MulticlassClassificationEvaluator > - > > Key: SPARK-26185 > URL: https://issues.apache.org/jira/browse/SPARK-26185 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > https://issues.apache.org/jira/browse/SPARK-24101 added weightCol in > MulticlassClassificationEvaluator.scala. This Jira will add weightCol in > python version of MulticlassClassificationEvaluator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25360) Parallelized RDDs of Ranges could have known partitioner
[ https://issues.apache.org/jira/browse/SPARK-25360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860759#comment-16860759 ] zhengruifeng commented on SPARK-25360: -- But I think it maybe worth to impl a direct version of `sc.range` other than `parallelize.maptition`, to simplify lineage. > Parallelized RDDs of Ranges could have known partitioner > > > Key: SPARK-25360 > URL: https://issues.apache.org/jira/browse/SPARK-25360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Trivial > > We already have the logic to split up the generator, we could expose the same > logic as a partitioner. This would be useful when joining a small > parallelized collection with a larger collection and other cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org