[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models
[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172037#comment-14172037 ] Christoph Sawade commented on SPARK-3702: - Okay. I will follow it. Standardize MLlib classes for learners, models -- Key: SPARK-3702 URL: https://issues.apache.org/jira/browse/SPARK-3702 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Blocker Summary: Create a class hierarchy for learning algorithms and the models those algorithms produce. Goals: * give intuitive structure to API, both for developers and for generated documentation * support meta-algorithms (e.g., boosting) * support generic functionality (e.g., evaluation) * reduce code duplication across classes [Design doc for class hierarchy | https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models
[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153173#comment-14153173 ] Christoph Sawade commented on SPARK-3702: - Great initiative. I really appreciate the attempt to standardize and identify common interfaces. Currently, I have three issues: ** Abstraction of Multilabel ** The distinguish between classification and regression seems to be natural and also the abstraction of a multi-label makes sense to me. The simplest multi-label approach that I can think of is a collection of binary classifiers. Do you plan to support also mixtures of multi-labels (regression / multinomial classification)? If so, does it makes sense to distinguish between ``MultilabelClassifier`` and ``MultilabelRegressor``? Isn't it then just a list of Estimators? ** Model-based vs. memory-based ** I am wondering if it is worth to distinguish between memory-based (e.g., k-nearest neighbour, kernel-machines,...) and model-based predictions (Decision trees, NN, Naive Bayes, GLMs)? Or more general, how does k-nearest neighbour fit into that framework? ** Model vs. Estimator Abstraction ** Currently, the main distinction is between classification and regression. However, many methods are similar because they have the same modelling rather than they have the same prediction type. I am wondering how the functional similarities can be reflected in that hierarchy. I tried to follow a bottom-up approach and applied these abstractions to different learning methods. Here are two examples: Decision trees are trained with some recursive algorithm as ID3 or C4.5 and the predicition is obtained by traversing the tree. The difference between classification and regression plays rather a minor role. So, intuitively, there is a DecisionTree estimator that can be, e.g., ID3 or C4.5. Then, the DecisionTreeClassifier is a DecisionTree with classification criteria; it returns a DecisionTree.Model (the tree) with a predictClass function (Classifier.Model?). The DecisionTreeRegresser is a DecisionTree with regression criteria and it returns a DecisionTree.Model with a predictScore function (Regressor.Model?). Formally, it looks like DecisionTree extends Estimator DecisionTreeClassifier extends DecisionTree with Classifier DecisionTreeRegressor extends DecisionTree with Regressor DecisionTree.Model extends Transformer DecisionTreeClassifier.Model extends DecisionTree.Model with Classifier.Model DecisionTreeRegressor.Model extends DecisionTree.Model with Regressor.Model Methods like LogReg, SVM, RidgeRegression, ... maintain a weight vector (one probably could summarize them to GLMs). The inner product with the example vector results naturally in a regression score for each prediction; a binary classification is then derived by thresholding that score. The underlying optimization problem for all consists of a sum over loss functions and a regularization term (regularized empirical risk minimization) that can be solved by different solvers, e.g., SGD, LBFGS... So to exploit this structure, I would expect something like this: RegularizedEmpiricalRiskMinimizer extends Estimator // LogisticRegression and SupportVectorMachine could be an automatic selection between the binomial and multinomial version BinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer MultinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer BinomialSupportVectorMachine extends RegularizedEmpiricalRiskMinimizer RidgeRegression extends RegularizedEmpiricalRiskMinimizer LinearModel extends Transformer BinomialLinearModel extends LinearModel with Classifier.Model MultinomialLinearModel extends LinearModel with Classifier.Model BinomialLogisticRegression.Model extends BinomialLinearModel with ProbabilisticClassificationModel MultinomialLogisticRegression.Model extends MultinomialLinearModel with ProbabilisticClassificationModel BinomialSupportVectorMachine.Model extends BinomialLinearModel // actually it is a binomial linear model RidgeRegression.Model extends LinearModel // actually it is a linear model So isn't the Classifier.Model more a trait than an abstract class? Perhaps, I just missed something, but I think it is helpful to consider the interfaces for specific instances. I am really interested in discussing the pros/cons. Standardize MLlib classes for learners, models -- Key: SPARK-3702 URL: https://issues.apache.org/jira/browse/SPARK-3702 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Blocker Summary: Create a class hierarchy for learning algorithms and the models those algorithms produce. Goals: * give
[jira] [Comment Edited] (SPARK-3702) Standardize MLlib classes for learners, models
[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153173#comment-14153173 ] Christoph Sawade edited comment on SPARK-3702 at 9/30/14 1:59 PM: -- Great initiative. I really appreciate the attempt to standardize and identify common interfaces. Currently, I have three issues: * Abstraction of Multilabel The distinguish between classification and regression seems to be natural and also the abstraction of a multi-label makes sense to me. The simplest multi-label approach that I can think of is a collection of binary classifiers. Do you plan to support also mixtures of multi-labels (regression / multinomial classification)? If so, does it makes sense to distinguish between ``MultilabelClassifier`` and ``MultilabelRegressor``? Isn't it then just a list of Estimators? * Model-based vs. memory-based I am wondering if it is worth to distinguish between memory-based (e.g., k-nearest neighbour, kernel-machines,...) and model-based predictions (Decision trees, NN, Naive Bayes, GLMs)? Or more general, how does k-nearest neighbour fit into that framework? * Model vs. Estimator Abstraction Currently, the main distinction is between classification and regression. However, many methods are similar because they have the same modelling rather than they have the same prediction type. I am wondering how the functional similarities can be reflected in that hierarchy. I tried to follow a bottom-up approach and applied these abstractions to different learning methods. Here are two examples: Decision trees are trained with some recursive algorithm as ID3 or C4.5 and the predicition is obtained by traversing the tree. The difference between classification and regression plays rather a minor role. So, intuitively, there is a DecisionTree estimator that can be, e.g., ID3 or C4.5. Then, the DecisionTreeClassifier is a DecisionTree with classification criteria; it returns a DecisionTree.Model (the tree) with a predictClass function (Classifier.Model?). The DecisionTreeRegresser is a DecisionTree with regression criteria and it returns a DecisionTree.Model with a predictScore function (Regressor.Model?). Formally, it looks like - DecisionTree extends Estimator - DecisionTreeClassifier extends DecisionTree with Classifier - DecisionTreeRegressor extends DecisionTree with Regressor - DecisionTree.Model extends Transformer - DecisionTreeClassifier.Model extends DecisionTree.Model with Classifier.Model - DecisionTreeRegressor.Model extends DecisionTree.Model with Regressor.Model Methods like LogReg, SVM, RidgeRegression, ... maintain a weight vector (one probably could summarize them to GLMs). The inner product with the example vector results naturally in a regression score for each prediction; a binary classification is then derived by thresholding that score. The underlying optimization problem for all consists of a sum over loss functions and a regularization term (regularized empirical risk minimization) that can be solved by different solvers, e.g., SGD, LBFGS... So to exploit this structure, I would expect something like this: - RegularizedEmpiricalRiskMinimizer extends Estimator // LogisticRegression and SupportVectorMachine could be an automatic selection between the binomial and multinomial version - BinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer - MultinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer - BinomialSupportVectorMachine extends RegularizedEmpiricalRiskMinimizer - RidgeRegression extends RegularizedEmpiricalRiskMinimizer - LinearModel extends Transformer - BinomialLinearModel extends LinearModel with Classifier.Model - MultinomialLinearModel extends LinearModel with Classifier.Model - BinomialLogisticRegression.Model extends BinomialLinearModel with ProbabilisticClassificationModel - MultinomialLogisticRegression.Model extends MultinomialLinearModel with ProbabilisticClassificationModel - BinomialSupportVectorMachine.Model extends BinomialLinearModel // actually it is a binomial linear model - RidgeRegression.Model extends LinearModel // actually it is a linear model So isn't the Classifier.Model more a trait than an abstract class? Perhaps, I just missed something, but I think it is helpful to consider the interfaces for specific instances. I am really interested in discussing the pros/cons. was (Author: bigcrunsh): Great initiative. I really appreciate the attempt to standardize and identify common interfaces. Currently, I have three issues: ** Abstraction of Multilabel ** The distinguish between classification and regression seems to be natural and also the abstraction of a multi-label makes sense to me. The simplest multi-label approach that I can think of is a collection of binary classifiers. Do you plan to support also mixtures of multi-labels (regression /
[jira] [Created] (SPARK-3251) Clarify learning interfaces
Christoph Sawade created SPARK-3251: --- Summary: Clarify learning interfaces Key: SPARK-3251 URL: https://issues.apache.org/jira/browse/SPARK-3251 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.1.0, 1.1.1 Reporter: Christoph Sawade ** Make threshold mandatory Currently, the output of predict for an example is either the score or the class. This side-effect is caused by clearThreshold. To clarify that behaviour three different types of predict (predictScore, predictClass, predictProbabilty) were introduced; the threshold is not longer optional. ** Clarify classification interfaces Currently, some functionality is spreaded over multiple models. In order to clarify the structure and simplify the implementation of more complex models (like multinomial logistic regression), two new classes are introduced: - BinaryClassificationModel: for all models that derives a binary classification from a single weight vector. Comprises the tresholding functionality to derive a prediction from a score. It basically captures SVMModel and LogisticRegressionModel. - ProbabilitistClassificaitonModel: This trait defines the interface for models that return a calibrated confidence score (aka probability). ** Misc - some renaming - add test for probabilistic output -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3251) Clarify learning interfaces
[ https://issues.apache.org/jira/browse/SPARK-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112231#comment-14112231 ] Christoph Sawade commented on SPARK-3251: - https://github.com/apache/spark/pull/2137 Clarify learning interfaces Key: SPARK-3251 URL: https://issues.apache.org/jira/browse/SPARK-3251 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.1.0, 1.1.1 Reporter: Christoph Sawade ** Make threshold mandatory Currently, the output of predict for an example is either the score or the class. This side-effect is caused by clearThreshold. To clarify that behaviour three different types of predict (predictScore, predictClass, predictProbabilty) were introduced; the threshold is not longer optional. ** Clarify classification interfaces Currently, some functionality is spreaded over multiple models. In order to clarify the structure and simplify the implementation of more complex models (like multinomial logistic regression), two new classes are introduced: - BinaryClassificationModel: for all models that derives a binary classification from a single weight vector. Comprises the tresholding functionality to derive a prediction from a score. It basically captures SVMModel and LogisticRegressionModel. - ProbabilitistClassificaitonModel: This trait defines the interface for models that return a calibrated confidence score (aka probability). ** Misc - some renaming - add test for probabilistic output -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org