[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

2014-10-14 Thread Christoph Sawade (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172037#comment-14172037
 ] 

Christoph Sawade commented on SPARK-3702:
-

Okay. I will follow it.

 Standardize MLlib classes for learners, models
 --

 Key: SPARK-3702
 URL: https://issues.apache.org/jira/browse/SPARK-3702
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Blocker

 Summary: Create a class hierarchy for learning algorithms and the models 
 those algorithms produce.
 Goals:
 * give intuitive structure to API, both for developers and for generated 
 documentation
 * support meta-algorithms (e.g., boosting)
 * support generic functionality (e.g., evaluation)
 * reduce code duplication across classes
 [Design doc for class hierarchy | 
 https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

2014-09-30 Thread Christoph Sawade (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153173#comment-14153173
 ] 

Christoph Sawade commented on SPARK-3702:
-

Great initiative. I really appreciate the attempt to standardize and identify 
common interfaces. Currently, I have three issues:

** Abstraction of Multilabel **
The distinguish between classification and regression seems to be natural and 
also the abstraction of a multi-label makes sense to me. The simplest 
multi-label approach that I can think of is a collection of binary classifiers. 
Do you plan to support also mixtures of multi-labels (regression / multinomial 
classification)? If so, does it makes sense to distinguish between 
``MultilabelClassifier`` and ``MultilabelRegressor``? Isn't it then just a list 
of Estimators?

** Model-based vs. memory-based **
I am wondering if it is worth to distinguish between memory-based (e.g., 
k-nearest neighbour, kernel-machines,...) and model-based predictions (Decision 
trees, NN, Naive Bayes, GLMs)? Or more general, how does k-nearest neighbour 
fit into that framework?

** Model vs. Estimator Abstraction **
Currently, the main distinction is between classification and regression. 
However, many methods are similar because they have the same modelling rather 
than they have the same prediction type. I am wondering how the functional 
similarities can be reflected in that hierarchy. I tried to follow a bottom-up 
approach and applied these abstractions to different learning methods. Here are 
two examples:

Decision trees are trained with some recursive algorithm as ID3 or C4.5 and the 
predicition is obtained by traversing the tree. The difference between 
classification and regression plays rather a minor role. So, intuitively, there 
is a DecisionTree estimator that can be, e.g., ID3 or C4.5. Then, the 
DecisionTreeClassifier is a DecisionTree with classification criteria; it 
returns a DecisionTree.Model (the tree) with a predictClass function 
(Classifier.Model?). The DecisionTreeRegresser is a DecisionTree with 
regression criteria and it returns a DecisionTree.Model with a predictScore 
function (Regressor.Model?). Formally, it looks like

  DecisionTree extends Estimator
  DecisionTreeClassifier extends DecisionTree with Classifier
  DecisionTreeRegressor extends DecisionTree with Regressor

  DecisionTree.Model extends Transformer
  DecisionTreeClassifier.Model extends DecisionTree.Model with Classifier.Model
  DecisionTreeRegressor.Model extends DecisionTree.Model with Regressor.Model

Methods like LogReg, SVM, RidgeRegression, ... maintain a weight vector (one 
probably could summarize them to GLMs). The inner product with the example 
vector results naturally in a regression score for each prediction; a binary 
classification is then derived by thresholding that score. The underlying 
optimization problem for all consists of a sum over loss functions and a 
regularization term (regularized empirical risk minimization) that can be 
solved by different solvers, e.g., SGD, LBFGS... So to exploit this structure, 
I would expect something like this:

  RegularizedEmpiricalRiskMinimizer extends Estimator
  // LogisticRegression and SupportVectorMachine could be an automatic 
selection between the binomial and multinomial version
  BinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
  MultinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
  BinomialSupportVectorMachine extends RegularizedEmpiricalRiskMinimizer
  RidgeRegression extends RegularizedEmpiricalRiskMinimizer
  
  LinearModel extends Transformer
  BinomialLinearModel extends LinearModel with Classifier.Model
  MultinomialLinearModel extends LinearModel with Classifier.Model
  BinomialLogisticRegression.Model extends BinomialLinearModel with 
ProbabilisticClassificationModel
  MultinomialLogisticRegression.Model extends MultinomialLinearModel with 
ProbabilisticClassificationModel
  BinomialSupportVectorMachine.Model extends BinomialLinearModel // actually it 
is a binomial linear model
  RidgeRegression.Model extends LinearModel // actually it is a linear model

So isn't the Classifier.Model more a trait than an abstract class? Perhaps, I 
just missed something, but I think it is helpful to consider the interfaces for 
specific instances. I am really interested in discussing the pros/cons.

 Standardize MLlib classes for learners, models
 --

 Key: SPARK-3702
 URL: https://issues.apache.org/jira/browse/SPARK-3702
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Blocker

 Summary: Create a class hierarchy for learning algorithms and the models 
 those algorithms produce.
 Goals:
 * give 

[jira] [Comment Edited] (SPARK-3702) Standardize MLlib classes for learners, models

2014-09-30 Thread Christoph Sawade (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153173#comment-14153173
 ] 

Christoph Sawade edited comment on SPARK-3702 at 9/30/14 1:59 PM:
--

Great initiative. I really appreciate the attempt to standardize and identify 
common interfaces. Currently, I have three issues:

* Abstraction of Multilabel
The distinguish between classification and regression seems to be natural and 
also the abstraction of a multi-label makes sense to me. The simplest 
multi-label approach that I can think of is a collection of binary classifiers. 
Do you plan to support also mixtures of multi-labels (regression / multinomial 
classification)? If so, does it makes sense to distinguish between 
``MultilabelClassifier`` and ``MultilabelRegressor``? Isn't it then just a list 
of Estimators?

* Model-based vs. memory-based
I am wondering if it is worth to distinguish between memory-based (e.g., 
k-nearest neighbour, kernel-machines,...) and model-based predictions (Decision 
trees, NN, Naive Bayes, GLMs)? Or more general, how does k-nearest neighbour 
fit into that framework?

* Model vs. Estimator Abstraction
Currently, the main distinction is between classification and regression. 
However, many methods are similar because they have the same modelling rather 
than they have the same prediction type. I am wondering how the functional 
similarities can be reflected in that hierarchy. I tried to follow a bottom-up 
approach and applied these abstractions to different learning methods. Here are 
two examples:

Decision trees are trained with some recursive algorithm as ID3 or C4.5 and the 
predicition is obtained by traversing the tree. The difference between 
classification and regression plays rather a minor role. So, intuitively, there 
is a DecisionTree estimator that can be, e.g., ID3 or C4.5. Then, the 
DecisionTreeClassifier is a DecisionTree with classification criteria; it 
returns a DecisionTree.Model (the tree) with a predictClass function 
(Classifier.Model?). The DecisionTreeRegresser is a DecisionTree with 
regression criteria and it returns a DecisionTree.Model with a predictScore 
function (Regressor.Model?). Formally, it looks like

- DecisionTree extends Estimator
- DecisionTreeClassifier extends DecisionTree with Classifier
- DecisionTreeRegressor extends DecisionTree with Regressor

- DecisionTree.Model extends Transformer
- DecisionTreeClassifier.Model extends DecisionTree.Model with Classifier.Model
- DecisionTreeRegressor.Model extends DecisionTree.Model with Regressor.Model

Methods like LogReg, SVM, RidgeRegression, ... maintain a weight vector (one 
probably could summarize them to GLMs). The inner product with the example 
vector results naturally in a regression score for each prediction; a binary 
classification is then derived by thresholding that score. The underlying 
optimization problem for all consists of a sum over loss functions and a 
regularization term (regularized empirical risk minimization) that can be 
solved by different solvers, e.g., SGD, LBFGS... So to exploit this structure, 
I would expect something like this:

- RegularizedEmpiricalRiskMinimizer extends Estimator
  // LogisticRegression and SupportVectorMachine could be an automatic 
selection between the binomial and multinomial version
- BinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
- MultinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
- BinomialSupportVectorMachine extends RegularizedEmpiricalRiskMinimizer
- RidgeRegression extends RegularizedEmpiricalRiskMinimizer
  
- LinearModel extends Transformer
- BinomialLinearModel extends LinearModel with Classifier.Model
- MultinomialLinearModel extends LinearModel with Classifier.Model
- BinomialLogisticRegression.Model extends BinomialLinearModel with 
ProbabilisticClassificationModel
- MultinomialLogisticRegression.Model extends MultinomialLinearModel with 
ProbabilisticClassificationModel
- BinomialSupportVectorMachine.Model extends BinomialLinearModel // actually it 
is a binomial linear model
- RidgeRegression.Model extends LinearModel // actually it is a linear model

So isn't the Classifier.Model more a trait than an abstract class? Perhaps, I 
just missed something, but I think it is helpful to consider the interfaces for 
specific instances. I am really interested in discussing the pros/cons.


was (Author: bigcrunsh):
Great initiative. I really appreciate the attempt to standardize and identify 
common interfaces. Currently, I have three issues:

** Abstraction of Multilabel **
The distinguish between classification and regression seems to be natural and 
also the abstraction of a multi-label makes sense to me. The simplest 
multi-label approach that I can think of is a collection of binary classifiers. 
Do you plan to support also mixtures of multi-labels (regression / 

[jira] [Created] (SPARK-3251) Clarify learning interfaces

2014-08-27 Thread Christoph Sawade (JIRA)
Christoph Sawade created SPARK-3251:
---

 Summary:  Clarify learning interfaces
 Key: SPARK-3251
 URL: https://issues.apache.org/jira/browse/SPARK-3251
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0, 1.1.1
Reporter: Christoph Sawade


** Make threshold mandatory
Currently, the output of predict for an example is either the score
or the class. This side-effect is caused by clearThreshold. To
clarify that behaviour three different types of predict (predictScore,
predictClass, predictProbabilty) were introduced; the threshold is not
longer optional.

** Clarify classification interfaces
Currently, some functionality is spreaded over multiple models.
In order to clarify the structure and simplify the implementation of
more complex models (like multinomial logistic regression), two new
classes are introduced:
- BinaryClassificationModel: for all models that derives a binary 
classification from a single weight vector. Comprises the tresholding 
functionality to derive a prediction from a score. It basically captures 
SVMModel and LogisticRegressionModel.
- ProbabilitistClassificaitonModel: This trait defines the interface for models 
that return a calibrated confidence score (aka probability).

** Misc
- some renaming
- add test for probabilistic output



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3251) Clarify learning interfaces

2014-08-27 Thread Christoph Sawade (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112231#comment-14112231
 ] 

Christoph Sawade commented on SPARK-3251:
-

https://github.com/apache/spark/pull/2137

  Clarify learning interfaces
 

 Key: SPARK-3251
 URL: https://issues.apache.org/jira/browse/SPARK-3251
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0, 1.1.1
Reporter: Christoph Sawade

 ** Make threshold mandatory
 Currently, the output of predict for an example is either the score
 or the class. This side-effect is caused by clearThreshold. To
 clarify that behaviour three different types of predict (predictScore,
 predictClass, predictProbabilty) were introduced; the threshold is not
 longer optional.
 ** Clarify classification interfaces
 Currently, some functionality is spreaded over multiple models.
 In order to clarify the structure and simplify the implementation of
 more complex models (like multinomial logistic regression), two new
 classes are introduced:
 - BinaryClassificationModel: for all models that derives a binary 
 classification from a single weight vector. Comprises the tresholding 
 functionality to derive a prediction from a score. It basically captures 
 SVMModel and LogisticRegressionModel.
 - ProbabilitistClassificaitonModel: This trait defines the interface for 
 models that return a calibrated confidence score (aka probability).
 ** Misc
 - some renaming
 - add test for probabilistic output



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org