from:"Christoph Sawade \(JIRA\)"

[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

2014-10-14 Thread Christoph Sawade (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172037#comment-14172037
 ] 

Christoph Sawade commented on SPARK-3702:
-

Okay. I will follow it.

 Standardize MLlib classes for learners, models
 --

 Key: SPARK-3702
 URL: https://issues.apache.org/jira/browse/SPARK-3702
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Blocker

 Summary: Create a class hierarchy for learning algorithms and the models 
 those algorithms produce.
 Goals:
 * give intuitive structure to API, both for developers and for generated 
 documentation
 * support meta-algorithms (e.g., boosting)
 * support generic functionality (e.g., evaluation)
 * reduce code duplication across classes
 [Design doc for class hierarchy | 
 https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

2014-09-30 Thread Christoph Sawade (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153173#comment-14153173
]

Christoph Sawade commented on SPARK-3702:
-

Great initiative. I really appreciate the attempt to standardize and identify
common interfaces. Currently, I have three issues:

** Abstraction of Multilabel **
The distinguish between classification and regression seems to be natural and
also the abstraction of a multi-label makes sense to me. The simplest
multi-label approach that I can think of is a collection of binary classifiers.
Do you plan to support also mixtures of multi-labels (regression / multinomial
classification)? If so, does it makes sense to distinguish between
``MultilabelClassifier`` and ``MultilabelRegressor``? Isn't it then just a list
of Estimators?

** Model-based vs. memory-based **
I am wondering if it is worth to distinguish between memory-based (e.g.,
k-nearest neighbour, kernel-machines,...) and model-based predictions (Decision
trees, NN, Naive Bayes, GLMs)? Or more general, how does k-nearest neighbour
fit into that framework?

** Model vs. Estimator Abstraction **
Currently, the main distinction is between classification and regression.
However, many methods are similar because they have the same modelling rather
than they have the same prediction type. I am wondering how the functional
similarities can be reflected in that hierarchy. I tried to follow a bottom-up
approach and applied these abstractions to different learning methods. Here are
two examples:

Decision trees are trained with some recursive algorithm as ID3 or C4.5 and the
predicition is obtained by traversing the tree. The difference between
classification and regression plays rather a minor role. So, intuitively, there
is a DecisionTree estimator that can be, e.g., ID3 or C4.5. Then, the
DecisionTreeClassifier is a DecisionTree with classification criteria; it
returns a DecisionTree.Model (the tree) with a predictClass function
(Classifier.Model?). The DecisionTreeRegresser is a DecisionTree with
regression criteria and it returns a DecisionTree.Model with a predictScore
function (Regressor.Model?). Formally, it looks like

DecisionTree extends Estimator
DecisionTreeClassifier extends DecisionTree with Classifier
DecisionTreeRegressor extends DecisionTree with Regressor

DecisionTree.Model extends Transformer
DecisionTreeClassifier.Model extends DecisionTree.Model with Classifier.Model
DecisionTreeRegressor.Model extends DecisionTree.Model with Regressor.Model

Methods like LogReg, SVM, RidgeRegression, ... maintain a weight vector (one
probably could summarize them to GLMs). The inner product with the example
vector results naturally in a regression score for each prediction; a binary
classification is then derived by thresholding that score. The underlying
optimization problem for all consists of a sum over loss functions and a
regularization term (regularized empirical risk minimization) that can be
solved by different solvers, e.g., SGD, LBFGS... So to exploit this structure,
I would expect something like this:

RegularizedEmpiricalRiskMinimizer extends Estimator
// LogisticRegression and SupportVectorMachine could be an automatic
selection between the binomial and multinomial version
BinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
MultinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
BinomialSupportVectorMachine extends RegularizedEmpiricalRiskMinimizer
RidgeRegression extends RegularizedEmpiricalRiskMinimizer

LinearModel extends Transformer
BinomialLinearModel extends LinearModel with Classifier.Model
MultinomialLinearModel extends LinearModel with Classifier.Model
BinomialLogisticRegression.Model extends BinomialLinearModel with
ProbabilisticClassificationModel
MultinomialLogisticRegression.Model extends MultinomialLinearModel with
ProbabilisticClassificationModel
BinomialSupportVectorMachine.Model extends BinomialLinearModel // actually it
is a binomial linear model
RidgeRegression.Model extends LinearModel // actually it is a linear model

So isn't the Classifier.Model more a trait than an abstract class? Perhaps, I
just missed something, but I think it is helpful to consider the interfaces for
specific instances. I am really interested in discussing the pros/cons.

Standardize MLlib classes for learners, models
--

Key: SPARK-3702
URL: https://issues.apache.org/jira/browse/SPARK-3702
Project: Spark
Issue Type: Sub-task
Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Blocker

Summary: Create a class hierarchy for learning algorithms and the models
those algorithms produce.
Goals:
* give

[jira] [Comment Edited] (SPARK-3702) Standardize MLlib classes for learners, models

2014-09-30 Thread Christoph Sawade (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153173#comment-14153173
]

Christoph Sawade edited comment on SPARK-3702 at 9/30/14 1:59 PM:
--

Great initiative. I really appreciate the attempt to standardize and identify
common interfaces. Currently, I have three issues:

* Abstraction of Multilabel
The distinguish between classification and regression seems to be natural and
also the abstraction of a multi-label makes sense to me. The simplest
multi-label approach that I can think of is a collection of binary classifiers.
Do you plan to support also mixtures of multi-labels (regression / multinomial
classification)? If so, does it makes sense to distinguish between
``MultilabelClassifier`` and ``MultilabelRegressor``? Isn't it then just a list
of Estimators?

* Model-based vs. memory-based
I am wondering if it is worth to distinguish between memory-based (e.g.,
k-nearest neighbour, kernel-machines,...) and model-based predictions (Decision
trees, NN, Naive Bayes, GLMs)? Or more general, how does k-nearest neighbour
fit into that framework?

* Model vs. Estimator Abstraction
Currently, the main distinction is between classification and regression.
However, many methods are similar because they have the same modelling rather
than they have the same prediction type. I am wondering how the functional
similarities can be reflected in that hierarchy. I tried to follow a bottom-up
approach and applied these abstractions to different learning methods. Here are
two examples:

- DecisionTree extends Estimator
- DecisionTreeClassifier extends DecisionTree with Classifier
- DecisionTreeRegressor extends DecisionTree with Regressor

- DecisionTree.Model extends Transformer
- DecisionTreeClassifier.Model extends DecisionTree.Model with Classifier.Model
- DecisionTreeRegressor.Model extends DecisionTree.Model with Regressor.Model

- RegularizedEmpiricalRiskMinimizer extends Estimator
// LogisticRegression and SupportVectorMachine could be an automatic
selection between the binomial and multinomial version
- BinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
- MultinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
- BinomialSupportVectorMachine extends RegularizedEmpiricalRiskMinimizer
- RidgeRegression extends RegularizedEmpiricalRiskMinimizer

- LinearModel extends Transformer
- BinomialLinearModel extends LinearModel with Classifier.Model
- MultinomialLinearModel extends LinearModel with Classifier.Model
- BinomialLogisticRegression.Model extends BinomialLinearModel with
ProbabilisticClassificationModel
- MultinomialLogisticRegression.Model extends MultinomialLinearModel with
ProbabilisticClassificationModel
- BinomialSupportVectorMachine.Model extends BinomialLinearModel // actually it
is a binomial linear model
- RidgeRegression.Model extends LinearModel // actually it is a linear model

was (Author: bigcrunsh):
Great initiative. I really appreciate the attempt to standardize and identify
common interfaces. Currently, I have three issues:

[jira] [Created] (SPARK-3251) Clarify learning interfaces

2014-08-27 Thread Christoph Sawade (JIRA)

Christoph Sawade created SPARK-3251:
---

 Summary:  Clarify learning interfaces
 Key: SPARK-3251
 URL: https://issues.apache.org/jira/browse/SPARK-3251
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0, 1.1.1
Reporter: Christoph Sawade


** Make threshold mandatory
Currently, the output of predict for an example is either the score
or the class. This side-effect is caused by clearThreshold. To
clarify that behaviour three different types of predict (predictScore,
predictClass, predictProbabilty) were introduced; the threshold is not
longer optional.

** Clarify classification interfaces
Currently, some functionality is spreaded over multiple models.
In order to clarify the structure and simplify the implementation of
more complex models (like multinomial logistic regression), two new
classes are introduced:
- BinaryClassificationModel: for all models that derives a binary 
classification from a single weight vector. Comprises the tresholding 
functionality to derive a prediction from a score. It basically captures 
SVMModel and LogisticRegressionModel.
- ProbabilitistClassificaitonModel: This trait defines the interface for models 
that return a calibrated confidence score (aka probability).

** Misc
- some renaming
- add test for probabilistic output



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3251) Clarify learning interfaces

2014-08-27 Thread Christoph Sawade (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112231#comment-14112231
 ] 

Christoph Sawade commented on SPARK-3251:
-

https://github.com/apache/spark/pull/2137

  Clarify learning interfaces
 

 Key: SPARK-3251
 URL: https://issues.apache.org/jira/browse/SPARK-3251
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0, 1.1.1
Reporter: Christoph Sawade

 ** Make threshold mandatory
 Currently, the output of predict for an example is either the score
 or the class. This side-effect is caused by clearThreshold. To
 clarify that behaviour three different types of predict (predictScore,
 predictClass, predictProbabilty) were introduced; the threshold is not
 longer optional.
 ** Clarify classification interfaces
 Currently, some functionality is spreaded over multiple models.
 In order to clarify the structure and simplify the implementation of
 more complex models (like multinomial logistic regression), two new
 classes are introduced:
 - BinaryClassificationModel: for all models that derives a binary 
 classification from a single weight vector. Comprises the tresholding 
 functionality to derive a prediction from a score. It basically captures 
 SVMModel and LogisticRegressionModel.
 - ProbabilitistClassificaitonModel: This trait defines the interface for 
 models that return a calibrated confidence score (aka probability).
 ** Misc
 - some renaming
 - add test for probabilistic output



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

[jira] [Comment Edited] (SPARK-3702) Standardize MLlib classes for learners, models

[jira] [Created] (SPARK-3251) Clarify learning interfaces

[jira] [Commented] (SPARK-3251) Clarify learning interfaces

5 matches

Site Navigation

Mail list logo

Footer information