[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-07-02 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530633#comment-16530633
 ] 

Seth Hendrickson commented on SPARK-24579:
--

Hmm... Am I the only one who cannot see comments on the doc?

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> 
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
>  Issue Type: Epic
>  Components: ML, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange 
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark 
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache 
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business 
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI 
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and 
> built-in SQL, DataFrames, and Streaming features, Spark remains the community 
> choice for big data. This is why we saw many efforts to integrate DL/AI 
> frameworks with Spark to leverage its power, for example, TFRecords data 
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project 
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark 
> and external DL/AI frameworks. And the performance matters. However, there 
> doesn’t exist a standard way to exchange data and hence implementation and 
> performance optimization fall into pieces. For example, TensorFlowOnSpark 
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and 
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames 
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s 
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) 
> between Spark and DL/AI frameworks and optimize data conversion from/to this 
> interface.  So DL/AI frameworks can leverage Spark to load data virtually 
> from anywhere without spending extra effort building complex data solutions, 
> like reading features from a production data warehouse or streaming model 
> inference. Spark users can use DL/AI frameworks without learning specific 
> data APIs implemented there. And developers from both sides can work on 
> performance optimizations independently given the interface itself doesn’t 
> introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23704) PySpark access of individual trees in random forest is slow

2018-06-22 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520832#comment-16520832
 ] 

Seth Hendrickson commented on SPARK-23704:
--

Instead of
{code:java}
model.trees[0].transform(test_feat).select('rowNum','probability'){code}
Can you try
{code:java}
trees = model.trees
trees[0].transform(test_feat).select('rowNum','probability'){code}
And time only the second line? The first line actually calls into the JVM and 
creates new trees in Python.

> PySpark access of individual trees in random forest is slow
> ---
>
> Key: SPARK-23704
> URL: https://issues.apache.org/jira/browse/SPARK-23704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.1
> Environment: PySpark 2.2.1 / Windows 10
>Reporter: Julian King
>Priority: Minor
>
> Making predictions from a randomForestClassifier PySpark is much faster than 
> making predictions from an individual tree contained within the .trees 
> attribute. 
> In fact, the model.transform call without an action is more than 10x slower 
> for an individual tree vs the model.transform call for the random forest 
> model.
> See 
> [https://stackoverflow.com/questions/49297470/slow-individual-tree-access-for-random-forest-in-pyspark]
>  for example with timing.
> Ideally:
>  * Getting a prediction from a single tree should be comparable to or faster 
> than getting predictions from the whole tree
>  * Getting all the predictions from all the individual trees should be 
> comparable in speed to getting the predictions from the random forest
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3159) Check for reducible DecisionTree

2018-03-02 Thread Seth Hendrickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Hendrickson resolved SPARK-3159.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Check for reducible DecisionTree
> 
>
> Key: SPARK-3159
> URL: https://issues.apache.org/jira/browse/SPARK-3159
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.4.0
>
>
> Improvement: test-time computation
> Currently, pairs of leaf nodes with the same parent can both output the same 
> prediction.  This happens since the splitting criterion (e.g., Gini) is not 
> the same as prediction accuracy/MSE; the splitting criterion can sometimes be 
> improved even when both children would still output the same prediction 
> (e.g., based on the majority label for classification).
> We could check the tree and reduce it if possible after training.
> Note: This happens with scikit-learn as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-02-16 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368109#comment-16368109
 ] 

Seth Hendrickson commented on SPARK-23437:
--

TBH, this seems like a pretty reasonable request. While I agree we do seem to 
tell people that the "standard" practice is to implement as a third party 
package and then integrate later, I don't see this happen in practice. I don't 
know that we've even validated that the "implement as third party package, then 
in Spark later on" approach even really works. Perhaps an even stronger reason 
for resisting new algorithms is just lack of reviewer/developer support on 
Spark ML. It's hard to predict if there will be anyone to review the PR within 
a reasonable amount of time, even if the code is well-designed. AFAIK, we 
haven't added any major algos since GeneralizedLinearRegression, which has to 
have been a couple years ago. 

That said, I think this is something to at least consider. We can start by 
discussing what algorithms exist, and why we'd choose a particular one. Strong 
arguments for why we need GPs in Spark ML are also beneficial. The fact that 
there isn't a non-parametric regression algo in Spark has some merit, but we 
don't write new algorithms just for the sake of filling in gaps - there needs 
to be user demand (which, unfortunately, is often hard to prove). It also helps 
to point to a package that already implements the algo you're proposing, but 
for example I don't believe scikit implements the linear-time version so we 
can't really leverage their experience. Providing more information on any/all 
of these categories will help make a stronger case, and I do think GPs can be a 
useful addition. Thanks for leading the discussion!

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2018-01-26 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341213#comment-16341213
 ] 

Seth Hendrickson commented on SPARK-17139:
--

Good catch, apart from redesigning this patch, I'm not sure I see a way to 
avoid it either.

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.3.0
>
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23138) Add user guide example for multiclass logistic regression summary

2018-01-17 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329783#comment-16329783
 ] 

Seth Hendrickson commented on SPARK-23138:
--

I can submit a PR for this soon.

> Add user guide example for multiclass logistic regression summary
> -
>
> Key: SPARK-23138
> URL: https://issues.apache.org/jira/browse/SPARK-23138
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> We haven't updated the user guide to reflect the multiclass logistic 
> regression summary added in SPARK-17139.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23138) Add user guide example for multiclass logistic regression summary

2018-01-17 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-23138:


 Summary: Add user guide example for multiclass logistic regression 
summary
 Key: SPARK-23138
 URL: https://issues.apache.org/jira/browse/SPARK-23138
 Project: Spark
  Issue Type: Documentation
  Components: ML
Affects Versions: 2.3.0
Reporter: Seth Hendrickson


We haven't updated the user guide to reflect the multiclass logistic regression 
summary added in SPARK-17139.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22993) checkpointInterval param doc should be clearer

2018-01-08 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-22993:


 Summary: checkpointInterval param doc should be clearer
 Key: SPARK-22993
 URL: https://issues.apache.org/jira/browse/SPARK-22993
 Project: Spark
  Issue Type: Documentation
  Components: ML
Affects Versions: 2.3.0
Reporter: Seth Hendrickson
Priority: Trivial


several algorithms use the shared parameter {{HasCheckpointInterval}} (ALS, 
LDA, GBT), each of which silently ignores the parameter when the checkpoint 
directory is not set on the spark context. This should be documented in the 
param doc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22461) Move Spark ML model summaries into a dedicated package

2017-11-06 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-22461:


 Summary: Move Spark ML model summaries into a dedicated package
 Key: SPARK-22461
 URL: https://issues.apache.org/jira/browse/SPARK-22461
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Seth Hendrickson
Priority: Minor


Summaries in ML right now do not adhere to a common abstraction, and are 
usually placed in the same file as the algorithm, which makes these files 
unwieldy. We can and should unify them under one hierarchy, perhaps in a new 
{{summary}} module.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-03 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238137#comment-16238137
 ] 

Seth Hendrickson commented on SPARK-22433:
--

The main problem I see is that we put "r2" in the `RegressionEvaluator` class, 
which can be used for all types of regression - e.g. DecisionTreeRegressor, 
which is non-sensical. Removing it would break compatibility and is probably 
not worth it since the end user is responsible for using the tools 
appropriately anyway. I'm not sure there is much to do here.

AFAIK using r2 on regularized models is a fuzzy area, but I don't think it's 
doing much harm to leave it and I don't think we'd be concerned about our test 
cases. Certainly unit tests don't imply an endorsement of the methodology 
anyway.

> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRegressionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms

2017-09-25 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179311#comment-16179311
 ] 

Seth Hendrickson commented on SPARK-17136:
--

Ping [~yanboliang]. This is relevant since we have recently been making 
attempts to provide added features/optimizers/algorithms around linear/logistic 
regression. This would be a good step toward building interfaces that can be 
extended in Spark ML. Could you elaborate on mimicking Spark SQL?

One concern I have is that, under the current proposal, we'd have a parameter 
`setMinimizer` that uses a generic Scala class that can't be easily serialized 
to Python, etc... It wouldn't be compatible. Maybe we could use reflection like 
Spark SQL does, but you'd still have to implement custom optimizers in Scala. 

Anyway, I think this, and work related to this, would be really beneficial to 
Spark ML.

> Design optimizer interface for ML algorithms
> 
>
> Key: SPARK-17136
> URL: https://issues.apache.org/jira/browse/SPARK-17136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should consider designing an interface that allows users to use their own 
> optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-09-12 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16163106#comment-16163106
 ] 

Seth Hendrickson commented on SPARK-19634:
--

Is there a plan for moving the linear algorithms that use the summarizer to 
this new implementation? 

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
> Fix For: 2.3.0
>
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21245) Resolve code duplication for classification/regression summarizers

2017-07-26 Thread Seth Hendrickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Hendrickson updated SPARK-21245:
-
  Labels: starter  (was: )
Priority: Minor  (was: Major)

> Resolve code duplication for classification/regression summarizers
> --
>
> Key: SPARK-21245
> URL: https://issues.apache.org/jira/browse/SPARK-21245
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Seth Hendrickson
>Priority: Minor
>  Labels: starter
>
> In several places (LogReg, LinReg, SVC) in Spark ML, we collect summary 
> information about training data using {{MultivariateOnlineSummarizer}} and 
> {{MulticlassSummarizer}}. We have the same code appearing in several places 
> (and including test suites). We can eliminate this by creating a common 
> implementation somewhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21405) Add LBFGS solver for GeneralizedLinearRegression

2017-07-14 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087418#comment-16087418
 ] 

Seth Hendrickson commented on SPARK-21405:
--

Good point, Nick. Though conveniently the machinery to deal with this is 
already in place: https://github.com/apache/spark/pull/15930

> Add LBFGS solver for GeneralizedLinearRegression
> 
>
> Key: SPARK-21405
> URL: https://issues.apache.org/jira/browse/SPARK-21405
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>
> GeneralizedLinearRegression in Spark ML currently only allows 4096 features 
> because it uses IRLS, and hence WLS, as an optimizer which relies on 
> collecting the covariance matrix to the driver. GLMs can also be fit by 
> simple gradient based methods like LBFGS.
> The new API from 
> [SPARK-19762|https://issues.apache.org/jira/browse/SPARK-19762] makes this 
> easy to add. I've already prototyped it, and it works pretty well. This 
> change would allow an arbitrary number of features (up to what can fit on a 
> single node) as in Linear/Logistic regression.
> For reference, other GLM packages also support this - e.g. statsmodels, H2O.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21406) Add logLikelihood to GLR families

2017-07-13 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-21406:


 Summary: Add logLikelihood to GLR families
 Key: SPARK-21406
 URL: https://issues.apache.org/jira/browse/SPARK-21406
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.3.0
Reporter: Seth Hendrickson
Priority: Minor


To be able to implement the typical gradient based aggregator for GLR, we'd 
need to add a {{logLikelihood(y: Double, mu: Double, weight: Double)}} method 
to GLR {{Family}} class. 

One possible hiccup - Tweedie family log likelihood is not computationally 
feasible [link| 
http://support.sas.com/documentation/cdl/en/stathpug/67524/HTML/default/viewer.htm#stathpug_hpgenselect_details16.htm].
 H2O gets around this by using the deviance instead. We could leave it 
unimplemented initially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21405) Add LBFGS solver for GeneralizedLinearRegression

2017-07-13 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086071#comment-16086071
 ] 

Seth Hendrickson commented on SPARK-21405:
--

cc [~yanboliang] [~actuaryzhang]

I'm happy to work on it, but wanted to get your opinions here. Thoughts?

> Add LBFGS solver for GeneralizedLinearRegression
> 
>
> Key: SPARK-21405
> URL: https://issues.apache.org/jira/browse/SPARK-21405
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>
> GeneralizedLinearRegression in Spark ML currently only allows 4096 features 
> because it uses IRLS, and hence WLS, as an optimizer which relies on 
> collecting the covariance matrix to the driver. GLMs can also be fit by 
> simple gradient based methods like LBFGS.
> The new API from 
> [SPARK-19762|https://issues.apache.org/jira/browse/SPARK-19762] makes this 
> easy to add. I've already prototyped it, and it works pretty well. This 
> change would allow an arbitrary number of features (up to what can fit on a 
> single node) as in Linear/Logistic regression.
> For reference, other GLM packages also support this - e.g. statsmodels, H2O.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21405) Add LBFGS solver for GeneralizedLinearRegression

2017-07-13 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-21405:


 Summary: Add LBFGS solver for GeneralizedLinearRegression
 Key: SPARK-21405
 URL: https://issues.apache.org/jira/browse/SPARK-21405
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Seth Hendrickson


GeneralizedLinearRegression in Spark ML currently only allows 4096 features 
because it uses IRLS, and hence WLS, as an optimizer which relies on collecting 
the covariance matrix to the driver. GLMs can also be fit by simple gradient 
based methods like LBFGS.

The new API from 
[SPARK-19762|https://issues.apache.org/jira/browse/SPARK-19762] makes this easy 
to add. I've already prototyped it, and it works pretty well. This change would 
allow an arbitrary number of features (up to what can fit on a single node) as 
in Linear/Logistic regression.

For reference, other GLM packages also support this - e.g. statsmodels, H2O.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21245) Resolve code duplication for classification/regression summarizers

2017-06-28 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-21245:


 Summary: Resolve code duplication for classification/regression 
summarizers
 Key: SPARK-21245
 URL: https://issues.apache.org/jira/browse/SPARK-21245
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.2.1
Reporter: Seth Hendrickson


In several places (LogReg, LinReg, SVC) in Spark ML, we collect summary 
information about training data using {{MultivariateOnlineSummarizer}} and 
{{MulticlassSummarizer}}. We have the same code appearing in several places 
(and including test suites). We can eliminate this by creating a common 
implementation somewhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21152) Use level 3 BLAS operations in LogisticAggregator

2017-06-23 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16061158#comment-16061158
 ] 

Seth Hendrickson commented on SPARK-21152:
--

[~yanboliang] I can do performance testing and post the results for sure. 
Still, do you have any thoughts about the caching issues? I wanted to see if it 
was a deal-breaker before getting so far as conducting exhaustive performance 
tests.

> Use level 3 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-21152
> URL: https://issues.apache.org/jira/browse/SPARK-21152
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Seth Hendrickson
>
> In logistic regression gradient update, we currently compute by each 
> individual row. If we blocked the rows together, we can do a blocked gradient 
> update which leverages the BLAS GEMM operation.
> On high dimensional dense datasets, I've observed ~10x speedups. The problem 
> here, though, is that it likely won't improve the sparse case so we need to 
> keep both implementations around, and this blocked algorithm will require 
> caching a new dataset of type:
> {code}
> BlockInstance(label: Vector, weight: Vector, features: Matrix)
> {code}
> We have avoided caching anything beside the original dataset passed to train 
> in the past because it adds memory overhead if the user has cached this 
> original dataset for other reasons. Here, I'd like to discuss whether we 
> think this patch would be worth the investment, given that it only improves a 
> subset of the use cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21152) Use level 3 BLAS operations in LogisticAggregator

2017-06-20 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-21152:


 Summary: Use level 3 BLAS operations in LogisticAggregator
 Key: SPARK-21152
 URL: https://issues.apache.org/jira/browse/SPARK-21152
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.1.1
Reporter: Seth Hendrickson


In logistic regression gradient update, we currently compute by each individual 
row. If we blocked the rows together, we can do a blocked gradient update which 
leverages the BLAS GEMM operation.

On high dimensional dense datasets, I've observed ~10x speedups. The problem 
here, though, is that it likely won't improve the sparse case so we need to 
keep both implementations around, and this blocked algorithm will require 
caching a new dataset of type:

{code}
BlockInstance(label: Vector, weight: Vector, features: Matrix)
{code}

We have avoided caching anything beside the original dataset passed to train in 
the past because it adds memory overhead if the user has cached this original 
dataset for other reasons. Here, I'd like to discuss whether we think this 
patch would be worth the investment, given that it only improves a subset of 
the use cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21152) Use level 3 BLAS operations in LogisticAggregator

2017-06-20 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056025#comment-16056025
 ] 

Seth Hendrickson commented on SPARK-21152:
--

cc [~dbtsai] [~mlnick] [~srowen]

BTW, I've been working on this. DB, you and I discussed the caching issue in 
the past. Here's a comment from DB for reference:

"In the old mllib implementation, I just decided to have a copy of entire 
standardized dataset and had it cached for simplicity. After talking to couple 
people for their use cases, many times, they're training models on the same 
cached dataset for different regularizations, and then the old mllib will cache 
them again and again which will result pressure on GC and waste some memory 
space."


> Use level 3 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-21152
> URL: https://issues.apache.org/jira/browse/SPARK-21152
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Seth Hendrickson
>
> In logistic regression gradient update, we currently compute by each 
> individual row. If we blocked the rows together, we can do a blocked gradient 
> update which leverages the BLAS GEMM operation.
> On high dimensional dense datasets, I've observed ~10x speedups. The problem 
> here, though, is that it likely won't improve the sparse case so we need to 
> keep both implementations around, and this blocked algorithm will require 
> caching a new dataset of type:
> {code}
> BlockInstance(label: Vector, weight: Vector, features: Matrix)
> {code}
> We have avoided caching anything beside the original dataset passed to train 
> in the past because it adds memory overhead if the user has cached this 
> original dataset for other reasons. Here, I'd like to discuss whether we 
> think this patch would be worth the investment, given that it only improves a 
> subset of the use cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20988) Convert logistic regression to new aggregator framework

2017-06-13 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048489#comment-16048489
 ] 

Seth Hendrickson commented on SPARK-20988:
--

I've already started it a bit. Would you mind doing the same thing for 
LinearSVC instead? It should mostly orthogonal, though I think some of the unit 
tests will need to share code.

> Convert logistic regression to new aggregator framework
> ---
>
> Key: SPARK-20988
> URL: https://issues.apache.org/jira/browse/SPARK-20988
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Use the hierarchy from SPARK-19762 for logistic regression optimization



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20988) Convert logistic regression to new aggregator framework

2017-06-05 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-20988:


 Summary: Convert logistic regression to new aggregator framework
 Key: SPARK-20988
 URL: https://issues.apache.org/jira/browse/SPARK-20988
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.3.0
Reporter: Seth Hendrickson
Priority: Minor


Use the hierarchy from SPARK-19762 for logistic regression optimization



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-03-27 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944030#comment-15944030
 ] 

Seth Hendrickson edited comment on SPARK-19634 at 3/27/17 10:23 PM:


I'm coming to this a bit late, but I'm finding things a bit hard to follow. 
Reading the design doc, it seems that the original plan was to implement two 
interfaces - an RDD one that provides the same performance as current 
{{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. 

from design doc:
"...In the meantime, there will be a (possibly faster) RDD interface and a 
(more flexible) Dataframe interface."

Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the 
it was pivoted away from UDAF, but the design doc does not reflect that. Also, 
if there is to be an RDD interface, what is the JIRA for it and what will it 
look like?

Also, there are several concerns raised in the design doc about this Catalyst 
aggregate approach being less efficient, and the consensus seemed to be: 
provide an initial API with a "slow" implementation that will be improved upon 
in the future. Is that correct? I'm not that familiar with the Catalyst 
optimizer, but are we sure there is a good way to implement the tree-reduce 
type aggregation, and if so could we document that? I'd prefer to get the 
details hashed out further rather than rushing to provide an API and initial 
slow implementation, that way we can make sure that we get this correct in the 
long-term. I really appreciate some clarification and my apologies if I have 
missed any of the details/discussion.


was (Author: sethah):
I'm coming to this a bit late, but I'm finding things a bit hard to follow. 
Reading the design doc, it seems that the original plan was to implement two 
interfaces - an RDD one that provides the same performance as current 
{{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. 

from design doc:
"...In the meantime, there will be a (possibly faster) RDD interface and a 
(more flexible) Dataframe interface."

Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the 
it was pivoted away from UDAF, but the design doc does not reflect that. Also, 
if there is to be an RDD interface, what is the JIRA for it and what will it 
look like?

Also, there are several concerns raised in the design doc about this Catalyst 
aggregate approach being less efficient, and the consensus seemed to be: 
provide an initial API with a "slow" implementation that will be improved upon 
in the future. Is that correct? I'm not that familiar with the Catalyst 
optimizer, but are we sure there is a good way to implement the tree-reduce 
type aggregation, and if so could we document that? If this is still targeted 
at 2.2, why? I'd prefer to get the details hashed out further rather than 
rushing to provide an API and initial slow implementation, that way we can make 
sure that we get this correct in the long-term. I really appreciate some 
clarification and my apologies if I have missed any of the details/discussion.

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-03-27 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944030#comment-15944030
 ] 

Seth Hendrickson commented on SPARK-19634:
--

I'm coming to this a bit late, but I'm finding things a bit hard to follow. 
Reading the design doc, it seems that the original plan was to implement two 
interfaces - an RDD one that provides the same performance as current 
{{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. 

from design doc:
"...In the meantime, there will be a (possibly faster) RDD interface and a 
(more flexible) Dataframe interface."

Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the 
it was pivoted away from UDAF, but the design doc does not reflect that. Also, 
if there is to be an RDD interface, what is the JIRA for it and what will it 
look like?

Also, there are several concerns raised in the design doc about this Catalyst 
aggregate approach being less efficient, and the consensus seemed to be: 
provide an initial API with a "slow" implementation that will be improved upon 
in the future. Is that correct? I'm not that familiar with the Catalyst 
optimizer, but are we sure there is a good way to implement the tree-reduce 
type aggregation, and if so could we document that? If this is still targeted 
at 2.2, why? I'd prefer to get the details hashed out further rather than 
rushing to provide an API and initial slow implementation, that way we can make 
sure that we get this correct in the long-term. I really appreciate some 
clarification and my apologies if I have missed any of the details/discussion.

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20083) Change matrix toArray to not create a new array when matrix is already column major

2017-03-27 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943873#comment-15943873
 ] 

Seth Hendrickson commented on SPARK-20083:
--

Yes, that would be the intention. We have to take care to change the existing 
code when we require a new array from {{toArray}} when we implement this change.

> Change matrix toArray to not create a new array when matrix is already column 
> major
> ---
>
> Key: SPARK-20083
> URL: https://issues.apache.org/jira/browse/SPARK-20083
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> {{toArray}} always creates a new array in column major format, even when the 
> resulting array is the same as the backing values. We should change this to 
> just return a reference to the values array when it is already column major.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients

2017-03-24 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941114#comment-15941114
 ] 

Seth Hendrickson commented on SPARK-17137:
--

I can make a PR for using this inside the MLOR code, but I probably won't have 
time to do performance tests within the next couple of days (since code freeze 
has already passed). [~dbtsai] Do you think we need to do performance tests 
before this patch goes in? 

> Add compressed support for multinomial logistic regression coefficients
> ---
>
> Key: SPARK-17137
> URL: https://issues.apache.org/jira/browse/SPARK-17137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> For sparse coefficients in MLOR, such as when high L1 regularization, it may 
> be more efficient to store coefficients in compressed format. We can add this 
> option to MLOR and perhaps to do some performance tests to verify 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20083) Change matrix toArray to not create a new array when matrix is already column major

2017-03-24 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-20083:


 Summary: Change matrix toArray to not create a new array when 
matrix is already column major
 Key: SPARK-20083
 URL: https://issues.apache.org/jira/browse/SPARK-20083
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Seth Hendrickson
Priority: Minor


{{toArray}} always creates a new array in column major format, even when the 
resulting array is the same as the backing values. We should change this to 
just return a reference to the values array when it is already column major.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms

2017-03-21 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935112#comment-15935112
 ] 

Seth Hendrickson commented on SPARK-17136:
--

The reason to support setting them in both places would be backwards 
compatibility mainly. If we still allow users to set {{maxIter}} on the 
estimator then we won't break code that previously did this. Specifying the 
optimizer, either one built into Spark or a custom one, would be optional and 
something mostly advanced users would do. About grid-based CV, this would be a 
point that we need to carefully consider and make sure that we get it right. 
We'd still allow users to search over grids of {{maxIter}}, {{tol}} etc... 
since those params are still there, but additionally users could search over 
different optimizers and optimizers with different parameters themselves. I 
think that could be a bit clunky, but it's open for design discussion. e.g.

{code}
val paramGrid = new ParamGridBuilder()
  .addGrid(lr.minimizer, Array(new LBFGS(), new OWLQN(), new LBFGSB(lb, ub)))
  .build()
{code}

Yes, there are cases where users could supply conflicting grids, but AFAICT 
this problem already exists, e.g. 

{code}
val paramGrid = new ParamGridBuilder()
  .addGrid(lr.solver, Array("normal", "l-bfgs"))
  .addGrid(lr.maxIter, Array(10, 20)) // maxIter is ignored when solver is 
normal
  .build()
{code}

About your suggestion of mimicking Spark SQL - would you mind elaborating here 
or on the design doc? I'm not as familiar with it, so if you have some design 
in mind it would be great to hear that.



> Design optimizer interface for ML algorithms
> 
>
> Key: SPARK-17136
> URL: https://issues.apache.org/jira/browse/SPARK-17136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should consider designing an interface that allows users to use their own 
> optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2017-03-21 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934960#comment-15934960
 ] 

Seth Hendrickson commented on SPARK-7129:
-

I don't think anyone is working on it. Though I'm afraid it is probably not a 
good use of time to spend on this task, for a couple of reasons. We still don't 
have weight support in trees and there is extremely limited bandwidth of 
reviewers/committers in Spark ML at the moment. Further, there are many more 
important tasks that need to be done in ML so I would rate this as low 
priority, which also means it is less likely to be reviewed or see much 
progress. Finally, given the recent success of things like xgboost/lightGBM, we 
may want to rethink/rewrite the existing boosting framework to see if we can 
get similar performance. If anything, I think we need to think about how we'd 
like to proceed improving the boosting libraries in Spark from an overall point 
of view, but that is a large task that is likely a few releases away. I'd be 
curious to hear others' thoughts of course, but this is the state of things 
AFAIK. I guess I don't see this as a priority, but it could become one given 
enough community interest.

> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19762) Implement aggregator/loss function hierarchy and apply to linear regression

2017-02-27 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-19762:


 Summary: Implement aggregator/loss function hierarchy and apply to 
linear regression
 Key: SPARK-19762
 URL: https://issues.apache.org/jira/browse/SPARK-19762
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.2.0
Reporter: Seth Hendrickson
Priority: Minor


Creating this subtask as a first step for consolidating ML aggregators. We can 
start by just applying this change to linear regression, to keep the PR more 
manageable in scope.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators

2017-02-26 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885185#comment-15885185
 ] 

Seth Hendrickson commented on SPARK-19747:
--

BTW, I have a rough prototype which at least indicates this is do-able. Still 
some kinks to work out though. I would like to work on this task if that's 
alright.

> Consolidate code in ML aggregators
> --
>
> Key: SPARK-19747
> URL: https://issues.apache.org/jira/browse/SPARK-19747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Many algorithms in Spark ML are posed as optimization of a differentiable 
> loss function over a parameter vector. We implement these by having a loss 
> function accumulate the gradient using an Aggregator class which has methods 
> that amount to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm 
> that obeys this form implements a cost function class and an aggregator 
> class, which are completely separate from one another but share probably 80% 
> of the same code. 
> I think it is important to clean things like this up, and if we can do it 
> properly it will make the code much more maintainable, readable, and bug 
> free. It will also help reduce the overhead of future implementations.
> The design is of course open for discussion, but I think we should aim to:
> 1. Have all aggregators share parent classes, so that they only need to 
> implement the {{add}} function. This is really the only difference in the 
> current aggregators.
> 2. Have a single, generic cost function that is parameterized by the 
> aggregator type. This reduces the many places we implement cost functions and 
> greatly reduces the amount of duplicated code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19747) Consolidate code in ML aggregators

2017-02-26 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-19747:


 Summary: Consolidate code in ML aggregators
 Key: SPARK-19747
 URL: https://issues.apache.org/jira/browse/SPARK-19747
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Seth Hendrickson
Priority: Minor


Many algorithms in Spark ML are posed as optimization of a differentiable loss 
function over a parameter vector. We implement these by having a loss function 
accumulate the gradient using an Aggregator class which has methods that amount 
to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm that obeys this 
form implements a cost function class and an aggregator class, which are 
completely separate from one another but share probably 80% of the same code. 

I think it is important to clean things like this up, and if we can do it 
properly it will make the code much more maintainable, readable, and bug free. 
It will also help reduce the overhead of future implementations.

The design is of course open for discussion, but I think we should aim to:
1. Have all aggregators share parent classes, so that they only need to 
implement the {{add}} function. This is really the only difference in the 
current aggregators.
2. Have a single, generic cost function that is parameterized by the aggregator 
type. This reduces the many places we implement cost functions and greatly 
reduces the amount of duplicated code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19746) LogisticAggregator is inefficient in indexing

2017-02-26 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-19746:


 Summary: LogisticAggregator is inefficient in indexing
 Key: SPARK-19746
 URL: https://issues.apache.org/jira/browse/SPARK-19746
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.1.0
Reporter: Seth Hendrickson


The following code occurs in the `LogisticAggregator.add` method, which is a 
performance critical path.

{code}
val localCoefficients = bcCoefficients.value
features.foreachActive { (index, value) =>
  val stdValue = value / localFeaturesStd(index)
  var j = 0
  while (j < numClasses) {
margins(j) += localCoefficients(index * numClasses + j) * stdValue
j += 1
  }
}
{code}

`llocalCoefficients(index * numClasses + j)` calls the `apply` method on 
`Vector`, which dispatches to `asBreeze(index * numClasses + j)` which creates 
a new Breeze vector, and then indexes it. This is very inefficient, creates a 
lot of unnecessary garbage, and we can avoid it by indexing the underlying 
array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19745) SVCAggregator serializes coefficients

2017-02-26 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-19745:


 Summary: SVCAggregator serializes coefficients
 Key: SPARK-19745
 URL: https://issues.apache.org/jira/browse/SPARK-19745
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Seth Hendrickson


Similar to [SPARK-16008|https://issues.apache.org/jira/browse/SPARK-16008], the 
SVC aggregator captures the coefficients in the class closure, and therefore 
ships them around during optimization. We can prevent this with a bit of 
reorganization of the aggregator class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-02-14 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867049#comment-15867049
 ] 

Seth Hendrickson commented on SPARK-18392:
--

I would pretty strongly prefer to focus on adding AND-amplification before 
adding anything else to LSH. That is more of a missing part of the 
functionality, where as other things are enhancements. Curious to hear others' 
thoughts on this. 

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2017-02-13 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865079#comment-15865079
 ] 

Seth Hendrickson commented on SPARK-9478:
-

[~josephkb] Done. Thanks for your feedback on sampling!

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19591) Add sample weights to decision trees

2017-02-13 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-19591:


 Summary: Add sample weights to decision trees
 Key: SPARK-19591
 URL: https://issues.apache.org/jira/browse/SPARK-19591
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.1.0
Reporter: Seth Hendrickson


Add sample weights to decision trees



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2017-02-08 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858438#comment-15858438
 ] 

Seth Hendrickson commented on SPARK-17139:
--

Seems like a reasonable way to solve a messy problem - so I think we should go 
ahead with it.

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2017-02-07 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857230#comment-15857230
 ] 

Seth Hendrickson commented on SPARK-17139:
--

[~josephkb] Is [this more or less what you had in 
mind|https://gist.github.com/sethah/83c57fd77385979579cb44f3d5730e67]?

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19313) GaussianMixture throws cryptic error when number of features is too high

2017-01-20 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-19313:


 Summary: GaussianMixture throws cryptic error when number of 
features is too high
 Key: SPARK-19313
 URL: https://issues.apache.org/jira/browse/SPARK-19313
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Reporter: Seth Hendrickson
Priority: Minor


The following fails

{code}
val df = Seq(
  Vectors.sparse(46400, Array(0, 4), Array(3.0, 8.0)),
  Vectors.sparse(46400, Array(1, 5), Array(4.0, 9.0)))
  .map(Tuple1.apply).toDF("features")
val gm = new GaussianMixture()
gm.fit(df)
{code}

It fails because GMMs allocate an array of size {{numFeatures * numFeatures}} 
and in this case we'll get integer overflow. We should limit the number of 
features appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms

2017-01-11 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819062#comment-15819062
 ] 

Seth Hendrickson commented on SPARK-17136:
--

I'm interested in working on this task including both driving the discussion 
and submitting an initial PR when it is time. I have the beginnings of a design 
document constructed 
[here|https://docs.google.com/document/d/1ynyTwlNw4b6DovG6m8okd3fD2PVZKCEq5rFfsg5Ba1k/edit?usp=sharing],
 and I'd like to open it up for community feedback and input. 

We do see requests from time to time for users to use their own optimizers in 
Spark ML algorithms and we have not supported it in Spark ML. With fairly 
minimal added code, we can make Spark ML optimizers pluggable which provides a 
tangible benefit to users. Potentially, we can design an API that has benefits 
beyond just that, and I'm interested to hear some of the other needs/wants 
people have.

cc [~dbtsai] [~yanboliang] [~WeichenXu123] [~josephkb] [~srowen]

> Design optimizer interface for ML algorithms
> 
>
> Key: SPARK-17136
> URL: https://issues.apache.org/jira/browse/SPARK-17136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should consider designing an interface that allows users to use their own 
> optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-06 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805366#comment-15805366
 ] 

Seth Hendrickson commented on SPARK-10078:
--

As a part of [SPARK-17136|https://issues.apache.org/jira/browse/SPARK-17136]. I 
am looking into a design for generic optimizer interface for Spark.ML. This 
should ideally, be abstracted such that, as Yanbo mentioned, users can switch 
between them easily. I don't think adding this to Breeze is important since we 
hope to add our own interface directly into Spark.

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-02 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793488#comment-15793488
 ] 

Seth Hendrickson commented on SPARK-10078:
--

[~yanboliang] I was a bit confused by the following comment under new 
requirements for VL-BFGS:

"API consistency with Breeze L-BFGS so we can migrate existing code smoothly."

What existing code are we migrating, and to where/what? Are we planning to 
replace the use of the Breeze LBFGS solvers with this VL-BFGS implementation? 
If so, what about the numerous use cases that do not need to partition by 
features? Thanks!

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18705) Docs for one-pass solver for linear regression with L1 and elastic-net penalties

2016-12-04 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720892#comment-15720892
 ] 

Seth Hendrickson commented on SPARK-18705:
--

Yeah, I'll do it today :)

> Docs for one-pass solver for linear regression with L1 and elastic-net 
> penalties
> 
>
> Key: SPARK-18705
> URL: https://issues.apache.org/jira/browse/SPARK-18705
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add document for SPARK-17748 at [{{Normal equation solver for weighted least 
> squares}}|http://spark.apache.org/docs/latest/ml-advanced.html#normal-equation-solver-for-weighted-least-squares]
>  session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17772) Add helper testing methods for instance weighting

2016-11-22 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687500#comment-15687500
 ] 

Seth Hendrickson commented on SPARK-17772:
--

Please do, thanks!

> Add helper testing methods for instance weighting
> -
>
> Key: SPARK-17772
> URL: https://issues.apache.org/jira/browse/SPARK-17772
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
>
> More and more ML algos are accepting instance weights. We keep replicating 
> code to test instance weighting in every test suite, which will get out of 
> hand rather quickly. We can and should implement some generic instance weight 
> test helper methods so that we can reduce duplicated code and standardize 
> these tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2016-11-17 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675770#comment-15675770
 ] 

Seth Hendrickson commented on SPARK-9478:
-

I'm going to work on submitting a PR for adding sample weights for 2.2. That pr 
is for adding class weights, which I think we decided against.

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9478) Add sample weights to Random Forest

2016-11-17 Thread Seth Hendrickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Hendrickson updated SPARK-9478:

Summary: Add sample weights to Random Forest  (was: Add class weights to 
Random Forest)

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18456) Use matrix abstraction for LogisitRegression coefficients during training

2016-11-15 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-18456:


 Summary: Use matrix abstraction for LogisitRegression coefficients 
during training
 Key: SPARK-18456
 URL: https://issues.apache.org/jira/browse/SPARK-18456
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Seth Hendrickson
Priority: Minor


This is a follow up from 
[SPARK-18060|https://issues.apache.org/jira/browse/SPARK-18060]. The current 
code for logistic regression relies on manually indexing flat arrays of column 
major coefficients, which can be messy and is hard to maintain. We can use a 
matrix abstraction instead of a flat array to simplify things. This will make 
the code easier to read and maintain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2016-11-14 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15665233#comment-15665233
 ] 

Seth Hendrickson commented on SPARK-18392:
--

Thank you for clarifying, I see it now.

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18321) ML 2.1 QA: API: Java compatibility, docs

2016-11-11 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658288#comment-15658288
 ] 

Seth Hendrickson commented on SPARK-18321:
--

So I generated the API docs between 2.0 and 2.1, and looked at everything that 
had changed. I didn't find anything in the way of type signatures not checking 
out. Again, the biggest items are LSH and new clustering summaries, the other 
things are mostly params added or edited. 

If anyone has other suggestions of what to do here, please let me know. I am 
reasonably sure there are no major Java incompatibilities based on the evidence 
above. 

> ML 2.1 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-18321
> URL: https://issues.apache.org/jira/browse/SPARK-18321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2016-11-11 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658063#comment-15658063
 ] 

Seth Hendrickson commented on SPARK-18392:
--

[~josephkb] I wasn't sure where to ask this, but I saw you suggested adding a 
self-type reference to the LSH class:

{code}
private[ml] abstract class LSH[T <: LSHModel[T]]
  extends Estimator[T] with LSHParams with DefaultParamsWritable {
  self: Estimator[T] =>
{code}

And I'm not sure I can see why it's needed. What was the intent?

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, 

[jira] [Commented] (SPARK-18321) ML 2.1 QA: API: Java compatibility, docs

2016-11-10 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654524#comment-15654524
 ] 

Seth Hendrickson commented on SPARK-18321:
--

In the current Spark Java docs here: 
http://spark.apache.org/docs/latest/api/java/, I see some classes showing up 
that are private in Scala, e.g. LogisticAggregator and LogisticCostFun. I 
checked older releases and this problem is not new... 

> ML 2.1 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-18321
> URL: https://issues.apache.org/jira/browse/SPARK-18321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18369) Deprecate runs in Pyspark mllib KMeans

2016-11-10 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654450#comment-15654450
 ] 

Seth Hendrickson commented on SPARK-18369:
--

There is a deprecation note for Python docs, but I realize now that we cannot 
deprecate the method since we can't overload methods in Python. Let's close 
this as no issue.

> Deprecate runs in Pyspark mllib KMeans
> --
>
> Key: SPARK-18369
> URL: https://issues.apache.org/jira/browse/SPARK-18369
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> We should deprecate runs in pyspark mllib kmeans algo as we have done in 
> Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18369) Deprecate runs in Pyspark mllib KMeans

2016-11-10 Thread Seth Hendrickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Hendrickson resolved SPARK-18369.
--
Resolution: Not A Problem

> Deprecate runs in Pyspark mllib KMeans
> --
>
> Key: SPARK-18369
> URL: https://issues.apache.org/jira/browse/SPARK-18369
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> We should deprecate runs in pyspark mllib kmeans algo as we have done in 
> Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18320) ML 2.1 QA: API: Python API coverage

2016-11-08 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649171#comment-15649171
 ] 

Seth Hendrickson edited comment on SPARK-18320 at 11/8/16 11:23 PM:


I scanned through the {{@Since("2.1.0")}} tags in ml/mllib. The major things 
that were added were LSH and clustering summaries, which are linked and have 
JIRAs. I made JIRAs for a couple other minor things as well and linked them.


was (Author: sethah):
I scanned through the {{@Since("2.1.0") tags in ml/mllib}}. The major things 
that were added were LSH and clustering summaries, which are linked and have 
JIRAs. I made JIRAs for a couple other minor things as well and linked them.

> ML 2.1 QA: API: Python API coverage
> ---
>
> Key: SPARK-18320
> URL: https://issues.apache.org/jira/browse/SPARK-18320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18320) ML 2.1 QA: API: Python API coverage

2016-11-08 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649171#comment-15649171
 ] 

Seth Hendrickson commented on SPARK-18320:
--

I scanned through the {{@Since("2.1.0") tags in ml/mllib}}. The major things 
that were added were LSH and clustering summaries, which are linked and have 
JIRAs. I made JIRAs for a couple other minor things as well and linked them.

> ML 2.1 QA: API: Python API coverage
> ---
>
> Key: SPARK-18320
> URL: https://issues.apache.org/jira/browse/SPARK-18320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18369) Deprecate runs in Pyspark mllib KMeans

2016-11-08 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-18369:


 Summary: Deprecate runs in Pyspark mllib KMeans
 Key: SPARK-18369
 URL: https://issues.apache.org/jira/browse/SPARK-18369
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Seth Hendrickson
Priority: Minor


We should deprecate runs in pyspark mllib kmeans algo as we have done in Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer

2016-11-08 Thread Seth Hendrickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Hendrickson updated SPARK-18366:
-
Component/s: PySpark
 ML

> Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer
> ---
>
> Key: SPARK-18366
> URL: https://issues.apache.org/jira/browse/SPARK-18366
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> We should add the new {{handleInvalid}} param for these transformers to 
> Python to maintain API parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer

2016-11-08 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-18366:


 Summary: Add handleInvalid to Pyspark for QuantileDiscretizer and 
Bucketizer
 Key: SPARK-18366
 URL: https://issues.apache.org/jira/browse/SPARK-18366
 Project: Spark
  Issue Type: New Feature
Reporter: Seth Hendrickson
Priority: Minor


We should add the new {{handleInvalid}} param for these transformers to Python 
to maintain API parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18321) ML 2.1 QA: API: Java compatibility, docs

2016-11-08 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648529#comment-15648529
 ] 

Seth Hendrickson commented on SPARK-18321:
--

I've taken a look at the new LSH additions as well as the clustering summaries 
which were both added since 2.0. They seem ok.

I'd appreciate some guidance: is the main item here to comb through API docs 
for Java and see that type signatures check out, as well as matching Java and 
Scala APIs? What other tools are there?

> ML 2.1 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-18321
> URL: https://issues.apache.org/jira/browse/SPARK-18321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * comparing with the Scala doc
> * verifying that Java docs are not messed up by Scala type incompatibilities. 
>  Some items to look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (The correctness can be checked in Scala.)
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18316) Spark MLlib, GraphX 2.1 QA umbrella

2016-11-07 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645454#comment-15645454
 ] 

Seth Hendrickson commented on SPARK-18316:
--

Much appreciated [~josephkb]!

> Spark MLlib, GraphX 2.1 QA umbrella
> ---
>
> Key: SPARK-18316
> URL: https://issues.apache.org/jira/browse/SPARK-18316
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX.   *SparkR is separate: [SPARK-18329].*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Check binary API compatibility for Scala/Java
> * Audit new public APIs (from the generated html doc)
> ** Scala
> ** Java compatibility
> ** Python coverage
> * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
> * Performance tests
> * Major new algorithms: MinHash, RandomProjection
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18282) Add model summaries for Python GMM and BisectingKMeans

2016-11-04 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-18282:


 Summary: Add model summaries for Python GMM and BisectingKMeans
 Key: SPARK-18282
 URL: https://issues.apache.org/jira/browse/SPARK-18282
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Seth Hendrickson
Priority: Minor


GaussianMixtureModel and BisectingKMeansModel in python do not have model 
summaries, but they are implemented in Scala. We should add them for API parity 
before 2.1 release. After the QA Jiras are created, this can be linked as a 
subtask.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18276) Some ML training summaries are not copied when {{copy()}} is called.

2016-11-04 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-18276:


 Summary: Some ML training summaries are not copied when {{copy()}} 
is called.
 Key: SPARK-18276
 URL: https://issues.apache.org/jira/browse/SPARK-18276
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Seth Hendrickson
Priority: Minor


GaussianMixture, KMeans, BisectingKMeans, and GeneralizedLinearRegression 
models do not copy their training summaries inside the {{copy}} method. In 
contrast, Linear/Logistic regression models do. They should all be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-11-04 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637424#comment-15637424
 ] 

Seth Hendrickson commented on SPARK-18081:
--

No worries, just wanted to check in to see if you had bandwidth to do it.

You can get a preview of the user guide by building the docs with jekyll 
{{SKIP_API=1 jekyll build}} inside the docs directory. For more detail, please 
see [the readme|https://github.com/apache/spark/tree/master/docs]

> Locality Sensitive Hashing (LSH) User Guide
> ---
>
> Key: SPARK-18081
> URL: https://issues.apache.org/jira/browse/SPARK-18081
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-11-04 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636757#comment-15636757
 ] 

Seth Hendrickson commented on SPARK-18081:
--

[~yunn] Do you have a status update on this? It would be great to have this for 
2.1

> Locality Sensitive Hashing (LSH) User Guide
> ---
>
> Key: SPARK-18081
> URL: https://issues.apache.org/jira/browse/SPARK-18081
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-11-03 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634354#comment-15634354
 ] 

Seth Hendrickson commented on SPARK-15581:
--

I think the points you mention are very important to get right moving forward. 
We can certainly debate about what should go on the roadmap, but regardless I 
think it would be helpful to maintain a specific subset of JIRAs that we expect 
to get done for the next release cycle. Particularly:

- We should maintain a list of items that we WILL get done for the next 
release, and we should deliver on nearly every one, barring unforeseen 
circumstances. If we don't get some of the items done, we should understand why 
and adjust accordingly until we can reach a list of items that we can 
consistently deliver on.
- The list of items should be small and targeted, and should take into account 
things like committer/reviewer bandwidth. MLlib does not have a ton of active 
committers right now, like SQL might have, and the roadmap should reflect that. 
We need to be realistic.
- We should make every effort to be as specific as possible. Linking to 
umbrella JIRAs hurts us IMO, and we'd be better off listing specific JIRAs. 
Some of the umbrella tickets contain items that are longer term or have little 
interest (nice-to-haves), but realistically won't get implemented (in a timely 
manner). For example, I looked at the tree umbrellas and I see some items that 
are high priority and can be done in one release cycle, but also other items 
that have been around for a long time and seem to have little interest. The 
list should contain only the items that we expect to get done.
-As you say, every item should have a committer linked to it that is capable of 
merging it. They do not have to be the primary reviewer, but they should have 
sufficient expertise such that they feel comfortable merging it after it has 
been appropriately reviewed. One interesting example to be wary of is that 
there seem to be a LOT of tree related items on the roadmap, but Joseph has 
traditionally been the only (at least the main) committer involved in 
tree-related JIRAs. I don't think it's realistic to target all of these tree 
improvements when we have limited committers available to review/merge them. We 
can trim them down to a realistic subset.

I propose a revised roadmap that contains two classifications of items:

1. JIRAs that will be done by the next relase
2. JIRAs that will be done at some point before the next major relase (e.g. 3.0)

JIRAs that are still up for debate (e.g. adding a factorization machine) should 
not be on the roadmap. That does not mean they will not get done, but they are 
not necessarily "planned" for any particular timeframe. IMO this revised 
roadmap can/will provide a lot more transparency, and appropriately set review 
expectations. If it's on the list of "will do by next minor release," then 
contributors should expect it to be reviewed. What does everyone else think?

Also, I took a bit of time to aggregate lists of specific JIRAs that I think 
fit into the two categories I listed above 
[here|https://docs.google.com/spreadsheets/d/1nNvbGoarRvhsMkYaFiU6midyHrndPBYQTcKKNOF5xcs/edit?usp=sharing]
 (note: does not contain SparkR items). I am not (necessarily) proposing to 
move the list to this google doc, and I understand this is still undergoing 
discussion. I just wanted to provide an example of what the above might look 
like.   

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
> Fix For: 2.1.0
>
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> 

[jira] [Comment Edited] (SPARK-15581) MLlib 2.1 Roadmap

2016-11-03 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634354#comment-15634354
 ] 

Seth Hendrickson edited comment on SPARK-15581 at 11/3/16 9:28 PM:
---

I think the points you mention are very important to get right moving forward. 
We can certainly debate about what should go on the roadmap, but regardless I 
think it would be helpful to maintain a specific subset of JIRAs that we expect 
to get done for the next release cycle. Particularly:

- We should maintain a list of items that we WILL get done for the next 
release, and we should deliver on nearly every one, barring unforeseen 
circumstances. If we don't get some of the items done, we should understand why 
and adjust accordingly until we can reach a list of items that we can 
consistently deliver on.
- The list of items should be small and targeted, and should take into account 
things like committer/reviewer bandwidth. MLlib does not have a ton of active 
committers right now, like SQL might have, and the roadmap should reflect that. 
We need to be realistic.
- We should make every effort to be as specific as possible. Linking to 
umbrella JIRAs hurts us IMO, and we'd be better off listing specific JIRAs. 
Some of the umbrella tickets contain items that are longer term or have little 
interest (nice-to-haves), but realistically won't get implemented (in a timely 
manner). For example, I looked at the tree umbrellas and I see some items that 
are high priority and can be done in one release cycle, but also other items 
that have been around for a long time and seem to have little interest. The 
list should contain only the items that we expect to get done.
- As you say, every item should have a committer linked to it that is capable 
of merging it. They do not have to be the primary reviewer, but they should 
have sufficient expertise such that they feel comfortable merging it after it 
has been appropriately reviewed. One interesting example to be wary of is that 
there seem to be a LOT of tree related items on the roadmap, but Joseph has 
traditionally been the only (at least the main) committer involved in 
tree-related JIRAs. I don't think it's realistic to target all of these tree 
improvements when we have limited committers available to review/merge them. We 
can trim them down to a realistic subset.

I propose a revised roadmap that contains two classifications of items:

1. JIRAs that will be done by the next release
2. JIRAs that will be done at some point before the next major release (e.g. 
3.0)

JIRAs that are still up for debate (e.g. adding a factorization machine) should 
not be on the roadmap. That does not mean they will not get done, but they are 
not necessarily "planned" for any particular timeframe. IMO this revised 
roadmap can/will provide a lot more transparency, and appropriately set review 
expectations. If it's on the list of "will do by next minor release," then 
contributors should expect it to be reviewed. What does everyone else think?

Also, I took a bit of time to aggregate lists of specific JIRAs that I think 
fit into the two categories I listed above 
[here|https://docs.google.com/spreadsheets/d/1nNvbGoarRvhsMkYaFiU6midyHrndPBYQTcKKNOF5xcs/edit?usp=sharing]
 (note: does not contain SparkR items). I am not (necessarily) proposing to 
move the list to this google doc, and I understand this is still undergoing 
discussion. I just wanted to provide an example of what the above might look 
like.   


was (Author: sethah):
I think the points you mention are very important to get right moving forward. 
We can certainly debate about what should go on the roadmap, but regardless I 
think it would be helpful to maintain a specific subset of JIRAs that we expect 
to get done for the next release cycle. Particularly:

- We should maintain a list of items that we WILL get done for the next 
release, and we should deliver on nearly every one, barring unforeseen 
circumstances. If we don't get some of the items done, we should understand why 
and adjust accordingly until we can reach a list of items that we can 
consistently deliver on.
- The list of items should be small and targeted, and should take into account 
things like committer/reviewer bandwidth. MLlib does not have a ton of active 
committers right now, like SQL might have, and the roadmap should reflect that. 
We need to be realistic.
- We should make every effort to be as specific as possible. Linking to 
umbrella JIRAs hurts us IMO, and we'd be better off listing specific JIRAs. 
Some of the umbrella tickets contain items that are longer term or have little 
interest (nice-to-haves), but realistically won't get implemented (in a timely 
manner). For example, I looked at the tree umbrellas and I see some items that 
are high priority and can be done in one release cycle, but also other items 
that have been around for a 

[jira] [Commented] (SPARK-17138) Python API for multinomial logistic regression

2016-11-03 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633627#comment-15633627
 ] 

Seth Hendrickson commented on SPARK-17138:
--

[~yanboliang] Can you mark this as resolved?

> Python API for multinomial logistic regression
> --
>
> Key: SPARK-17138
> URL: https://issues.apache.org/jira/browse/SPARK-17138
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Once [SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159] is merged, 
> we should make a Python API for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18253) ML Instrumentation logging requires too much manual implementation

2016-11-03 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-18253:


 Summary: ML Instrumentation logging requires too much manual 
implementation
 Key: SPARK-18253
 URL: https://issues.apache.org/jira/browse/SPARK-18253
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Seth Hendrickson
Priority: Minor


[SPARK-14567|https://issues.apache.org/jira/browse/SPARK-14567] introduced an 
{{Instrumentation}} class for standardized logging of ML training sessions. 
Right now, we manually log individual params for each algorithm, partly because 
we don't want to log all params since some params can be huge in size, and we 
could flood the logs. We should find a more sustainable way of logging params 
in ML algos. The current approach does not seem sustainable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-10-31 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623661#comment-15623661
 ] 

Seth Hendrickson commented on SPARK-15784:
--

This seems like it fits the framework of a feature transformer. We could 
generate a real-valued feature column using PIC algorithm where the values are 
just the components of the pseudo-eigenvector. Alternatively we could pipeline 
a KMeans clustering on the end, but I think it makes more sense to let users do 
that themselves - but that's up for debate.

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18060) Avoid unnecessary standardization in multinomial logistic regression training

2016-10-21 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-18060:


 Summary: Avoid unnecessary standardization in multinomial logistic 
regression training
 Key: SPARK-18060
 URL: https://issues.apache.org/jira/browse/SPARK-18060
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Seth Hendrickson


The MLOR implementation in spark.ml trains the model in the standardized 
feature space by dividing the feature values by the column standard deviation 
in each iteration. We perform this computation many time more than is necessary 
in order to achieve sequential memory access pattern when computing the 
gradients. We can have both - sequential access patterns and reduced 
computation - if we use a column major layout for the coefficients.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18036) Decision Trees do not handle edge cases

2016-10-20 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-18036:


 Summary: Decision Trees do not handle edge cases
 Key: SPARK-18036
 URL: https://issues.apache.org/jira/browse/SPARK-18036
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Reporter: Seth Hendrickson
Priority: Minor


Decision trees/GBT/RF do not handle edge cases such as constant features or 
empty features. For example:

{code}
val dt = new DecisionTreeRegressor()
val data = Seq(LabeledPoint(1.0, Vectors.dense(Array.empty[Double]))).toDF()
dt.fit(data)

java.lang.UnsupportedOperationException: empty.max
  at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
  at scala.collection.mutable.ArrayOps$ofInt.max(ArrayOps.scala:234)
  at 
org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:207)
  at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105)
  at 
org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:93)
  at 
org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:46)
  at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
  ... 52 elided

{code}

as well as 

{code}
val dt = new DecisionTreeRegressor()
val data = Seq(LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))).toDF()
dt.fit(data)

java.lang.UnsupportedOperationException: empty.maxBy
at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236)
at scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37)
at 
org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:846)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18019) Log instrumentation in GBTs

2016-10-19 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-18019:


 Summary: Log instrumentation in GBTs
 Key: SPARK-18019
 URL: https://issues.apache.org/jira/browse/SPARK-18019
 Project: Spark
  Issue Type: Sub-task
Reporter: Seth Hendrickson


Sub-task for adding instrumentation to GBTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17941) Logistic regression test suites should use weights when comparing to glmnet

2016-10-14 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-17941:


 Summary: Logistic regression test suites should use weights when 
comparing to glmnet
 Key: SPARK-17941
 URL: https://issues.apache.org/jira/browse/SPARK-17941
 Project: Spark
  Issue Type: Test
  Components: ML
Reporter: Seth Hendrickson
Priority: Minor


Logistic regression suite currently has many test cases comparing to R's 
glmnet. Both libraries support weights, and to make the testing of weights in 
Spark LOR more robust, we should add weights to all the test cases. The current 
weight testing is quite minimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17906) MulticlassClassificationEvaluator support target label

2016-10-13 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572154#comment-15572154
 ] 

Seth Hendrickson commented on SPARK-17906:
--

We are adding model summaries that would expose some of this behavior. For 
example, see [https://github.com/apache/spark/pull/15435]. That PR will likely 
expose some of the functionality being requested here.

> MulticlassClassificationEvaluator support target label
> --
>
> Key: SPARK-17906
> URL: https://issues.apache.org/jira/browse/SPARK-17906
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> In practice, I sometime only focus on metric of one special label.
> For example, in CTR prediction, I usually only mind F1 of positive class.
> In sklearn, this is supported:
> {code}
> >>> from sklearn.metrics import classification_report
> >>> y_true = [0, 1, 2, 2, 2]
> >>> y_pred = [0, 0, 2, 2, 1]
> >>> target_names = ['class 0', 'class 1', 'class 2']
> >>> print(classification_report(y_true, y_pred, target_names=target_names))
>  precisionrecall  f1-score   support
> class 0   0.50  1.00  0.67 1
> class 1   0.00  0.00  0.00 1
> class 2   1.00  0.67  0.80 3
> avg / total   0.70  0.60  0.61 5
> {code}
> Now, ml only support `weightedXXX`. So I think there may be a point to 
> improve.
> The API may be designed like this:
> {code}
> val dataset = ...
> val evaluator = new MulticlassClassificationEvaluator
> evaluator.setMetricName("f1")
> evaluator.evaluate(dataset)   // weightedF1 of all classes
> evaluator.setTarget(0.0).setMetricName("f1")
> evaluator.evaluate(dataset)   // F1 of class "0"
> {code}
> what's your opinion? [~yanboliang][~josephkb][~sethah][~srowen] 
> If this is useful and acceptable, I'm happy to work on this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17772) Add helper testing methods for instance weighting

2016-10-10 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564174#comment-15564174
 ] 

Seth Hendrickson commented on SPARK-17772:
--

I'm working on this.

> Add helper testing methods for instance weighting
> -
>
> Key: SPARK-17772
> URL: https://issues.apache.org/jira/browse/SPARK-17772
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> More and more ML algos are accepting instance weights. We keep replicating 
> code to test instance weighting in every test suite, which will get out of 
> hand rather quickly. We can and should implement some generic instance weight 
> test helper methods so that we can reduce duplicated code and standardize 
> these tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add class weights to Random Forest

2016-10-10 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563919#comment-15563919
 ] 

Seth Hendrickson commented on SPARK-9478:
-

I'm going to revive this, and hopefully submit a PR soon.

> Add class weights to Random Forest
> --
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2016-10-10 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563729#comment-15563729
 ] 

Seth Hendrickson commented on SPARK-17139:
--

[~WeichenXu123] Status?

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17140) Add initial model to MultinomialLogisticRegression

2016-10-10 Thread Seth Hendrickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Hendrickson resolved SPARK-17140.
--
Resolution: Invalid

MultinomialLogisticRegression was elminated in 
SPARK-[17163|https://issues.apache.org/jira/browse/SPARK-17163]

> Add initial model to MultinomialLogisticRegression
> --
>
> Key: SPARK-17140
> URL: https://issues.apache.org/jira/browse/SPARK-17140
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> We should add initial model support to Multinomial logistic regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17824) QR solver for WeightedLeastSquares

2016-10-07 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556237#comment-15556237
 ] 

Seth Hendrickson commented on SPARK-17824:
--

Thank you for clarifying

> QR solver for WeightedLeastSquares
> --
>
> Key: SPARK-17824
> URL: https://issues.apache.org/jira/browse/SPARK-17824
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Cholesky decomposition is unstable (for near-singular and rank deficient 
> matrices) and only works on positive definite matrices which can not be 
> guaranteed in all cases, it was often used when matrix A is very large and 
> sparse due to faster calculation. QR decomposition has better numerical 
> properties than Cholesky and can works on matrices which are not positive 
> definite. Spark MLlib {{WeightedLeastSquares}} use Cholesky decomposition to 
> solve normal equation currently, we should also support or move to QR solver 
> for better stability. I'm preparing to send a PR.
> cc [~dbtsai] [~sethah]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17824) QR solver for WeightedLeastSquares

2016-10-07 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1385#comment-1385
 ] 

Seth Hendrickson edited comment on SPARK-17824 at 10/7/16 3:42 PM:
---

[~yanboliang] Can you please post your design plans? This is almost certainly 
going to conflict with the PR I'm about to send for 
[SPARK-17748|https://issues.apache.org/jira/browse/SPARK-17748]. In that PR, I 
have implemented a pluggable solver for the normal equations, I posted a bit of 
detail on the JIRA. In fact, if it gets merged we will be able to deal with 
singular matrices by running L-BFGS on the normal equations on the driver 
(one-pass). It may not be the most elegant solution, but it is a byproduct of 
implementing the OWL-QN solver. I'd like to hear more about your patch to 
understand how the two fit together, what conflicts there are, and how we need 
to coordinate.

In fact, I may have already written some of the test cases you will need to 
write, so maybe we can share them :)

Thanks!


was (Author: sethah):
[~yanboliang] Can you please post your design plans? This is almost certainly 
going to conflict with the PR I'm about to send for 
[SPARK-17748|https://issues.apache.org/jira/browse/SPARK-17748]. In that PR, I 
have implemented a pluggable solver for the normal equations, I posted a bit of 
detail on the JIRA. In fact, if it gets merged we will be able to deal with 
singular matrices by running L-BFGS on the normal equations on the driver 
(one-pass). It may not be the most elegant solution, but it is a byproduct of 
implementing the OWL-QN solver. I'd like to hear more about your patch to 
understand how the two fit together, what conflicts there are, and how we need 
to coordinate.

Thanks!

> QR solver for WeightedLeastSquares
> --
>
> Key: SPARK-17824
> URL: https://issues.apache.org/jira/browse/SPARK-17824
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Cholesky decomposition is unstable (for near-singular and rank deficient 
> matrices) and only works on positive definite matrices which can not be 
> guaranteed in all cases, it was often used when matrix A is very large and 
> sparse due to faster calculation. QR decomposition has better numerical 
> properties than Cholesky and can works on matrices which are not positive 
> definite. Spark MLlib {{WeightedLeastSquares}} use Cholesky decomposition to 
> solve normal equation currently, we should also support or move to QR solver 
> for better stability. I'm preparing to send a PR.
> cc [~dbtsai] [~sethah]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17824) QR solver for WeightedLeastSquares

2016-10-07 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1385#comment-1385
 ] 

Seth Hendrickson commented on SPARK-17824:
--

[~yanboliang] Can you please post your design plans? This is almost certainly 
going to conflict with the PR I'm about to send for 
[SPARK-17748|https://issues.apache.org/jira/browse/SPARK-17748]. In that PR, I 
have implemented a pluggable solver for the normal equations, I posted a bit of 
detail on the JIRA. In fact, if it gets merged we will be able to deal with 
singular matrices by running L-BFGS on the normal equations on the driver 
(one-pass). It may not be the most elegant solution, but it is a byproduct of 
implementing the OWL-QN solver. I'd like to hear more about your patch to 
understand how the two fit together, what conflicts there are, and how we need 
to coordinate.

Thanks!

> QR solver for WeightedLeastSquares
> --
>
> Key: SPARK-17824
> URL: https://issues.apache.org/jira/browse/SPARK-17824
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Cholesky decomposition is unstable (for near-singular and rank deficient 
> matrices) and only works on positive definite matrices which can not be 
> guaranteed in all cases, it was often used when matrix A is very large and 
> sparse due to faster calculation. QR decomposition has better numerical 
> properties than Cholesky and can works on matrices which are not positive 
> definite. Spark MLlib {{WeightedLeastSquares}} use Cholesky decomposition to 
> solve normal equation currently, we should also support or move to QR solver 
> for better stability. I'm preparing to send a PR.
> cc [~dbtsai] [~sethah]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17789) Don't force users to set k for KMeans if initial model is set

2016-10-05 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550601#comment-15550601
 ] 

Seth Hendrickson commented on SPARK-17789:
--

When the model is fit, the initial model may have some number of centers (say, 
5), but k defaults to 1, so the check in the fit method will throw an exception.

> Don't force users to set k for KMeans if initial model is set
> -
>
> Key: SPARK-17789
> URL: https://issues.apache.org/jira/browse/SPARK-17789
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> In the initial implementation of initalModel, we allow users to set the 
> initial model with a KMeansModel that has a different {{k}} than the current 
> model. We throw an error at train time if the two are mismatched. This means 
> that the following code throws a runtime exception:
> {code}
> val kmeansModel = new KMeans().setInitialModel(model).fit(df)
> {code}
> We should discuss this behavior, and decide if we should enforce users to set 
> both the initial model and k, or if we should alter k when the initial model 
> is set, or if we should keep the current behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17789) Don't force users to set k for KMeans if initial model is set

2016-10-05 Thread Seth Hendrickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Hendrickson updated SPARK-17789:
-
Description: 
In the initial implementation of initalModel, we allow users to set the initial 
model with a KMeansModel that has a different {{k}} than the current model. We 
throw an error at train time if the two are mismatched. This means that the 
following code throws a runtime exception:

{code}

val kmeansModel = new KMeans().setInitialModel(model).fit(df)

{code}

We should discuss this behavior, and decide if we should enforce users to set 
both the initial model and k, or if we should alter k when the initial model is 
set, or if we should keep the current behavior.

  was:
In the initial implementation of initalModel, we allow users to set the initial 
model with a KMeansModel that has a different {{k}} than the current model. We 
throw an error at train time if the two are mismatched. This means that the 
following code throws a runtime exception:

{{code}}
val kmeansModel = new KMeans().setInitialModel(model).fit(df)
{{code}}

We should discuss this behavior, and decide if we should enforce users to set 
both the initial model and k, or if we should alter k when the initial model is 
set, or if we should keep the current behavior.


> Don't force users to set k for KMeans if initial model is set
> -
>
> Key: SPARK-17789
> URL: https://issues.apache.org/jira/browse/SPARK-17789
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> In the initial implementation of initalModel, we allow users to set the 
> initial model with a KMeansModel that has a different {{k}} than the current 
> model. We throw an error at train time if the two are mismatched. This means 
> that the following code throws a runtime exception:
> {code}
> val kmeansModel = new KMeans().setInitialModel(model).fit(df)
> {code}
> We should discuss this behavior, and decide if we should enforce users to set 
> both the initial model and k, or if we should alter k when the initial model 
> is set, or if we should keep the current behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17792) L-BFGS solver for linear regression does not accept general numeric label column types

2016-10-05 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-17792:


 Summary: L-BFGS solver for linear regression does not accept 
general numeric label column types
 Key: SPARK-17792
 URL: https://issues.apache.org/jira/browse/SPARK-17792
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Seth Hendrickson
Priority: Minor


There's a bug in accepting numeric types for linear regression. We cast the 
label to {{DoubleType}} in one spot where we use normal solver, but not for the 
l-bfgs solver. The following can reproduce the problem:

{code}
import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.linalg.{Vector, DenseVector, Vectors}
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.sql.types._

val df = Seq(LabeledPoint(1.0, Vectors.dense(1.0))).toDF().withColumn("weight", 
lit(1.0).cast(LongType))
val lr = new LinearRegression().setSolver("l-bfgs").setWeightCol("weight")
lr.fit(df)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17792) L-BFGS solver for linear regression does not accept general numeric label column types

2016-10-05 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550249#comment-15550249
 ] 

Seth Hendrickson commented on SPARK-17792:
--

I'll have a PR shortly.

> L-BFGS solver for linear regression does not accept general numeric label 
> column types
> --
>
> Key: SPARK-17792
> URL: https://issues.apache.org/jira/browse/SPARK-17792
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> There's a bug in accepting numeric types for linear regression. We cast the 
> label to {{DoubleType}} in one spot where we use normal solver, but not for 
> the l-bfgs solver. The following can reproduce the problem:
> {code}
> import org.apache.spark.ml.feature.LabeledPoint
> import org.apache.spark.ml.linalg.{Vector, DenseVector, Vectors}
> import org.apache.spark.ml.regression.LinearRegression
> import org.apache.spark.sql.types._
> val df = Seq(LabeledPoint(1.0, 
> Vectors.dense(1.0))).toDF().withColumn("weight", lit(1.0).cast(LongType))
> val lr = new LinearRegression().setSolver("l-bfgs").setWeightCol("weight")
> lr.fit(df)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17789) Don't force users to set k for KMeans if initial model is set

2016-10-05 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-17789:


 Summary: Don't force users to set k for KMeans if initial model is 
set
 Key: SPARK-17789
 URL: https://issues.apache.org/jira/browse/SPARK-17789
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Seth Hendrickson
Priority: Minor


In the initial implementation of initalModel, we allow users to set the initial 
model with a KMeansModel that has a different {{k}} than the current model. We 
throw an error at train time if the two are mismatched. This means that the 
following code throws a runtime exception:

{{code}}
val kmeansModel = new KMeans().setInitialModel(model).fit(df)
{{code}}

We should discuss this behavior, and decide if we should enforce users to set 
both the initial model and k, or if we should alter k when the initial model is 
set, or if we should keep the current behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17772) Add helper testing methods for instance weighting

2016-10-03 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-17772:


 Summary: Add helper testing methods for instance weighting
 Key: SPARK-17772
 URL: https://issues.apache.org/jira/browse/SPARK-17772
 Project: Spark
  Issue Type: Test
  Components: ML
Reporter: Seth Hendrickson
Priority: Minor


More and more ML algos are accepting instance weights. We keep replicating code 
to test instance weighting in every test suite, which will get out of hand 
rather quickly. We can and should implement some generic instance weight test 
helper methods so that we can reduce duplicated code and standardize these 
tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties

2016-09-30 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536781#comment-15536781
 ] 

Seth Hendrickson edited comment on SPARK-17748 at 9/30/16 7:16 PM:
---

I am working on this currently. The basic plan is to refactor WLS so that it 
has a pluggable solver for the normal equations. We can implement a new 
interface like 

{code:java}
trait NormalEquationSolver {
  def solve(
  bBar: Double,
  bbBar: Double,
  abBar: DenseVector,
  aaBar: DenseVector,
  aBar: DenseVector): NormalEquationSolution
}
class CholeskySolver extends NormalEquationsSolver
class QuasiNewtonSolver extends NormalEquationSolver
{code}

If others have thoughts on the design please comment, otherwise I will continue 
working on this and submit a PR reasonably soon.

cc [~srowen] [~yanboliang] [~dbtsai]


was (Author: sethah):
I am working on this currently. The basic plan is to refactor WLS so that it 
has a pluggable solver for the normal equations. We can implement a new 
interface like 

{code:java}
trait NormalEquationSolver {
  def solve(
  bBar: Double,
  bbBar: Double,
  abBar: DenseVector,
  aaBar: DenseVector,
  aBar: DenseVector): NormalEquationSolution
}
class CholeskySolver extends NormalEquationsSolver
class QuasiNewtonSolver extends NormalEquationSolver
{code}

If others have thoughts on the design please comment, otherwise I will continue 
working on this and submit a PR reasonably soon.

cc [~srowen] [~yanboliang]

> One-pass algorithm for linear regression with L1 and elastic-net penalties
> --
>
> Key: SPARK-17748
> URL: https://issues.apache.org/jira/browse/SPARK-17748
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Seth Hendrickson
>
> Currently linear regression uses weighted least squares to solve the normal 
> equations locally on the driver when the dimensionality is small (<4096). 
> Weighted least squares uses a Cholesky decomposition to solve the problem 
> with L2 regularization (which has a closed-form solution). We can support 
> L1/elasticnet penalties by solving the equations locally using OWL-QN solver.
> Also note that Cholesky does not handle singular covariance matrices, but 
> L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch 
> can also add support for solving singular covariance matrices by also adding 
> L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties

2016-09-30 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536781#comment-15536781
 ] 

Seth Hendrickson edited comment on SPARK-17748 at 9/30/16 7:16 PM:
---

I am working on this currently. The basic plan is to refactor WLS so that it 
has a pluggable solver for the normal equations. We can implement a new 
interface like 

{code:java}
trait NormalEquationSolver {
  def solve(
  bBar: Double,
  bbBar: Double,
  abBar: DenseVector,
  aaBar: DenseVector,
  aBar: DenseVector): NormalEquationSolution
}
class CholeskySolver extends NormalEquationsSolver
class QuasiNewtonSolver extends NormalEquationSolver
{code}

If others have thoughts on the design please comment, otherwise I will continue 
working on this and submit a PR reasonably soon.

cc [~srowen] [~yanboliang]


was (Author: sethah):
I am working on this currently. The basic plan is to refactor WLS so that it 
has a pluggable solver for the normal equations. We can implement a new 
interface like 

{code:java}
trait NormalEquationSolver {
  def solve(
  bBar: Double,
  bbBar: Double,
  abBar: DenseVector,
  aaBar: DenseVector,
  aBar: DenseVector): NormalEquationSolution
}
class CholeskySolver extends NormalEquationsSolver
class QuasiNewtonSolver extends NormalEquationSolver
{code}

If others have thoughts on the design please comment, otherwise I will continue 
working on this and submit a PR reasonably soon.

> One-pass algorithm for linear regression with L1 and elastic-net penalties
> --
>
> Key: SPARK-17748
> URL: https://issues.apache.org/jira/browse/SPARK-17748
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Seth Hendrickson
>
> Currently linear regression uses weighted least squares to solve the normal 
> equations locally on the driver when the dimensionality is small (<4096). 
> Weighted least squares uses a Cholesky decomposition to solve the problem 
> with L2 regularization (which has a closed-form solution). We can support 
> L1/elasticnet penalties by solving the equations locally using OWL-QN solver.
> Also note that Cholesky does not handle singular covariance matrices, but 
> L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch 
> can also add support for solving singular covariance matrices by also adding 
> L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties

2016-09-30 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536781#comment-15536781
 ] 

Seth Hendrickson commented on SPARK-17748:
--

I am working on this currently. The basic plan is to refactor WLS so that it 
has a pluggable solver for the normal equations. We can implement a new 
interface like 

{code:java}
trait NormalEquationSolver {
  def solve(
  bBar: Double,
  bbBar: Double,
  abBar: DenseVector,
  aaBar: DenseVector,
  aBar: DenseVector): NormalEquationSolution
}
class CholeskySolver extends NormalEquationsSolver
class QuasiNewtonSolver extends NormalEquationSolver
{code}

If others have thoughts on the design please comment, otherwise I will continue 
working on this and submit a PR reasonably soon.

> One-pass algorithm for linear regression with L1 and elastic-net penalties
> --
>
> Key: SPARK-17748
> URL: https://issues.apache.org/jira/browse/SPARK-17748
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Seth Hendrickson
>
> Currently linear regression uses weighted least squares to solve the normal 
> equations locally on the driver when the dimensionality is small (<4096). 
> Weighted least squares uses a Cholesky decomposition to solve the problem 
> with L2 regularization (which has a closed-form solution). We can support 
> L1/elasticnet penalties by solving the equations locally using OWL-QN solver.
> Also note that Cholesky does not handle singular covariance matrices, but 
> L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch 
> can also add support for solving singular covariance matrices by also adding 
> L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties

2016-09-30 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-17748:


 Summary: One-pass algorithm for linear regression with L1 and 
elastic-net penalties
 Key: SPARK-17748
 URL: https://issues.apache.org/jira/browse/SPARK-17748
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Seth Hendrickson


Currently linear regression uses weighted least squares to solve the normal 
equations locally on the driver when the dimensionality is small (<4096). 
Weighted least squares uses a Cholesky decomposition to solve the problem with 
L2 regularization (which has a closed-form solution). We can support 
L1/elasticnet penalties by solving the equations locally using OWL-QN solver.

Also note that Cholesky does not handle singular covariance matrices, but 
L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch can 
also add support for solving singular covariance matrices by also adding L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2016-09-23 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15515529#comment-15515529
 ] 

Seth Hendrickson edited comment on SPARK-17134 at 9/23/16 6:09 AM:
---

This makes sense. In my initial testing I found that having to standardize the 
features in every iteration takes a non-trivial amount of time. Still, you 
mentioned the desire to not cache the standardized dataset since it can create 
unnecessary memory overhead. One solution is to allow the users to specify that 
their data has already been standardized, and then we don't have to perform the 
extra divisions in the update method. Alternatively, we could do as you suggest 
above, but store the coefficients in column major order in order to still 
maximize cache hits.

We'll need some testing for both cases to truly understand this.


was (Author: sethah):
This makes sense. In my initial testing I found that having to standardize the 
features in every iteration takes a non-trivial amount of time. Still, you 
mentioned the desire to not cache the standardized dataset since it can create 
unnecessary memory overhead. One solution is to allow the users to specify that 
there data has already been standardized, and then we don't have to perform the 
extra divisions in the update method. Alternatively, we could do as you suggest 
above, but store the coefficients in column major order in order to still 
maximize cache hits.

We'll need some testing for both cases to truly understand this.

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2016-09-23 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15515529#comment-15515529
 ] 

Seth Hendrickson commented on SPARK-17134:
--

This makes sense. In my initial testing I found that having to standardize the 
features in every iteration takes a non-trivial amount of time. Still, you 
mentioned the desire to not cache the standardized dataset since it can create 
unnecessary memory overhead. One solution is to allow the users to specify that 
there data has already been standardized, and then we don't have to perform the 
extra divisions in the update method. Alternatively, we could do as you suggest 
above, but store the coefficients in column major order in order to still 
maximize cache hits.

We'll need some testing for both cases to truly understand this.

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2016-09-21 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510198#comment-15510198
 ] 

Seth Hendrickson commented on SPARK-17134:
--

Hmm, it would be nice to see this vs the old mlor in rdd API, just as a sanity 
check. I conducted performance testing against mllib initially, though, so 
there shouldn't be any regressions.

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17471) Add compressed method for Matrix class

2016-09-12 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484775#comment-15484775
 ] 

Seth Hendrickson commented on SPARK-17471:
--

[~yanboliang] Do you have any updates on this? We need to make implementation 
of the {{compressed}} method for matrices high priority. I can look into 
implementing it, but I don't want to overlap work. Thanks!

> Add compressed method for Matrix class
> --
>
> Key: SPARK-17471
> URL: https://issues.apache.org/jira/browse/SPARK-17471
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>
> Vectors in Spark have a {{compressed}} method which selects either sparse or 
> dense representation by minimizing storage requirements. Matrices should also 
> have this method, which is now explicitly needed in {{LogisticRegression}} 
> since we have implemented multiclass regression.
> The compressed method should also give the option to store row major or 
> column major, and if nothing is specified should select the lower storage 
> representation (for sparse).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17476) Proper handling for unseen labels in logistic regression training.

2016-09-09 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-17476:


 Summary: Proper handling for unseen labels in logistic regression 
training.
 Key: SPARK-17476
 URL: https://issues.apache.org/jira/browse/SPARK-17476
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Seth Hendrickson


Now that logistic regression supports multiclass, it is possible to train on 
data that has {{K}} classes, but one or more of the classes does not appear in 
training. For example,

{code}
(0.0, x1)
(2.0, x2)
...
{code}

Currently, logistic regression assumes that the outcome classes in the above 
dataset have three levels: {{0, 1, 2}}. Since label 1 never appears, it should 
never be predicted. In theory, the coefficients should be zero and the 
intercept should be negative infinity. This can cause problems since we center 
the intercepts after training.

We should discuss whether or not the intercepts actually tend to -infinity in 
practice, and whether or not we should even include them in training. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17471) Add compressed method for Matrix class

2016-09-09 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477841#comment-15477841
 ] 

Seth Hendrickson commented on SPARK-17471:
--

[~yanboliang] I guess it can be seen as a duplicate, but really there are two 
separate tasks. 1.) Add a `compressed` method to the matrix library in spark, 
which is non-trivial. 2.) Adding a mechanism inside of MLOR to use the 
compressed method, and how to deal with flattening the sparse matrix into a 
sparse vector when binomial family is used.

We can keep the JIRAs separate, or do them both together. I see them as 
separate tasks.

> Add compressed method for Matrix class
> --
>
> Key: SPARK-17471
> URL: https://issues.apache.org/jira/browse/SPARK-17471
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>
> Vectors in Spark have a {{compressed}} method which selects either sparse or 
> dense representation by minimizing storage requirements. Matrices should also 
> have this method, which is now explicitly needed in {{LogisticRegression}} 
> since we have implemented multiclass regression.
> The compressed method should also give the option to store row major or 
> column major, and if nothing is specified should select the lower storage 
> representation (for sparse).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17471) Add compressed method for Matrix class

2016-09-09 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-17471:


 Summary: Add compressed method for Matrix class
 Key: SPARK-17471
 URL: https://issues.apache.org/jira/browse/SPARK-17471
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Seth Hendrickson


Vectors in Spark have a {{compressed}} method which selects either sparse or 
dense representation by minimizing storage requirements. Matrices should also 
have this method, which is now explicitly needed in {{LogisticRegression}} 
since we have implemented multiclass regression.

The compressed method should also give the option to store row major or column 
major, and if nothing is specified should select the lower storage 
representation (for sparse).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >