[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20591273 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect

[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20628796 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala --- @@ -45,146 +43,92 @@ import org.apache.spark.storage.StorageLevel

[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20629011 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala --- @@ -40,151 +39,98 @@ import org.apache.spark.storage.StorageLevel

[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20629126 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala --- @@ -387,7 +386,7 @@ object RandomForest extends Serializable with Logging

[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20629452 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala --- @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20629451 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala --- @@ -40,151 +39,98 @@ import org.apache.spark.storage.StorageLevel

[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63766642 @mengxr Thanks for the updates! Just added a few small comments. Other than those, LGTM --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20676259 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +182,206 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect

[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20676265 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +182,206 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect

[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3320#discussion_r20676263 --- Diff: python/pyspark/mllib/tree.py --- @@ -181,8 +182,206 @@ def trainRegressor(data, categoricalFeaturesInfo, model.predict(rdd).collect

[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63877350 @davies Thanks for adding this API! I made a few small comments. Other than those, LGTM --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-4439] [MLlib] add python api for random...

2014-11-20 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3320#issuecomment-63881779 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request: [SPARK-4531] [MLlib] cache serialized java obj...

2014-11-20 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3397#issuecomment-63931044 It might be good to cache for decision tree too since it makes a couple of passes through the original RDD (before it creates the TreePoint RDD). --- If your project

[GitHub] spark pull request: [SPARK-4531] [MLlib] cache serialized java obj...

2014-11-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3397#discussion_r20739035 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -74,10 +74,28 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: [SPARK-4531] [MLlib] cache serialized java obj...

2014-11-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3397#discussion_r20739110 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -526,10 +515,15 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: [SPARK-4531] [MLlib] cache serialized java obj...

2014-11-21 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3397#issuecomment-64031697 LGTM @pwendell had questions about whether we should allow the user specify (in the Python call) whether they want to use caching. CC @mengxr --- If your

[GitHub] spark pull request: [SPARK-4562] [MLlib] speedup vector

2014-11-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3420#discussion_r20770274 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -749,7 +759,13 @@ private[spark] object SerDe extends

[GitHub] spark pull request: [SPARK-4562] [MLlib] speedup vector

2014-11-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3420#discussion_r20770273 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -749,7 +759,13 @@ private[spark] object SerDe extends

[GitHub] spark pull request: [SPARK-4562] [MLlib] speedup vector

2014-11-23 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3420#issuecomment-64144537 For the record, I ran some tests with this and confirmed the speedups. This PR puts test time prediction for GLMs at the same speed as the Spark 1.1 release

[GitHub] spark pull request: [SPARK-4562] [MLlib] speedup vector

2014-11-23 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3420#issuecomment-64156125 By the way, my tests were with dense vectors, not sparse. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [MLLIB] [WIP] [SPARK-3702] Standardizing abstr...

2014-11-24 Thread jkbradley
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/3427 [MLLIB] [WIP] [SPARK-3702] Standardizing abstractions and developer API for prediction This is WIP effort to standardize abstractions and developer API for prediction tasks (classification

[GitHub] spark pull request: [SPARK-3251][MLLIB]: Clarify learning interfac...

2014-11-24 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2137#issuecomment-64165435 @BigCrunsh I just submitted a WIP for the new MLlib API. Apologies for the slow development, but I'd like to try to get your PR in to improve the original MLlib API

[GitHub] spark pull request: [SPARK-4583] [mllib] LogLoss for GradientBoost...

2014-11-24 Thread jkbradley
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/3439 [SPARK-4583] [mllib] LogLoss for GradientBoostedTrees fix + doc updates Currently, the LogLoss used by GradientBoostedTrees has 2 issues: * the gradient (and therefore loss) does not match

[GitHub] spark pull request: [SPARK-4583] [mllib] LogLoss for GradientBoost...

2014-11-25 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3439#discussion_r20885397 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/SquaredError.scala --- @@ -49,18 +48,17 @@ object SquaredError extends Loss

[GitHub] spark pull request: [SPARK-4583] [mllib] LogLoss for GradientBoost...

2014-11-25 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3439#issuecomment-64474382 I just pushed an update which includes: * removing the 1/2 from SquaredError. This also required updating the test suite since it effectively doubles the gradient

[GitHub] spark pull request: [SPARK-4604][MLLIB] make MatrixFactorizationMo...

2014-11-25 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3459#discussion_r20901054 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -28,13 +28,16 @@ import

[GitHub] spark pull request: [SPARK-4604][MLLIB] make MatrixFactorizationMo...

2014-11-25 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3459#discussion_r20901060 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -28,13 +28,16 @@ import

[GitHub] spark pull request: [SPARK-4604][MLLIB] make MatrixFactorizationMo...

2014-11-25 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3459#discussion_r20901112 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -28,13 +28,16 @@ import

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

2014-11-25 Thread jkbradley
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/3461 [SPARK-4580] [SPARK-4610] [mllib] Documentation for tree ensembles + DecisionTree API fix Major changes: * Added documentation for tree ensembles * Added examples for tree ensembles

[GitHub] spark pull request: [SPARK-4583] [mllib] LogLoss for GradientBoost...

2014-11-25 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3439#discussion_r20910282 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala --- @@ -45,19 +46,21 @@ object LogLoss extends Loss { model

[GitHub] spark pull request: [SPARK-4583] [mllib] LogLoss for GradientBoost...

2014-11-25 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3439#discussion_r20911009 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala --- @@ -45,19 +46,21 @@ object LogLoss extends Loss { model

[GitHub] spark pull request: [SPARK-4583] [mllib] LogLoss for GradientBoost...

2014-11-25 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3439#issuecomment-64502217 Updated LogLoss. @mengxr @manishamde Thanks for looking at this! --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-4604][MLLIB] make MatrixFactorizationMo...

2014-11-25 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3459#issuecomment-64503911 @mengxr Except for the imports, LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-4604][MLLIB] make MatrixFactorizationMo...

2014-11-25 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3459#discussion_r20912049 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModelSuite.scala --- @@ -0,0 +1,56 @@ +/* + * Licensed

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

2014-11-25 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3461#issuecomment-64506845 Note: I'm working on updating the decision tree programming guide further too (with more info about parameters). --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

2014-11-25 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3461#issuecomment-64518822 OK! I think everything's updated, though I'm sure people will have feedback. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

2014-12-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3461#discussion_r2669 --- Diff: docs/mllib-decision-tree.md --- @@ -103,36 +106,73 @@ and the resulting `$M-1$` split candidates are considered. ### Stopping rule

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

2014-12-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3461#discussion_r2725 --- Diff: docs/mllib-decision-tree.md --- @@ -103,36 +106,73 @@ and the resulting `$M-1$` split candidates are considered. ### Stopping rule

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

2014-12-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3461#discussion_r2857 --- Diff: docs/mllib-decision-tree.md --- @@ -103,36 +106,73 @@ and the resulting `$M-1$` split candidates are considered. ### Stopping rule

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

2014-12-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3461#discussion_r21112016 --- Diff: docs/mllib-decision-tree.md --- @@ -103,36 +106,73 @@ and the resulting `$M-1$` split candidates are considered. ### Stopping rule

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

2014-12-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3461#discussion_r21112912 --- Diff: docs/mllib-decision-tree.md --- @@ -103,36 +106,73 @@ and the resulting `$M-1$` split candidates are considered. ### Stopping rule

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

2014-12-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3461#discussion_r21113406 --- Diff: docs/mllib-decision-tree.md --- @@ -103,36 +106,73 @@ and the resulting `$M-1$` split candidates are considered. ### Stopping rule

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

2014-12-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3461#discussion_r21113669 --- Diff: docs/mllib-gbt.md --- @@ -0,0 +1,308 @@ +--- +layout: global +title: Gradient-Boosted Trees - MLlib +displayTitle: a href=mllib

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

2014-12-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3461#discussion_r21113959 --- Diff: docs/mllib-gbt.md --- @@ -0,0 +1,308 @@ +--- +layout: global +title: Gradient-Boosted Trees - MLlib +displayTitle: a href=mllib

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

2014-12-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3461#discussion_r21114104 --- Diff: docs/mllib-gbt.md --- @@ -0,0 +1,308 @@ +--- +layout: global +title: Gradient-Boosted Trees - MLlib +displayTitle: a href=mllib

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

2014-12-01 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3461#issuecomment-65124916 @manishamde Thanks for the feedback! I made the fixes, except for the default values for all optional parameters + ensembles section issues. Let me know if you

[GitHub] spark pull request: [SPARK-4710] [mllib] Eliminate MLlib compilati...

2014-12-02 Thread jkbradley
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/3568 [SPARK-4710] [mllib] Eliminate MLlib compilation warnings Renamed StreamingKMeans to StreamingKMeansExample to avoid warning about name conflict with StreamingKMeans class. Added import

[GitHub] spark pull request: [SPARK-4711] [mllib] Programming guide advice ...

2014-12-02 Thread jkbradley
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/3569 [SPARK-4711] [mllib] Programming guide advice on choosing optimizer I have heard requests for the docs to include advice about choosing an optimization method. The programming guide could include

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

2014-12-02 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3461#issuecomment-65352254 @mengxr Sure, that seems like a good solution to the suggestion from @manishamde Will do. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-4685] Include all spark.ml and spark.ml...

2014-12-04 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3598#issuecomment-65664523 LGTM in retrospect --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-04 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-65682412 @akopich Thanks for the responses! Follow-ups: (1) Users implementing their own regularizers You're right that this would be nice to have. If we

[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-08 Thread jkbradley
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/3637 [SPARK-4789] [mllib] Standardize ML Prediction APIs This is part (1) of the updates from the WIP PR in [https://github.com/apache/spark/pull/3427] Abstract classes for learning

[GitHub] spark pull request: [MLLIB] [WIP] [SPARK-3702] Standardizing abstr...

2014-12-08 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3427#issuecomment-66177125 I just submitted the first part of this PR: [https://github.com/apache/spark/pull/3637/files] --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3636#discussion_r21480864 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala --- @@ -27,6 +27,8 @@ import org.apache.spark.rdd.RDD import

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3636#discussion_r21480867 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala --- @@ -39,6 +41,7 @@ class GradientDescent private[mllib

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3636#discussion_r21480907 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala --- @@ -182,34 +195,38 @@ object GradientDescent extends Logging

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3636#discussion_r21480898 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala --- @@ -77,6 +80,14 @@ class GradientDescent private[mllib

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3636#discussion_r21480909 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala --- @@ -219,4 +236,17 @@ object GradientDescent extends Logging

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-08 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3636#issuecomment-66178359 @Lewuathe Thanks for the PR! I added some inline comments. One more general comment: When using subsampling (miniBatchFraction 1.0), testing against

[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-08 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-66203211 The test failure reveals an issue in Spark SQL (ScalaReflection.scala:121 in schemaFor) where it gets confused if the case class includes multiple constructors

[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r21495884 --- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-08 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-66208868 @avulanov Nice tests! A few comments: * Computing accuracy: It would be good to test on the original MNIST test set, rather than a subset of the training set

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-08 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66210825 @akopich The test failure seems unrelated (from a Python SQL test). I'll re-run the tests. (2) Regular and Robust in the same class Would

[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r21497595 --- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r21498969 --- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r21499038 --- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request: [SPARK-4791] [sql] Infer schema from case clas...

2014-12-09 Thread jkbradley
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/3646 [SPARK-4791] [sql] Infer schema from case class with multiple constructors Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors. Added

[GitHub] spark pull request: [SPARK-4797] Replace breezeSquaredDistance

2014-12-09 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3643#issuecomment-66342802 Hi, it looks like this may be faster for dense vectors but not for sparse. SparseVector.toArray will create a dense vector, making it much slower if the vector

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3636#discussion_r21556478 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala --- @@ -142,7 +154,9 @@ object GradientDescent extends Logging

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3636#discussion_r21556482 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala --- @@ -155,7 +169,13 @@ object GradientDescent extends Logging

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3636#discussion_r21556486 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala --- @@ -182,34 +202,40 @@ object GradientDescent extends Logging

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3636#discussion_r21556490 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala --- @@ -138,6 +138,45 @@ class GradientDescentSuite extends

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3636#discussion_r21556494 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala --- @@ -138,6 +138,45 @@ class GradientDescentSuite extends

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-09 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3636#issuecomment-66345003 @Lewuathe Thanks for the updates! I just saw a couple more things, but I think it's almost ready. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-09 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-66346654 Question: Do people have preferences for the name of what is currently predictRaw? Possibilities are: ``` predictRaw() predictConfidence() confidences

[GitHub] spark pull request: [SPARK-4736][mllib] [random forest] functions ...

2014-12-09 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3583#issuecomment-66348700 @dikejiang Thanks for the PR! I'm wondering if you'd be interested in a more general API. In the new experimental ML package, I have a PR [https://www.github.com

[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r21559541 --- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request: [MLLIB] SPARK-4362: Added classProbabilities m...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3626#discussion_r21563627 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala --- @@ -65,6 +66,25 @@ class NaiveBayesModel private[mllib

[GitHub] spark pull request: [MLLIB] SPARK-4362: Added classProbabilities m...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3626#discussion_r21563623 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala --- @@ -65,6 +66,25 @@ class NaiveBayesModel private[mllib

[GitHub] spark pull request: [MLLIB] SPARK-4362: Added classProbabilities m...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3626#discussion_r21564191 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala --- @@ -65,6 +66,25 @@ class NaiveBayesModel private[mllib

[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r21566135 --- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-09 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-66368865 @srowen @Lewuathe Continuing the above inline discussion... Question: Should the typed interface be public? New proposal: Hide the typed interface

[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-09 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-66380629 Oh, apologies for being unclear. I meant this division: * Typed interface: train(RDD[LabeledPoint]), predict(Vector) * SchemaRDD interface: fit(SchemaRDD

[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3636#discussion_r21578119 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala --- @@ -138,6 +138,45 @@ class GradientDescentSuite extends

[GitHub] spark pull request: SPARK-4749: Allow initializing KMeans clusters...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3610#discussion_r21581983 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -353,6 +359,31 @@ object KMeans

[GitHub] spark pull request: SPARK-4749: Allow initializing KMeans clusters...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3610#discussion_r21581989 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -90,6 +90,27 @@ class KMeansSuite extends FunSuite

[GitHub] spark pull request: SPARK-4749: Allow initializing KMeans clusters...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3610#discussion_r21581991 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -90,6 +90,27 @@ class KMeansSuite extends FunSuite

[GitHub] spark pull request: SPARK-4749: Allow initializing KMeans clusters...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3610#discussion_r21581990 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -90,6 +90,27 @@ class KMeansSuite extends FunSuite

[GitHub] spark pull request: SPARK-4749: Allow initializing KMeans clusters...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3610#discussion_r21581986 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -353,6 +359,31 @@ object KMeans

[GitHub] spark pull request: SPARK-4749: Allow initializing KMeans clusters...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3610#discussion_r21581982 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -43,7 +43,8 @@ class KMeans private ( private var runs: Int

[GitHub] spark pull request: SPARK-4749: Allow initializing KMeans clusters...

2014-12-09 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3610#issuecomment-66398376 @nxwhite-str Thanks for the PR! Could you please update the title to start with [SPARK-4749] [mllib] to help with automated tagging? --- If your project is set up

[GitHub] spark pull request: [SPARK-4494] IDFModel.transform() add support ...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3603#discussion_r21582536 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -174,37 +174,18 @@ class IDFModel private[mllib] (val idf: Vector) extends

[GitHub] spark pull request: [SPARK-4494] IDFModel.transform() add support ...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3603#discussion_r21582540 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/IDFSuite.scala --- @@ -53,6 +53,19 @@ class IDFSuite extends FunSuite

[GitHub] spark pull request: [SPARK-4494] IDFModel.transform() add support ...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3603#discussion_r21582538 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/IDFSuite.scala --- @@ -17,12 +17,10 @@ package org.apache.spark.mllib.feature

[GitHub] spark pull request: [SPARK-4494] IDFModel.transform() add support ...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3603#discussion_r21582546 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/IDFSuite.scala --- @@ -86,6 +101,19 @@ class IDFSuite extends FunSuite

[GitHub] spark pull request: [SPARK-4494] IDFModel.transform() add support ...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3603#discussion_r21582552 --- Diff: python/pyspark/mllib/feature.py --- @@ -220,12 +220,15 @@ def transform(self, dataset): the terms which occur in fewer than

[GitHub] spark pull request: [SPARK-4494] IDFModel.transform() add support ...

2014-12-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3603#discussion_r21582550 --- Diff: python/pyspark/mllib/feature.py --- @@ -212,7 +212,7 @@ class IDFModel(JavaVectorTransformer): Represents an IDF model that can

[GitHub] spark pull request: [SPARK-4494] IDFModel.transform() add support ...

2014-12-09 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3603#issuecomment-66399885 @yu-iskw Thanks for the PR! I added some comments but left a question for @mengxr Also, could you please add the [mllib] tag to the PR title? --- If your

[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-10 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-66509244 Thanks everyone for all of the comments! @shivaram No problem, thanks for checking out the design doc! The 2 main use cases you listed are correct

[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...

2014-12-10 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3603#discussion_r21635133 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -174,37 +174,18 @@ class IDFModel private[mllib] (val idf: Vector) extends

<    2   3   4   5   6   7   8   9   10   11   >