Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3320#discussion_r20591273
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +180,191 @@ def trainRegressor(data, categoricalFeaturesInfo,
model.predict(rdd).collect
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3374#discussion_r20628796
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
@@ -45,146 +43,92 @@ import org.apache.spark.storage.StorageLevel
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3374#discussion_r20629011
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
@@ -40,151 +39,98 @@ import org.apache.spark.storage.StorageLevel
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3374#discussion_r20629126
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala ---
@@ -387,7 +386,7 @@ object RandomForest extends Serializable with Logging
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3374#discussion_r20629452
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala
---
@@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3374#discussion_r20629451
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
@@ -40,151 +39,98 @@ import org.apache.spark.storage.StorageLevel
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3374#issuecomment-63766642
@mengxr Thanks for the updates! Just added a few small comments. Other
than those, LGTM
---
If your project is set up for it, you can reply to this email and have
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3320#discussion_r20676259
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +182,206 @@ def trainRegressor(data, categoricalFeaturesInfo,
model.predict(rdd).collect
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3320#discussion_r20676265
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +182,206 @@ def trainRegressor(data, categoricalFeaturesInfo,
model.predict(rdd).collect
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3320#discussion_r20676263
--- Diff: python/pyspark/mllib/tree.py ---
@@ -181,8 +182,206 @@ def trainRegressor(data, categoricalFeaturesInfo,
model.predict(rdd).collect
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3320#issuecomment-63877350
@davies Thanks for adding this API! I made a few small comments. Other
than those, LGTM
---
If your project is set up for it, you can reply to this email and have
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3320#issuecomment-63881779
LGTM
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3397#issuecomment-63931044
It might be good to cache for decision tree too since it makes a couple of
passes through the original RDD (before it creates the TreePoint RDD).
---
If your project
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3397#discussion_r20739035
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -74,10 +74,28 @@ class PythonMLLibAPI extends Serializable
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3397#discussion_r20739110
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -526,10 +515,15 @@ class PythonMLLibAPI extends Serializable
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3397#issuecomment-64031697
LGTM
@pwendell had questions about whether we should allow the user specify (in
the Python call) whether they want to use caching. CC @mengxr
---
If your
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3420#discussion_r20770274
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -749,7 +759,13 @@ private[spark] object SerDe extends
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3420#discussion_r20770273
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -749,7 +759,13 @@ private[spark] object SerDe extends
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3420#issuecomment-64144537
For the record, I ran some tests with this and confirmed the speedups.
This PR puts test time prediction for GLMs at the same speed as the Spark 1.1
release
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3420#issuecomment-64156125
By the way, my tests were with dense vectors, not sparse.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/3427
[MLLIB] [WIP] [SPARK-3702] Standardizing abstractions and developer API for
prediction
This is WIP effort to standardize abstractions and developer API for
prediction tasks (classification
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2137#issuecomment-64165435
@BigCrunsh I just submitted a WIP for the new MLlib API. Apologies for the
slow development, but I'd like to try to get your PR in to improve the original
MLlib API
GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/3439
[SPARK-4583] [mllib] LogLoss for GradientBoostedTrees fix + doc updates
Currently, the LogLoss used by GradientBoostedTrees has 2 issues:
* the gradient (and therefore loss) does not match
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3439#discussion_r20885397
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/loss/SquaredError.scala ---
@@ -49,18 +48,17 @@ object SquaredError extends Loss
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3439#issuecomment-64474382
I just pushed an update which includes:
* removing the 1/2 from SquaredError. This also required updating the test
suite since it effectively doubles the gradient
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3459#discussion_r20901054
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
---
@@ -28,13 +28,16 @@ import
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3459#discussion_r20901060
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
---
@@ -28,13 +28,16 @@ import
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3459#discussion_r20901112
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
---
@@ -28,13 +28,16 @@ import
GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/3461
[SPARK-4580] [SPARK-4610] [mllib] Documentation for tree ensembles +
DecisionTree API fix
Major changes:
* Added documentation for tree ensembles
* Added examples for tree ensembles
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3439#discussion_r20910282
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala ---
@@ -45,19 +46,21 @@ object LogLoss extends Loss {
model
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3439#discussion_r20911009
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala ---
@@ -45,19 +46,21 @@ object LogLoss extends Loss {
model
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3439#issuecomment-64502217
Updated LogLoss.
@mengxr @manishamde Thanks for looking at this!
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3459#issuecomment-64503911
@mengxr Except for the imports, LGTM
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3459#discussion_r20912049
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModelSuite.scala
---
@@ -0,0 +1,56 @@
+/*
+ * Licensed
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3461#issuecomment-64506845
Note: I'm working on updating the decision tree programming guide further
too (with more info about parameters).
---
If your project is set up for it, you can reply
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3461#issuecomment-64518822
OK! I think everything's updated, though I'm sure people will have
feedback.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3461#discussion_r2669
--- Diff: docs/mllib-decision-tree.md ---
@@ -103,36 +106,73 @@ and the resulting `$M-1$` split candidates are
considered.
### Stopping rule
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3461#discussion_r2725
--- Diff: docs/mllib-decision-tree.md ---
@@ -103,36 +106,73 @@ and the resulting `$M-1$` split candidates are
considered.
### Stopping rule
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3461#discussion_r2857
--- Diff: docs/mllib-decision-tree.md ---
@@ -103,36 +106,73 @@ and the resulting `$M-1$` split candidates are
considered.
### Stopping rule
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3461#discussion_r21112016
--- Diff: docs/mllib-decision-tree.md ---
@@ -103,36 +106,73 @@ and the resulting `$M-1$` split candidates are
considered.
### Stopping rule
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3461#discussion_r21112912
--- Diff: docs/mllib-decision-tree.md ---
@@ -103,36 +106,73 @@ and the resulting `$M-1$` split candidates are
considered.
### Stopping rule
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3461#discussion_r21113406
--- Diff: docs/mllib-decision-tree.md ---
@@ -103,36 +106,73 @@ and the resulting `$M-1$` split candidates are
considered.
### Stopping rule
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3461#discussion_r21113669
--- Diff: docs/mllib-gbt.md ---
@@ -0,0 +1,308 @@
+---
+layout: global
+title: Gradient-Boosted Trees - MLlib
+displayTitle: a href=mllib
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3461#discussion_r21113959
--- Diff: docs/mllib-gbt.md ---
@@ -0,0 +1,308 @@
+---
+layout: global
+title: Gradient-Boosted Trees - MLlib
+displayTitle: a href=mllib
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3461#discussion_r21114104
--- Diff: docs/mllib-gbt.md ---
@@ -0,0 +1,308 @@
+---
+layout: global
+title: Gradient-Boosted Trees - MLlib
+displayTitle: a href=mllib
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3461#issuecomment-65124916
@manishamde Thanks for the feedback! I made the fixes, except for the
default values for all optional parameters + ensembles section issues. Let me
know if you
GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/3568
[SPARK-4710] [mllib] Eliminate MLlib compilation warnings
Renamed StreamingKMeans to StreamingKMeansExample to avoid warning about
name conflict with StreamingKMeans class.
Added import
GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/3569
[SPARK-4711] [mllib] Programming guide advice on choosing optimizer
I have heard requests for the docs to include advice about choosing an
optimization method. The programming guide could include
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3461#issuecomment-65352254
@mengxr Sure, that seems like a good solution to the suggestion from
@manishamde
Will do.
---
If your project is set up for it, you can reply to this email
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3598#issuecomment-65664523
LGTM in retrospect
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-65682412
@akopich Thanks for the responses! Follow-ups:
(1) Users implementing their own regularizers
You're right that this would be nice to have. If we
GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/3637
[SPARK-4789] [mllib] Standardize ML Prediction APIs
This is part (1) of the updates from the WIP PR in
[https://github.com/apache/spark/pull/3427]
Abstract classes for learning
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3427#issuecomment-66177125
I just submitted the first part of this PR:
[https://github.com/apache/spark/pull/3637/files]
---
If your project is set up for it, you can reply to this email
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3636#discussion_r21480864
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
---
@@ -27,6 +27,8 @@ import org.apache.spark.rdd.RDD
import
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3636#discussion_r21480867
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
---
@@ -39,6 +41,7 @@ class GradientDescent private[mllib
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3636#discussion_r21480907
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
---
@@ -182,34 +195,38 @@ object GradientDescent extends Logging
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3636#discussion_r21480898
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
---
@@ -77,6 +80,14 @@ class GradientDescent private[mllib
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3636#discussion_r21480909
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
---
@@ -219,4 +236,17 @@ object GradientDescent extends Logging
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3636#issuecomment-66178359
@Lewuathe Thanks for the PR! I added some inline comments. One more
general comment: When using subsampling (miniBatchFraction 1.0), testing
against
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3637#issuecomment-66203211
The test failure reveals an issue in Spark SQL (ScalaReflection.scala:121
in schemaFor) where it gets confused if the case class includes multiple
constructors
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r21495884
--- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/1379#issuecomment-66208868
@avulanov Nice tests! A few comments:
* Computing accuracy: It would be good to test on the original MNIST test
set, rather than a subset of the training set
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-66210825
@akopich
The test failure seems unrelated (from a Python SQL test). I'll re-run the
tests.
(2) Regular and Robust in the same class
Would
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r21497595
--- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r21498969
--- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r21499038
--- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF
GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/3646
[SPARK-4791] [sql] Infer schema from case class with multiple constructors
Modified ScalaReflection.schemaFor to take primary constructor of Product
when there are multiple constructors. Added
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3643#issuecomment-66342802
Hi, it looks like this may be faster for dense vectors but not for sparse.
SparseVector.toArray will create a dense vector, making it much slower if the
vector
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3636#discussion_r21556478
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
---
@@ -142,7 +154,9 @@ object GradientDescent extends Logging
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3636#discussion_r21556482
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
---
@@ -155,7 +169,13 @@ object GradientDescent extends Logging
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3636#discussion_r21556486
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
---
@@ -182,34 +202,40 @@ object GradientDescent extends Logging
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3636#discussion_r21556490
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala
---
@@ -138,6 +138,45 @@ class GradientDescentSuite extends
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3636#discussion_r21556494
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala
---
@@ -138,6 +138,45 @@ class GradientDescentSuite extends
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3636#issuecomment-66345003
@Lewuathe Thanks for the updates! I just saw a couple more things, but I
think it's almost ready.
---
If your project is set up for it, you can reply to this email
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3637#issuecomment-66346654
Question: Do people have preferences for the name of what is currently
predictRaw? Possibilities are:
```
predictRaw()
predictConfidence()
confidences
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3583#issuecomment-66348700
@dikejiang Thanks for the PR! I'm wondering if you'd be interested in a
more general API. In the new experimental ML package, I have a PR
[https://www.github.com
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r21559541
--- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3626#discussion_r21563627
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -65,6 +66,25 @@ class NaiveBayesModel private[mllib
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3626#discussion_r21563623
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -65,6 +66,25 @@ class NaiveBayesModel private[mllib
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3626#discussion_r21564191
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -65,6 +66,25 @@ class NaiveBayesModel private[mllib
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3637#discussion_r21566135
--- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3637#issuecomment-66368865
@srowen @Lewuathe Continuing the above inline discussion...
Question: Should the typed interface be public?
New proposal: Hide the typed interface
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3637#issuecomment-66380629
Oh, apologies for being unclear. I meant this division:
* Typed interface: train(RDD[LabeledPoint]), predict(Vector)
* SchemaRDD interface: fit(SchemaRDD
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3636#discussion_r21578119
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala
---
@@ -138,6 +138,45 @@ class GradientDescentSuite extends
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3610#discussion_r21581983
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -353,6 +359,31 @@ object KMeans
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3610#discussion_r21581989
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -90,6 +90,27 @@ class KMeansSuite extends FunSuite
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3610#discussion_r21581991
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -90,6 +90,27 @@ class KMeansSuite extends FunSuite
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3610#discussion_r21581990
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -90,6 +90,27 @@ class KMeansSuite extends FunSuite
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3610#discussion_r21581986
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -353,6 +359,31 @@ object KMeans
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3610#discussion_r21581982
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -43,7 +43,8 @@ class KMeans private (
private var runs: Int
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3610#issuecomment-66398376
@nxwhite-str Thanks for the PR! Could you please update the title to
start with [SPARK-4749] [mllib] to help with automated tagging?
---
If your project is set up
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3603#discussion_r21582536
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -174,37 +174,18 @@ class IDFModel private[mllib] (val idf: Vector)
extends
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3603#discussion_r21582540
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/feature/IDFSuite.scala ---
@@ -53,6 +53,19 @@ class IDFSuite extends FunSuite
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3603#discussion_r21582538
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/feature/IDFSuite.scala ---
@@ -17,12 +17,10 @@
package org.apache.spark.mllib.feature
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3603#discussion_r21582546
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/feature/IDFSuite.scala ---
@@ -86,6 +101,19 @@ class IDFSuite extends FunSuite
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3603#discussion_r21582552
--- Diff: python/pyspark/mllib/feature.py ---
@@ -220,12 +220,15 @@ def transform(self, dataset):
the terms which occur in fewer than
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3603#discussion_r21582550
--- Diff: python/pyspark/mllib/feature.py ---
@@ -212,7 +212,7 @@ class IDFModel(JavaVectorTransformer):
Represents an IDF model that can
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3603#issuecomment-66399885
@yu-iskw Thanks for the PR! I added some comments but left a question for
@mengxr
Also, could you please add the [mllib] tag to the PR title?
---
If your
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3637#issuecomment-66509244
Thanks everyone for all of the comments!
@shivaram No problem, thanks for checking out the design doc! The 2 main
use cases you listed are correct
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3603#discussion_r21635133
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -174,37 +174,18 @@ class IDFModel private[mllib] (val idf: Vector)
extends
601 - 700 of 7695 matches
Mail list logo