Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17090
Finally, I've done some work related to
[SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968) and have a
potential solution that seems to be pretty good. In this case it should be
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17090
I should note that I've found the performance of "recommend all" to be very
dependent on number of partitions since it controls the memory consumption per
task (which can easily
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17090
The performance of #12574 is not better than the existing `mllib`
recommend-all - since it wraps the functionality it's roughly on par.
---
If your project is set up for it, you can reply to
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17090
Fitting into the CV / evaluator is actually fairly straightforward. It's
just that the semantics of `transform` for top-k recommendation must fit into
whatever we decide on for `RankingEval
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17090
@jkbradley do we propose to add further methods to support recommending for
all users (or items) in an input DF? like `recommendForAllUsers(dataset:
DataFrame, num: Int)`?
---
If your project is
GitHub user MLnick opened a pull request:
https://github.com/apache/spark/pull/17102
[SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" usage in ALS
[SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489) added the
ability to skip `NaN` predicti
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/12896
Merged to master
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17076
@sethah a quick glance at the screenshots seems to indicate the processing
time went up? Which seems a bit odd. Of course it's a small test so maybe just
noise.
---
If your project is set u
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/17059#discussion_r103421071
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
---
@@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17090
For performance tests, I've been using the MovieLens `ml-latest` dataset
[here](https://grouplens.org/datasets/movielens/). It has `24,404,096` ratings
with `259,137` users and `39,443` m
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17059
Ok, let me take a look at this. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17090
#12574 is a comprehensive solution that also intends to support
cross-validation as well as recommending for a subset (or any arbitrary set) of
users/items. So it solves
[SPARK-10802](https
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17059
@datumbox you mention there is GC & performance overhead which makes some
sense. Have you run into problems with very large scale (like millions users &
items & ratings)? I did regr
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/17076#discussion_r103187723
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -440,19 +440,9 @@ private class LinearSVCAggregator
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17000
cc @yanboliang - it seems actually similar in effect to the VL-BFGS work
with RDD-based coefficients?
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17000
I'm not totally certain there will be some huge benefit with porting vector
summary to UDAF framework. But there are API-level benefits to doing so.
Perhaps there is a way to incorporat
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17000
@ZunwenYou yes I understand that the `sliceAggregate` is different from
SPARK-19634 and more comparable to `treeAggregate`. But I'm not sure, if we
plan to port the vector summary to use `Data
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17034
As commented we could I guess try to fit in the additional tests into
`checkNumericTypes` - but it's specific to AFT so doesn't seem worth it for now.
So, this LGTM.
---
If your
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/17034#discussion_r102727229
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala
---
@@ -361,6 +363,36 @@ class AFTSurvivalRegressionSuite
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16971
Yes my point was returning null is not very idiomatic in Scala. Better to
return Option or empty collection. Option doesn't work for Java compat, so
empty Array is best in this case I be
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17016
Merged to master
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17021
Merge to master
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17021
LGTM
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17000
Is the speedup coming mostly from the `MultivariateOnlineSummarizer` stage?
See https://issues.apache.org/jira/browse/SPARK-19634 which is for porting
this operation to use DataFrame UDAF
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16971#discussion_r102146260
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
---
@@ -78,7 +80,13 @@ object StatFunctions extends Logging
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16971#discussion_r102145908
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16971#discussion_r102145412
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
---
@@ -54,6 +54,8 @@ object StatFunctions extends Logging
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16971#discussion_r102145538
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16971#discussion_r102146144
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
@@ -214,20 +214,29 @@ class DataFrameStatSuite extends QueryTest with
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17000
Just to be clear - this is essentially just splitting an array up into
smaller chunks so that overall communication is more efficient? It would be
good to look at why Spark is not doing a good job
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16965
cc @sethah @jkbradley
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16966
@Yunni have you verified what performance improvement this gives?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16966#discussion_r102005885
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -147,6 +148,15 @@ private[ml] abstract class LSHModel[T <: LSHMode
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/12896
jenkins retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16774
I'd say coming up with a heuristic or algorithm to automatically set the
parallel execution param is going to be pretty challenging, since it depends on
the details of the individual pip
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16776#discussion_r101155454
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -63,44 +63,49 @@ final class DataFrameStatFunctions private[sql
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16776#discussion_r101156697
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -58,49 +58,52 @@ final class DataFrameStatFunctions private[sql
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16776#discussion_r101152427
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
@@ -159,16 +159,72 @@ class DataFrameStatSuite extends QueryTest with
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r100933636
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ---
@@ -106,18 +110,21 @@ class TrainValidationSplit @Since("
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r100933844
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ---
@@ -106,18 +110,21 @@ class TrainValidationSplit @Since("
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r100932890
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -100,31 +104,44 @@ class CrossValidator @Since("1.2.0") (@Si
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r100934267
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -51,7 +51,7 @@ private[ml] trait CrossValidatorParams extends
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r100934570
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -100,31 +104,44 @@ class CrossValidator @Since("1.2.0") (@Si
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r100932022
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -100,31 +104,44 @@ class CrossValidator @Since("1.2.0") (@Si
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r100934338
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala ---
@@ -67,6 +67,17 @@ private[ml] trait ValidatorParams extends HasSeed
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r100933608
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ---
@@ -106,18 +110,21 @@ class TrainValidationSplit @Since("
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100927448
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100927378
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
return self.getOrDefault(self.threshold
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100930770
--- Diff: docs/ml-features.md ---
@@ -1558,6 +1558,15 @@ for more details on the API.
{% include_example
java/org/apache/spark/examples/ml
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16715#discussion_r100929903
--- Diff:
examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala
---
@@ -38,40 +39,45 @@ object
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16776#discussion_r100089611
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -63,44 +63,49 @@ final class DataFrameStatFunctions private[sql
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/12135
@gatorsmile it's a good point about the tests. However this JIRA & PR was
for exposing the multi-column functionality of `approxQuantiles`. The missing
test cases date back to original im
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/12135#discussion_r99062470
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -75,13 +76,43 @@ final class DataFrameStatFunctions private[sql
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16002
Doesn't seem like a final decision was made here - I'm generally in
agreement with @srowen @sethah that it doesn't really seem worth changing the
current mechanism.
@yanboli
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/12135
LGTM. @zhengruifeng did you manage to add a JIRA for exposing multi-col
support in SparkR?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16676
ok to test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16661#discussion_r97499446
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
@@ -272,6 +277,10 @@ class GaussianMixture private
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/12896
Reviving after a hiatus. Updated since tags. I've actually recently come
across a number of users hitting this issue in production and are unable to use
ALS with cross-validation as a r
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16441
@imatiach-msft thanks for this, really great to have GBT in the
classification trait hierarchy, and now usable with binary evaluator metrics!
---
If your project is set up for it, you can reply to
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16344
jenkins test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16344
jenkins add to whitelist
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/12896
Jenkins retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16516#discussion_r95542552
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
---
@@ -365,7 +365,7 @@ class LogisticRegression @Since("
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16158#discussion_r91957172
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ---
@@ -123,7 +124,10 @@ class TrainValidationSplit @Since("
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16139#discussion_r90825257
--- Diff: docs/ml-advanced.md ---
@@ -59,17 +59,22 @@ Given $n$ weighted observations $(w_i, a_i, b_i)$:
The number of features for each
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16020
Yes unit tests would be good to add.
Tests may require using event listeners to check the caching of the
intermediate dataset with/without cached initial data. Or at least that is
the
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16037
I'm sure this will be net positive, and _shouldn't_ cause any regression.
Still, we must be certain. @AnthonyTruchet can you provide for posterity the
detailed test results for the vector
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/15795
ok to test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16020#discussion_r90599078
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
---
@@ -334,10 +334,10 @@ class KMeans @Since("1.5.0") (
v
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16037
By the way this same issue may also impact the `ml` optimizers that use
L-BFGS. We should check the various gradient aggregators for
`LogisticRegression`, `LinearRegression`, `MLP` etc. cc @sethah
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/15831
I'm also generally supportive of (1) - porting the code to `ml` and having
the `mllib` code wrap the `ml` version - this is the approach for other models
that have been done. Of course only
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16037#discussion_r90421008
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +241,27 @@ object LBFGS extends Logging {
val bcW
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90395065
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90395345
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90394053
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90394630
--- Diff:
examples/src/main/scala/org/apache/spark/examples/ml/ApproxSimilarityJoinExample.scala
---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90395495
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90395294
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90395451
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90394871
--- Diff:
examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala ---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90393584
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90395459
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90394571
--- Diff:
examples/src/main/scala/org/apache/spark/examples/ml/ApproxSimilarityJoinExample.scala
---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90393279
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/15795#discussion_r90393263
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
{% include_example python/ml/chisq_selector_example.py
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16037#discussion_r90391752
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +239,25 @@ object LBFGS extends Logging {
val bcW
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16037
ok to test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16037#discussion_r90388974
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +239,25 @@ object LBFGS extends Logging {
val bcW
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16037
What worries me more actually is that the initial vector when sent in the
closure should be compressed. So why is this issue occurring? Is it a problem
with serialization / compression? OR even
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16037
Right ok. So I think the approach of making the zero vector sparse then
calling `toDense` in `seqOp` as @srowen suggested makes most sense.
Currently the gradient vector *must* be dense in
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16078
@AnthonyTruchet I think in this case it was just confusing to have many PRs
opened against the issue. One option is to either adjust the existing PR with
changes (so that only one PR is open
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16037
This is all a bit confusing - can we highlight which PR is actually to be
reviewed?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/15817
Sorry for delay - this LGTM. Given it's been around for a while and given
RC2 is likely to be cut, I've gone ahead and merged to master / branch-2.1.
Thanks!
---
If your project is set
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/15817
Jenkins retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16020#discussion_r89740159
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala ---
@@ -273,6 +283,7 @@ class BisectingKMeans @Since("
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16020#discussion_r89740085
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
---
@@ -334,10 +334,8 @@ class KMeans @Since("1.5.0") (
val sum
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/16020#discussion_r89740051
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala ---
@@ -255,10 +256,19 @@ class BisectingKMeans @Since("
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16011
As far as I recall, the idea is that the `Bucketizer` can be used
standalone, and because the `QuantileDiscretizer` itself produced the same
thing as a bucketizer, it was used as the model rather
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/16011
Typically the estimator Params are copied to the model though. How do you
propose to set the handle invalid param in say a pipeline?
On Fri, 25 Nov 2016 at 18:38, Yanbo Liang wrote
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/15817#discussion_r89609989
--- Diff: python/pyspark/ml/feature.py ---
@@ -158,21 +158,28 @@ class Bucketizer(JavaTransformer, HasInputCol,
HasOutputCol, JavaMLReadable, Jav
701 - 800 of 1955 matches
Mail list logo