[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

2017-02-28 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17090 Finally, I've done some work related to [SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968) and have a potential solution that seems to be pretty good. In this case it should be

[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

2017-02-28 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17090 I should note that I've found the performance of "recommend all" to be very dependent on number of partitions since it controls the memory consumption per task (which can easily

[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

2017-02-28 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17090 The performance of #12574 is not better than the existing `mllib` recommend-all - since it wraps the functionality it's roughly on par. --- If your project is set up for it, you can reply to

[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

2017-02-28 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17090 Fitting into the CV / evaluator is actually fairly straightforward. It's just that the semantics of `transform` for top-k recommendation must fit into whatever we decide on for `RankingEval

[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

2017-02-28 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17090 @jkbradley do we propose to add further methods to support recommending for all users (or items) in an input DF? like `recommendForAllUsers(dataset: DataFrame, num: Int)`? --- If your project is

[GitHub] spark pull request #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStra...

2017-02-28 Thread MLnick
GitHub user MLnick opened a pull request: https://github.com/apache/spark/pull/17102 [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" usage in ALS [SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489) added the ability to skip `NaN` predicti

[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

2017-02-28 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/12896 Merged to master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark issue #17076: [SPARK-19745][ML] SVCAggregator captures coefficients in...

2017-02-28 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17076 @sethah a quick glance at the screenshots seems to indicate the processing time went up? Which seems a bit odd. Of course it's a small test so maybe just noise. --- If your project is set u

[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

2017-02-28 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/17059#discussion_r103421071 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends

[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

2017-02-28 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17090 For performance tests, I've been using the MovieLens `ml-latest` dataset [here](https://grouplens.org/datasets/movielens/). It has `24,404,096` ratings with `259,137` users and `39,443` m

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

2017-02-28 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17059 Ok, let me take a look at this. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

2017-02-28 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17090 #12574 is a comprehensive solution that also intends to support cross-validation as well as recommending for a subset (or any arbitrary set) of users/items. So it solves [SPARK-10802](https

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

2017-02-27 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17059 @datumbox you mention there is GC & performance overhead which makes some sense. Have you run into problems with very large scale (like millions users & items & ratings)? I did regr

[GitHub] spark pull request #17076: [SPARK-19745][ML] SVCAggregator captures coeffici...

2017-02-27 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/17076#discussion_r103187723 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -440,19 +440,9 @@ private class LinearSVCAggregator

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-23 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17000 cc @yanboliang - it seems actually similar in effect to the VL-BFGS work with RDD-based coefficients? --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-23 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17000 I'm not totally certain there will be some huge benefit with porting vector summary to UDAF framework. But there are API-level benefits to doing so. Perhaps there is a way to incorporat

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-23 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17000 @ZunwenYou yes I understand that the `sliceAggregate` is different from SPARK-19634 and more comparable to `treeAggregate`. But I'm not sure, if we plan to port the vector summary to use `Data

[GitHub] spark issue #17034: [SPARK-19704][ML] AFTSurvivalRegression should support n...

2017-02-23 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17034 As commented we could I guess try to fit in the additional tests into `checkNumericTypes` - but it's specific to AFT so doesn't seem worth it for now. So, this LGTM. --- If your

[GitHub] spark pull request #17034: [SPARK-19704][ML] AFTSurvivalRegression should su...

2017-02-23 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/17034#discussion_r102727229 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala --- @@ -361,6 +363,36 @@ class AFTSurvivalRegressionSuite

[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...

2017-02-22 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16971 Yes my point was returning null is not very idiomatic in Scala. Better to return Option or empty collection. Option doesn't work for Java compat, so empty Array is best in this case I be

[GitHub] spark issue #17016: [SPARK-19679][ML] Destroy broadcasted object without blo...

2017-02-22 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17016 Merged to master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark issue #17021: [SPARK-19694][ML] Add missing 'setTopicDistributionCol' ...

2017-02-22 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17021 Merge to master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark issue #17021: [SPARK-19694][ML] Add missing 'setTopicDistributionCol' ...

2017-02-22 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17021 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17000 Is the speedup coming mostly from the `MultivariateOnlineSummarizer` stage? See https://issues.apache.org/jira/browse/SPARK-19634 which is for porting this operation to use DataFrame UDAF

[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-20 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102146260 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala --- @@ -78,7 +80,13 @@ object StatFunctions extends Logging

[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-20 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102145908 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql

[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-20 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102145412 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala --- @@ -54,6 +54,8 @@ object StatFunctions extends Logging

[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-20 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102145538 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql

[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-20 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102146144 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala --- @@ -214,20 +214,29 @@ class DataFrameStatSuite extends QueryTest with

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17000 Just to be clear - this is essentially just splitting an array up into smaller chunks so that overall communication is more efficient? It would be good to look at why Spark is not doing a good job

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-20 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16965 cc @sethah @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #16966: [SPARK-18409][ML]LSH approxNearestNeighbors should use a...

2017-02-20 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16966 @Yunni have you verified what performance improvement this gives? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request #16966: [SPARK-18409][ML]LSH approxNearestNeighbors shoul...

2017-02-20 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16966#discussion_r102005885 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -147,6 +148,15 @@ private[ml] abstract class LSHModel[T <: LSHMode

[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

2017-02-20 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/12896 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #16774: [SPARK-19357][ML][WIP] Adding parallel model evaluation ...

2017-02-15 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16774 I'd say coming up with a heuristic or algorithm to automatically set the parallel execution param is going to be pretty challenging, since it depends on the details of the individual pip

[GitHub] spark pull request #16776: [SPARK-19436][SQL] Add missing tests for approxQu...

2017-02-14 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16776#discussion_r101155454 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -63,44 +63,49 @@ final class DataFrameStatFunctions private[sql

[GitHub] spark pull request #16776: [SPARK-19436][SQL] Add missing tests for approxQu...

2017-02-14 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16776#discussion_r101156697 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -58,49 +58,52 @@ final class DataFrameStatFunctions private[sql

[GitHub] spark pull request #16776: [SPARK-19436][SQL] Add missing tests for approxQu...

2017-02-14 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16776#discussion_r101152427 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala --- @@ -159,16 +159,72 @@ class DataFrameStatSuite extends QueryTest with

[GitHub] spark pull request #16774: [SPARK-19357][ML][WIP] Adding parallel model eval...

2017-02-13 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r100933636 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala --- @@ -106,18 +110,21 @@ class TrainValidationSplit @Since("

[GitHub] spark pull request #16774: [SPARK-19357][ML][WIP] Adding parallel model eval...

2017-02-13 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r100933844 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala --- @@ -106,18 +110,21 @@ class TrainValidationSplit @Since("

[GitHub] spark pull request #16774: [SPARK-19357][ML][WIP] Adding parallel model eval...

2017-02-13 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r100932890 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -100,31 +104,44 @@ class CrossValidator @Since("1.2.0") (@Si

[GitHub] spark pull request #16774: [SPARK-19357][ML][WIP] Adding parallel model eval...

2017-02-13 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r100934267 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -51,7 +51,7 @@ private[ml] trait CrossValidatorParams extends

[GitHub] spark pull request #16774: [SPARK-19357][ML][WIP] Adding parallel model eval...

2017-02-13 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r100934570 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -100,31 +104,44 @@ class CrossValidator @Since("1.2.0") (@Si

[GitHub] spark pull request #16774: [SPARK-19357][ML][WIP] Adding parallel model eval...

2017-02-13 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r100932022 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -100,31 +104,44 @@ class CrossValidator @Since("1.2.0") (@Si

[GitHub] spark pull request #16774: [SPARK-19357][ML][WIP] Adding parallel model eval...

2017-02-13 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r100934338 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala --- @@ -67,6 +67,17 @@ private[ml] trait ValidatorParams extends HasSeed

[GitHub] spark pull request #16774: [SPARK-19357][ML][WIP] Adding parallel model eval...

2017-02-13 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r100933608 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala --- @@ -106,18 +110,21 @@ class TrainValidationSplit @Since("

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16715#discussion_r100927448 --- Diff: python/pyspark/ml/feature.py --- @@ -120,6 +122,198 @@ def getThreshold(self): return self.getOrDefault(self.threshold

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16715#discussion_r100927378 --- Diff: python/pyspark/ml/feature.py --- @@ -120,6 +122,198 @@ def getThreshold(self): return self.getOrDefault(self.threshold

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16715#discussion_r100930770 --- Diff: docs/ml-features.md --- @@ -1558,6 +1558,15 @@ for more details on the API. {% include_example java/org/apache/spark/examples/ml

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16715#discussion_r100929903 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala --- @@ -38,40 +39,45 @@ object

[GitHub] spark pull request #16776: [SPARK-19436][SQL] Add missing tests for approxQu...

2017-02-08 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16776#discussion_r100089611 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -63,44 +63,49 @@ final class DataFrameStatFunctions private[sql

[GitHub] spark issue #12135: [SPARK-14352][SQL] approxQuantile should support multi c...

2017-02-02 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/12135 @gatorsmile it's a good point about the tests. However this JIRA & PR was for exposing the multi-column functionality of `approxQuantiles`. The missing test cases date back to original im

[GitHub] spark pull request #12135: [SPARK-14352][SQL] approxQuantile should support ...

2017-02-01 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/12135#discussion_r99062470 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -75,13 +76,43 @@ final class DataFrameStatFunctions private[sql

[GitHub] spark issue #16002: [SPARK-18341][ML] Eliminate use of SingularMatrixExcepti...

2017-01-25 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16002 Doesn't seem like a final decision was made here - I'm generally in agreement with @srowen @sethah that it doesn't really seem worth changing the current mechanism. @yanboli

[GitHub] spark issue #12135: [SPARK-14352][SQL] approxQuantile should support multi c...

2017-01-24 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/12135 LGTM. @zhengruifeng did you manage to add a JIRA for exposing multi-col support in SparkR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark issue #16676: delete useless var ā€œjā€

2017-01-24 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16676 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the

[GitHub] spark pull request #16661: [SPARK-19313][ML][MLLIB] GaussianMixture should l...

2017-01-24 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16661#discussion_r97499446 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala --- @@ -272,6 +277,10 @@ class GaussianMixture private

[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

2017-01-24 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/12896 Reviving after a hiatus. Updated since tags. I've actually recently come across a number of users hitting this issue in production and are unable to use ALS with cross-validation as a r

[GitHub] spark issue #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict probabi...

2017-01-19 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16441 @imatiach-msft thanks for this, really great to have GBT in the classification trait hierarchy, and now usable with binary evaluator metrics! --- If your project is set up for it, you can reply to

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-18 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16344 jenkins test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-18 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16344 jenkins add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

2017-01-18 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/12896 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark pull request #16516: [SPARK-19155][ML] Make some string params of ML a...

2017-01-11 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16516#discussion_r95542552 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -365,7 +365,7 @@ class LogisticRegression @Since("

[GitHub] spark pull request #16158: [SPARK-18724][ML] Add TuningSummary for TrainVali...

2016-12-12 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16158#discussion_r91957172 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala --- @@ -123,7 +124,10 @@ class TrainValidationSplit @Since("

[GitHub] spark pull request #16139: [SPARK-18705][ML][DOC] Update user guide to refle...

2016-12-05 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16139#discussion_r90825257 --- Diff: docs/ml-advanced.md --- @@ -59,17 +59,22 @@ Given $n$ weighted observations $(w_i, a_i, b_i)$: The number of features for each

[GitHub] spark issue #16020: [SPARK-18596][ML] add checking and caching to bisecting ...

2016-12-02 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16020 Yes unit tests would be good to add. Tests may require using event listeners to check the caching of the intermediate dataset with/without cached initial data. Or at least that is the

[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-12-02 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16037 I'm sure this will be net positive, and _shouldn't_ cause any regression. Still, we must be certain. @AnthonyTruchet can you provide for posterity the detailed test results for the vector

[GitHub] spark issue #15795: [SPARK-18081] Add user guide for Locality Sensitive Hash...

2016-12-02 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/15795 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the

[GitHub] spark pull request #16020: [SPARK-18596][ML] add checking and caching to bis...

2016-12-01 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16020#discussion_r90599078 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -334,10 +334,10 @@ class KMeans @Since("1.5.0") ( v

[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-12-01 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16037 By the way this same issue may also impact the `ml` optimizers that use L-BFGS. We should check the various gradient aggregators for `LogisticRegression`, `LinearRegression`, `MLP` etc. cc @sethah

[GitHub] spark issue #15831: [SPARK-18385][ML] Make the transformer's natively in ml ...

2016-12-01 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/15831 I'm also generally supportive of (1) - porting the code to `ml` and having the `mllib` code wrap the `ml` version - this is the approach for other models that have been done. Of course only

[GitHub] spark pull request #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-12-01 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16037#discussion_r90421008 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala --- @@ -241,16 +241,27 @@ object LBFGS extends Logging { val bcW

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90395065 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90395345 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90394053 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90394630 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/ApproxSimilarityJoinExample.scala --- @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90395495 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90395294 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90395451 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90394871 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala --- @@ -0,0 +1,54 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90393584 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90395459 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90394571 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/ApproxSimilarityJoinExample.scala --- @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90393279 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90393263 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py

[GitHub] spark pull request #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16037#discussion_r90391752 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala --- @@ -241,16 +239,25 @@ object LBFGS extends Logging { val bcW

[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-30 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16037 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the

[GitHub] spark pull request #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16037#discussion_r90388974 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala --- @@ -241,16 +239,25 @@ object LBFGS extends Logging { val bcW

[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-30 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16037 What worries me more actually is that the initial vector when sent in the closure should be compressed. So why is this issue occurring? Is it a problem with serialization / compression? OR even

[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-30 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16037 Right ok. So I think the approach of making the zero vector sparse then calling `toDense` in `seqOp` as @srowen suggested makes most sense. Currently the gradient vector *must* be dense in

[GitHub] spark issue #16078: [SPARK-18471][MLLIB] Fix huge vectors of zero send in cl...

2016-11-30 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16078 @AnthonyTruchet I think in this case it was just confusing to have many PRs opened against the issue. One option is to either adjust the existing PR with changes (so that only one PR is open

[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-30 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16037 This is all a bit confusing - can we highlight which PR is actually to be reviewed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark issue #15817: [SPARK-18366][PYSPARK][ML] Add handleInvalid to Pyspark ...

2016-11-30 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/15817 Sorry for delay - this LGTM. Given it's been around for a while and given RC2 is likely to be cut, I've gone ahead and merged to master / branch-2.1. Thanks! --- If your project is set

[GitHub] spark issue #15817: [SPARK-18366][PYSPARK][ML] Add handleInvalid to Pyspark ...

2016-11-29 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/15817 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark pull request #16020: [SPARK-18596][ML] add checking and caching to bis...

2016-11-28 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16020#discussion_r89740159 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala --- @@ -273,6 +283,7 @@ class BisectingKMeans @Since("

[GitHub] spark pull request #16020: [SPARK-18596][ML] add checking and caching to bis...

2016-11-28 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16020#discussion_r89740085 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -334,10 +334,8 @@ class KMeans @Since("1.5.0") ( val sum

[GitHub] spark pull request #16020: [SPARK-18596][ML] add checking and caching to bis...

2016-11-28 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16020#discussion_r89740051 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala --- @@ -255,10 +256,19 @@ class BisectingKMeans @Since("

[GitHub] spark issue #16011: [SPARK-18587][ML] Remove handleInvalid from QuantileDisc...

2016-11-27 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16011 As far as I recall, the idea is that the `Bucketizer` can be used standalone, and because the `QuantileDiscretizer` itself produced the same thing as a bucketizer, it was used as the model rather

[GitHub] spark issue #16011: [SPARK-18587][ML] Remove handleInvalid from QuantileDisc...

2016-11-25 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16011 Typically the estimator Params are copied to the model though. How do you propose to set the handle invalid param in say a pipeline? On Fri, 25 Nov 2016 at 18:38, Yanbo Liang wrote

[GitHub] spark pull request #15817: [SPARK-18366][PYSPARK][ML] Add handleInvalid to P...

2016-11-25 Thread MLnick
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15817#discussion_r89609989 --- Diff: python/pyspark/ml/feature.py --- @@ -158,21 +158,28 @@ class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, Jav

<    3   4   5   6   7   8   9   10   11   12   >