[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

2017-07-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18513#discussion_r127498459 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala --- @@ -0,0 +1,185 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

2017-07-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18513#discussion_r127557871 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/FeatureHasherSuite.scala --- @@ -0,0 +1,193 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

2017-07-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18513#discussion_r127491688 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/FeatureHasherSuite.scala --- @@ -0,0 +1,193 @@ +/* + * Licensed to the Apache Software

[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

2017-07-14 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/18513 Just to clarify: * If I want to treat a column as categorical that is represented by integers, I'd have to map those integers to strings, right? I believe that's one of your bul

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-12 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r127064727 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the

[GitHub] spark issue #18305: [SPARK-20988][ML] Logistic regression uses aggregator hi...

2017-07-12 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/18305 Did we reach a consensus on the broadcast variables? My opinion is that it's probably better in this case not to worry about it, and we can back out the change that destroys them in the test s

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r126446952 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-05 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r125759270 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-05 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r125757263 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/loss/RDDLossFunction.scala --- @@ -62,8 +62,8 @@ private[ml] class RDDLossFunction[ val

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-05 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r125681761 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala --- @@ -32,40 +34,45 @@ private[ml] trait

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-05 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r125680954 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-06-28 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r124615112 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/loss/RDDLossFunction.scala --- @@ -50,7 +50,7 @@ private[ml] class RDDLossFunction[ Agg

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-06-28 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r124614166 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/DifferentiableLossAggregatorSuite.scala --- @@ -157,4 +160,38 @@ object

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-06-28 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r124615145 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala --- @@ -38,34 +40,39 @@ private[ml] trait

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-06-28 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r124614829 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-06-28 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r124615494 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala --- @@ -38,34 +40,39 @@ private[ml] trait

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-06-27 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r124384213 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/loss/RDDLossFunction.scala --- @@ -62,8 +62,8 @@ private[ml] class RDDLossFunction[ val

[GitHub] spark issue #18305: [SPARK-20988][ML] Logistic regression uses aggregator hi...

2017-06-23 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/18305 also ping @hhbyyh @yanboliang This is a straightforward follow up to https://github.com/apache/spark/pull/17094. Let me know if I can do anything to make the review easier. --- If your project is

[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...

2017-06-22 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/18118 I'll take a look at the changes in the next few days. In the meantime, you can remove "Please review http://spark.apache.org/contributing.html before opening a pull request." from the

[GitHub] spark issue #18389: [SPARK-14174][ML] Add minibatch kmeans

2017-06-22 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/18389 @zhengruifeng Wasn't there some history on this issue? I thought there was another PR? If that's the case, it's always helpful to post links to discussions, or just to summarize the d

[GitHub] spark pull request #18118: SPARK-20199 : Provided featureSubsetStrategy to G...

2017-06-20 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18118#discussion_r123050801 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala --- @@ -150,11 +154,11 @@ class GBTRegressor @Since("1.4.0") (@Si

[GitHub] spark pull request #18118: SPARK-20199 : Provided featureSubsetStrategy to G...

2017-06-20 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18118#discussion_r123049135 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -136,6 +136,10 @@ class GBTClassifier @Since("

[GitHub] spark pull request #18118: SPARK-20199 : Provided featureSubsetStrategy to G...

2017-06-20 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18118#discussion_r123051195 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala --- @@ -118,11 +119,12 @@ class DecisionTreeRegressor @Since

[GitHub] spark pull request #18118: SPARK-20199 : Provided featureSubsetStrategy to G...

2017-06-20 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18118#discussion_r123040612 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala --- @@ -73,19 +75,21 @@ private[spark] object GradientBoostedTrees

[GitHub] spark pull request #18118: SPARK-20199 : Provided featureSubsetStrategy to G...

2017-06-20 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18118#discussion_r123039522 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala --- @@ -140,6 +140,10 @@ class GBTRegressor @Since("1.4.0") (@Si

[GitHub] spark pull request #18118: SPARK-20199 : Provided featureSubsetStrategy to G...

2017-06-20 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18118#discussion_r123049728 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -192,6 +196,9 @@ object GBTClassifier extends

[GitHub] spark pull request #18118: SPARK-20199 : Provided featureSubsetStrategy to G...

2017-06-20 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18118#discussion_r123040767 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala --- @@ -284,11 +290,13 @@ private[spark] object

[GitHub] spark pull request #18118: SPARK-20199 : Provided featureSubsetStrategy to G...

2017-06-20 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18118#discussion_r123051005 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala --- @@ -319,8 +327,10 @@ private[spark] object GradientBoostedTrees

[GitHub] spark pull request #18118: SPARK-20199 : Provided featureSubsetStrategy to G...

2017-06-20 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18118#discussion_r123050956 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala --- @@ -284,11 +290,13 @@ private[spark] object

[GitHub] spark pull request #18118: SPARK-20199 : Provided featureSubsetStrategy to G...

2017-06-20 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18118#discussion_r123042480 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala --- @@ -49,14 +49,16 @@ import org.apache.spark.rdd.RDD @Since

[GitHub] spark issue #18315: [SPARK-21108] [ML] [WIP] convert LinearSVC to aggregator...

2017-06-16 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/18315 Thanks for this pr @hhbyyh. I think we need to add a test suite for the aggregator, but since https://github.com/apache/spark/pull/18305 needs to be merged first, it's fine to wait. If you wou

[GitHub] spark issue #18305: [SPARK-20988][ML] Logistic regression uses aggregator hi...

2017-06-14 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/18305 cc @VinceShieh @MLnick @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-06-14 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15435 ping?? @yanboliang @MLnick --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-06-14 Thread sethah
GitHub user sethah opened a pull request: https://github.com/apache/spark/pull/18305 [SPARK-20988][ML] Logistic regression uses aggregator hierarchy ## What changes were proposed in this pull request? This change pulls the `LogisticAggregator` class out of

[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-06-13 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17862 @hhbyyh Thanks for doing the extra work to use the new aggregator here. I do think it's better to separate those changes from this one, though. There is actually more that needs to be done fo

[GitHub] spark issue #18118: SPARK-20199 : Provided featureSubsetStrategy to GBTClass...

2017-06-07 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/18118 I don't think there's any point in pinging every day :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project doe

[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...

2017-06-03 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17094 @srowen Speaking for myself, I think the other concerns can be issued as follow ups, yes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark issue #18151: [SPARK-20929][ML] LinearSVC should use its own threshold...

2017-06-02 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/18151 One minor comment, otherwise LGTM. Thanks for catching this @jkbradley! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request #18151: [SPARK-20929][ML] LinearSVC should use its own th...

2017-06-02 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18151#discussion_r119901231 --- Diff: python/pyspark/ml/classification.py --- @@ -109,6 +109,10 @@ class LinearSVC(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, Ha

[GitHub] spark issue #17894: [WIP][SPARK-17134][ML] Use level 2 BLAS operations in Lo...

2017-06-01 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17894 @VinceShieh Thanks for posting your results. You tested these on datasets with only 100 samples correct? That's probably not a representative use case of a normal workload... Also, how many cl

[GitHub] spark pull request #18151: [SPARK-20929][ML] LinearSVC should use its own th...

2017-05-31 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18151#discussion_r119472374 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala --- @@ -127,6 +127,27 @@ class LinearSVCSuite extends SparkFunSuite

[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...

2017-05-31 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17094 Ok, yes all good points. I think since these are all private apis it gives us room for future changes. For now, I think we can get rid of a lot of code duplication and fill in some testing gaps with

[GitHub] spark pull request #18151: [SPARK-20929][ML] LinearSVC should use its own th...

2017-05-30 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18151#discussion_r119275319 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala --- @@ -127,6 +127,14 @@ class LinearSVCSuite extends SparkFunSuite

[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...

2017-05-30 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17094 @MLnick I completely agree about the leaky regularization abstraction. In fact, I think the function composition feature would make it easy to get rid of that problem. Consider: In the

[GitHub] spark issue #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemb...

2017-05-30 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/18120 cc @BryanCutler. Bryan did some work on https://github.com/apache/spark/pull/17849. It seems even with that patch, we still need to add methods like these, hoping Bryan can confirm. If

[GitHub] spark issue #11974: [SPARK-14174][ML] Accelerate KMeans via Mini-Batch EM

2017-05-25 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/11974 Mini-batching in Spark generally isn't that efficient, since to extract a mini-batch you still need to iterate over the entire dataset - and that means reading it from disk if it doesn'

[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...

2017-05-25 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17094 cc @srowen also --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark issue #11459: [SPARK-13025] Allow users to set initial model in logist...

2017-05-24 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/11459 This can be closed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #13959: [SPARK-14351] [MLlib] [ML] Optimize findBestSplits metho...

2017-05-18 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/13959 Yes, this is a tough issue. Let's wait and see if @jkbradley has thoughts on this issue. If we don't hear anything, then I'd leave it up to @MechCoder on whether to reopen. Thanks,

[GitHub] spark issue #13959: [SPARK-14351] [MLlib] [ML] Optimize findBestSplits metho...

2017-05-18 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/13959 This is fine, but are we not also policing JIRAs? I've argued above that the reason this PR has been inactive is simply lack of interest in this issue. If that's the case, then the JIRA mu

[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...

2017-05-18 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17094 Thanks @MLnick! I am happy to discuss splitting this into smaller bits as well, if it can make things easier. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #13959: [SPARK-14351] [MLlib] [ML] Optimize findBestSplits metho...

2017-05-18 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/13959 The lack of bandwidth in MLlib means that sometimes good code that would make an impact just gets ignored. This is kind of the reality of things. However, if we are going to close the PR simply

[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...

2017-05-17 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17094 ping! @MLnick @jkbradley @yanboliang @hhbyyh Is there any interest in this? I actually think this cleanup will be a precursor to several different improvements (adding more optimized

[GitHub] spark issue #17586: [SPARK-20249][ML][PYSPARK] Add summary for LinearSVCMode...

2017-05-16 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17586 @MLnick There was some discussion [here](https://github.com/apache/spark/pull/15435) and also on the JIRA for that pr. We definitely want to design it carefully so it's easy to share code. -

[GitHub] spark issue #17910: [SPARK-20669][ML] LoR.family and LDA.optimizer should be...

2017-05-15 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17910 @zhengruifeng In the follow up PR, would you mind changing the logistic regression tests to incorporate `setMaxIter(1)`? --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #12761: [SPARK-14464] [MLLIB] Better support for logistic regres...

2017-05-15 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/12761 @daniel-siegmann-aol Good points, and thanks for following up on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request #17910: [SPARK-20669][ML] LogisticRegression family shoul...

2017-05-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17910#discussion_r116408136 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -2318,8 +2319,8 @@ class LogisticRegressionSuite

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-05-13 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15435 ping! @jkbradley @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #17894: [SPARK-17134][ML] Use level 2 BLAS operations in Logisti...

2017-05-11 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17894 Would you mind adding `[WIP]` to the title? Without even a benchmark for dense features, this is definitely a work-in-progress. --- If your project is set up for it, you can reply to this email and

[GitHub] spark pull request #17910: [SPARK-20669][ML] LogisticRegression family shoul...

2017-05-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17910#discussion_r115813053 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -2318,8 +2319,8 @@ class LogisticRegressionSuite

[GitHub] spark issue #17864: [SPARK-20604][ML] Allow imputer to handle numeric types

2017-05-04 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17864 So the other PR https://github.com/apache/spark/pull/11601 is really long. For reference, I am picking out the relevant discussions to this PR (also someone tell me if there's a better way to

[GitHub] spark pull request #17793: [SPARK-20484][MLLIB] Add documentation to ALS cod...

2017-05-04 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17793#discussion_r114893509 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -910,26 +944,143 @@ object ALS extends DefaultParamsReadable[ALS] with

[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS as optimizer for L...

2017-05-04 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r114879308 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -145,6 +164,15 @@ class LinearSVC @Since("2.2.0") (

[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-05-04 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r114818175 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -982,19 +989,33 @@ class LogisticRegressionModel private

[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

2017-05-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17845#discussion_r114660256 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -389,6 +436,17 @@ class ALSModel private[ml

[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

2017-05-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17845#discussion_r114660148 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -389,6 +436,17 @@ class ALSModel private[ml

[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

2017-05-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17845#discussion_r114661114 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -372,11 +385,45 @@ class ALSModel private[ml] ( num: Int

[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

2017-05-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17845#discussion_r114660366 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -356,6 +356,19 @@ class ALSModel private[ml

[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

2017-05-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17845#discussion_r114658306 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -372,11 +385,45 @@ class ALSModel private[ml] ( num: Int

[GitHub] spark pull request #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recomm...

2017-05-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17742#discussion_r114655727 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -274,46 +275,62 @@ object

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-05-03 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15435 btw @WeichenXu123 you just have to fix merge conflicts while rebasing. This is always possible. Squashing commits is rarely necessary and rarely good practice for an open PR IMO. --- If your

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-05-03 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15435 ping @jkbradley @srowen. Any hope/interest for 2.2? Probably too late, but wanted to check. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use midpoints for split values.

2017-05-03 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17556 Thanks @srowen! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use midpoints for split valu...

2017-05-02 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r114457816 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1037,7 +1042,8 @@ private[spark] object RandomForest extends Logging

[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use midpoints for split valu...

2017-05-01 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r114134468 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1009,10 +1009,24 @@ private[spark] object RandomForest extends

[GitHub] spark pull request #17793: [SPARK-20484][MLLIB] Add documentation to ALS cod...

2017-04-29 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17793#discussion_r114061940 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -910,26 +944,127 @@ object ALS extends DefaultParamsReadable[ALS] with

[GitHub] spark issue #17793: [SPARK-20484][MLLIB] Add documentation to ALS code

2017-04-28 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17793 btw "You can build just the Spark scaladoc by running build/sbt unidoc from the SPARK_PROJECT_ROOT directory." [Link](https://github.com/apache/spark/tree/master/docs) --- If your proj

[GitHub] spark pull request #17793: [SPARK-20484][MLLIB] Add documentation to ALS cod...

2017-04-28 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17793#discussion_r114009389 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -910,26 +944,127 @@ object ALS extends DefaultParamsReadable[ALS] with

[GitHub] spark pull request #17793: [SPARK-20484][MLLIB] Add documentation to ALS cod...

2017-04-28 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17793#discussion_r114003560 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -910,26 +944,127 @@ object ALS extends DefaultParamsReadable[ALS] with

[GitHub] spark pull request #17793: [SPARK-20484][MLLIB] Add documentation to ALS cod...

2017-04-28 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17793#discussion_r114009897 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -1026,7 +1161,24 @@ object ALS extends DefaultParamsReadable[ALS] with

[GitHub] spark pull request #17793: [SPARK-20484][MLLIB] Add documentation to ALS cod...

2017-04-28 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17793#discussion_r114009801 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -910,26 +944,127 @@ object ALS extends DefaultParamsReadable[ALS] with

[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-04-28 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r113950702 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1231,6 +1295,109 @@ class

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-04-27 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15435 ping @yanboliang @jkbradley This LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-04-27 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r113857158 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1231,6 +1295,109 @@ class

[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-04-27 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r113857182 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1231,6 +1295,109 @@ class

[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-04-27 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15435#discussion_r113856908 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1070,90 +1096,128 @@ private[classification] class

[GitHub] spark issue #17793: [SPARK-20484][MLLIB] Add documentation to ALS code

2017-04-27 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17793 +1 for this change. I'll try to take a look sometime, but maybe after the QA period. Also cc @MLnick. --- If your project is set up for it, you can reply to this email and have your reply appe

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-27 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17556 I don't mind the weighted midpoints. However, if for a continuous feature we find that many points have the exact same value, we are assuming we may find data points in the test set that are

[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-27 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r113855186 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1009,10 +1009,24 @@ private[spark] object RandomForest extends

[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-27 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r113855243 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -138,9 +169,10 @@ class RandomForestSuite extends SparkFunSuite

[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-27 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r113854473 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -112,9 +138,11 @@ class RandomForestSuite extends SparkFunSuite

[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-27 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r113855209 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1037,7 +1051,10 @@ private[spark] object RandomForest extends

[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2017-04-27 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17503 I think the benefit of this would be for speed at predict time or for model storage. @srowen the nodes don't have to be equal to be merged, they just have to output the same prediction. Since t

[GitHub] spark pull request #17706: [SPARK-20423][ML] fix MLOR coeffs centering when ...

2017-04-21 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17706#discussion_r112746518 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -1204,6 +1207,9 @@ class LogisticRegressionSuite

[GitHub] spark pull request #17706: [SPARK-20423][ML] fix MLOR coeffs centering when ...

2017-04-21 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17706#discussion_r112745372 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -1204,6 +1207,9 @@ class LogisticRegressionSuite

[GitHub] spark issue #17706: [ML] fix MLOR coeffs centering when reg == 0

2017-04-20 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17706 @WeichenXu123 Thanks for the pr. Is there a JIRA? Why is testing "not applicable"? Seems you are correct on this, but could you please provide a good reference? --- If your project is

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-04-16 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15435 @WeichenXu123 I made a PR to your branch. Can you check it? I think you'll still need to update the Mima file. Also, this may not make 2.2, so then you'd have to update the since tags. -

[GitHub] spark issue #17416: [SPARK-20075][CORE][WIP] Support classifier, packaging i...

2017-04-13 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17416 @srowen Can you confirm what happens when the jars are not found in your local m2 cache? Do you still find the `-models` jar in the ivy2 cache? --- If your project is set up for it, you can reply

[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-13 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17556 Seems like a reasonable change. Just left some minor comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-13 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r111434055 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -104,6 +104,18 @@ class RandomForestSuite extends SparkFunSuite

<    1   2   3   4   5   6   7   8   9   10   >