[GitHub] spark pull request #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of P...

2016-06-21 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/13778#discussion_r67894162 --- Diff: python/pyspark/sql/tests.py --- @@ -558,6 +558,18 @@ def check_datatype(datatype): _verify_type(PythonOnlyPoint(1.0, 2.0

[GitHub] spark pull request #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of P...

2016-06-21 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/13778#discussion_r67894797 --- Diff: python/pyspark/sql/tests.py --- @@ -558,6 +558,18 @@ def check_datatype(datatype): _verify_type(PythonOnlyPoint(1.0, 2.0

[GitHub] spark pull request #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of P...

2016-06-21 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/13778#discussion_r67894457 --- Diff: python/pyspark/sql/tests.py --- @@ -558,6 +558,18 @@ def check_datatype(datatype): _verify_type(PythonOnlyPoint(1.0, 2.0

[GitHub] spark pull request #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of P...

2016-06-21 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/13778#discussion_r67894974 --- Diff: python/pyspark/sql/tests.py --- @@ -558,6 +558,18 @@ def check_datatype(datatype): _verify_type(PythonOnlyPoint(1.0, 2.0

[GitHub] spark issue #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of Python-o...

2016-06-20 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/13778 Update: looks like the above is just an issue with the __str__ method of udf-returned UDTs, which is a different bug (a bug that's also pretty harmless). --- If your project is set up for it, you

[GitHub] spark issue #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of Python-o...

2016-06-20 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/13778 Another update: https://gist.github.com/vlad17/cfcd42f30ea2380df4fb0bfa30dda7ce unresolved --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of Python-o...

2016-06-20 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/13778 Here's an unresolved example: https://gist.github.com/vlad17/2db8e14972344c693e8a3f03d91c9c8d --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of P...

2016-06-24 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/13778#discussion_r68416995 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -427,8 +427,12 @@ case class MapObjects private

[GitHub] spark issue #13724: [SPARK-15973] [PYSPARK] Fix GroupedData Documentation

2016-06-17 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/13724 The issue is that there are still some references to the old `GroupedData._jdf` attribute - if you click on the the Jenkins link you'll find where (but I pasted below too): Error Message

[GitHub] spark pull request #11443: [SPARK-13244][SQL] Migrates DataFrame to Dataset

2016-08-02 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/11443#discussion_r73192952 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -745,6 +825,80 @@ class DataFrame private[sql

[GitHub] spark pull request #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of P...

2016-06-28 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/13778#discussion_r68789054 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -349,11 +349,12 @@ object MapObjects

[GitHub] spark issue #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of Python-o...

2016-07-07 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/13778 LGTM +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of P...

2016-07-06 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/13778#discussion_r69758782 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -374,13 +407,15 @@ object MapObjects

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]

2016-08-08 Thread vlad17
GitHub user vlad17 opened a pull request: https://github.com/apache/spark/pull/14547 [SPARK-16718][MLlib] gbm-style treeboost [WIP] ## What changes were proposed in this pull request? This change adds TreeBoost functionality to `GBTClassifer` and `GBTRegressor`. The main

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]

2016-08-08 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @hhbyyh Would you mind reviewing this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]

2016-08-08 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @sethah Thanks for the FYI. I'm pretty confident that it'll help since now we're directly optimizing the loss function. However, it would be nice to prove this. Unfortunately, the example I linked

[GitHub] spark pull request #11443: [SPARK-13244][SQL] Migrates DataFrame to Dataset

2016-08-01 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/11443#discussion_r73048593 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -745,6 +825,80 @@ class DataFrame private[sql

[GitHub] spark issue #11443: [SPARK-13244][SQL] Migrates DataFrame to Dataset

2016-08-01 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/11443 There may be small bug introduced here. Please see my comment inline: https://github.com/apache/spark/pull/11443/files#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R836 --- If your project is set up

[GitHub] spark pull request #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of P...

2016-06-29 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/13778#discussion_r68979443 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -346,14 +346,38 @@ case class LambdaVariable

[GitHub] spark issue #16495: SPARK-16920: Add a stress test for evaluateEachIteration...

2017-02-04 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/16495 Yes, sorry for my wording. A unit test is indeed an inappropriate place for stress tests. An offline test would be sufficient to verify that an O(N) implementation is an improvement over the O(N^2

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-08-19 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @sethah Was that coupling not already there beforehand? I didn't really change any of the implementation class' interfaces, I just added the Bernoulli impurity to the existing Impurity framework

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78481687 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -42,18 +42,30 @@ import org.apache.spark.sql.types.DoubleType

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78483990 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala --- @@ -258,11 +258,13 @@ private[spark] object GradientBoostedTrees

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78484048 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala --- @@ -258,11 +258,13 @@ private[spark] object GradientBoostedTrees

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78484507 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -183,24 +191,18 @@ private[ml] trait DecisionTreeParams extends

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78484752 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -220,32 +222,42 @@ private[ml] object TreeClassifierParams { final val

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78482838 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala --- @@ -134,11 +146,15 @@ class GBTRegressor @Since("1.4.0") (@Si

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78482844 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala --- @@ -17,13 +17,13 @@ package org.apache.spark.ml.regression

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78482842 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala --- @@ -38,25 +38,35 @@ import org.apache.spark.sql.{DataFrame, Dataset

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78482846 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -148,11 +154,14 @@ class GBTClassifier @Since("

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78484182 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala --- @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78484135 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala --- @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78484559 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala --- @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78484742 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -183,24 +191,18 @@ private[ml] trait DecisionTreeParams extends

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78484674 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -183,24 +191,18 @@ private[ml] trait DecisionTreeParams extends

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78485063 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -465,33 +497,64 @@ private[ml] trait GBTParams extends TreeEnsembleParams

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78485087 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -465,33 +497,64 @@ private[ml] trait GBTParams extends TreeEnsembleParams

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78484998 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -465,33 +497,64 @@ private[ml] trait GBTParams extends TreeEnsembleParams

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78485041 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -501,36 +564,75 @@ private[ml] trait GBTClassifierParams extends GBTParams

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78481662 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -42,18 +42,30 @@ import org.apache.spark.sql.types.DoubleType

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r78483380 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/DTStatsAggregator.scala --- @@ -33,11 +34,13 @@ private[spark] class DTStatsAggregator

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-09-12 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @jkbradley I addressed your comments (will be pushing new version after tests run), but I didn't understand what you were referring to in the "test gists" comment. Would you mind

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-10-09 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @setah do you have any opinion on "loss-based" vs. "auto" or @jkbradley do you feel strongly about this? I think the trade-off is between being explicit vs. possibly confusin

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-11-01 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @jkbradley it seems I can only deprecate `setImpurity`: the value can't be deprecated since it's used internally, which triggers a fatal warning, and getImpurity has scaladoc shared between other

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-11-01 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @jkbradley There seems to be more issues with deprecating impurity: [error] [warn] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/ml/classification

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-10-13 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @jkbradley Re test scripts: `res8: Double = 0.5193104784040287` is the value outputted by `counts.max / counts.sum`. Indeed, it's just a sanity check that the value isn't 1 - i.e., we don't

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-10-13 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r83278169 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/GBTSuiteHelper.scala --- @@ -0,0 +1,271 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-10-13 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r83278154 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/GBTSuiteHelper.scala --- @@ -0,0 +1,271 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-10-13 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r83277784 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/GBTSuiteHelper.scala --- @@ -0,0 +1,271 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-10-13 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r83277146 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -223,15 +278,18 @@ private object GBTClassifierSuite

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-10-13 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r83285393 --- Diff: python/pyspark/ml/regression.py --- @@ -1003,20 +1003,20 @@ class GBTRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-10-15 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @sethah You raise good points. Regarding (1), I don't know if it is actually true. I don't want to speak for @jkbradley, but I was just going off of "software engineering intuition&qu

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2017-03-15 Thread vlad17
Github user vlad17 commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r106311077 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala --- @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost

2017-07-17 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @HyukjinKwon sorry for the inactivity (I have some free time now). @jkbradley is SPARK-4240 still on the roadmap? I can resume work on this (and the subsequent GBT work) --- If your project is set

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost

2018-05-10 Thread vlad17
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/14547 @thesuperzapper unfortunately I haven't been able to keep up-to-date with Spark over the past year (first year of grad school has been occupying me). I don't think I can make any contributions right