Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/13778#discussion_r67894162
--- Diff: python/pyspark/sql/tests.py ---
@@ -558,6 +558,18 @@ def check_datatype(datatype):
_verify_type(PythonOnlyPoint(1.0, 2.0
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/13778#discussion_r67894797
--- Diff: python/pyspark/sql/tests.py ---
@@ -558,6 +558,18 @@ def check_datatype(datatype):
_verify_type(PythonOnlyPoint(1.0, 2.0
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/13778#discussion_r67894457
--- Diff: python/pyspark/sql/tests.py ---
@@ -558,6 +558,18 @@ def check_datatype(datatype):
_verify_type(PythonOnlyPoint(1.0, 2.0
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/13778#discussion_r67894974
--- Diff: python/pyspark/sql/tests.py ---
@@ -558,6 +558,18 @@ def check_datatype(datatype):
_verify_type(PythonOnlyPoint(1.0, 2.0
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/13778
Update: looks like the above is just an issue with the __str__ method of
udf-returned UDTs, which is a different bug (a bug that's also pretty harmless).
---
If your project is set up for it, you
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/13778
Another update:
https://gist.github.com/vlad17/cfcd42f30ea2380df4fb0bfa30dda7ce unresolved
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/13778
Here's an unresolved example:
https://gist.github.com/vlad17/2db8e14972344c693e8a3f03d91c9c8d
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/13778#discussion_r68416995
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
---
@@ -427,8 +427,12 @@ case class MapObjects private
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/13724
The issue is that there are still some references to the old
`GroupedData._jdf` attribute - if you click on the the Jenkins link you'll find
where (but I pasted below too):
Error Message
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/11443#discussion_r73192952
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -745,6 +825,80 @@ class DataFrame private[sql
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/13778#discussion_r68789054
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
---
@@ -349,11 +349,12 @@ object MapObjects
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/13778
LGTM +1
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/13778#discussion_r69758782
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
---
@@ -374,13 +407,15 @@ object MapObjects
GitHub user vlad17 opened a pull request:
https://github.com/apache/spark/pull/14547
[SPARK-16718][MLlib] gbm-style treeboost [WIP]
## What changes were proposed in this pull request?
This change adds TreeBoost functionality to `GBTClassifer` and
`GBTRegressor`. The main
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@hhbyyh Would you mind reviewing this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@sethah Thanks for the FYI. I'm pretty confident that it'll help since now
we're directly optimizing the loss function. However, it would be nice to prove
this. Unfortunately, the example I linked
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/11443#discussion_r73048593
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -745,6 +825,80 @@ class DataFrame private[sql
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/11443
There may be small bug introduced here. Please see my comment inline:
https://github.com/apache/spark/pull/11443/files#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R836
---
If your project is set up
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/13778#discussion_r68979443
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
---
@@ -346,14 +346,38 @@ case class LambdaVariable
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/16495
Yes, sorry for my wording. A unit test is indeed an inappropriate place for
stress tests. An offline test would be sufficient to verify that an O(N)
implementation is an improvement over the O(N^2
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@sethah Was that coupling not already there beforehand? I didn't really
change any of the implementation class' interfaces, I just added the Bernoulli
impurity to the existing Impurity framework
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78481687
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -42,18 +42,30 @@ import org.apache.spark.sql.types.DoubleType
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78483990
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala
---
@@ -258,11 +258,13 @@ private[spark] object GradientBoostedTrees
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78484048
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala
---
@@ -258,11 +258,13 @@ private[spark] object GradientBoostedTrees
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78484507
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -183,24 +191,18 @@ private[ml] trait DecisionTreeParams extends
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78484752
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -220,32 +222,42 @@ private[ml] object TreeClassifierParams {
final val
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78482838
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala ---
@@ -134,11 +146,15 @@ class GBTRegressor @Since("1.4.0") (@Si
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78482844
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala ---
@@ -17,13 +17,13 @@
package org.apache.spark.ml.regression
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78482842
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala ---
@@ -38,25 +38,35 @@ import org.apache.spark.sql.{DataFrame, Dataset
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78482846
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -148,11 +154,14 @@ class GBTClassifier @Since("
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78484182
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala
---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78484135
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala
---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78484559
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala
---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78484742
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -183,24 +191,18 @@ private[ml] trait DecisionTreeParams extends
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78484674
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -183,24 +191,18 @@ private[ml] trait DecisionTreeParams extends
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78485063
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -465,33 +497,64 @@ private[ml] trait GBTParams extends
TreeEnsembleParams
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78485087
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -465,33 +497,64 @@ private[ml] trait GBTParams extends
TreeEnsembleParams
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78484998
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -465,33 +497,64 @@ private[ml] trait GBTParams extends
TreeEnsembleParams
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78485041
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -501,36 +564,75 @@ private[ml] trait GBTClassifierParams extends
GBTParams
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78481662
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -42,18 +42,30 @@ import org.apache.spark.sql.types.DoubleType
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r78483380
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/DTStatsAggregator.scala ---
@@ -33,11 +34,13 @@ private[spark] class DTStatsAggregator
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@jkbradley I addressed your comments (will be pushing new version after
tests run), but I didn't understand what you were referring to in the "test
gists" comment. Would you mind
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@setah do you have any opinion on "loss-based" vs. "auto" or @jkbradley do
you feel strongly about this? I think the trade-off is between being explicit
vs. possibly confusin
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@jkbradley it seems I can only deprecate `setImpurity`: the value can't be
deprecated since it's used internally, which triggers a fatal warning, and
getImpurity has scaladoc shared between other
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@jkbradley There seems to be more issues with deprecating impurity:
[error] [warn]
/home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/ml/classification
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@jkbradley Re test scripts:
`res8: Double = 0.5193104784040287` is the value outputted by `counts.max /
counts.sum`. Indeed, it's just a sanity check that the value isn't 1 - i.e., we
don't
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r83278169
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/util/GBTSuiteHelper.scala ---
@@ -0,0 +1,271 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r83278154
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/util/GBTSuiteHelper.scala ---
@@ -0,0 +1,271 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r83277784
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/util/GBTSuiteHelper.scala ---
@@ -0,0 +1,271 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r83277146
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
---
@@ -223,15 +278,18 @@ private object GBTClassifierSuite
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r83285393
--- Diff: python/pyspark/ml/regression.py ---
@@ -1003,20 +1003,20 @@ class GBTRegressor(JavaEstimator, HasFeaturesCol,
HasLabelCol, HasPredictionCol
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@sethah You raise good points.
Regarding (1), I don't know if it is actually true. I don't want to speak
for @jkbradley, but I was just going off of "software engineering intuition&qu
Github user vlad17 commented on a diff in the pull request:
https://github.com/apache/spark/pull/14547#discussion_r106311077
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala
---
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@HyukjinKwon sorry for the inactivity (I have some free time now).
@jkbradley is SPARK-4240 still on the roadmap? I can resume work on this (and
the subsequent GBT work)
---
If your project is set
Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@thesuperzapper unfortunately I haven't been able to keep up-to-date with
Spark over the past year (first year of grad school has been occupying me). I
don't think I can make any contributions right
55 matches
Mail list logo