[GitHub] spark pull request #20403: [MINOR][PYTHON] Minor doc correction for 'spark.s...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20403#discussion_r164160358 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1045,9 +1045,10 @@ object SQLConf { buildConf("spark.sql.execution.arrow.enabled") .internal() .doc("Make use of Apache Arrow for columnar data transfers. Currently available " + --- End diff -- `When true, make use of` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20332: [SPARK-23138][ML][DOC] Multiclass logistic regression su...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20332 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20402 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20402: [SPARK-23223][SQL] Make stacking dataset transfor...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20402#discussion_r164159613 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala --- @@ -66,7 +54,16 @@ class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) { lazy val analyzed: LogicalPlan = { SparkSession.setActiveSession(sparkSession) -sparkSession.sessionState.analyzer.execute(logical) +val plan = sparkSession.sessionState.analyzer.execute(logical) +try { + sparkSession.sessionState.analyzer.checkAnalysis(plan) + EliminateBarriers(plan) --- End diff -- In the future, we can re-visit all the rules we put in the optimizer `Finish Analysis` batch. It might make sense to introduce a dedicate batch here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20369 **[Test build #86714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86714/testReport)** for PR 20369 at commit [`d311d56`](https://github.com/apache/spark/commit/d311d5639b3af9123e0c6dbe38468f0172e06712). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks
Github user jose-torres commented on the issue: https://github.com/apache/spark/pull/20369 The above failure is an unrelated issue which https://github.com/apache/spark/pull/20398 is out to fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks
Github user jose-torres commented on the issue: https://github.com/apache/spark/pull/20369 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20332: [SPARK-23138][ML][DOC] Multiclass logistic regression su...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20332 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86713/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20332: [SPARK-23138][ML][DOC] Multiclass logistic regression su...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20332 **[Test build #86713 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86713/testReport)** for PR 20332 at commit [`ac7a4ae`](https://github.com/apache/spark/commit/ac7a4aeb1c2f76e25f611c167ab8726069589a3e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20347: [SPARK-20129][Core] JavaSparkContext should use SparkCon...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/20347 @rekhajoshm I think maybe the right resolution here is to do nothing. I haven't heard @mengxr on his old JIRA to make this change. Thank you for chasing down open JIRAs like this of course. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20410: [SPARK-23234][ML][PYSPARK] Remove default outputCol in p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20410 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86711/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20410: [SPARK-23234][ML][PYSPARK] Remove default outputCol in p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20410 **[Test build #86711 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86711/testReport)** for PR 20410 at commit [`13e4731`](https://github.com/apache/spark/commit/13e4731cf983f9fad386875862b806ff03817e1c). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20410: [SPARK-23234][ML][PYSPARK] Remove default outputCol in p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20410 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/292/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20403 **[Test build #86712 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86712/testReport)** for PR 20403 at commit [`4899e33`](https://github.com/apache/spark/commit/4899e33a0608f96c6151f692cee4146f1082a46d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20332: [SPARK-23138][ML][DOC] Multiclass logistic regression su...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20332 **[Test build #86713 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86713/testReport)** for PR 20332 at commit [`ac7a4ae`](https://github.com/apache/spark/commit/ac7a4aeb1c2f76e25f611c167ab8726069589a3e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20332: [SPARK-23138][ML][DOC] Multiclass logistic regression su...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20332 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20332: [SPARK-23138][ML][DOC] Multiclass logistic regression su...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20332 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/291/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20410: [SPARK-23234][ML][PYSPARK] Remove default outputCol in p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20410 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20410: [SPARK-23234][ML][PYSPARK] Remove default outputCol in p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20410 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/290/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20403 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20410: [SPARK-23234][ML][PYSPARK] Remove default outputCol in p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20410 **[Test build #86711 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86711/testReport)** for PR 20410 at commit [`13e4731`](https://github.com/apache/spark/commit/13e4731cf983f9fad386875862b806ff03817e1c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20332: [SPARK-23138][ML][DOC] Multiclass logistic regres...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/20332#discussion_r164151869 --- Diff: docs/ml-classification-regression.md --- @@ -97,10 +97,6 @@ only available on the driver. [`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary) provides a summary for a [`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel). -Currently, only binary classification is supported and the --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20332: [SPARK-23138][ML][DOC] Multiclass logistic regres...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/20332#discussion_r164151796 --- Diff: docs/ml-classification-regression.md --- @@ -125,7 +117,6 @@ Continuing the earlier example: [`LogisticRegressionTrainingSummary`](api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegressionSummary) provides a summary for a [`LogisticRegressionModel`](api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegressionModel). --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20332: [SPARK-23138][ML][DOC] Multiclass logistic regres...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/20332#discussion_r164151687 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/MulticlassLogisticRegressionWithElasticNetExample.scala --- @@ -49,6 +49,48 @@ object MulticlassLogisticRegressionWithElasticNetExample { // Print the coefficients and intercept for multinomial logistic regression println(s"Coefficients: \n${lrModel.coefficientMatrix}") println(s"Intercepts: \n${lrModel.interceptVector}") + +val trainingSummary = lrModel.summary + +val objectiveHistory = trainingSummary.objectiveHistory --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20332: [SPARK-23138][ML][DOC] Multiclass logistic regres...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/20332#discussion_r164151731 --- Diff: examples/src/main/python/ml/multiclass_logistic_regression_with_elastic_net.py --- @@ -43,6 +43,43 @@ # Print the coefficients and intercept for multinomial logistic regression print("Coefficients: \n" + str(lrModel.coefficientMatrix)) print("Intercept: " + str(lrModel.interceptVector)) + +trainingSummary = lrModel.summary + +# Obtain the objective per iteration +objectiveHistory = trainingSummary.objectiveHistory +print("objectiveHistory:") +for objective in objectiveHistory: +print(objective) + +print("False positive rate by label:") --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20410: [SPARK-23234][ML][PYSPARK] Remove default outputC...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/20410 [SPARK-23234][ML][PYSPARK] Remove default outputCol in python ## What changes were proposed in this pull request? SPARK-22799 and SPARK-22797 are causing valid Python test failures. The reason is that Python is setting the default params with set. So an outputCol value is always set by the Python API for Bucketizer. ## How was this patch tested? passing failing UTs You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-23234 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20410.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20410 commit 13e4731cf983f9fad386875862b806ff03817e1c Author: Marco GaidoDate: 2018-01-26T16:07:45Z [SPARK-23234][ML][PYSPARK] Remove default outputCol in python --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20402 **[Test build #86710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86710/testReport)** for PR 20402 at commit [`77c6761`](https://github.com/apache/spark/commit/77c676133ad2ff3cfe1875615c72bba518627383). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20402 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/289/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86705/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20402 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20403 **[Test build #86705 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86705/testReport)** for PR 20403 at commit [`4899e33`](https://github.com/apache/spark/commit/4899e33a0608f96c6151f692cee4146f1082a46d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20402 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86709/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20402 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20409 **[Test build #86707 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86707/testReport)** for PR 20409 at commit [`317c4fd`](https://github.com/apache/spark/commit/317c4fd54ecb707b92088c62aebd551805ecae8f). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20409 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86707/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20409 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/20397 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20396 test failures are unrelated and caused by SPARK-23234 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20407 test failures are unrelated and caused by SPARK-23234 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20407 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86703/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20407 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20407 **[Test build #86703 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86703/testReport)** for PR 20407 at commit [`074c342`](https://github.com/apache/spark/commit/074c34245d300901390d2d5ed74bb69e32539b8a). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20397 **[Test build #86708 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86708/testReport)** for PR 20397 at commit [`ce98e09`](https://github.com/apache/spark/commit/ce98e09f72a18776a6fa4c659ea3fa3c6a94801b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20409 **[Test build #86707 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86707/testReport)** for PR 20409 at commit [`317c4fd`](https://github.com/apache/spark/commit/317c4fd54ecb707b92088c62aebd551805ecae8f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20397 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20397 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/288/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20409 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20409 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/287/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20409 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20409 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20409 **[Test build #86706 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86706/testReport)** for PR 20409 at commit [`317c4fd`](https://github.com/apache/spark/commit/317c4fd54ecb707b92088c62aebd551805ecae8f). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20409 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86706/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20409 **[Test build #86706 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86706/testReport)** for PR 20409 at commit [`317c4fd`](https://github.com/apache/spark/commit/317c4fd54ecb707b92088c62aebd551805ecae8f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20409 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20409 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/286/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20409: Reset the cache in asNondeterministic to set determinist...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20409 cc @ueshin and @viirya, could you take a look please when you have some time? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20409: Reset the cache in asNondeterministic to set dete...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/20409 Reset the cache in asNondeterministic to set deterministic properly ## What changes were proposed in this pull request? Reproducer: ```python from pyspark.sql.functions import udf f = udf(lambda x: x) spark.range(1).select(f("id")) # cache JVM UDF instance. f = f.asNondeterministic() spark.range(1).select(f("id"))._jdf.logicalPlan().projectList().head().deterministic() ``` It should return `False` but the current master returns `True`. Seems it's because we cache the JVM UDF instance and then we reuse it even after setting `deterministic` enabled once it's called. For an easy reproducer, with the diff below: ```diff diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py index de96846c5c7..026a78bf547 100644 --- a/python/pyspark/sql/udf.py +++ b/python/pyspark/sql/udf.py @@ -180,6 +180,7 @@ class UserDefinedFunction(object): wrapper.deterministic = self.deterministic wrapper.asNondeterministic = functools.wraps( self.asNondeterministic)(lambda: self.asNondeterministic()._wrapped()) +wrapper._unwrapped = lambda: self return wrapper def asNondeterministic(self): ``` **Before** ```python >>> from pyspark.sql.functions import udf >>> f = udf(lambda x: x) >>> spark.range(1).select(f("id")) DataFrame[(id): string] >>> f._unwrapped()._judf_placeholder.udfDeterministic() True >>> ndf = f.asNondeterministic() >>> ndf.deterministic False >>> spark.range(1).select(ndf("id")) DataFrame[(id): string] >>> ndf._unwrapped()._judf_placeholder.udfDeterministic() True ``` **After** ```python >>> from pyspark.sql.functions import udf >>> f = udf(lambda x: x) >>> spark.range(1).select(f("id")) DataFrame[(id): string] >>> f._unwrapped()._judf_placeholder.udfDeterministic() True >>> ndf = f.asNondeterministic() >>> ndf.deterministic False >>> spark.range(1).select(ndf("id")) DataFrame[(id): string] >>> ndf._unwrapped()._judf_placeholder.udfDeterministic() False ``` ## How was this patch tested? Manually tested. I am not sure if I should add the test with a lot of JVM accesses. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-23233 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20409.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20409 commit 317c4fd54ecb707b92088c62aebd551805ecae8f Author: hyukjinkwonDate: 2018-01-26T15:01:22Z Reset the cache in asNondeterministic to set deterministic properly --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20408: [SPARK-23189][Core][UI] Reflect stage level blacklisting...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20408 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20408: [SPARK-23189][Core][UI] Reflect stage level blacklisting...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20408 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20408: [SPARK-23189][Core][UI] Reflect stage level black...
GitHub user attilapiros opened a pull request: https://github.com/apache/spark/pull/20408 [SPARK-23189][Core][UI] Reflect stage level blacklisting on executor tab ## What changes were proposed in this pull request? The purpose of this PR to reflect the stage level blacklisting on the executor tab for the currently active stages. After this change in the executor tab at the Status column one of the following label will be: - "Blacklisted" when the executor is blacklisted application level (old flag) - "Dead" when the executor is not Blacklisted and not Active - "Blacklisted in Stages: [...]" when the executor is Active but the blacklistedInStages set is not empty where within the [] coma separated active stageIDs are listed. - "Active" when the executor is Active and blacklistedInStages set is empty ## How was this patch tested? Both with unit tests and manually. Manual test Spark was started as: ```bash bin/spark-shell --master "local-cluster[2,1,1024]" --conf "spark.blacklist.enabled=true" --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf "spark.blacklist.application.maxFailedTasksPerExecutor=10" ``` And the job was: ```scala import org.apache.spark.SparkEnv val pairs = sc.parallelize(1 to 1, 10).map { x => if (SparkEnv.get.executorId.toInt == 0) throw new RuntimeException("Bad executor") else { Thread.sleep(10) (x % 10, x) } } val all = pairs.cogroup(pairs) all.collect() ``` UI screenshots about the running: - One executor is blacklisted in the two stages: ![One executor is blacklisted in two stages](https://issues.apache.org/jira/secure/attachment/12907862/multiple_stages_1.png) - One stage completes the other one is still running: ![One stage completes the other is still running](https://issues.apache.org/jira/secure/attachment/12907862/multiple_stages_2.png) - Both stages are completed: ![Both stages are completed](https://issues.apache.org/jira/secure/attachment/12907862/multiple_stages_3.png) ### Unit tests In AppStatusListenerSuite.scala both node blacklisting for stage and executor blacklisting for stage are tested. You can merge this pull request into a Git repository by running: $ git pull https://github.com/attilapiros/spark SPARK-23189 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20408.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20408 commit 4f5176be0d3da7794d20895d8a0bfce16d6b8e5c Author: âattilapirosâDate: 2018-01-25T22:17:11Z inital version --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r164131001 --- Diff: docs/sql-programming-guide.md --- @@ -1640,6 +1640,129 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a You may run `./bin/spark-sql --help` for a complete list of all available options. +# PySpark Usage Guide for Pandas with Arrow + +## Arrow in Spark + +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer +data between JVM and Python processes. This currently is most beneficial to Python users that +work with Pandas/NumPy data. Its usage is not automatic and might require some minor +changes to configuration or code to take full advantage and ensure compatibility. This guide will +give a high-level description of how to use Arrow in Spark and highlight any differences when +working with Arrow-enabled data. + +### Ensure PyArrow Installed + +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow +is installed and available on all cluster nodes. The current supported version is 0.8.0. +You can install using pip or conda from the conda-forge channel. See PyArrow +[installation](https://arrow.apache.org/docs/python/install.html) for details. + +## Enabling for Conversion to/from Pandas + +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`. +To use Arrow when executing these calls, users need to first set the Spark configuration +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. --- End diff -- > I feel we should discourage the use of toPandas I am not sure that's necessary. I think it's reasonable to down-sample/aggregate data in Spark and use `toPandas()` to bring small data to local and analyze further or visualize. Maybe instead we should discourage use of `toPandas` with large amounts of data? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20396 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86704/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20396 **[Test build #86704 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86704/testReport)** for PR 20396 at commit [`8a68f75`](https://github.com/apache/spark/commit/8a68f758a7a41f6c2a9a58f54a982745665be6a6). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20396 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20405: [SPARK-23229][SQL] Dataset.hint should use planWithBarri...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20405 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20405: [SPARK-23229][SQL] Dataset.hint should use planWithBarri...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20405 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86702/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20405: [SPARK-23229][SQL] Dataset.hint should use planWithBarri...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20405 **[Test build #86702 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86702/testReport)** for PR 20405 at commit [`47bb245`](https://github.com/apache/spark/commit/47bb245353202208f2c41634c3796c8e4d2be663). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20167: [SPARK-16501] [MESOS] Allow providing Mesos princ...
Github user rvesse commented on a diff in the pull request: https://github.com/apache/spark/pull/20167#discussion_r164123285 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala --- @@ -71,40 +74,64 @@ trait MesosSchedulerUtils extends Logging { failoverTimeout: Option[Double] = None, frameworkId: Option[String] = None): SchedulerDriver = { val fwInfoBuilder = FrameworkInfo.newBuilder().setUser(sparkUser).setName(appName) -val credBuilder = Credential.newBuilder() + fwInfoBuilder.setHostname(Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse( + conf.get(DRIVER_HOST_ADDRESS))) webuiUrl.foreach { url => fwInfoBuilder.setWebuiUrl(url) } checkpoint.foreach { checkpoint => fwInfoBuilder.setCheckpoint(checkpoint) } failoverTimeout.foreach { timeout => fwInfoBuilder.setFailoverTimeout(timeout) } frameworkId.foreach { id => fwInfoBuilder.setId(FrameworkID.newBuilder().setValue(id).build()) } - fwInfoBuilder.setHostname(Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse( - conf.get(DRIVER_HOST_ADDRESS))) -conf.getOption("spark.mesos.principal").foreach { principal => - fwInfoBuilder.setPrincipal(principal) - credBuilder.setPrincipal(principal) -} -conf.getOption("spark.mesos.secret").foreach { secret => - credBuilder.setSecret(secret) -} -if (credBuilder.hasSecret && !fwInfoBuilder.hasPrincipal) { - throw new SparkException( -"spark.mesos.principal must be configured when spark.mesos.secret is set") -} + conf.getOption("spark.mesos.role").foreach { role => fwInfoBuilder.setRole(role) } val maxGpus = conf.getInt("spark.mesos.gpus.max", 0) if (maxGpus > 0) { fwInfoBuilder.addCapabilities(Capability.newBuilder().setType(Capability.Type.GPU_RESOURCES)) } +val credBuilder = buildCredentials(conf, fwInfoBuilder) if (credBuilder.hasPrincipal) { new MesosSchedulerDriver( scheduler, fwInfoBuilder.build(), masterUrl, credBuilder.build()) } else { new MesosSchedulerDriver(scheduler, fwInfoBuilder.build(), masterUrl) } } + + def buildCredentials( + conf: SparkConf, + fwInfoBuilder: Protos.FrameworkInfo.Builder): Protos.Credential.Builder = { +val credBuilder = Credential.newBuilder() +conf.getOption("spark.mesos.principal") + .orElse(Option(conf.getenv("SPARK_MESOS_PRINCIPAL"))) --- End diff -- I am not sure I understand enough of spark internals to answer your question. These variables are only necessary for Mesos and only on the driver which registers the framework with Mesos. Is it actually possible to submit jobs via REST into a Mesos cluster? Even if it is the Spark framework must exist at that point thereby rendering credentials unnecessary in that scenario. Or am I missing something here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20303: [SPARK-23128][SQL] A new approach to do adaptive ...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/20303#discussion_r164122162 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStage.scala --- @@ -0,0 +1,222 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.adaptive + +import scala.concurrent.{ExecutionContext, Future} +import scala.concurrent.duration.Duration + +import org.apache.spark.MapOutputStatistics +import org.apache.spark.broadcast +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.physical.Partitioning +import org.apache.spark.sql.execution._ +import org.apache.spark.sql.execution.exchange._ +import org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecutionUpdate +import org.apache.spark.util.ThreadUtils + +/** + * In adaptive execution mode, an execution plan is divided into multiple QueryStages. Each + * QueryStage is a sub-tree that runs in a single stage. + */ +abstract class QueryStage extends UnaryExecNode { + + var child: SparkPlan + + // Ignore this wrapper for canonicalizing. + override def doCanonicalize(): SparkPlan = child.canonicalized + + override def output: Seq[Attribute] = child.output + + override def outputPartitioning: Partitioning = child.outputPartitioning + + override def outputOrdering: Seq[SortOrder] = child.outputOrdering + + /** + * Execute childStages and wait until all stages are completed. Use a thread pool to avoid + * blocking on one child stage. + */ + def executeChildStages(): Unit = { +// Handle broadcast stages +val broadcastQueryStages: Seq[BroadcastQueryStage] = child.collect { + case bqs: BroadcastQueryStageInput => bqs.childStage +} +val broadcastFutures = broadcastQueryStages.map { queryStage => + Future { queryStage.prepareBroadcast() }(QueryStage.executionContext) +} + +// Submit shuffle stages +val executionId = sqlContext.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) +val shuffleQueryStages: Seq[ShuffleQueryStage] = child.collect { + case sqs: ShuffleQueryStageInput => sqs.childStage +} +val shuffleStageFutures = shuffleQueryStages.map { queryStage => + Future { +SQLExecution.withExecutionId(sqlContext.sparkContext, executionId) { + queryStage.execute() +} + }(QueryStage.executionContext) +} + +ThreadUtils.awaitResult( + Future.sequence(broadcastFutures)(implicitly, QueryStage.executionContext), Duration.Inf) +ThreadUtils.awaitResult( + Future.sequence(shuffleStageFutures)(implicitly, QueryStage.executionContext), Duration.Inf) + } + + /** + * Before executing the plan in this query stage, we execute all child stages, optimize the plan + * in this stage and determine the reducer number based on the child stages' statistics. Finally + * we do a codegen for this query stage and update the UI with the new plan. + */ + def prepareExecuteStage(): Unit = { +// 1. Execute childStages +executeChildStages() +// It is possible to optimize this stage's plan here based on the child stages' statistics. + +// 2. Determine reducer number +val queryStageInputs: Seq[ShuffleQueryStageInput] = child.collect { + case input: ShuffleQueryStageInput => input +} +val childMapOutputStatistics = queryStageInputs.map(_.childStage.mapOutputStatistics) + .filter(_ != null).toArray +if (childMapOutputStatistics.length > 0) { + val exchangeCoordinator = new ExchangeCoordinator( +conf.targetPostShuffleInputSize, +conf.minNumPostShufflePartitions) + + val
[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20404#discussion_r164122074 --- Diff: python/pyspark/sql/session.py --- @@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None): if SparkSession._instantiatedSession is None \ or SparkSession._instantiatedSession._sc._jsc is None: SparkSession._instantiatedSession = self + self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession) --- End diff -- Simplest way I can think of is just add it in `def stop`: ``` self._jvm.org.apache.spark.sql.SparkSession.clearDefaultSession() ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20403 **[Test build #86705 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86705/testReport)** for PR 20403 at commit [`4899e33`](https://github.com/apache/spark/commit/4899e33a0608f96c6151f692cee4146f1082a46d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/285/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20403 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20396 **[Test build #86704 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86704/testReport)** for PR 20396 at commit [`8a68f75`](https://github.com/apache/spark/commit/8a68f758a7a41f6c2a9a58f54a982745665be6a6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20396 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/284/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20396 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86701/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20403 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20403 **[Test build #86701 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86701/testReport)** for PR 20403 at commit [`4899e33`](https://github.com/apache/spark/commit/4899e33a0608f96c6151f692cee4146f1082a46d). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20396: [SPARK-23217][ML] Add cosine distance measure to ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20396#discussion_r164116722 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -421,13 +453,220 @@ private[evaluation] object SquaredEuclideanSilhouette { computeSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: Double, _: Double) } -val silhouetteScore = dfWithSquaredNorm - .select(avg( -computeSilhouetteCoefficientUDF( - col(featuresCol), col(predictionCol).cast(DoubleType), col("squaredNorm")) - )) - .collect()(0) - .getDouble(0) +val silhouetteScore = overallScore(dfWithSquaredNorm, + computeSilhouetteCoefficientUDF(col(featuresCol), col(predictionCol).cast(DoubleType), +col("squaredNorm"))) + +bClustersStatsMap.destroy() + +silhouetteScore + } +} + + +/** + * The algorithm which is implemented in this object, instead, is an efficient and parallel + * implementation of the Silhouette using the cosine distance measure. The cosine distance + * measure is defined as `1 - s` where `s` is the cosine similarity between two points. + * + * The total distance of the point `X` to the points `$C_{i}$` belonging to the cluster `$\Gamma$` + * is: + * + * + * $$ + * \sum\limits_{i=1}^N d(X, C_{i} ) = + * \sum\limits_{i=1}^N \Big( 1 - \frac{\sum\limits_{j=1}^D x_{j}c_{ij} }{ \|X\|\|C_{i}\|} \Big) + * = \sum\limits_{i=1}^N 1 - \sum\limits_{i=1}^N \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} + * \frac{c_{ij}}{\|C_{i}\|} + * = N - \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} \Big( \sum\limits_{i=1}^N + * \frac{c_{ij}}{\|C_{i}\|} \Big) + * $$ + * + * + * where `$x_{j}$` is the `j`-th dimension of the point `X` and `$c_{ij}$` is the `j`-th dimension + * of the `i`-th point in cluster `$\Gamma$`. + * + * Then, we can define the vector: + * + * + * $$ + * \xi_{X} : \xi_{X i} = \frac{x_{i}}{\|X\|}, i = 1, ..., D + * $$ + * + * + * which can be precomputed for each point and the vector + * + * + * $$ + * \Omega_{\Gamma} : \Omega_{\Gamma i} = \sum\limits_{j=1}^N \xi_{C_{j}i}, i = 1, ..., D + * $$ + * + * + * which can be precomputed too for each cluster `$\Gamma$` by its points `$C_{i}$`. + * + * With these definitions, the numerator becomes: + * + * + * $$ + * N - \sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j} + * $$ + * + * + * Thus the average distance of a point `X` to the points of the cluster `$\Gamma$` is: + * + * + * $$ + * 1 - \frac{\sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j}}{N} + * $$ + * + * + * In the implementation, the precomputed values for the clusters are distributed among the worker + * nodes via broadcasted variables, because we can assume that the clusters are limited in number. + * + * The main strengths of this algorithm are the low computational complexity and the intrinsic + * parallelism. The precomputed information for each point and for each cluster can be computed + * with a computational complexity which is `O(N/W)`, where `N` is the number of points in the + * dataset and `W` is the number of worker nodes. After that, every point can be analyzed + * independently from the others. + * + * For every point we need to compute the average distance to all the clusters. Since the formula + * above requires `O(D)` operations, this phase has a computational complexity which is + * `O(C*D*N/W)` where `C` is the number of clusters (which we assume quite low), `D` is the number + * of dimensions, `N` is the number of points in the dataset and `W` is the number of worker + * nodes. + */ +private[evaluation] object CosineSilhouette extends Silhouette { + + private[this] var kryoRegistrationPerformed: Boolean = false + + private[this] val normalizedFeaturesColName = "normalizedFeatures" + + /** + * This method registers the class + * [[org.apache.spark.ml.evaluation.CosineSilhouette.ClusterStats]] + * for kryo serialization. + * + * @param sc `SparkContext` to be used + */ + def registerKryoClasses(sc: SparkContext): Unit = { --- End diff -- sorry, I am not sure I can get what you mean. Which method is this duplicating? The registration happens once since there is a flag for it... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20396: [SPARK-23217][ML] Add cosine distance measure to ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20396#discussion_r164112428 --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala --- @@ -66,16 +66,38 @@ class ClusteringEvaluatorSuite assert(evaluator.evaluate(irisDataset) ~== 0.6564679231 relTol 1e-5) } - test("number of clusters must be greater than one") { -val singleClusterDataset = irisDataset.where($"label" === 0.0) + /* +Use the following python code to load the data and evaluate it using scikit-learn package. --- End diff -- I see, the idea is to make it more copy-pasteable. That's fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20396: [SPARK-23217][ML] Add cosine distance measure to ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20396#discussion_r164112835 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -111,6 +129,46 @@ object ClusteringEvaluator } +private[evaluation] abstract class Silhouette { + + /** + * It computes the Silhouette coefficient for a point. + */ + def pointSilhouetteCoefficient( + clusterIds: Set[Double], + pointClusterId: Double, + pointClusterNumOfPoints: Long, + averageDistanceToCluster: (Double) => Double): Double = { +// Here we compute the average dissimilarity of the current point to any cluster of which the +// point is not a member. +// The cluster with the lowest average dissimilarity - i.e. the nearest cluster to the current +// point - s said to be the "neighboring cluster". +val otherClusterIds = clusterIds.filter(_ != pointClusterId) +val neighboringClusterDissimilarity = otherClusterIds.map(averageDistanceToCluster).min + +// adjustment for excluding the node itself from the computation of the average dissimilarity +val currentClusterDissimilarity = if (pointClusterNumOfPoints == 1) { + 0 +} else { + averageDistanceToCluster(pointClusterId) * pointClusterNumOfPoints / +(pointClusterNumOfPoints - 1) +} + +(currentClusterDissimilarity compare neighboringClusterDissimilarity).signum match { --- End diff -- Is this just expressing ... ``` if (currentClusterDissimilarity < neighboringClusterDissimilarity) { ... } else if (currentClusterDissimilarity > neighboringClusterDissimilarity) { } else { ... } ``` That seems more straightforward if that's all it is, to my eyes. This has postfix notation, signum, match statement --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20396: [SPARK-23217][ML] Add cosine distance measure to ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20396#discussion_r164111780 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -421,13 +460,220 @@ private[evaluation] object SquaredEuclideanSilhouette { computeSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: Double, _: Double) } -val silhouetteScore = dfWithSquaredNorm - .select(avg( -computeSilhouetteCoefficientUDF( - col(featuresCol), col(predictionCol).cast(DoubleType), col("squaredNorm")) - )) - .collect()(0) - .getDouble(0) +val silhouetteScore = overallScore(dfWithSquaredNorm, + computeSilhouetteCoefficientUDF(col(featuresCol), col(predictionCol).cast(DoubleType), +col("squaredNorm"))) + +bClustersStatsMap.destroy() + +silhouetteScore + } +} + + +/** + * The algorithm which is implemented in this object, instead, is an efficient and parallel + * implementation of the Silhouette using the cosine distance measure. The cosine distance + * measure is defined as `1 - s` where `s` is the cosine similarity between two points. + * + * The total distance of the point `X` to the points `$C_{i}$` belonging to the cluster `$\Gamma$` + * is: + * + * + * $$ + * \sum\limits_{i=1}^N d(X, C_{i} ) = + * \sum\limits_{i=1}^N \Big( 1 - \frac{\sum\limits_{j=1}^D x_{j}c_{ij} }{ \|X\|\|C_{i}\|} \Big) + * = \sum\limits_{i=1}^N 1 - \sum\limits_{i=1}^N \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} + * \frac{c_{ij}}{\|C_{i}\|} + * = N - \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} \Big( \sum\limits_{i=1}^N + * \frac{c_{ij}}{\|C_{i}\|} \Big) + * $$ + * + * + * where `$x_{j}$` is the `j`-th dimension of the point `X` and `$c_{ij}$` is the `j`-th dimension + * of the `i`-th point in cluster `$\Gamma$`. + * + * Then, we can define the vector: + * + * + * $$ + * \xi_{X} : \xi_{X i} = \frac{x_{i}}{\|X\|}, i = 1, ..., D + * $$ + * + * + * which can be precomputed for each point and the vector + * + * + * $$ + * \Omega_{\Gamma} : \Omega_{\Gamma i} = \sum\limits_{j=1}^N \xi_{C_{j}i}, i = 1, ..., D + * $$ + * + * + * which can be precomputed too for each cluster `$\Gamma$` by its points `$C_{i}$`. + * + * With these definitions, the numerator becomes: + * + * + * $$ + * N - \sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j} + * $$ + * + * + * Thus the average distance of a point `X` to the points of the cluster `$\Gamma$` is: + * + * + * $$ + * 1 - \frac{\sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j}}{N} + * $$ + * + * + * In the implementation, the precomputed values for the clusters are distributed among the worker + * nodes via broadcasted variables, because we can assume that the clusters are limited in number. + * + * The main strengths of this algorithm are the low computational complexity and the intrinsic + * parallelism. The precomputed information for each point and for each cluster can be computed + * with a computational complexity which is `O(N/W)`, where `N` is the number of points in the + * dataset and `W` is the number of worker nodes. After that, every point can be analyzed + * independently from the others. + * + * For every point we need to compute the average distance to all the clusters. Since the formula + * above requires `O(D)` operations, this phase has a computational complexity which is + * `O(C*D*N/W)` where `C` is the number of clusters (which we assume quite low), `D` is the number + * of dimensions, `N` is the number of points in the dataset and `W` is the number of worker + * nodes. + */ +private[evaluation] object CosineSilhouette extends Silhouette { + + private[this] var kryoRegistrationPerformed: Boolean = false + + private[this] val normalizedFeaturesColName = "normalizedFeatures" + + /** + * This method registers the class + * [[org.apache.spark.ml.evaluation.CosineSilhouette.ClusterStats]] + * for kryo serialization. + * + * @param sc `SparkContext` to be used + */ + def registerKryoClasses(sc: SparkContext): Unit = { +if (!kryoRegistrationPerformed) { + sc.getConf.registerKryoClasses( +Array( + classOf[CosineSilhouette.ClusterStats] +) + ) + kryoRegistrationPerformed = true +} + } + + case class ClusterStats(normalizedFeatureSum: Vector, numOfPoints: Long) + + /** + * The method takes the input dataset and computes the aggregated values + * about
[GitHub] spark pull request #20396: [SPARK-23217][ML] Add cosine distance measure to ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20396#discussion_r164112258 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -421,13 +460,220 @@ private[evaluation] object SquaredEuclideanSilhouette { computeSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: Double, _: Double) } -val silhouetteScore = dfWithSquaredNorm - .select(avg( -computeSilhouetteCoefficientUDF( - col(featuresCol), col(predictionCol).cast(DoubleType), col("squaredNorm")) - )) - .collect()(0) - .getDouble(0) +val silhouetteScore = overallScore(dfWithSquaredNorm, + computeSilhouetteCoefficientUDF(col(featuresCol), col(predictionCol).cast(DoubleType), +col("squaredNorm"))) + +bClustersStatsMap.destroy() + +silhouetteScore + } +} + + +/** + * The algorithm which is implemented in this object, instead, is an efficient and parallel + * implementation of the Silhouette using the cosine distance measure. The cosine distance + * measure is defined as `1 - s` where `s` is the cosine similarity between two points. + * + * The total distance of the point `X` to the points `$C_{i}$` belonging to the cluster `$\Gamma$` + * is: + * + * + * $$ + * \sum\limits_{i=1}^N d(X, C_{i} ) = + * \sum\limits_{i=1}^N \Big( 1 - \frac{\sum\limits_{j=1}^D x_{j}c_{ij} }{ \|X\|\|C_{i}\|} \Big) + * = \sum\limits_{i=1}^N 1 - \sum\limits_{i=1}^N \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} + * \frac{c_{ij}}{\|C_{i}\|} + * = N - \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} \Big( \sum\limits_{i=1}^N + * \frac{c_{ij}}{\|C_{i}\|} \Big) + * $$ + * + * + * where `$x_{j}$` is the `j`-th dimension of the point `X` and `$c_{ij}$` is the `j`-th dimension + * of the `i`-th point in cluster `$\Gamma$`. + * + * Then, we can define the vector: + * + * + * $$ + * \xi_{X} : \xi_{X i} = \frac{x_{i}}{\|X\|}, i = 1, ..., D + * $$ + * + * + * which can be precomputed for each point and the vector + * + * + * $$ + * \Omega_{\Gamma} : \Omega_{\Gamma i} = \sum\limits_{j=1}^N \xi_{C_{j}i}, i = 1, ..., D + * $$ + * + * + * which can be precomputed too for each cluster `$\Gamma$` by its points `$C_{i}$`. + * + * With these definitions, the numerator becomes: + * + * + * $$ + * N - \sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j} + * $$ + * + * + * Thus the average distance of a point `X` to the points of the cluster `$\Gamma$` is: + * + * + * $$ + * 1 - \frac{\sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j}}{N} + * $$ + * + * + * In the implementation, the precomputed values for the clusters are distributed among the worker + * nodes via broadcasted variables, because we can assume that the clusters are limited in number. + * + * The main strengths of this algorithm are the low computational complexity and the intrinsic + * parallelism. The precomputed information for each point and for each cluster can be computed + * with a computational complexity which is `O(N/W)`, where `N` is the number of points in the + * dataset and `W` is the number of worker nodes. After that, every point can be analyzed + * independently from the others. + * + * For every point we need to compute the average distance to all the clusters. Since the formula + * above requires `O(D)` operations, this phase has a computational complexity which is + * `O(C*D*N/W)` where `C` is the number of clusters (which we assume quite low), `D` is the number + * of dimensions, `N` is the number of points in the dataset and `W` is the number of worker + * nodes. + */ +private[evaluation] object CosineSilhouette extends Silhouette { + + private[this] var kryoRegistrationPerformed: Boolean = false + + private[this] val normalizedFeaturesColName = "normalizedFeatures" + + /** + * This method registers the class + * [[org.apache.spark.ml.evaluation.CosineSilhouette.ClusterStats]] + * for kryo serialization. + * + * @param sc `SparkContext` to be used + */ + def registerKryoClasses(sc: SparkContext): Unit = { +if (!kryoRegistrationPerformed) { + sc.getConf.registerKryoClasses( +Array( + classOf[CosineSilhouette.ClusterStats] +) + ) + kryoRegistrationPerformed = true +} + } + + case class ClusterStats(normalizedFeatureSum: Vector, numOfPoints: Long) + + /** + * The method takes the input dataset and computes the aggregated values + * about
[GitHub] spark pull request #20396: [SPARK-23217][ML] Add cosine distance measure to ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20396#discussion_r164112193 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -421,13 +453,220 @@ private[evaluation] object SquaredEuclideanSilhouette { computeSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: Double, _: Double) } -val silhouetteScore = dfWithSquaredNorm - .select(avg( -computeSilhouetteCoefficientUDF( - col(featuresCol), col(predictionCol).cast(DoubleType), col("squaredNorm")) - )) - .collect()(0) - .getDouble(0) +val silhouetteScore = overallScore(dfWithSquaredNorm, + computeSilhouetteCoefficientUDF(col(featuresCol), col(predictionCol).cast(DoubleType), +col("squaredNorm"))) + +bClustersStatsMap.destroy() + +silhouetteScore + } +} + + +/** + * The algorithm which is implemented in this object, instead, is an efficient and parallel + * implementation of the Silhouette using the cosine distance measure. The cosine distance + * measure is defined as `1 - s` where `s` is the cosine similarity between two points. + * + * The total distance of the point `X` to the points `$C_{i}$` belonging to the cluster `$\Gamma$` + * is: + * + * + * $$ + * \sum\limits_{i=1}^N d(X, C_{i} ) = + * \sum\limits_{i=1}^N \Big( 1 - \frac{\sum\limits_{j=1}^D x_{j}c_{ij} }{ \|X\|\|C_{i}\|} \Big) + * = \sum\limits_{i=1}^N 1 - \sum\limits_{i=1}^N \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} + * \frac{c_{ij}}{\|C_{i}\|} + * = N - \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} \Big( \sum\limits_{i=1}^N + * \frac{c_{ij}}{\|C_{i}\|} \Big) + * $$ + * + * + * where `$x_{j}$` is the `j`-th dimension of the point `X` and `$c_{ij}$` is the `j`-th dimension + * of the `i`-th point in cluster `$\Gamma$`. + * + * Then, we can define the vector: + * + * + * $$ + * \xi_{X} : \xi_{X i} = \frac{x_{i}}{\|X\|}, i = 1, ..., D + * $$ + * + * + * which can be precomputed for each point and the vector + * + * + * $$ + * \Omega_{\Gamma} : \Omega_{\Gamma i} = \sum\limits_{j=1}^N \xi_{C_{j}i}, i = 1, ..., D + * $$ + * + * + * which can be precomputed too for each cluster `$\Gamma$` by its points `$C_{i}$`. + * + * With these definitions, the numerator becomes: + * + * + * $$ + * N - \sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j} + * $$ + * + * + * Thus the average distance of a point `X` to the points of the cluster `$\Gamma$` is: + * + * + * $$ + * 1 - \frac{\sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j}}{N} + * $$ + * + * + * In the implementation, the precomputed values for the clusters are distributed among the worker + * nodes via broadcasted variables, because we can assume that the clusters are limited in number. + * + * The main strengths of this algorithm are the low computational complexity and the intrinsic + * parallelism. The precomputed information for each point and for each cluster can be computed + * with a computational complexity which is `O(N/W)`, where `N` is the number of points in the + * dataset and `W` is the number of worker nodes. After that, every point can be analyzed + * independently from the others. + * + * For every point we need to compute the average distance to all the clusters. Since the formula + * above requires `O(D)` operations, this phase has a computational complexity which is + * `O(C*D*N/W)` where `C` is the number of clusters (which we assume quite low), `D` is the number + * of dimensions, `N` is the number of points in the dataset and `W` is the number of worker + * nodes. + */ +private[evaluation] object CosineSilhouette extends Silhouette { + + private[this] var kryoRegistrationPerformed: Boolean = false + + private[this] val normalizedFeaturesColName = "normalizedFeatures" + + /** + * This method registers the class + * [[org.apache.spark.ml.evaluation.CosineSilhouette.ClusterStats]] + * for kryo serialization. + * + * @param sc `SparkContext` to be used + */ + def registerKryoClasses(sc: SparkContext): Unit = { --- End diff -- This duplicates a method in ClusteringEvaluator right? I wonder if this can happen just once. It's OK if it registers a bunch of classes, not all of which will be used. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20396: [SPARK-23217][ML] Add cosine distance measure to ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20396#discussion_r164111264 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -84,18 +81,39 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str @Since("2.3.0") def setMetricName(value: String): this.type = set(metricName, value) - setDefault(metricName -> "silhouette") + /** + * param for distance measure to be used in evaluation + * (supports `"squaredEuclidean"` (default), `"cosine"`) + * @group param + */ + @Since("2.4.0") + val distanceMeasure: Param[String] = { +val allowedParams = ParamValidators.inArray(Array("squaredEuclidean", "cosine")) --- End diff -- You don't need to change this, but it occurs to me that on lots of the parameters that take discrete values, the error message could reference the same array of values the validator uses, to make sure they're always consistent. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/20404#discussion_r164109814 --- Diff: python/pyspark/sql/session.py --- @@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None): if SparkSession._instantiatedSession is None \ or SparkSession._instantiatedSession._sc._jsc is None: SparkSession._instantiatedSession = self + self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession) --- End diff -- Ohh, I see. My misunderstanding. let me figure out a way to clear this object. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20404#discussion_r164109079 --- Diff: python/pyspark/sql/session.py --- @@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None): if SparkSession._instantiatedSession is None \ or SparkSession._instantiatedSession._sc._jsc is None: SparkSession._instantiatedSession = self + self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession) --- End diff -- Actually, It seems not because we don't call this code path. Stop and start logic is convoluted in PySpark in my humble opinion. Setting the default one fixes an actual issue and seems we are okay with it, at least. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/20404#discussion_r164108216 --- Diff: python/pyspark/sql/session.py --- @@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None): if SparkSession._instantiatedSession is None \ or SparkSession._instantiatedSession._sc._jsc is None: SparkSession._instantiatedSession = self + self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession) --- End diff -- >Btw, shall we clear it when stopping PySpark SparkSession? JVM SparkSession will clear it when application is stopped (https://github.com/apache/spark/blob/3e252514741447004f3c18ddd77c617b4e37cfaa/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L961). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20383: [SPARK-23200] Reset Kubernetes-specific config on...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/20383#discussion_r164106372 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala --- @@ -53,6 +53,21 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time) "spark.driver.host", "spark.driver.bindAddress", "spark.driver.port", + "spark.kubernetes.driver.pod.name", + "spark.kubernetes.executor.podNamePrefix", + "spark.kubernetes.initcontainer.executor.configmapname", + "spark.kubernetes.initcontainer.executor.configmapkey", + "spark.kubernetes.initcontainer.downloadJarsResourceIdentifier", + "spark.kubernetes.initcontainer.downloadJarsSecretLocation", + "spark.kubernetes.initcontainer.downloadFilesResourceIdentifier", + "spark.kubernetes.initcontainer.downloadFilesSecretLocation", + "spark.kubernetes.initcontainer.remoteJars", + "spark.kubernetes.initcontainer.remoteFiles", + "spark.kubernetes.mountdependencies.jarsDownloadDir", + "spark.kubernetes.mountdependencies.filesDownloadDir", + "spark.kubernetes.initcontainer.executor.stagingServerSecret.name", --- End diff -- I think it will not affect the correctness of streaming application. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/20404#discussion_r164105856 --- Diff: python/pyspark/sql/session.py --- @@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None): if SparkSession._instantiatedSession is None \ or SparkSession._instantiatedSession._sc._jsc is None: SparkSession._instantiatedSession = self + self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession) --- End diff -- By looking at Scala code, seems Scala `getOrCreate` will set to `defaultSession` I was thinking it is more proper to set to `defaultSession`. Does PySpark support multiple sessions? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20407 **[Test build #86703 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86703/testReport)** for PR 20407 at commit [`074c342`](https://github.com/apache/spark/commit/074c34245d300901390d2d5ed74bb69e32539b8a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20407 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20407 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/283/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/20407 [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJoin fallback ## What changes were proposed in this pull request? In JoinStrategies, currently if no better option is available, it fallbacks to BroadcastNestedLoopJoin. This strategy can be very problematic, since it can cause OOM. While generally this is not a big problem, in some applications like Thriftserver this is an issue, because a failing job can cause the whole application to go in a bad state. Thus, in these cases, it might be useful to be able to disable this behavior and allow to fail only the jobs which can cause it. ## How was this patch tested? added UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-23124 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20407.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20407 commit 074c34245d300901390d2d5ed74bb69e32539b8a Author: Marco GaidoDate: 2018-01-26T12:54:29Z [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJoin fallback --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20404#discussion_r164103440 --- Diff: python/pyspark/sql/session.py --- @@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None): if SparkSession._instantiatedSession is None \ or SparkSession._instantiatedSession._sc._jsc is None: SparkSession._instantiatedSession = self + self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession) --- End diff -- `setActiveSession` or `setDefaultSession`? Which one is more proper to set here? Btw, shall we clear it when stopping PySpark SparkSession? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org