[GitHub] spark pull request #20403: [MINOR][PYTHON] Minor doc correction for 'spark.s...

2018-01-26 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20403#discussion_r164160358
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1045,9 +1045,10 @@ object SQLConf {
 buildConf("spark.sql.execution.arrow.enabled")
   .internal()
   .doc("Make use of Apache Arrow for columnar data transfers. 
Currently available " +
--- End diff --

`When true, make use of`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20332: [SPARK-23138][ML][DOC] Multiclass logistic regression su...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20332
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...

2018-01-26 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20402
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20402: [SPARK-23223][SQL] Make stacking dataset transfor...

2018-01-26 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20402#discussion_r164159613
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala ---
@@ -66,7 +54,16 @@ class QueryExecution(val sparkSession: SparkSession, val 
logical: LogicalPlan) {
 
   lazy val analyzed: LogicalPlan = {
 SparkSession.setActiveSession(sparkSession)
-sparkSession.sessionState.analyzer.execute(logical)
+val plan = sparkSession.sessionState.analyzer.execute(logical)
+try {
+  sparkSession.sessionState.analyzer.checkAnalysis(plan)
+  EliminateBarriers(plan)
--- End diff --

In the future, we can re-visit all the rules we put in the optimizer 
`Finish Analysis` batch. It might make sense to introduce a dedicate batch 
here. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20369
  
**[Test build #86714 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86714/testReport)**
 for PR 20369 at commit 
[`d311d56`](https://github.com/apache/spark/commit/d311d5639b3af9123e0c6dbe38468f0172e06712).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks

2018-01-26 Thread jose-torres
Github user jose-torres commented on the issue:

https://github.com/apache/spark/pull/20369
  
The above failure is an unrelated issue which 
https://github.com/apache/spark/pull/20398 is out to fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks

2018-01-26 Thread jose-torres
Github user jose-torres commented on the issue:

https://github.com/apache/spark/pull/20369
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20332: [SPARK-23138][ML][DOC] Multiclass logistic regression su...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20332
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86713/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20332: [SPARK-23138][ML][DOC] Multiclass logistic regression su...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20332
  
**[Test build #86713 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86713/testReport)**
 for PR 20332 at commit 
[`ac7a4ae`](https://github.com/apache/spark/commit/ac7a4aeb1c2f76e25f611c167ab8726069589a3e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20347: [SPARK-20129][Core] JavaSparkContext should use SparkCon...

2018-01-26 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/20347
  
@rekhajoshm I think maybe the right resolution here is to do nothing. I 
haven't heard @mengxr on his old JIRA to make this change. Thank you for 
chasing down open JIRAs like this of course.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20410: [SPARK-23234][ML][PYSPARK] Remove default outputCol in p...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20410
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86711/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20410: [SPARK-23234][ML][PYSPARK] Remove default outputCol in p...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20410
  
**[Test build #86711 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86711/testReport)**
 for PR 20410 at commit 
[`13e4731`](https://github.com/apache/spark/commit/13e4731cf983f9fad386875862b806ff03817e1c).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20410: [SPARK-23234][ML][PYSPARK] Remove default outputCol in p...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20410
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20403
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/292/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20403
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20403
  
**[Test build #86712 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86712/testReport)**
 for PR 20403 at commit 
[`4899e33`](https://github.com/apache/spark/commit/4899e33a0608f96c6151f692cee4146f1082a46d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20332: [SPARK-23138][ML][DOC] Multiclass logistic regression su...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20332
  
**[Test build #86713 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86713/testReport)**
 for PR 20332 at commit 
[`ac7a4ae`](https://github.com/apache/spark/commit/ac7a4aeb1c2f76e25f611c167ab8726069589a3e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20332: [SPARK-23138][ML][DOC] Multiclass logistic regression su...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20332
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20332: [SPARK-23138][ML][DOC] Multiclass logistic regression su...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20332
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/291/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20410: [SPARK-23234][ML][PYSPARK] Remove default outputCol in p...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20410
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20410: [SPARK-23234][ML][PYSPARK] Remove default outputCol in p...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20410
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/290/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...

2018-01-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20403
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20410: [SPARK-23234][ML][PYSPARK] Remove default outputCol in p...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20410
  
**[Test build #86711 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86711/testReport)**
 for PR 20410 at commit 
[`13e4731`](https://github.com/apache/spark/commit/13e4731cf983f9fad386875862b806ff03817e1c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20332: [SPARK-23138][ML][DOC] Multiclass logistic regres...

2018-01-26 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/20332#discussion_r164151869
  
--- Diff: docs/ml-classification-regression.md ---
@@ -97,10 +97,6 @@ only available on the driver.
 
[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
 provides a summary for a
 
[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
-Currently, only binary classification is supported and the
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20332: [SPARK-23138][ML][DOC] Multiclass logistic regres...

2018-01-26 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/20332#discussion_r164151796
  
--- Diff: docs/ml-classification-regression.md ---
@@ -125,7 +117,6 @@ Continuing the earlier example:
 
[`LogisticRegressionTrainingSummary`](api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegressionSummary)
 provides a summary for a
 
[`LogisticRegressionModel`](api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegressionModel).
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20332: [SPARK-23138][ML][DOC] Multiclass logistic regres...

2018-01-26 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/20332#discussion_r164151687
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/MulticlassLogisticRegressionWithElasticNetExample.scala
 ---
@@ -49,6 +49,48 @@ object MulticlassLogisticRegressionWithElasticNetExample 
{
 // Print the coefficients and intercept for multinomial logistic 
regression
 println(s"Coefficients: \n${lrModel.coefficientMatrix}")
 println(s"Intercepts: \n${lrModel.interceptVector}")
+
+val trainingSummary = lrModel.summary
+
+val objectiveHistory = trainingSummary.objectiveHistory
--- End diff --

Done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20332: [SPARK-23138][ML][DOC] Multiclass logistic regres...

2018-01-26 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/20332#discussion_r164151731
  
--- Diff: 
examples/src/main/python/ml/multiclass_logistic_regression_with_elastic_net.py 
---
@@ -43,6 +43,43 @@
 # Print the coefficients and intercept for multinomial logistic 
regression
 print("Coefficients: \n" + str(lrModel.coefficientMatrix))
 print("Intercept: " + str(lrModel.interceptVector))
+
+trainingSummary = lrModel.summary
+
+# Obtain the objective per iteration
+objectiveHistory = trainingSummary.objectiveHistory
+print("objectiveHistory:")
+for objective in objectiveHistory:
+print(objective)
+
+print("False positive rate by label:")
--- End diff --

Done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20410: [SPARK-23234][ML][PYSPARK] Remove default outputC...

2018-01-26 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/20410

[SPARK-23234][ML][PYSPARK] Remove default outputCol in python

## What changes were proposed in this pull request?

SPARK-22799 and SPARK-22797 are causing valid Python test failures. The 
reason is that Python is setting the default params with set. So an outputCol 
value is always set by the Python API for Bucketizer.

## How was this patch tested?

passing failing UTs


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-23234

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20410.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20410


commit 13e4731cf983f9fad386875862b806ff03817e1c
Author: Marco Gaido 
Date:   2018-01-26T16:07:45Z

[SPARK-23234][ML][PYSPARK] Remove default outputCol in python




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20402
  
**[Test build #86710 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86710/testReport)**
 for PR 20402 at commit 
[`77c6761`](https://github.com/apache/spark/commit/77c676133ad2ff3cfe1875615c72bba518627383).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20402
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/289/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20403
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86705/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20403
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20402
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20403
  
**[Test build #86705 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86705/testReport)**
 for PR 20403 at commit 
[`4899e33`](https://github.com/apache/spark/commit/4899e33a0608f96c6151f692cee4146f1082a46d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20402
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86709/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20402: [SPARK-23223][SQL] Make stacking dataset transforms more...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20402
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20409
  
**[Test build #86707 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86707/testReport)**
 for PR 20409 at commit 
[`317c4fd`](https://github.com/apache/spark/commit/317c4fd54ecb707b92088c62aebd551805ecae8f).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20409
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86707/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20409
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

2018-01-26 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/20397
  
cc @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...

2018-01-26 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20396
  
test failures are unrelated and caused by SPARK-23234


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...

2018-01-26 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20407
  
test failures are unrelated and caused by SPARK-23234


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20407
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86703/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20407
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20407
  
**[Test build #86703 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86703/testReport)**
 for PR 20407 at commit 
[`074c342`](https://github.com/apache/spark/commit/074c34245d300901390d2d5ed74bb69e32539b8a).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20397
  
**[Test build #86708 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86708/testReport)**
 for PR 20397 at commit 
[`ce98e09`](https://github.com/apache/spark/commit/ce98e09f72a18776a6fa4c659ea3fa3c6a94801b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20409
  
**[Test build #86707 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86707/testReport)**
 for PR 20409 at commit 
[`317c4fd`](https://github.com/apache/spark/commit/317c4fd54ecb707b92088c62aebd551805ecae8f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20397
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20397
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/288/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20409
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20409
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/287/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20409
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20409
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20409
  
**[Test build #86706 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86706/testReport)**
 for PR 20409 at commit 
[`317c4fd`](https://github.com/apache/spark/commit/317c4fd54ecb707b92088c62aebd551805ecae8f).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20409
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86706/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20409
  
**[Test build #86706 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86706/testReport)**
 for PR 20409 at commit 
[`317c4fd`](https://github.com/apache/spark/commit/317c4fd54ecb707b92088c62aebd551805ecae8f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20409
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20409
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/286/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20409: Reset the cache in asNondeterministic to set determinist...

2018-01-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20409
  
cc @ueshin and @viirya, could you take a look please when you have some 
time?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20409: Reset the cache in asNondeterministic to set dete...

2018-01-26 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/20409

Reset the cache in asNondeterministic to set deterministic properly

## What changes were proposed in this pull request?


Reproducer:

```python
from pyspark.sql.functions import udf
f = udf(lambda x: x)
spark.range(1).select(f("id"))  # cache JVM UDF instance.
f = f.asNondeterministic()

spark.range(1).select(f("id"))._jdf.logicalPlan().projectList().head().deterministic()
```

It should return `False` but the current master returns `True`. Seems it's 
because we cache the JVM UDF instance and then we reuse it even after setting 
`deterministic` enabled once it's called.

For an easy reproducer, with the diff below:


```diff
diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py
index de96846c5c7..026a78bf547 100644
--- a/python/pyspark/sql/udf.py
+++ b/python/pyspark/sql/udf.py
@@ -180,6 +180,7 @@ class UserDefinedFunction(object):
 wrapper.deterministic = self.deterministic
 wrapper.asNondeterministic = functools.wraps(
 self.asNondeterministic)(lambda: 
self.asNondeterministic()._wrapped())
+wrapper._unwrapped = lambda: self
 return wrapper

 def asNondeterministic(self):
```


**Before**

```python
>>> from pyspark.sql.functions import udf
>>> f = udf(lambda x: x)
>>> spark.range(1).select(f("id"))
DataFrame[(id): string]
>>> f._unwrapped()._judf_placeholder.udfDeterministic()
True
>>> ndf = f.asNondeterministic()
>>> ndf.deterministic
False
>>> spark.range(1).select(ndf("id"))
DataFrame[(id): string]
>>> ndf._unwrapped()._judf_placeholder.udfDeterministic()
True
```

**After**


```python
>>> from pyspark.sql.functions import udf
>>> f = udf(lambda x: x)
>>> spark.range(1).select(f("id"))
DataFrame[(id): string]
>>> f._unwrapped()._judf_placeholder.udfDeterministic()
True
>>> ndf = f.asNondeterministic()
>>> ndf.deterministic
False
>>> spark.range(1).select(ndf("id"))
DataFrame[(id): string]
>>> ndf._unwrapped()._judf_placeholder.udfDeterministic()
False
```

## How was this patch tested?

Manually tested. I am not sure if I should add the test with a lot of JVM 
accesses.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-23233

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20409.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20409


commit 317c4fd54ecb707b92088c62aebd551805ecae8f
Author: hyukjinkwon 
Date:   2018-01-26T15:01:22Z

Reset the cache in asNondeterministic to set deterministic properly




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20408: [SPARK-23189][Core][UI] Reflect stage level blacklisting...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20408
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20408: [SPARK-23189][Core][UI] Reflect stage level blacklisting...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20408
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20408: [SPARK-23189][Core][UI] Reflect stage level black...

2018-01-26 Thread attilapiros
GitHub user attilapiros opened a pull request:

https://github.com/apache/spark/pull/20408

[SPARK-23189][Core][UI] Reflect stage level blacklisting on executor tab 

## What changes were proposed in this pull request?

The purpose of this PR to reflect the stage level blacklisting on the 
executor tab for the currently active stages.

After this change in the executor tab at the Status column one of the 
following label will be:

- "Blacklisted" when the executor is blacklisted application level (old 
flag)
- "Dead" when the executor is not Blacklisted and not Active
- "Blacklisted in Stages: [...]" when the executor is Active but the 
blacklistedInStages set is not empty where within the [] coma separated active 
stageIDs are listed.
- "Active" when the executor is Active and blacklistedInStages set is empty

## How was this patch tested?

Both with unit tests and manually.

 Manual test

Spark was started as:

```bash
 bin/spark-shell --master "local-cluster[2,1,1024]" --conf 
"spark.blacklist.enabled=true" --conf 
"spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
"spark.blacklist.application.maxFailedTasksPerExecutor=10" 
```

And the job was:
```scala
import org.apache.spark.SparkEnv

val pairs = sc.parallelize(1 to 1, 10).map { x =>
  if (SparkEnv.get.executorId.toInt == 0) throw new RuntimeException("Bad 
executor")
  else  {
Thread.sleep(10)
(x % 10, x)
  }
}
  
val all = pairs.cogroup(pairs)
  
all.collect()
```

UI screenshots about the running:

- One executor is blacklisted in the two stages: 

![One executor is blacklisted in two 
stages](https://issues.apache.org/jira/secure/attachment/12907862/multiple_stages_1.png)

- One stage completes the other one is still running: 

![One stage completes the other is still 
running](https://issues.apache.org/jira/secure/attachment/12907862/multiple_stages_2.png)

- Both stages are completed: 

![Both stages are 
completed](https://issues.apache.org/jira/secure/attachment/12907862/multiple_stages_3.png)

### Unit tests

In AppStatusListenerSuite.scala both node blacklisting for stage and 
executor blacklisting for stage are tested.  




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/attilapiros/spark SPARK-23189

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20408.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20408


commit 4f5176be0d3da7794d20895d8a0bfce16d6b8e5c
Author: “attilapiros” 
Date:   2018-01-25T22:17:11Z

inital version




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

2018-01-26 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/19575#discussion_r164131001
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1640,6 +1640,129 @@ Configuration of Hive is done by placing your 
`hive-site.xml`, `core-site.xml` a
 You may run `./bin/spark-sql --help` for a complete list of all available
 options.
 
+# PySpark Usage Guide for Pandas with Arrow
+
+## Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial 
to Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require 
some minor
+changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight 
any differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an 
extra dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you 
must ensure that PyArrow
+is installed and available on all cluster nodes. The current supported 
version is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See 
PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for 
details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to 
Pandas using the call
+`toPandas()` and when creating a Spark DataFrame from Pandas with 
`createDataFrame(pandas_df)`.
+To use Arrow when executing these calls, users need to first set the Spark 
configuration
+'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
--- End diff --

>  I feel we should discourage the use of toPandas

I am not sure that's necessary. I think it's reasonable to 
down-sample/aggregate data in Spark and use `toPandas()` to bring small data to 
local and analyze further or visualize. Maybe instead we should discourage use 
of `toPandas` with large amounts of data?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20396
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86704/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20396
  
**[Test build #86704 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86704/testReport)**
 for PR 20396 at commit 
[`8a68f75`](https://github.com/apache/spark/commit/8a68f758a7a41f6c2a9a58f54a982745665be6a6).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20396
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20405: [SPARK-23229][SQL] Dataset.hint should use planWithBarri...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20405
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20405: [SPARK-23229][SQL] Dataset.hint should use planWithBarri...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20405
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86702/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20405: [SPARK-23229][SQL] Dataset.hint should use planWithBarri...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20405
  
**[Test build #86702 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86702/testReport)**
 for PR 20405 at commit 
[`47bb245`](https://github.com/apache/spark/commit/47bb245353202208f2c41634c3796c8e4d2be663).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20167: [SPARK-16501] [MESOS] Allow providing Mesos princ...

2018-01-26 Thread rvesse
Github user rvesse commented on a diff in the pull request:

https://github.com/apache/spark/pull/20167#discussion_r164123285
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
 ---
@@ -71,40 +74,64 @@ trait MesosSchedulerUtils extends Logging {
   failoverTimeout: Option[Double] = None,
   frameworkId: Option[String] = None): SchedulerDriver = {
 val fwInfoBuilder = 
FrameworkInfo.newBuilder().setUser(sparkUser).setName(appName)
-val credBuilder = Credential.newBuilder()
+
fwInfoBuilder.setHostname(Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(
+  conf.get(DRIVER_HOST_ADDRESS)))
 webuiUrl.foreach { url => fwInfoBuilder.setWebuiUrl(url) }
 checkpoint.foreach { checkpoint => 
fwInfoBuilder.setCheckpoint(checkpoint) }
 failoverTimeout.foreach { timeout => 
fwInfoBuilder.setFailoverTimeout(timeout) }
 frameworkId.foreach { id =>
   fwInfoBuilder.setId(FrameworkID.newBuilder().setValue(id).build())
 }
-
fwInfoBuilder.setHostname(Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(
-  conf.get(DRIVER_HOST_ADDRESS)))
-conf.getOption("spark.mesos.principal").foreach { principal =>
-  fwInfoBuilder.setPrincipal(principal)
-  credBuilder.setPrincipal(principal)
-}
-conf.getOption("spark.mesos.secret").foreach { secret =>
-  credBuilder.setSecret(secret)
-}
-if (credBuilder.hasSecret && !fwInfoBuilder.hasPrincipal) {
-  throw new SparkException(
-"spark.mesos.principal must be configured when spark.mesos.secret 
is set")
-}
+
 conf.getOption("spark.mesos.role").foreach { role =>
   fwInfoBuilder.setRole(role)
 }
 val maxGpus = conf.getInt("spark.mesos.gpus.max", 0)
 if (maxGpus > 0) {
   
fwInfoBuilder.addCapabilities(Capability.newBuilder().setType(Capability.Type.GPU_RESOURCES))
 }
+val credBuilder = buildCredentials(conf, fwInfoBuilder)
 if (credBuilder.hasPrincipal) {
   new MesosSchedulerDriver(
 scheduler, fwInfoBuilder.build(), masterUrl, credBuilder.build())
 } else {
   new MesosSchedulerDriver(scheduler, fwInfoBuilder.build(), masterUrl)
 }
   }
+  
+  def buildCredentials(
+  conf: SparkConf, 
+  fwInfoBuilder: Protos.FrameworkInfo.Builder): 
Protos.Credential.Builder = {
+val credBuilder = Credential.newBuilder()
+conf.getOption("spark.mesos.principal")
+  .orElse(Option(conf.getenv("SPARK_MESOS_PRINCIPAL")))
--- End diff --

I am not sure I understand enough of spark internals to answer your 
question.

These variables are only necessary for Mesos and only on the driver which 
registers the framework with Mesos. Is it actually possible to submit jobs via 
REST into a Mesos cluster? Even if it is the Spark framework must exist at that 
point thereby rendering credentials unnecessary in that scenario.

Or am I missing something here?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20303: [SPARK-23128][SQL] A new approach to do adaptive ...

2018-01-26 Thread yucai
Github user yucai commented on a diff in the pull request:

https://github.com/apache/spark/pull/20303#discussion_r164122162
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStage.scala
 ---
@@ -0,0 +1,222 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.adaptive
+
+import scala.concurrent.{ExecutionContext, Future}
+import scala.concurrent.duration.Duration
+
+import org.apache.spark.MapOutputStatistics
+import org.apache.spark.broadcast
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.physical.Partitioning
+import org.apache.spark.sql.execution._
+import org.apache.spark.sql.execution.exchange._
+import 
org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecutionUpdate
+import org.apache.spark.util.ThreadUtils
+
+/**
+ * In adaptive execution mode, an execution plan is divided into multiple 
QueryStages. Each
+ * QueryStage is a sub-tree that runs in a single stage.
+ */
+abstract class QueryStage extends UnaryExecNode {
+
+  var child: SparkPlan
+
+  // Ignore this wrapper for canonicalizing.
+  override def doCanonicalize(): SparkPlan = child.canonicalized
+
+  override def output: Seq[Attribute] = child.output
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
+
+  /**
+   * Execute childStages and wait until all stages are completed. Use a 
thread pool to avoid
+   * blocking on one child stage.
+   */
+  def executeChildStages(): Unit = {
+// Handle broadcast stages
+val broadcastQueryStages: Seq[BroadcastQueryStage] = child.collect {
+  case bqs: BroadcastQueryStageInput => bqs.childStage
+}
+val broadcastFutures = broadcastQueryStages.map { queryStage =>
+  Future { queryStage.prepareBroadcast() }(QueryStage.executionContext)
+}
+
+// Submit shuffle stages
+val executionId = 
sqlContext.sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY)
+val shuffleQueryStages: Seq[ShuffleQueryStage] = child.collect {
+  case sqs: ShuffleQueryStageInput => sqs.childStage
+}
+val shuffleStageFutures = shuffleQueryStages.map { queryStage =>
+  Future {
+SQLExecution.withExecutionId(sqlContext.sparkContext, executionId) 
{
+  queryStage.execute()
+}
+  }(QueryStage.executionContext)
+}
+
+ThreadUtils.awaitResult(
+  Future.sequence(broadcastFutures)(implicitly, 
QueryStage.executionContext), Duration.Inf)
+ThreadUtils.awaitResult(
+  Future.sequence(shuffleStageFutures)(implicitly, 
QueryStage.executionContext), Duration.Inf)
+  }
+
+  /**
+   * Before executing the plan in this query stage, we execute all child 
stages, optimize the plan
+   * in this stage and determine the reducer number based on the child 
stages' statistics. Finally
+   * we do a codegen for this query stage and update the UI with the new 
plan.
+   */
+  def prepareExecuteStage(): Unit = {
+// 1. Execute childStages
+executeChildStages()
+// It is possible to optimize this stage's plan here based on the 
child stages' statistics.
+
+// 2. Determine reducer number
+val queryStageInputs: Seq[ShuffleQueryStageInput] = child.collect {
+  case input: ShuffleQueryStageInput => input
+}
+val childMapOutputStatistics = 
queryStageInputs.map(_.childStage.mapOutputStatistics)
+  .filter(_ != null).toArray
+if (childMapOutputStatistics.length > 0) {
+  val exchangeCoordinator = new ExchangeCoordinator(
+conf.targetPostShuffleInputSize,
+conf.minNumPostShufflePartitions)
+
+  val 

[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...

2018-01-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20404#discussion_r164122074
  
--- Diff: python/pyspark/sql/session.py ---
@@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None):
 if SparkSession._instantiatedSession is None \
 or SparkSession._instantiatedSession._sc._jsc is None:
 SparkSession._instantiatedSession = self
+
self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession)
--- End diff --

Simplest way I can think of is just add it in `def stop`:

```
self._jvm.org.apache.spark.sql.SparkSession.clearDefaultSession()
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20403
  
**[Test build #86705 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86705/testReport)**
 for PR 20403 at commit 
[`4899e33`](https://github.com/apache/spark/commit/4899e33a0608f96c6151f692cee4146f1082a46d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20403
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/285/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20403
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...

2018-01-26 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20403
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20396
  
**[Test build #86704 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86704/testReport)**
 for PR 20396 at commit 
[`8a68f75`](https://github.com/apache/spark/commit/8a68f758a7a41f6c2a9a58f54a982745665be6a6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20396
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/284/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20396
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20403
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86701/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20403
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20403: [MINOR][PYTHON] Minor doc correction for 'spark.sql.exec...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20403
  
**[Test build #86701 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86701/testReport)**
 for PR 20403 at commit 
[`4899e33`](https://github.com/apache/spark/commit/4899e33a0608f96c6151f692cee4146f1082a46d).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20396: [SPARK-23217][ML] Add cosine distance measure to ...

2018-01-26 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20396#discussion_r164116722
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -421,13 +453,220 @@ private[evaluation] object 
SquaredEuclideanSilhouette {
   computeSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: 
Double, _: Double)
 }
 
-val silhouetteScore = dfWithSquaredNorm
-  .select(avg(
-computeSilhouetteCoefficientUDF(
-  col(featuresCol), col(predictionCol).cast(DoubleType), 
col("squaredNorm"))
-  ))
-  .collect()(0)
-  .getDouble(0)
+val silhouetteScore = overallScore(dfWithSquaredNorm,
+  computeSilhouetteCoefficientUDF(col(featuresCol), 
col(predictionCol).cast(DoubleType),
+col("squaredNorm")))
+
+bClustersStatsMap.destroy()
+
+silhouetteScore
+  }
+}
+
+
+/**
+ * The algorithm which is implemented in this object, instead, is an 
efficient and parallel
+ * implementation of the Silhouette using the cosine distance measure. The 
cosine distance
+ * measure is defined as `1 - s` where `s` is the cosine similarity 
between two points.
+ *
+ * The total distance of the point `X` to the points `$C_{i}$` belonging 
to the cluster `$\Gamma$`
+ * is:
+ *
+ * 
+ *   $$
+ *   \sum\limits_{i=1}^N d(X, C_{i} ) =
+ *   \sum\limits_{i=1}^N \Big( 1 - \frac{\sum\limits_{j=1}^D x_{j}c_{ij} 
}{ \|X\|\|C_{i}\|} \Big)
+ *   = \sum\limits_{i=1}^N 1 - \sum\limits_{i=1}^N \sum\limits_{j=1}^D 
\frac{x_{j}}{\|X\|}
+ *   \frac{c_{ij}}{\|C_{i}\|}
+ *   = N - \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} \Big( 
\sum\limits_{i=1}^N
+ *   \frac{c_{ij}}{\|C_{i}\|} \Big)
+ *   $$
+ * 
+ *
+ * where `$x_{j}$` is the `j`-th dimension of the point `X` and `$c_{ij}$` 
is the `j`-th dimension
+ * of the `i`-th point in cluster `$\Gamma$`.
+ *
+ * Then, we can define the vector:
+ *
+ * 
+ *   $$
+ *   \xi_{X} : \xi_{X i} = \frac{x_{i}}{\|X\|}, i = 1, ..., D
+ *   $$
+ * 
+ *
+ * which can be precomputed for each point and the vector
+ *
+ * 
+ *   $$
+ *   \Omega_{\Gamma} : \Omega_{\Gamma i} = \sum\limits_{j=1}^N 
\xi_{C_{j}i}, i = 1, ..., D
+ *   $$
+ * 
+ *
+ * which can be precomputed too for each cluster `$\Gamma$` by its points 
`$C_{i}$`.
+ *
+ * With these definitions, the numerator becomes:
+ *
+ * 
+ *   $$
+ *   N - \sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j}
+ *   $$
+ * 
+ *
+ * Thus the average distance of a point `X` to the points of the cluster 
`$\Gamma$` is:
+ *
+ * 
+ *   $$
+ *   1 - \frac{\sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j}}{N}
+ *   $$
+ * 
+ *
+ * In the implementation, the precomputed values for the clusters are 
distributed among the worker
+ * nodes via broadcasted variables, because we can assume that the 
clusters are limited in number.
+ *
+ * The main strengths of this algorithm are the low computational 
complexity and the intrinsic
+ * parallelism. The precomputed information for each point and for each 
cluster can be computed
+ * with a computational complexity which is `O(N/W)`, where `N` is the 
number of points in the
+ * dataset and `W` is the number of worker nodes. After that, every point 
can be analyzed
+ * independently from the others.
+ *
+ * For every point we need to compute the average distance to all the 
clusters. Since the formula
+ * above requires `O(D)` operations, this phase has a computational 
complexity which is
+ * `O(C*D*N/W)` where `C` is the number of clusters (which we assume quite 
low), `D` is the number
+ * of dimensions, `N` is the number of points in the dataset and `W` is 
the number of worker
+ * nodes.
+ */
+private[evaluation] object CosineSilhouette extends Silhouette {
+
+  private[this] var kryoRegistrationPerformed: Boolean = false
+
+  private[this] val normalizedFeaturesColName = "normalizedFeatures"
+
+  /**
+   * This method registers the class
+   * [[org.apache.spark.ml.evaluation.CosineSilhouette.ClusterStats]]
+   * for kryo serialization.
+   *
+   * @param sc `SparkContext` to be used
+   */
+  def registerKryoClasses(sc: SparkContext): Unit = {
--- End diff --

sorry, I am not sure I can get what you mean. Which method is this 
duplicating? The registration happens once since there is a flag for it...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20396: [SPARK-23217][ML] Add cosine distance measure to ...

2018-01-26 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20396#discussion_r164112428
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala
 ---
@@ -66,16 +66,38 @@ class ClusteringEvaluatorSuite
 assert(evaluator.evaluate(irisDataset) ~== 0.6564679231 relTol 1e-5)
   }
 
-  test("number of clusters must be greater than one") {
-val singleClusterDataset = irisDataset.where($"label" === 0.0)
+  /*
+Use the following python code to load the data and evaluate it using 
scikit-learn package.
--- End diff --

I see, the idea is to make it more copy-pasteable. That's fine.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20396: [SPARK-23217][ML] Add cosine distance measure to ...

2018-01-26 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20396#discussion_r164112835
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -111,6 +129,46 @@ object ClusteringEvaluator
 }
 
 
+private[evaluation] abstract class Silhouette {
+
+  /**
+   * It computes the Silhouette coefficient for a point.
+   */
+  def pointSilhouetteCoefficient(
+  clusterIds: Set[Double],
+  pointClusterId: Double,
+  pointClusterNumOfPoints: Long,
+  averageDistanceToCluster: (Double) => Double): Double = {
+// Here we compute the average dissimilarity of the current point to 
any cluster of which the
+// point is not a member.
+// The cluster with the lowest average dissimilarity - i.e. the 
nearest cluster to the current
+// point - s said to be the "neighboring cluster".
+val otherClusterIds = clusterIds.filter(_ != pointClusterId)
+val neighboringClusterDissimilarity = 
otherClusterIds.map(averageDistanceToCluster).min
+
+// adjustment for excluding the node itself from the computation of 
the average dissimilarity
+val currentClusterDissimilarity = if (pointClusterNumOfPoints == 1) {
+  0
+} else {
+  averageDistanceToCluster(pointClusterId) * pointClusterNumOfPoints /
+(pointClusterNumOfPoints - 1)
+}
+
+(currentClusterDissimilarity compare 
neighboringClusterDissimilarity).signum match {
--- End diff --

Is this just expressing ...

```
if (currentClusterDissimilarity < neighboringClusterDissimilarity) {
  ...
} else if (currentClusterDissimilarity > neighboringClusterDissimilarity) {

} else {
  ...
}
```

That seems more straightforward if that's all it is, to my eyes. This has 
postfix notation, signum, match statement


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20396: [SPARK-23217][ML] Add cosine distance measure to ...

2018-01-26 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20396#discussion_r164111780
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -421,13 +460,220 @@ private[evaluation] object 
SquaredEuclideanSilhouette {
   computeSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: 
Double, _: Double)
 }
 
-val silhouetteScore = dfWithSquaredNorm
-  .select(avg(
-computeSilhouetteCoefficientUDF(
-  col(featuresCol), col(predictionCol).cast(DoubleType), 
col("squaredNorm"))
-  ))
-  .collect()(0)
-  .getDouble(0)
+val silhouetteScore = overallScore(dfWithSquaredNorm,
+  computeSilhouetteCoefficientUDF(col(featuresCol), 
col(predictionCol).cast(DoubleType),
+col("squaredNorm")))
+
+bClustersStatsMap.destroy()
+
+silhouetteScore
+  }
+}
+
+
+/**
+ * The algorithm which is implemented in this object, instead, is an 
efficient and parallel
+ * implementation of the Silhouette using the cosine distance measure. The 
cosine distance
+ * measure is defined as `1 - s` where `s` is the cosine similarity 
between two points.
+ *
+ * The total distance of the point `X` to the points `$C_{i}$` belonging 
to the cluster `$\Gamma$`
+ * is:
+ *
+ * 
+ *   $$
+ *   \sum\limits_{i=1}^N d(X, C_{i} ) =
+ *   \sum\limits_{i=1}^N \Big( 1 - \frac{\sum\limits_{j=1}^D x_{j}c_{ij} 
}{ \|X\|\|C_{i}\|} \Big)
+ *   = \sum\limits_{i=1}^N 1 - \sum\limits_{i=1}^N \sum\limits_{j=1}^D 
\frac{x_{j}}{\|X\|}
+ *   \frac{c_{ij}}{\|C_{i}\|}
+ *   = N - \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} \Big( 
\sum\limits_{i=1}^N
+ *   \frac{c_{ij}}{\|C_{i}\|} \Big)
+ *   $$
+ * 
+ *
+ * where `$x_{j}$` is the `j`-th dimension of the point `X` and `$c_{ij}$` 
is the `j`-th dimension
+ * of the `i`-th point in cluster `$\Gamma$`.
+ *
+ * Then, we can define the vector:
+ *
+ * 
+ *   $$
+ *   \xi_{X} : \xi_{X i} = \frac{x_{i}}{\|X\|}, i = 1, ..., D
+ *   $$
+ * 
+ *
+ * which can be precomputed for each point and the vector
+ *
+ * 
+ *   $$
+ *   \Omega_{\Gamma} : \Omega_{\Gamma i} = \sum\limits_{j=1}^N 
\xi_{C_{j}i}, i = 1, ..., D
+ *   $$
+ * 
+ *
+ * which can be precomputed too for each cluster `$\Gamma$` by its points 
`$C_{i}$`.
+ *
+ * With these definitions, the numerator becomes:
+ *
+ * 
+ *   $$
+ *   N - \sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j}
+ *   $$
+ * 
+ *
+ * Thus the average distance of a point `X` to the points of the cluster 
`$\Gamma$` is:
+ *
+ * 
+ *   $$
+ *   1 - \frac{\sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j}}{N}
+ *   $$
+ * 
+ *
+ * In the implementation, the precomputed values for the clusters are 
distributed among the worker
+ * nodes via broadcasted variables, because we can assume that the 
clusters are limited in number.
+ *
+ * The main strengths of this algorithm are the low computational 
complexity and the intrinsic
+ * parallelism. The precomputed information for each point and for each 
cluster can be computed
+ * with a computational complexity which is `O(N/W)`, where `N` is the 
number of points in the
+ * dataset and `W` is the number of worker nodes. After that, every point 
can be analyzed
+ * independently from the others.
+ *
+ * For every point we need to compute the average distance to all the 
clusters. Since the formula
+ * above requires `O(D)` operations, this phase has a computational 
complexity which is
+ * `O(C*D*N/W)` where `C` is the number of clusters (which we assume quite 
low), `D` is the number
+ * of dimensions, `N` is the number of points in the dataset and `W` is 
the number of worker
+ * nodes.
+ */
+private[evaluation] object CosineSilhouette extends Silhouette {
+
+  private[this] var kryoRegistrationPerformed: Boolean = false
+
+  private[this] val normalizedFeaturesColName = "normalizedFeatures"
+
+  /**
+   * This method registers the class
+   * [[org.apache.spark.ml.evaluation.CosineSilhouette.ClusterStats]]
+   * for kryo serialization.
+   *
+   * @param sc `SparkContext` to be used
+   */
+  def registerKryoClasses(sc: SparkContext): Unit = {
+if (!kryoRegistrationPerformed) {
+  sc.getConf.registerKryoClasses(
+Array(
+  classOf[CosineSilhouette.ClusterStats]
+)
+  )
+  kryoRegistrationPerformed = true
+}
+  }
+
+  case class ClusterStats(normalizedFeatureSum: Vector, numOfPoints: Long)
+
+  /**
+   * The method takes the input dataset and computes the aggregated values
+   * about 

[GitHub] spark pull request #20396: [SPARK-23217][ML] Add cosine distance measure to ...

2018-01-26 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20396#discussion_r164112258
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -421,13 +460,220 @@ private[evaluation] object 
SquaredEuclideanSilhouette {
   computeSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: 
Double, _: Double)
 }
 
-val silhouetteScore = dfWithSquaredNorm
-  .select(avg(
-computeSilhouetteCoefficientUDF(
-  col(featuresCol), col(predictionCol).cast(DoubleType), 
col("squaredNorm"))
-  ))
-  .collect()(0)
-  .getDouble(0)
+val silhouetteScore = overallScore(dfWithSquaredNorm,
+  computeSilhouetteCoefficientUDF(col(featuresCol), 
col(predictionCol).cast(DoubleType),
+col("squaredNorm")))
+
+bClustersStatsMap.destroy()
+
+silhouetteScore
+  }
+}
+
+
+/**
+ * The algorithm which is implemented in this object, instead, is an 
efficient and parallel
+ * implementation of the Silhouette using the cosine distance measure. The 
cosine distance
+ * measure is defined as `1 - s` where `s` is the cosine similarity 
between two points.
+ *
+ * The total distance of the point `X` to the points `$C_{i}$` belonging 
to the cluster `$\Gamma$`
+ * is:
+ *
+ * 
+ *   $$
+ *   \sum\limits_{i=1}^N d(X, C_{i} ) =
+ *   \sum\limits_{i=1}^N \Big( 1 - \frac{\sum\limits_{j=1}^D x_{j}c_{ij} 
}{ \|X\|\|C_{i}\|} \Big)
+ *   = \sum\limits_{i=1}^N 1 - \sum\limits_{i=1}^N \sum\limits_{j=1}^D 
\frac{x_{j}}{\|X\|}
+ *   \frac{c_{ij}}{\|C_{i}\|}
+ *   = N - \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} \Big( 
\sum\limits_{i=1}^N
+ *   \frac{c_{ij}}{\|C_{i}\|} \Big)
+ *   $$
+ * 
+ *
+ * where `$x_{j}$` is the `j`-th dimension of the point `X` and `$c_{ij}$` 
is the `j`-th dimension
+ * of the `i`-th point in cluster `$\Gamma$`.
+ *
+ * Then, we can define the vector:
+ *
+ * 
+ *   $$
+ *   \xi_{X} : \xi_{X i} = \frac{x_{i}}{\|X\|}, i = 1, ..., D
+ *   $$
+ * 
+ *
+ * which can be precomputed for each point and the vector
+ *
+ * 
+ *   $$
+ *   \Omega_{\Gamma} : \Omega_{\Gamma i} = \sum\limits_{j=1}^N 
\xi_{C_{j}i}, i = 1, ..., D
+ *   $$
+ * 
+ *
+ * which can be precomputed too for each cluster `$\Gamma$` by its points 
`$C_{i}$`.
+ *
+ * With these definitions, the numerator becomes:
+ *
+ * 
+ *   $$
+ *   N - \sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j}
+ *   $$
+ * 
+ *
+ * Thus the average distance of a point `X` to the points of the cluster 
`$\Gamma$` is:
+ *
+ * 
+ *   $$
+ *   1 - \frac{\sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j}}{N}
+ *   $$
+ * 
+ *
+ * In the implementation, the precomputed values for the clusters are 
distributed among the worker
+ * nodes via broadcasted variables, because we can assume that the 
clusters are limited in number.
+ *
+ * The main strengths of this algorithm are the low computational 
complexity and the intrinsic
+ * parallelism. The precomputed information for each point and for each 
cluster can be computed
+ * with a computational complexity which is `O(N/W)`, where `N` is the 
number of points in the
+ * dataset and `W` is the number of worker nodes. After that, every point 
can be analyzed
+ * independently from the others.
+ *
+ * For every point we need to compute the average distance to all the 
clusters. Since the formula
+ * above requires `O(D)` operations, this phase has a computational 
complexity which is
+ * `O(C*D*N/W)` where `C` is the number of clusters (which we assume quite 
low), `D` is the number
+ * of dimensions, `N` is the number of points in the dataset and `W` is 
the number of worker
+ * nodes.
+ */
+private[evaluation] object CosineSilhouette extends Silhouette {
+
+  private[this] var kryoRegistrationPerformed: Boolean = false
+
+  private[this] val normalizedFeaturesColName = "normalizedFeatures"
+
+  /**
+   * This method registers the class
+   * [[org.apache.spark.ml.evaluation.CosineSilhouette.ClusterStats]]
+   * for kryo serialization.
+   *
+   * @param sc `SparkContext` to be used
+   */
+  def registerKryoClasses(sc: SparkContext): Unit = {
+if (!kryoRegistrationPerformed) {
+  sc.getConf.registerKryoClasses(
+Array(
+  classOf[CosineSilhouette.ClusterStats]
+)
+  )
+  kryoRegistrationPerformed = true
+}
+  }
+
+  case class ClusterStats(normalizedFeatureSum: Vector, numOfPoints: Long)
+
+  /**
+   * The method takes the input dataset and computes the aggregated values
+   * about 

[GitHub] spark pull request #20396: [SPARK-23217][ML] Add cosine distance measure to ...

2018-01-26 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20396#discussion_r164112193
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -421,13 +453,220 @@ private[evaluation] object 
SquaredEuclideanSilhouette {
   computeSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: 
Double, _: Double)
 }
 
-val silhouetteScore = dfWithSquaredNorm
-  .select(avg(
-computeSilhouetteCoefficientUDF(
-  col(featuresCol), col(predictionCol).cast(DoubleType), 
col("squaredNorm"))
-  ))
-  .collect()(0)
-  .getDouble(0)
+val silhouetteScore = overallScore(dfWithSquaredNorm,
+  computeSilhouetteCoefficientUDF(col(featuresCol), 
col(predictionCol).cast(DoubleType),
+col("squaredNorm")))
+
+bClustersStatsMap.destroy()
+
+silhouetteScore
+  }
+}
+
+
+/**
+ * The algorithm which is implemented in this object, instead, is an 
efficient and parallel
+ * implementation of the Silhouette using the cosine distance measure. The 
cosine distance
+ * measure is defined as `1 - s` where `s` is the cosine similarity 
between two points.
+ *
+ * The total distance of the point `X` to the points `$C_{i}$` belonging 
to the cluster `$\Gamma$`
+ * is:
+ *
+ * 
+ *   $$
+ *   \sum\limits_{i=1}^N d(X, C_{i} ) =
+ *   \sum\limits_{i=1}^N \Big( 1 - \frac{\sum\limits_{j=1}^D x_{j}c_{ij} 
}{ \|X\|\|C_{i}\|} \Big)
+ *   = \sum\limits_{i=1}^N 1 - \sum\limits_{i=1}^N \sum\limits_{j=1}^D 
\frac{x_{j}}{\|X\|}
+ *   \frac{c_{ij}}{\|C_{i}\|}
+ *   = N - \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} \Big( 
\sum\limits_{i=1}^N
+ *   \frac{c_{ij}}{\|C_{i}\|} \Big)
+ *   $$
+ * 
+ *
+ * where `$x_{j}$` is the `j`-th dimension of the point `X` and `$c_{ij}$` 
is the `j`-th dimension
+ * of the `i`-th point in cluster `$\Gamma$`.
+ *
+ * Then, we can define the vector:
+ *
+ * 
+ *   $$
+ *   \xi_{X} : \xi_{X i} = \frac{x_{i}}{\|X\|}, i = 1, ..., D
+ *   $$
+ * 
+ *
+ * which can be precomputed for each point and the vector
+ *
+ * 
+ *   $$
+ *   \Omega_{\Gamma} : \Omega_{\Gamma i} = \sum\limits_{j=1}^N 
\xi_{C_{j}i}, i = 1, ..., D
+ *   $$
+ * 
+ *
+ * which can be precomputed too for each cluster `$\Gamma$` by its points 
`$C_{i}$`.
+ *
+ * With these definitions, the numerator becomes:
+ *
+ * 
+ *   $$
+ *   N - \sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j}
+ *   $$
+ * 
+ *
+ * Thus the average distance of a point `X` to the points of the cluster 
`$\Gamma$` is:
+ *
+ * 
+ *   $$
+ *   1 - \frac{\sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j}}{N}
+ *   $$
+ * 
+ *
+ * In the implementation, the precomputed values for the clusters are 
distributed among the worker
+ * nodes via broadcasted variables, because we can assume that the 
clusters are limited in number.
+ *
+ * The main strengths of this algorithm are the low computational 
complexity and the intrinsic
+ * parallelism. The precomputed information for each point and for each 
cluster can be computed
+ * with a computational complexity which is `O(N/W)`, where `N` is the 
number of points in the
+ * dataset and `W` is the number of worker nodes. After that, every point 
can be analyzed
+ * independently from the others.
+ *
+ * For every point we need to compute the average distance to all the 
clusters. Since the formula
+ * above requires `O(D)` operations, this phase has a computational 
complexity which is
+ * `O(C*D*N/W)` where `C` is the number of clusters (which we assume quite 
low), `D` is the number
+ * of dimensions, `N` is the number of points in the dataset and `W` is 
the number of worker
+ * nodes.
+ */
+private[evaluation] object CosineSilhouette extends Silhouette {
+
+  private[this] var kryoRegistrationPerformed: Boolean = false
+
+  private[this] val normalizedFeaturesColName = "normalizedFeatures"
+
+  /**
+   * This method registers the class
+   * [[org.apache.spark.ml.evaluation.CosineSilhouette.ClusterStats]]
+   * for kryo serialization.
+   *
+   * @param sc `SparkContext` to be used
+   */
+  def registerKryoClasses(sc: SparkContext): Unit = {
--- End diff --

This duplicates a method in ClusteringEvaluator right? I wonder if this can 
happen just once. It's OK if it registers a bunch of classes, not all of which 
will be used. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20396: [SPARK-23217][ML] Add cosine distance measure to ...

2018-01-26 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20396#discussion_r164111264
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -84,18 +81,39 @@ class ClusteringEvaluator @Since("2.3.0") 
(@Since("2.3.0") override val uid: Str
   @Since("2.3.0")
   def setMetricName(value: String): this.type = set(metricName, value)
 
-  setDefault(metricName -> "silhouette")
+  /**
+   * param for distance measure to be used in evaluation
+   * (supports `"squaredEuclidean"` (default), `"cosine"`)
+   * @group param
+   */
+  @Since("2.4.0")
+  val distanceMeasure: Param[String] = {
+val allowedParams = ParamValidators.inArray(Array("squaredEuclidean", 
"cosine"))
--- End diff --

You don't need to change this, but it occurs to me that on lots of the 
parameters that take discrete values, the error message could reference the 
same array of values the validator uses, to make sure they're always consistent.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...

2018-01-26 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20404#discussion_r164109814
  
--- Diff: python/pyspark/sql/session.py ---
@@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None):
 if SparkSession._instantiatedSession is None \
 or SparkSession._instantiatedSession._sc._jsc is None:
 SparkSession._instantiatedSession = self
+
self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession)
--- End diff --

Ohh, I see. My misunderstanding. let me figure out a way to clear this 
object.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...

2018-01-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20404#discussion_r164109079
  
--- Diff: python/pyspark/sql/session.py ---
@@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None):
 if SparkSession._instantiatedSession is None \
 or SparkSession._instantiatedSession._sc._jsc is None:
 SparkSession._instantiatedSession = self
+
self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession)
--- End diff --

Actually, It seems not because we don't call this code path. Stop and start 
logic is convoluted in PySpark in my humble opinion. Setting the default one 
fixes an actual issue and seems we are okay with it, at least.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...

2018-01-26 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20404#discussion_r164108216
  
--- Diff: python/pyspark/sql/session.py ---
@@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None):
 if SparkSession._instantiatedSession is None \
 or SparkSession._instantiatedSession._sc._jsc is None:
 SparkSession._instantiatedSession = self
+
self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession)
--- End diff --

>Btw, shall we clear it when stopping PySpark SparkSession?

JVM SparkSession will clear it when application is stopped 
(https://github.com/apache/spark/blob/3e252514741447004f3c18ddd77c617b4e37cfaa/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L961).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20383: [SPARK-23200] Reset Kubernetes-specific config on...

2018-01-26 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20383#discussion_r164106372
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
@@ -53,6 +53,21 @@ class Checkpoint(ssc: StreamingContext, val 
checkpointTime: Time)
   "spark.driver.host",
   "spark.driver.bindAddress",
   "spark.driver.port",
+  "spark.kubernetes.driver.pod.name",
+  "spark.kubernetes.executor.podNamePrefix",
+  "spark.kubernetes.initcontainer.executor.configmapname",
+  "spark.kubernetes.initcontainer.executor.configmapkey",
+  "spark.kubernetes.initcontainer.downloadJarsResourceIdentifier",
+  "spark.kubernetes.initcontainer.downloadJarsSecretLocation",
+  "spark.kubernetes.initcontainer.downloadFilesResourceIdentifier",
+  "spark.kubernetes.initcontainer.downloadFilesSecretLocation",
+  "spark.kubernetes.initcontainer.remoteJars",
+  "spark.kubernetes.initcontainer.remoteFiles",
+  "spark.kubernetes.mountdependencies.jarsDownloadDir",
+  "spark.kubernetes.mountdependencies.filesDownloadDir",
+  "spark.kubernetes.initcontainer.executor.stagingServerSecret.name",
--- End diff --

I think it will not affect the correctness of streaming application. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...

2018-01-26 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20404#discussion_r164105856
  
--- Diff: python/pyspark/sql/session.py ---
@@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None):
 if SparkSession._instantiatedSession is None \
 or SparkSession._instantiatedSession._sc._jsc is None:
 SparkSession._instantiatedSession = self
+
self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession)
--- End diff --

By looking at Scala code, seems Scala `getOrCreate` will set to 
`defaultSession` I was thinking it is more proper to set to `defaultSession`. 
Does PySpark support multiple sessions?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...

2018-01-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20407
  
**[Test build #86703 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86703/testReport)**
 for PR 20407 at commit 
[`074c342`](https://github.com/apache/spark/commit/074c34245d300901390d2d5ed74bb69e32539b8a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20407
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...

2018-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20407
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/283/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...

2018-01-26 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/20407

[SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJoin fallback

## What changes were proposed in this pull request?

In JoinStrategies, currently if no better option is available, it fallbacks 
to BroadcastNestedLoopJoin. This strategy can be very problematic, since it can 
cause OOM. While generally this is not a big problem, in some applications like 
Thriftserver this is an issue, because a failing job can cause the whole 
application to go in a bad state. Thus, in these cases, it might be useful to 
be able to disable this behavior and allow to fail only the jobs which can 
cause it.

## How was this patch tested?

added UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-23124

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20407.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20407


commit 074c34245d300901390d2d5ed74bb69e32539b8a
Author: Marco Gaido 
Date:   2018-01-26T12:54:29Z

[SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJoin fallback




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...

2018-01-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20404#discussion_r164103440
  
--- Diff: python/pyspark/sql/session.py ---
@@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None):
 if SparkSession._instantiatedSession is None \
 or SparkSession._instantiatedSession._sc._jsc is None:
 SparkSession._instantiatedSession = self
+
self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession)
--- End diff --

`setActiveSession` or `setDefaultSession`? Which one is more proper to set 
here?

Btw, shall we clear it when stopping PySpark SparkSession?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   >