date:20171222

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-12-22 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19218
  
What are multiple expressions?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-12-22 Thread fjh100456

Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
@gatorsmile 
I'd test manully. When table-level compression not configured, it always 
take the session level compression, and ignore the existing file compression. 
Seems like a bug, however, table files with multiple compressions do not affect 
the reading and writing. 
Is it ok to add a test to check reading and writing when there are multiple 
conpressions in the existing table files?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19984: [SPARK-22789] Map-only continuous processing exec...

2017-12-22 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19984


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution

2017-12-22 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/19984
  
Thanks! Merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19984
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19984
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85332/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19984
  
**[Test build #85332 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85332/testReport)**
 for PR 19984 at commit 
[`b4f7976`](https://github.com/apache/spark/commit/b4f79762c083735011bf98250c39c263876c8cc8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20034: [SPARK-22846][SQL] Fix table owner is null when c...

2017-12-22 Thread BruceXu1991

Github user BruceXu1991 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20034#discussion_r158577749
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -186,7 +186,7 @@ private[hive] class HiveClientImpl(
   /** Returns the configuration for the current session. */
   def conf: HiveConf = state.getConf
 
-  private val userName = state.getAuthenticator.getUserName
+  private val userName = conf.getUser
--- End diff --

well, if using spark 2.2.1's current implementation
```
private val userName = state.getAuthenticator.getUserName
```
when the implementation of state.getAuthenticator is 
**HadoopDefaultAuthenticator**, which is default in hive conf, the username is 
got. 

however, in the case that the implementation of state.getAuthenticator is 
**SessionStateUserAuthenticator**, which is used in my case, then username will 
be null.

the simplified code below explains the reason:
1) HadoopDefaultAuthenticator
```
public class HadoopDefaultAuthenticator implements 
HiveAuthenticationProvider {
@Override
  public String getUserName() {
return userName;
  }

  @Override
  public void setConf(Configuration conf) {
this.conf = conf;
UserGroupInformation ugi = null;
try {
  ugi = Utils.getUGI();
} catch (Exception e) {
  throw new RuntimeException(e);
}
this.userName = ugi.getShortUserName();
if (ugi.getGroupNames() != null) {
  this.groupNames = Arrays.asList(ugi.getGroupNames());
}
  }
}

public class Utils {
  public static UserGroupInformation getUGI() throws LoginException, 
IOException {
String doAs = System.getenv("HADOOP_USER_NAME");
if(doAs != null && doAs.length() > 0) {
  return UserGroupInformation.createProxyUser(doAs, 
UserGroupInformation.getLoginUser());
}
return UserGroupInformation.getCurrentUser();
  }
}
```
it shows that HadoopDefaultAuthenticator will get username through 
Utils.getUGI(), so the username is HADOOP_USER_NAME of LoginUser.

2)  SessionStateUserAuthenticator
```
public class SessionStateUserAuthenticator implements 
HiveAuthenticationProvider {
  @Override
  public void setConf(Configuration arg0) {
  }

  @Override
  public String getUserName() {
return sessionState.getUserName();
  }
}
```
it shows that SessionStateUserAuthenticator get the username through 
sessionState.getUserName(), which is null. Here is the [instantiation of 
SessionState in 
HiveClientImpl](https://github.com/apache/spark/blob/1cf3e3a26961d306eb17b7629d8742a4df45f339/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L187)
 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20059
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85331/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20059
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20059
  
**[Test build #85331 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85331/testReport)**
 for PR 20059 at commit 
[`f23bf0f`](https://github.com/apache/spark/commit/f23bf0fdf21f21224895a5c35e0d95956a29abf9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20061: [SPARK-22890][TEST] Basic tests for DateTimeOperations

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20061
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20061: [SPARK-22890][TEST] Basic tests for DateTimeOperations

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20061
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85330/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20061: [SPARK-22890][TEST] Basic tests for DateTimeOperations

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20061
  
**[Test build #85330 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85330/testReport)**
 for PR 20061 at commit 
[`e8e4d11`](https://github.com/apache/spark/commit/e8e4d11a504c4169848baeabbec84af2a1b3e6a8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-22 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/19977
  
might have...I'll check the performance.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20004: [Spark-22818][SQL] csv escape of quote escape

2017-12-22 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20004#discussion_r158576836
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -148,6 +149,9 @@ class CSVOptions(
 format.setDelimiter(delimiter)
 format.setQuote(quote)
 format.setQuoteEscape(escape)
+if (charToEscapeQuoteEscaping != '\u') {
--- End diff --

Because we always call `format.setQuoteEscape(escape)`, this default value 
becomes wrong. That means, if users set `\u`, the actual quoteEscape in 
`univocity` is `\`. 

Let us use `Option[Char]` as the type of `charToEscapeQuoteEscaping `? We 
call `format.setCharToEscapeQuoteEscaping ` only if it is defined. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20004: [Spark-22818][SQL] csv escape of quote escape

2017-12-22 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20004
  
Thank you for your investigation! 

Basically, because the option `escape` is set to `\`, the default value of 
`charToEscapeQuoteEscaping` is actually `\` in effect.

Could you update the doc?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...

2017-12-22 Thread danielvdende

Github user danielvdende commented on the issue:

https://github.com/apache/spark/pull/20057
  
@gatorsmile could you explain why you have doubts about the feature? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19977
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85329/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19977
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19977
  
**[Test build #85329 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85329/testReport)**
 for PR 19977 at commit 
[`3ec440d`](https://github.com/apache/spark/commit/3ec440de914d8f71c35f30cdccec815b199b5f17).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2017-12-22 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20023
  
```
db2 => create table decimals_test(id int, a decimal(31,18), b 
decimal(31,18))
DB2I  The SQL command completed successfully.
db2 => insert into decimals_test values (1, 2.33, 1.123456789123456789)
DB2I  The SQL command completed successfully.
db2 => select a * b from decimals_test

1
-
SQL0802N  Arithmetic overflow or other arithmetic exception occurred.  
SQLSTATE=22003
```

I might not get your point. Above is the result I got. This is your 
scenario 3 or 2?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2017-12-22 Thread timout

Github user timout commented on the issue:

https://github.com/apache/spark/pull/17619
  
That does exactly what is supposed to do. And you absolutely right it 
related to executors.
I am sorry if it is not clear from my previous explanations.
Let us say:
Spark Streaming App - very long running app:
 Driver, started by marathon using  docker image, schedules (in mesos 
meaning) executors using 
 docker images.(net=HOST) (every executor started from docker image on some 
mesos agent)
So if some recoverable error happens, for instance: 
ExecutorLostFailure (executor 40 exited caused by one of the running tasks) 
Reason: Remote RPC client disassociated...(I do not know how about others but 
it is relatively often in my env.)
As result the executor will be dead and after 2 failures mesos agent node 
will be included in MesosCoarseGrainedSchedulerBackend black list and driver 
will never schedule (in mesos meaning) executor on it. So the app will 
starve... and notice will not die.
That exactly what happened with my streams apps before that patch.

That patch may be incompatible with master already but i can fix it if 
needed.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19984
  
**[Test build #85332 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85332/testReport)**
 for PR 19984 at commit 
[`b4f7976`](https://github.com/apache/spark/commit/b4f79762c083735011bf98250c39c263876c8cc8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19954
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85327/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19954
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19954
  
**[Test build #85327 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85327/testReport)**
 for PR 19954 at commit 
[`f82c568`](https://github.com/apache/spark/commit/f82c5682565bc0e1bc34ec428faedb53ee5ddecd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20011: [SPARK-20654][core] Add config to limit disk usage of th...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20011
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20011: [SPARK-20654][core] Add config to limit disk usage of th...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20011
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85326/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20011: [SPARK-20654][core] Add config to limit disk usage of th...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20011
  
**[Test build #85326 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85326/testReport)**
 for PR 20011 at commit 
[`931b2d2`](https://github.com/apache/spark/commit/931b2d262aa02880631ca4c693a84fa4c4d12318).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...

2017-12-22 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20057
  
Thank you for your contribution! I am doubting the value of this feature.

If you are interested in the JDBC-related work, could you take 
https://issues.apache.org/jira/browse/SPARK-22731? 

Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20059
  
**[Test build #85331 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85331/testReport)**
 for PR 20059 at commit 
[`f23bf0f`](https://github.com/apache/spark/commit/f23bf0fdf21f21224895a5c35e0d95956a29abf9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-22 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19977
  
Yes; otherwise, it will introduce a performance regression, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-22 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r158575144
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
 ---
@@ -42,8 +43,15 @@ private[parquet] class ParquetOptions(
* Acceptable values are defined in 
[[shortParquetCompressionCodecNames]].
*/
   val compressionCodecClassName: String = {
-val codecName = parameters.getOrElse("compression",
-  sqlConf.parquetCompressionCodec).toLowerCase(Locale.ROOT)
+// `compression`, `parquet.compression`(i.e., 
ParquetOutputFormat.COMPRESSION), and
+// `spark.sql.parquet.compression.codec`
+// are in order of precedence from highest to lowest.
+val parquetCompressionConf = 
parameters.get(ParquetOutputFormat.COMPRESSION)
+val codecName = parameters
+  .get("compression")
+  .orElse(parquetCompressionConf)
--- End diff --

Yeah, we can submit a separate PR for that issue. The behavior change needs 
to be documented in SparkSQL doc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20059: [SPARK-22648][K8s] Add documentation covering ini...

2017-12-22 Thread liyinan926

Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20059#discussion_r158575137
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -528,51 +579,90 @@ specific to Spark on Kubernetes.
   
 
 
-   spark.kubernetes.driver.limit.cores
-   (none)
-   
- Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for the driver pod.
-   
- 
- 
-   spark.kubernetes.executor.limit.cores
-   (none)
-   
- Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for each executor pod launched for the Spark Application.
-   
- 
- 
-   spark.kubernetes.node.selector.[labelKey]
-   (none)
-   
- Adds to the node selector of the driver pod and executor pods, with 
key labelKey and the value as the
- configuration's value. For example, setting 
spark.kubernetes.node.selector.identifier to 
myIdentifier
- will result in the driver pod and executors having a node selector 
with key identifier and value
-  myIdentifier. Multiple node selector keys can be added 
by setting multiple configurations with this prefix.
-
-  
- 
-   
spark.kubernetes.driverEnv.[EnvironmentVariableName]
-   (none)
-   
- Add the environment variable specified by 
EnvironmentVariableName to
- the Driver process. The user can specify multiple of these to set 
multiple environment variables.
-   
- 
-  
-
spark.kubernetes.mountDependencies.jarsDownloadDir
-/var/spark-data/spark-jars
-
-  Location to download jars to in the driver and executors.
-  This directory must be empty and will be mounted as an empty 
directory volume on the driver and executor pods.
-
-  
-   
- 
spark.kubernetes.mountDependencies.filesDownloadDir
- /var/spark-data/spark-files
- 
-   Location to download jars to in the driver and executors.
-   This directory must be empty and will be mounted as an empty 
directory volume on the driver and executor pods.
- 
-   
+  spark.kubernetes.driver.limit.cores
+  (none)
+  
+Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for the driver pod.
+  
+
+
+  spark.kubernetes.executor.limit.cores
+  (none)
+  
+Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for each executor pod launched for the Spark Application.
+  
+
+
+  spark.kubernetes.node.selector.[labelKey]
+  (none)
+  
+Adds to the node selector of the driver pod and executor pods, with 
key labelKey and the value as the
+configuration's value. For example, setting 
spark.kubernetes.node.selector.identifier to 
myIdentifier
+will result in the driver pod and executors having a node selector 
with key identifier and value
+ myIdentifier. Multiple node selector keys can be added 
by setting multiple configurations with this prefix.
+  
+
+
+  
spark.kubernetes.driverEnv.[EnvironmentVariableName]
+  (none)
+  
+Add the environment variable specified by 
EnvironmentVariableName to
+the Driver process. The user can specify multiple of these to set 
multiple environment variables.
+  
+
+
+  spark.kubernetes.mountDependencies.jarsDownloadDir
+  /var/spark-data/spark-jars
+  
+Location to download jars to in the driver and executors.
+This directory must be empty and will be mounted as an empty directory 
volume on the driver and executor pods.
+  
+
+
+  spark.kubernetes.mountDependencies.filesDownloadDir
+  /var/spark-data/spark-files
+  
+Location to download jars to in the driver and executors.
+This directory must be empty and will be mounted as an empty directory 
volume on the driver and executor pods.
+  
+
+
+  spark.kubernetes.mountDependencies.timeout
+  5 minutes
+  
+   Timeout before aborting the attempt to download and unpack dependencies 
from remote locations into the driver and executor pods.
+  
+
+
+  
spark.kubernetes.mountDependencies.maxThreadPoolSize
+  5
+  
+   Maximum size of the thread pool for downloading remote dependencies 
into the driver and executor pods.
--- End diff --

Done.


---

-
To unsubscribe,

[GitHub] spark pull request #20059: [SPARK-22648][K8s] Add documentation covering ini...

2017-12-22 Thread liyinan926

Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20059#discussion_r158575135
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -120,6 +120,57 @@ by their appropriate remote URIs. Also, application 
dependencies can be pre-moun
 Those dependencies can be added to the classpath by referencing them with 
`local://` URIs and/or setting the
 `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles.
 
+### Using Remote Dependencies
+When there are application dependencies hosted in remote locations like 
HDFS or HTTP servers, the driver and executor pods
+need a Kubernetes 
[init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/)
 for downloading
+the dependencies so the driver and executor containers can use them 
locally. This requires users to specify the container
+image for the init-container using the configuration property 
`spark.kubernetes.initContainer.image`. For example, users
+simply add the following option to the `spark-submit` command to specify 
the init-container image:
+
+```
+--conf spark.kubernetes.initContainer.image=
+```
+
+The init-container handles remote dependencies specified in `spark.jars` 
(or the `--jars` option of `spark-submit`) and
+`spark.files` (or the `--files` option of `spark-submit`). It also handles 
remotely hosted main application resources, e.g.,
+the main application jar. The following shows an example of using remote 
dependencies with the `spark-submit` command:
+
+```bash
+$ bin/spark-submit \
+--master k8s://https://: \
+--deploy-mode cluster \
+--name spark-pi \
+--class org.apache.spark.examples.SparkPi \
+--jars https://path/to/dependency1.jar,https://path/to/dependency2.jar
+--files hdfs://host:port/path/to/file1,hdfs://host:port/path/to/file2
+--conf spark.executor.instances=5 \
+--conf spark.kubernetes.driver.docker.image= \
+--conf spark.kubernetes.executor.docker.image= \
+--conf spark.kubernetes.initContainer.image=
+https://path/to/examples.jar
+```
+
+## Secret Management
+In some cases, a Spark application may need to use some credentials, e.g., 
for accessing data on a secured HDFS cluster
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20061: [SPARK-22890][TEST] Basic tests for DateTimeOperations

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20061
  
**[Test build #85330 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85330/testReport)**
 for PR 20061 at commit 
[`e8e4d11`](https://github.com/apache/spark/commit/e8e4d11a504c4169848baeabbec84af2a1b3e6a8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20061: [SPARK-22890][TEST] Basic tests for DateTimeOpera...

2017-12-22 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/20061

[SPARK-22890][TEST] Basic tests for DateTimeOperations

## What changes were proposed in this pull request?

Test Coverage for `DateTimeOperations`, this is a Sub-tasks for 
[SPARK-22722](https://issues.apache.org/jira/browse/SPARK-22722).

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-22890

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20061.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20061


commit 24b50f0c8371af258ed152363a9ba8148b23d2d2
Author: Yuming Wang 
Date:   2017-12-23T02:39:39Z

Basic tests for DateTimeOperations

commit e8e4d11a504c4169848baeabbec84af2a1b3e6a8
Author: Yuming Wang 
Date:   2017-12-23T02:53:40Z

Append a blank line




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-22 Thread fjh100456

Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r158574986
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
 ---
@@ -42,8 +43,15 @@ private[parquet] class ParquetOptions(
* Acceptable values are defined in 
[[shortParquetCompressionCodecNames]].
*/
   val compressionCodecClassName: String = {
-val codecName = parameters.getOrElse("compression",
-  sqlConf.parquetCompressionCodec).toLowerCase(Locale.ROOT)
+// `compression`, `parquet.compression`(i.e., 
ParquetOutputFormat.COMPRESSION), and
+// `spark.sql.parquet.compression.codec`
+// are in order of precedence from highest to lowest.
+val parquetCompressionConf = 
parameters.get(ParquetOutputFormat.COMPRESSION)
+val codecName = parameters
+  .get("compression")
+  .orElse(parquetCompressionConf)
--- End diff --

If so, parquet's table-level compression may be overwrited in this PR, and 
it may not be what we want.
Shall I  fix it first in another PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20059
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20059
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85325/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20059
  
**[Test build #85325 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85325/testReport)**
 for PR 20059 at commit 
[`fbb2112`](https://github.com/apache/spark/commit/fbb21121447fe131358042ce05454f88d6fb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20059
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85323/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20059
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20059
  
**[Test build #85323 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85323/testReport)**
 for PR 20059 at commit 
[`a6a6660`](https://github.com/apache/spark/commit/a6a666060787cadd832b5bd32281940a6c81a9a9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19954
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19954
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85324/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19954
  
**[Test build #85324 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85324/testReport)**
 for PR 19954 at commit 
[`785b90e`](https://github.com/apache/spark/commit/785b90e52580e9175896b22b00b23f30fbe020ef).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20039: [SPARK-22850][core] Ensure queued events are deli...

2017-12-22 Thread Ngone51

Github user Ngone51 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20039#discussion_r158574199
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala ---
@@ -125,13 +128,39 @@ private[spark] class LiveListenerBus(conf: SparkConf) 
{
 
   /** Post an event to all queues. */
   def post(event: SparkListenerEvent): Unit = {
-if (!stopped.get()) {
-  metrics.numEventsPosted.inc()
-  val it = queues.iterator()
-  while (it.hasNext()) {
-it.next().post(event)
+if (stopped.get()) {
+  return
+}
+
+metrics.numEventsPosted.inc()
+
+// If the event buffer is null, it means the bus has been started and 
we can avoid
+// synchronization and post events directly to the queues. This should 
be the most
+// common case during the life of the bus.
+if (queuedEvents == null) {
+  postToQueues(event)
--- End diff --

What if stop() called after the null judge and  before postToQueues() call 
? 
Do you think we should check the stopped.get() in postToQueues()?
like:
```
private def postToQueues(event: SparkListenerEvent): Unit = {
  if (!stopped.get()) {
 val it = queues.iterator()
 while (it.hasNext()) {
it.next().post(event)
 }
  }
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20039: [SPARK-22850][core] Ensure queued events are deli...

2017-12-22 Thread Ngone51

Github user Ngone51 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20039#discussion_r158573853
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala ---
@@ -149,7 +158,11 @@ private[spark] class LiveListenerBus(conf: SparkConf) {
 }
 
 this.sparkContext = sc
-queues.asScala.foreach(_.start(sc))
+queues.asScala.foreach { q =>
+  q.start(sc)
+  queuedEvents.foreach(q.post)
--- End diff --

Okï¼**synchronized** can avoid this problem.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19982: [SPARK-22787] [TEST] [SQL] Add a TPC-H query suite

2017-12-22 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/19982
  
@gatorsmile Any progress on this? 
https://github.com/apache/spark/pull/19982#discussion_r157119941
After I thought your comment, I came up with collecting metrics for each 
rule like;
https://github.com/apache/spark/compare/master...maropu:MetricSpike
This conflicts with your activity, or this is not acceptable? welcome any 
comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20011: [SPARK-20654][core] Add config to limit disk usage of th...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20011
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85320/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20011: [SPARK-20654][core] Add config to limit disk usage of th...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20011
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20011: [SPARK-20654][core] Add config to limit disk usage of th...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20011
  
**[Test build #85320 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85320/testReport)**
 for PR 20011 at commit 
[`be3f130`](https://github.com/apache/spark/commit/be3f1307f0edffdc7c9457ec960781cd28b07bf8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20039: [SPARK-22850][core] Ensure queued events are deli...

2017-12-22 Thread Ngone51

Github user Ngone51 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20039#discussion_r158573360
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala ---
@@ -149,7 +158,11 @@ private[spark] class LiveListenerBus(conf: SparkConf) {
 }
 
 this.sparkContext = sc
-queues.asScala.foreach(_.start(sc))
+queues.asScala.foreach { q =>
+  q.start(sc)
+  queuedEvents.foreach(q.post)
--- End diff --

What if stop() called before all queuedEvents post to AsyncEventQueue?
```
/**
   * Stop the listener bus. It will wait until the queued events have been 
processed, but new
   * events will be dropped.
   */
```
**(the "queued  events" mentioned in description above is not equal to 
"queuedEvents" here.)**

As queuedEvents "post" before  listeners install, so, can they be treated  
as new events?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20060: [SPARK-22889][SPARKR] Set overwrite=T when install Spark...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20060
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85328/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20060: [SPARK-22889][SPARKR] Set overwrite=T when install Spark...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20060
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20060: [SPARK-22889][SPARKR] Set overwrite=T when install Spark...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20060
  
**[Test build #85328 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85328/testReport)**
 for PR 20060 at commit 
[`6edbd71`](https://github.com/apache/spark/commit/6edbd710d0b73a7c4170dbb78eb42ace5a8c73a0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20059
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20059
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85321/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20059
  
**[Test build #85321 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85321/testReport)**
 for PR 20059 at commit 
[`c0a659a`](https://github.com/apache/spark/commit/c0a659a3117674dfbd5c078badd653886c15cc8e).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19929
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85322/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19929
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19929
  
**[Test build #85322 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85322/testReport)**
 for PR 19929 at commit 
[`cc309b0`](https://github.com/apache/spark/commit/cc309b0ce2496365afd8c602c282e3d84aeed940).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-22 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/19977
  
I found the optimizer rule can't combine nested concat like;
```
scala>: psate
val df = sql("""
SELECT ((col1 || col2) || (col3 || col4)) col
FROM (
  SELECT
encode(string(id), 'utf-8') col1,
encode(string(id + 1), 'utf-8') col2,
string(id + 2) col3,
string(id + 3) col4
  FROM range(10)
)
""")

scala> df.explain(true)
== Parsed Logical Plan ==
'Project [concat(concat('col1, 'col2), concat('col3, 'col4)) AS col#4]+- 
'SubqueryAlias __auto_generated_subquery_name
   +- 'Project ['encode('string('id), utf-8) AS col1#0, 
'encode('string(('id + 1)), utf-8) AS col2#1, 'string(('id + 2)) AS col3#2, 
'string(('id + 3)) AS col4#3
]
  +- 'UnresolvedTableValuedFunction range, [10]

== Analyzed Logical Plan ==
col: string
Project [concat(cast(concat(col1#0, col2#1) as string), concat(col3#2, 
col4#3)) AS col#4]
+- SubqueryAlias __auto_generated_subquery_name
   +- Project [encode(cast(id#9L as string), utf-8) AS col1#0, 
encode(cast((id#9L + cast(1 as bigint)) as string), utf-8) AS col2#1, 
cast((id#9L + cast(2 as bigint)) as string) AS col3#2, cast((id#9L + cast(3 as 
bigint)) as string) AS col4#3]
  +- Range (0, 10, step=1, splits=None)

== Optimized Logical Plan ==
Project [concat(cast(concat(encode(cast(id#9L as string), utf-8), 
encode(cast((id#9L + 1) as string), utf-8)) as string), cast((id#9L + 2) as 
string), cast((id#9L + 3) as string)) AS col#4]
+- Range (0, 10, step=1, splits=None)

== Physical Plan ==
*Project [concat(cast(concat(encode(cast(id#9L as string), utf-8), 
encode(cast((id#9L + 1) as string), utf-8)) as string), cast((id#9L + 2) as 
string), cast((id#9L + 3) as string)) AS col#4]
+- *Range (0, 10, step=1, splits=4)
```
We need to support the optimization in this case, too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19977
  
**[Test build #85329 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85329/testReport)**
 for PR 19977 at commit 
[`3ec440d`](https://github.com/apache/spark/commit/3ec440de914d8f71c35f30cdccec815b199b5f17).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20039: [SPARK-22850][core] Ensure queued events are deli...

2017-12-22 Thread Ngone51

Github user Ngone51 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20039#discussion_r158572288
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala ---
@@ -149,7 +158,11 @@ private[spark] class LiveListenerBus(conf: SparkConf) {
 }
 
 this.sparkContext = sc
-queues.asScala.foreach(_.start(sc))
+queues.asScala.foreach { q =>
+  q.start(sc)
+  queuedEvents.foreach(q.post)
--- End diff --

Yesï¼thatâs true, semantic only.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19977: [SPARK-22771][SQL] Concatenate binary inputs into...

2017-12-22 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19977#discussion_r158572201
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2171,7 +2171,8 @@ object functions {
   def base64(e: Column): Column = withExpr { Base64(e.expr) }
 
   /**
-   * Concatenates multiple input string columns together into a single 
string column.
+   * Concatenates multiple input columns together into a single column.
+   * If all inputs are binary, concat returns an output as binary. 
Otherwise, it returns as string.
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20053: [SPARK-22873] [CORE] Init lastReportTimestamp wit...

2017-12-22 Thread Ngone51

Github user Ngone51 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20053#discussion_r158572161
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala ---
@@ -112,6 +112,7 @@ private class AsyncEventQueue(val name: String, conf: 
SparkConf, metrics: LiveLi
   private[scheduler] def start(sc: SparkContext): Unit = {
 if (started.compareAndSet(false, true)) {
   this.sc = sc
+  lastReportTimestamp = System.currentTimeMillis()
--- End diff --

But there's no 'last time' for 'first time'.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20060: [SPARK-22889][SPARKR] Set overwrite=T when install Spark...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20060
  
**[Test build #85328 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85328/testReport)**
 for PR 20060 at commit 
[`6edbd71`](https://github.com/apache/spark/commit/6edbd710d0b73a7c4170dbb78eb42ace5a8c73a0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20060: [SPARK-22889][SPARKR] Set overwrite=T when install Spark...

2017-12-22 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/20060
  
cc @felixcheung 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20060: [SPARK-22889][SPARKR] Set overwrite=T when instal...

2017-12-22 Thread shivaram

GitHub user shivaram opened a pull request:

https://github.com/apache/spark/pull/20060

[SPARK-22889][SPARKR] Set overwrite=T when install SparkR in tests

## What changes were proposed in this pull request?

Since all CRAN checks go through the same machine, if there is an older 
partial download or partial install of Spark left behind the tests fail. This 
PR overwrites the install files when running tests. This shouldn't affect 
Jenkins as `SPARK_HOME` is set when running Jenkins tests.

## How was this patch tested?

Test manually by running `R CMD check --as-cran`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/shivaram/spark-1 sparkr-overwrite-cran

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20060.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20060


commit 6edbd710d0b73a7c4170dbb78eb42ace5a8c73a0
Author: Shivaram Venkataraman 
Date:   2017-12-23T00:27:56Z

Set overwrite=T when install SparkR in tests




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-22 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r158571702
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
 ---
@@ -42,8 +43,15 @@ private[parquet] class ParquetOptions(
* Acceptable values are defined in 
[[shortParquetCompressionCodecNames]].
*/
   val compressionCodecClassName: String = {
-val codecName = parameters.getOrElse("compression",
-  sqlConf.parquetCompressionCodec).toLowerCase(Locale.ROOT)
+// `compression`, `parquet.compression`(i.e., 
ParquetOutputFormat.COMPRESSION), and
+// `spark.sql.parquet.compression.codec`
+// are in order of precedence from highest to lowest.
+val parquetCompressionConf = 
parameters.get(ParquetOutputFormat.COMPRESSION)
+val codecName = parameters
+  .get("compression")
+  .orElse(parquetCompressionConf)
--- End diff --

Could we keep the old behavior? We could add it later? We do not want to 
mix multiple issues in the same PR? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-22 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r158571657
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala 
---
@@ -102,4 +111,18 @@ object HiveOptions {
 "collectionDelim" -> "colelction.delim",
 "mapkeyDelim" -> "mapkey.delim",
 "lineDelim" -> "line.delim").map { case (k, v) => 
k.toLowerCase(Locale.ROOT) -> v }
+
+  def getHiveWriteCompression(tableInfo: TableDesc, sqlConf: SQLConf): 
Option[(String, String)] = {
+tableInfo.getOutputFileFormatClassName.toLowerCase match {
+  case formatName if formatName.endsWith("parquetoutputformat") =>
+val compressionCodec = new 
ParquetOptions(tableInfo.getProperties.asScala.toMap,
+  sqlConf).compressionCodecClassName
+Option((ParquetOutputFormat.COMPRESSION, compressionCodec))
+  case formatName if formatName.endsWith("orcoutputformat") =>
+val compressionCodec = new 
OrcOptions(tableInfo.getProperties.asScala.toMap,
+  sqlConf).compressionCodec
--- End diff --

Yeah. Just to make it consistent 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-12-22 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r158571649
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -35,7 +39,7 @@ case class TestData(key: Int, value: String)
 case class ThreeCloumntable(key: Int, value: String, key1: String)
 
 class InsertSuite extends QueryTest with TestHiveSingleton with 
BeforeAndAfter
-with SQLTestUtils {
+with ParquetTest {
--- End diff --

Fine to me. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19929: [SPARK-22629][PYTHON] Add deterministic flag to p...

2017-12-22 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19929#discussion_r158571445
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2075,9 +2075,10 @@ class PandasUDFType(object):
 def udf(f=None, returnType=StringType()):
--- End diff --

Do we need to just add a parameter for deterministic? Adding it to the end 
is OK to PySpark without breaking the existing app? cc @ueshin  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19929: [SPARK-22629][PYTHON] Add deterministic flag to p...

2017-12-22 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19929#discussion_r158571371
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -58,6 +58,7 @@ class UDFRegistration private[sql] (functionRegistry: 
FunctionRegistry) extends
 | pythonIncludes: ${udf.func.pythonIncludes}
 | pythonExec: ${udf.func.pythonExec}
 | dataType: ${udf.dataType}
--- End diff --

Could you also print out `pythonEvalType`? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19929: [SPARK-22629][PYTHON] Add deterministic flag to p...

2017-12-22 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19929#discussion_r158571355
  
--- Diff: python/pyspark/sql/udf.py ---
@@ -157,5 +158,13 @@ def wrapper(*args):
 wrapper.func = self.func
 wrapper.returnType = self.returnType
 wrapper.evalType = self.evalType
+wrapper.asNondeterministic = self.asNondeterministic
 
 return wrapper
+
+def asNondeterministic(self):
+"""
+Updates UserDefinedFunction to nondeterministic.
--- End diff --

```
"""
Updates UserDefinedFunction to nondeterministic.

.. versionadded:: 2.3
"""
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19929: [SPARK-22629][PYTHON] Add deterministic flag to p...

2017-12-22 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19929#discussion_r158571320
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -434,6 +434,16 @@ def test_udf_with_array_type(self):
 self.assertEqual(list(range(3)), l1)
 self.assertEqual(1, l2)
 
+def test_nondeterministic_udf(self):
+from pyspark.sql.functions import udf
+import random
+udf_random_col = udf(lambda: int(100 * random.random()), 
IntegerType()).asNondeterministic()
+df = 
self.spark.createDataFrame([Row(1)]).select(udf_random_col().alias('RAND'))
+random.seed(1234)
+udf_add_ten = udf(lambda rand: rand + 10, IntegerType())
+[row] = df.withColumn('RAND_PLUS_TEN', 
udf_add_ten('RAND')).collect()
+self.assertEqual(row[0] + 10, row[1])
--- End diff --

Compare the values, since you already set the seed?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19954
  
**[Test build #85327 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85327/testReport)**
 for PR 19954 at commit 
[`f82c568`](https://github.com/apache/spark/commit/f82c5682565bc0e1bc34ec428faedb53ee5ddecd).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20059: [SPARK-22648][Kubernetes] Update documentation to...

2017-12-22 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20059#discussion_r158570531
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -120,6 +120,57 @@ by their appropriate remote URIs. Also, application 
dependencies can be pre-moun
 Those dependencies can be added to the classpath by referencing them with 
`local://` URIs and/or setting the
 `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles.
 
+### Using Remote Dependencies
+When there are application dependencies hosted in remote locations like 
HDFS or HTTP servers, the driver and executor pods
+need a Kubernetes 
[init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/)
 for downloading
+the dependencies so the driver and executor containers can use them 
locally. This requires users to specify the container
+image for the init-container using the configuration property 
`spark.kubernetes.initContainer.image`. For example, users
+simply add the following option to the `spark-submit` command to specify 
the init-container image:
+
+```
+--conf spark.kubernetes.initContainer.image=
+```
+
+The init-container handles remote dependencies specified in `spark.jars` 
(or the `--jars` option of `spark-submit`) and
+`spark.files` (or the `--files` option of `spark-submit`). It also handles 
remotely hosted main application resources, e.g.,
+the main application jar. The following shows an example of using remote 
dependencies with the `spark-submit` command:
+
+```bash
+$ bin/spark-submit \
+--master k8s://https://: \
+--deploy-mode cluster \
+--name spark-pi \
+--class org.apache.spark.examples.SparkPi \
+--jars https://path/to/dependency1.jar,https://path/to/dependency2.jar
+--files hdfs://host:port/path/to/file1,hdfs://host:port/path/to/file2
+--conf spark.executor.instances=5 \
+--conf spark.kubernetes.driver.docker.image= \
+--conf spark.kubernetes.executor.docker.image= \
+--conf spark.kubernetes.initContainer.image=
+https://path/to/examples.jar
+```
+
+## Secret Management
+In some cases, a Spark application may need to use some credentials, e.g., 
for accessing data on a secured HDFS cluster
--- End diff --

I'd rewrite this.

"Kubernetes Secrets can be used to provide credentials for a Spark 
application to access secured services. To mount secrets into a driver 
container, ..."


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20059: [SPARK-22648][Kubernetes] Update documentation to...

2017-12-22 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20059#discussion_r158570653
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -528,51 +579,90 @@ specific to Spark on Kubernetes.
   
 
 
-   spark.kubernetes.driver.limit.cores
-   (none)
-   
- Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for the driver pod.
-   
- 
- 
-   spark.kubernetes.executor.limit.cores
-   (none)
-   
- Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for each executor pod launched for the Spark Application.
-   
- 
- 
-   spark.kubernetes.node.selector.[labelKey]
-   (none)
-   
- Adds to the node selector of the driver pod and executor pods, with 
key labelKey and the value as the
- configuration's value. For example, setting 
spark.kubernetes.node.selector.identifier to 
myIdentifier
- will result in the driver pod and executors having a node selector 
with key identifier and value
-  myIdentifier. Multiple node selector keys can be added 
by setting multiple configurations with this prefix.
-
-  
- 
-   
spark.kubernetes.driverEnv.[EnvironmentVariableName]
-   (none)
-   
- Add the environment variable specified by 
EnvironmentVariableName to
- the Driver process. The user can specify multiple of these to set 
multiple environment variables.
-   
- 
-  
-
spark.kubernetes.mountDependencies.jarsDownloadDir
-/var/spark-data/spark-jars
-
-  Location to download jars to in the driver and executors.
-  This directory must be empty and will be mounted as an empty 
directory volume on the driver and executor pods.
-
-  
-   
- 
spark.kubernetes.mountDependencies.filesDownloadDir
- /var/spark-data/spark-files
- 
-   Location to download jars to in the driver and executors.
-   This directory must be empty and will be mounted as an empty 
directory volume on the driver and executor pods.
- 
-   
+  spark.kubernetes.driver.limit.cores
+  (none)
+  
+Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for the driver pod.
+  
+
+
+  spark.kubernetes.executor.limit.cores
+  (none)
+  
+Specify the hard CPU 
[limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
 for each executor pod launched for the Spark Application.
+  
+
+
+  spark.kubernetes.node.selector.[labelKey]
+  (none)
+  
+Adds to the node selector of the driver pod and executor pods, with 
key labelKey and the value as the
+configuration's value. For example, setting 
spark.kubernetes.node.selector.identifier to 
myIdentifier
+will result in the driver pod and executors having a node selector 
with key identifier and value
+ myIdentifier. Multiple node selector keys can be added 
by setting multiple configurations with this prefix.
+  
+
+
+  
spark.kubernetes.driverEnv.[EnvironmentVariableName]
+  (none)
+  
+Add the environment variable specified by 
EnvironmentVariableName to
+the Driver process. The user can specify multiple of these to set 
multiple environment variables.
+  
+
+
+  spark.kubernetes.mountDependencies.jarsDownloadDir
+  /var/spark-data/spark-jars
+  
+Location to download jars to in the driver and executors.
+This directory must be empty and will be mounted as an empty directory 
volume on the driver and executor pods.
+  
+
+
+  spark.kubernetes.mountDependencies.filesDownloadDir
+  /var/spark-data/spark-files
+  
+Location to download jars to in the driver and executors.
+This directory must be empty and will be mounted as an empty directory 
volume on the driver and executor pods.
+  
+
+
+  spark.kubernetes.mountDependencies.timeout
+  5 minutes
+  
+   Timeout before aborting the attempt to download and unpack dependencies 
from remote locations into the driver and executor pods.
+  
+
+
+  
spark.kubernetes.mountDependencies.maxThreadPoolSize
+  5
+  
+   Maximum size of the thread pool for downloading remote dependencies 
into the driver and executor pods.
--- End diff --

I'd clarify this controls how many downloads happen simultaneously; could 
even change the name of the config

[GitHub] spark issue #20011: [SPARK-20654][core] Add config to limit disk usage of th...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20011
  
**[Test build #85326 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85326/testReport)**
 for PR 20011 at commit 
[`931b2d2`](https://github.com/apache/spark/commit/931b2d262aa02880631ca4c693a84fa4c4d12318).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20011: [SPARK-20654][core] Add config to limit disk usag...

2017-12-22 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20011#discussion_r158569095
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/HistoryServerDiskManager.scala
 ---
@@ -0,0 +1,310 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.history
+
+import java.io.File
+import java.nio.file.Files
+import java.nio.file.attribute.PosixFilePermissions
+import java.util.concurrent.atomic.AtomicLong
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.{HashMap, ListBuffer}
+
+import org.apache.commons.io.FileUtils
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.status.KVUtils._
+import org.apache.spark.util.{Clock, Utils}
+import org.apache.spark.util.kvstore.KVStore
+
+/**
+ * A class used to keep track of disk usage by the SHS, allowing 
application data to be deleted
+ * from disk when usage exceeds a configurable threshold.
+ *
+ * The goal of the class is not to guarantee that usage will never exceed 
the threshold; because of
+ * how application data is written, disk usage may temporarily go higher. 
But, eventually, it
+ * should fall back under the threshold.
+ *
+ * @param conf Spark configuration.
+ * @param path Path where to store application data.
+ * @param listing The listing store, used to persist usage data.
+ * @param clock Clock instance to use.
+ */
+private class HistoryServerDiskManager(
+conf: SparkConf,
+path: File,
+listing: KVStore,
+clock: Clock) extends Logging {
+
+  import config._
+
+  private val appStoreDir = new File(path, "apps")
+  if (!appStoreDir.isDirectory() && !appStoreDir.mkdir()) {
+throw new IllegalArgumentException(s"Failed to create app directory 
($appStoreDir).")
+  }
+
+  private val tmpStoreDir = new File(path, "temp")
+  if (!tmpStoreDir.isDirectory() && !tmpStoreDir.mkdir()) {
+throw new IllegalArgumentException(s"Failed to create temp directory 
($tmpStoreDir).")
+  }
+
+  private val maxUsage = conf.get(MAX_LOCAL_DISK_USAGE)
+  private val currentUsage = new AtomicLong(0L)
+  private val committedUsage = new AtomicLong(0L)
+  private val active = new HashMap[(String, Option[String]), Long]()
+
+  def initialize(): Unit = {
+updateUsage(sizeOf(appStoreDir), committed = true)
+
+// Clean up any temporary stores during start up. This assumes that 
they're leftover from other
+// instances and are not useful.
+tmpStoreDir.listFiles().foreach(FileUtils.deleteQuietly)
+
+// Go through the recorded store directories and remove any that may 
have been removed by
+// external code.
+val orphans = 
listing.view(classOf[ApplicationStoreInfo]).asScala.filter { info =>
+  !new File(info.path).exists()
+}.toSeq
+
+orphans.foreach { info =>
+  listing.delete(info.getClass(), info.path)
+}
+  }
+
+  /**
+   * Lease some space from the store. The leased space is calculated as a 
fraction of the given
+   * event log size; this is an approximation, and doesn't mean the 
application store cannot
+   * outgrow the lease.
+   *
+   * If there's not enough space for the lease, other applications might 
be evicted to make room.
+   * This method always returns a lease, meaning that it's possible for 
local disk usage to grow
+   * past the configured threshold if there aren't enough idle 
applications to evict.
+   *
+   * While the lease is active, the data is written to a temporary 
location, so `openStore()`
+   * will still return `None` for the application.
+   */
+  def lease(eventLogSize: Long, isCompressed: Boolean = false): Lease = {
+val needed = approximateSize(eventLogSize, isCompressed)
+makeRoom(needed)
+
+val perms =

[GitHub] spark pull request #20059: [SPARK-22648][Kubernetes] Update documentation to...

2017-12-22 Thread foxish

Github user foxish commented on a diff in the pull request:

https://github.com/apache/spark/pull/20059#discussion_r158568498
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -120,6 +120,23 @@ by their appropriate remote URIs. Also, application 
dependencies can be pre-moun
 Those dependencies can be added to the classpath by referencing them with 
`local://` URIs and/or setting the
 `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles.
 
+### Using Remote Dependencies
+When there are application dependencies hosted in remote locations like 
HDFS or HTTP servers, the driver and executor pods need a Kubernetes 
[init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/)
 for downloading the dependencies so the driver and executor containers can use 
them locally. This requires users to specify the container image for the 
init-container using the configuration property 
`spark.kubernetes.initContainer.image`. For example, users simply add the 
following option to the `spark-submit` command to specify the init-container 
image:
--- End diff --

HDFS and HTTP sound good. We can cover GCS elsewhere. Line breaks were for 
ease of reviewing by others (being able to comment on individual lines) and for 
consistency with the rest of the docs. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20011: [SPARK-20654][core] Add config to limit disk usag...

2017-12-22 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/20011#discussion_r158568388
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/HistoryServerDiskManager.scala
 ---
@@ -0,0 +1,310 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.history
+
+import java.io.File
+import java.nio.file.Files
+import java.nio.file.attribute.PosixFilePermissions
+import java.util.concurrent.atomic.AtomicLong
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.{HashMap, ListBuffer}
+
+import org.apache.commons.io.FileUtils
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+import org.apache.spark.status.KVUtils._
+import org.apache.spark.util.{Clock, Utils}
+import org.apache.spark.util.kvstore.KVStore
+
+/**
+ * A class used to keep track of disk usage by the SHS, allowing 
application data to be deleted
+ * from disk when usage exceeds a configurable threshold.
+ *
+ * The goal of the class is not to guarantee that usage will never exceed 
the threshold; because of
+ * how application data is written, disk usage may temporarily go higher. 
But, eventually, it
+ * should fall back under the threshold.
+ *
+ * @param conf Spark configuration.
+ * @param path Path where to store application data.
+ * @param listing The listing store, used to persist usage data.
+ * @param clock Clock instance to use.
+ */
+private class HistoryServerDiskManager(
+conf: SparkConf,
+path: File,
+listing: KVStore,
+clock: Clock) extends Logging {
+
+  import config._
+
+  private val appStoreDir = new File(path, "apps")
+  if (!appStoreDir.isDirectory() && !appStoreDir.mkdir()) {
+throw new IllegalArgumentException(s"Failed to create app directory 
($appStoreDir).")
+  }
+
+  private val tmpStoreDir = new File(path, "temp")
+  if (!tmpStoreDir.isDirectory() && !tmpStoreDir.mkdir()) {
+throw new IllegalArgumentException(s"Failed to create temp directory 
($tmpStoreDir).")
+  }
+
+  private val maxUsage = conf.get(MAX_LOCAL_DISK_USAGE)
+  private val currentUsage = new AtomicLong(0L)
+  private val committedUsage = new AtomicLong(0L)
+  private val active = new HashMap[(String, Option[String]), Long]()
+
+  def initialize(): Unit = {
+updateUsage(sizeOf(appStoreDir), committed = true)
+
+// Clean up any temporary stores during start up. This assumes that 
they're leftover from other
+// instances and are not useful.
+tmpStoreDir.listFiles().foreach(FileUtils.deleteQuietly)
+
+// Go through the recorded store directories and remove any that may 
have been removed by
+// external code.
+val orphans = 
listing.view(classOf[ApplicationStoreInfo]).asScala.filter { info =>
+  !new File(info.path).exists()
+}.toSeq
+
+orphans.foreach { info =>
+  listing.delete(info.getClass(), info.path)
+}
+  }
+
+  /**
+   * Lease some space from the store. The leased space is calculated as a 
fraction of the given
+   * event log size; this is an approximation, and doesn't mean the 
application store cannot
+   * outgrow the lease.
+   *
+   * If there's not enough space for the lease, other applications might 
be evicted to make room.
+   * This method always returns a lease, meaning that it's possible for 
local disk usage to grow
+   * past the configured threshold if there aren't enough idle 
applications to evict.
+   *
+   * While the lease is active, the data is written to a temporary 
location, so `openStore()`
+   * will still return `None` for the application.
+   */
+  def lease(eventLogSize: Long, isCompressed: Boolean = false): Lease = {
+val needed = approximateSize(eventLogSize, isCompressed)
+makeRoom(needed)
+
+val perms =

[GitHub] spark issue #20059: [SPARK-22648][Kubernetes] Update documentation to cover ...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20059
  
**[Test build #85325 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85325/testReport)**
 for PR 20059 at commit 
[`fbb2112`](https://github.com/apache/spark/commit/fbb21121447fe131358042ce05454f88d6fb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][Kubernetes] Update documentation to cover ...

2017-12-22 Thread liyinan926

Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/20059
  
Updated in 
https://github.com/apache/spark/pull/20059/commits/fbb21121447fe131358042ce05454f88d6fb.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20039: [SPARK-22850][core] Ensure queued events are delivered t...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20039
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85317/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20039: [SPARK-22850][core] Ensure queued events are delivered t...

2017-12-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20039
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20059: [SPARK-22648][Kubernetes] Update documentation to...

2017-12-22 Thread liyinan926

Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20059#discussion_r158568089
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -120,6 +120,23 @@ by their appropriate remote URIs. Also, application 
dependencies can be pre-moun
 Those dependencies can be added to the classpath by referencing them with 
`local://` URIs and/or setting the
 `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles.
 
+### Using Remote Dependencies
+When there are application dependencies hosted in remote locations like 
HDFS or HTTP servers, the driver and executor pods need a Kubernetes 
[init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/)
 for downloading the dependencies so the driver and executor containers can use 
them locally. This requires users to specify the container image for the 
init-container using the configuration property 
`spark.kubernetes.initContainer.image`. For example, users simply add the 
following option to the `spark-submit` command to specify the init-container 
image:
--- End diff --

Regarding examples, I can add one spark-submit example showing how to use 
remote jars/files on http/https and hdfs. But gcs requires the connector in the 
init-container, which is non-trivial. I'm not sure about s3. I think we should 
avoid doing so.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20039: [SPARK-22850][core] Ensure queued events are delivered t...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20039
  
**[Test build #85317 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85317/testReport)**
 for PR 20039 at commit 
[`2602fa6`](https://github.com/apache/spark/commit/2602fa68424e984f2cd49f79fb54bcf9676ba5fb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18754: [SPARK-21552][SQL] Add DecimalType support to Arr...

2017-12-22 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/18754#discussion_r158566190
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala 
---
@@ -214,6 +216,22 @@ private[arrow] class DoubleWriter(val valueVector: 
Float8Vector) extends ArrowFi
   }
 }
 
+private[arrow] class DecimalWriter(
+val valueVector: DecimalVector,
+precision: Int,
+scale: Int) extends ArrowFieldWriter {
+
+  override def setNull(): Unit = {
+valueVector.setNull(count)
+  }
+
+  override def setValue(input: SpecializedGetters, ordinal: Int): Unit = {
+val decimal = input.getDecimal(ordinal, precision, scale)
+decimal.changePrecision(precision, scale)
--- End diff --

Is it necessary to call `changePrecision` even though `getDecimal` already 
takes the precision/scale as input - is it not guaranteed to return a decimal 
with that scale?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18754: [SPARK-21552][SQL] Add DecimalType support to Arr...

2017-12-22 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/18754#discussion_r158567491
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1617,7 +1617,7 @@ def to_arrow_type(dt):
 elif type(dt) == DoubleType:
 arrow_type = pa.float64()
 elif type(dt) == DecimalType:
-arrow_type = pa.decimal(dt.precision, dt.scale)
+arrow_type = pa.decimal128(dt.precision, dt.scale)
--- End diff --

yes, that's the right way - it is now fixed at 128 bits internally.  I 
believe the Arrow Java limit is the same as Spark 38/38, not sure if pyarrow is 
the same but I think so.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20045: [Spark-22360][SQL] Add unit tests for Window Specificati...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20045
  
**[Test build #4023 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4023/testReport)**
 for PR 20045 at commit 
[`797c907`](https://github.com/apache/spark/commit/797c90739a4fc57f487d4d73c28ba59bfd598942).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20059: [SPARK-22648][Kubernetes] Update documentation to...

2017-12-22 Thread liyinan926

Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20059#discussion_r158567222
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -120,6 +120,23 @@ by their appropriate remote URIs. Also, application 
dependencies can be pre-moun
 Those dependencies can be added to the classpath by referencing them with 
`local://` URIs and/or setting the
 `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles.
 
+### Using Remote Dependencies
+When there are application dependencies hosted in remote locations like 
HDFS or HTTP servers, the driver and executor pods need a Kubernetes 
[init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/)
 for downloading the dependencies so the driver and executor containers can use 
them locally. This requires users to specify the container image for the 
init-container using the configuration property 
`spark.kubernetes.initContainer.image`. For example, users simply add the 
following option to the `spark-submit` command to specify the init-container 
image:
--- End diff --

Do we need to break them into lines? I thought this should be automatically 
wrapped when being viewed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19954
  
**[Test build #85324 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85324/testReport)**
 for PR 19954 at commit 
[`785b90e`](https://github.com/apache/spark/commit/785b90e52580e9175896b22b00b23f30fbe020ef).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20059: [SPARK-22648][Kubernetes] Update documentation to...

2017-12-22 Thread foxish

Github user foxish commented on a diff in the pull request:

https://github.com/apache/spark/pull/20059#discussion_r158566909
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -120,6 +120,23 @@ by their appropriate remote URIs. Also, application 
dependencies can be pre-moun
 Those dependencies can be added to the classpath by referencing them with 
`local://` URIs and/or setting the
 `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles.
 
+### Using Remote Dependencies
+When there are application dependencies hosted in remote locations like 
HDFS or HTTP servers, the driver and executor pods need a Kubernetes 
[init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/)
 for downloading the dependencies so the driver and executor containers can use 
them locally. This requires users to specify the container image for the 
init-container using the configuration property 
`spark.kubernetes.initContainer.image`. For example, users simply add the 
following option to the `spark-submit` command to specify the init-container 
image:
--- End diff --

Maybe we should include 2-3 examples of remote file usage - ideally, 
showing that one can use http, hdfs, gcs, s3 in dependencies.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20059: [SPARK-22648][Kubernetes] Update documentation to...

2017-12-22 Thread foxish

Github user foxish commented on a diff in the pull request:

https://github.com/apache/spark/pull/20059#discussion_r158566870
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -120,6 +120,23 @@ by their appropriate remote URIs. Also, application 
dependencies can be pre-moun
 Those dependencies can be added to the classpath by referencing them with 
`local://` URIs and/or setting the
 `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles.
 
+### Using Remote Dependencies
+When there are application dependencies hosted in remote locations like 
HDFS or HTTP servers, the driver and executor pods need a Kubernetes 
[init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/)
 for downloading the dependencies so the driver and executor containers can use 
them locally. This requires users to specify the container image for the 
init-container using the configuration property 
`spark.kubernetes.initContainer.image`. For example, users simply add the 
following option to the `spark-submit` command to specify the init-container 
image:
--- End diff --

This and below text should be broken up into multiple lines.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20059: [SPARK-22648][Kubernetes] Update documentation to cover ...

2017-12-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20059
  
**[Test build #85323 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85323/testReport)**
 for PR 20059 at commit 
[`a6a6660`](https://github.com/apache/spark/commit/a6a666060787cadd832b5bd32281940a6c81a9a9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 321 matches

Mail list logo