date:20171017

[GitHub] spark issue #19511: [SPARK-22293][SQL] Avoid unnecessary traversal in Resolv...

2017-10-17 Thread ConeyLiu

Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/19511
  
OK, close it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19511: [SPARK-22293][SQL] Avoid unnecessary traversal in Resolv...

2017-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19511
  
@ConeyLiu I just saw extra codes/logics are added. Maybe close it first?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

2017-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19480
  
@kiszk Does this look good to you?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19269
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82872/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19269
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19269
  
**[Test build #82872 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82872/testReport)**
 for PR 19269 at commit 
[`9e12d9f`](https://github.com/apache/spark/commit/9e12d9ffc4e2368e709b018b6117cbeef743d6f3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

2017-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19480
  
cc @rednaxelafx Do you have a bandwidth to review this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18747: [SPARK-20822][SQL] Generate code to directly get value f...

2017-10-17 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/18747
  
ping @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1

2017-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19521
  
`SPARK-15474 ` is zero row. The above case is zero column. Are they the 
same issues?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19524
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19524
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82871/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19524
  
**[Test build #82871 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82871/testReport)**
 for PR 19524 at commit 
[`26eeae1`](https://github.com/apache/spark/commit/26eeae140093896155d0b7196cd2de7dd4c6bda6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19522: [SPARK-22249][FOLLOWUP][SQL] Check if list of value for ...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19522
  
**[Test build #82873 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82873/testReport)**
 for PR 19522 at commit 
[`e95bc7b`](https://github.com/apache/spark/commit/e95bc7b395e027aa3d1e719d987b4f5a4461c34b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1

2017-10-17 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19521
  
Thank you for review, @gatorsmile .

1. The test case was added at #15898 (SPARK-18457). I guess Parquet returns 
`null`, but we had better have explicit test cases. I will try to extend that 
test case for parquet next time.
2. Thanks for bringing that up. Yes. We can resolve that empty ORC file 
issue, SPARK-15474 (ORC data source fails to write and read back empty 
dataframe), with new ORC source by creating an empty file with the correct 
schema, not `struct<>`.

BTW, I've linked all related ORC issues into 
[SPARK-20901](https://issues.apache.org/jira/browse/SPARK-20901) and am working 
on it. You can monitor ORC progress there.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19523: [SPARK-22301][SQL] Add rule to Optimizer for In w...

2017-10-17 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19523#discussion_r145318964
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -102,7 +102,8 @@ case class InMemoryTableScanExec(
 case IsNull(a: Attribute) => statsFor(a).nullCount > 0
 case IsNotNull(a: Attribute) => statsFor(a).count - 
statsFor(a).nullCount > 0
 
-case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty 
=> Literal.FalseLiteral
+// We rely on the optimizations in 
org.apache.spark.sql.catalyst.optimizer.OptimizeIn
+// to be sure that the list cannot be empty
--- End diff --

We can remove this line after we merge this PR 
https://github.com/apache/spark/pull/19522


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19522: [SPARK-22249][FOLLOWUP][SQL] Check if list of value for ...

2017-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19522
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19272
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82870/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19272
  
Build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19272
  
**[Test build #82870 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82870/testReport)**
 for PR 19272 at commit 
[`e522150`](https://github.com/apache/spark/commit/e522150c32ffca9f116197f80530554d3cb68b6c).
 * This patch passes all tests.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1

2017-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19521
  
We can save an empty DataFrame as an ORC table, but we are unable to fetch 
it from the table. 

```Scala
  val rddNoCols = sparkContext.parallelize(1 to 10).map(_ => Row.empty)
  val dfNoCols = spark.createDataFrame(rddNoCols, StructType(Seq.empty))
  dfNoCols.write.format("orc").saveAsTable("t")
  spark.sql("select 1 from t").show()
```

This is not related to this upgrade, but you might be interested in this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1

2017-10-17 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19521
  
Also LGTM

Regarding the test case you posted, does Parquet return `null` or `empty 
string`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-17 Thread shaneknapp

Github user shaneknapp commented on the issue:

https://github.com/apache/spark/pull/19524
  
yeah, this makes sense.  i don't think we've officially supported 2.6 in a 
while, esp for tests and i'd be ok w/removing the backports.  this makes for a 
much clearer exit case.

@rxin  @sameeragarwal 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1

2017-10-17 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19521
  
Thank you for review, @cloud-fan !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1

2017-10-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19521
  
looks good, no new dependencies introduced, just upgrading. cc @srowen to 
double check. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19523: [SPARK-22301][SQL] Add rule to Optimizer for In w...

2017-10-17 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19523#discussion_r145306880
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -204,6 +204,7 @@ case class In(value: Expression, list: Seq[Expression]) 
extends Predicate {
 
   override def children: Seq[Expression] = value +: list
   lazy val inSetConvertible = list.forall(_.isInstanceOf[Literal])
+  lazy val isListEmpty = list.isEmpty
--- End diff --

Do we really need this val?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19523: [SPARK-22301][SQL] Add rule to Optimizer for In w...

2017-10-17 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19523#discussion_r145307826
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -102,7 +102,8 @@ case class InMemoryTableScanExec(
 case IsNull(a: Attribute) => statsFor(a).nullCount > 0
 case IsNotNull(a: Attribute) => statsFor(a).count - 
statsFor(a).nullCount > 0
 
-case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty 
=> Literal.FalseLiteral
+// We rely on the optimizations in 
org.apache.spark.sql.catalyst.optimizer.OptimizeIn
+// to be sure that the list cannot be empty
--- End diff --

IMHO this comment is not accurate, since in optimizer we only deal with the 
case attribute is not nullable.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19495: [SPARK-22278][SS] Expose current event time water...

2017-10-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19495


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19495
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19495
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82869/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19495
  
**[Test build #82869 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82869/testReport)**
 for PR 19495 at commit 
[`ed9d3a2`](https://github.com/apache/spark/commit/ed9d3a2e4fd4b1a0ec0e0ceb34419ede4e2303b8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19269
  
**[Test build #82872 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82872/testReport)**
 for PR 19269 at commit 
[`9e12d9f`](https://github.com/apache/spark/commit/9e12d9ffc4e2368e709b018b6117cbeef743d6f3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19459
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19459
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82866/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19515: [SPARK-22287][MESOS] SPARK_DAEMON_MEMORY not honored by ...

2017-10-17 Thread pmackles

Github user pmackles commented on the issue:

https://github.com/apache/spark/pull/19515
  
Hi @ArtRand - Based on my testing and interpretation of the code, 
```SPARK_DRIVER_MEMORY``` has no affect on MesosClusterDispatcher. Heap size 
always winds up being set to ```-Xmx1G``` (the default).

I went with ```SPARK_DAEMON_MEMORY``` because that's what is used for the 
Master/Worker in a standalone cluster and I think Dispatcher is pretty similar 
in nature to those daemons. That said, I don't have a strong opinion and 
```SPARK_DISPATCHER_MEMORY``` works fine for my use case so I am happy to 
switch it in the name of getting this PR accepted.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19459
  
**[Test build #82866 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82866/testReport)**
 for PR 19459 at commit 
[`81ddfa9`](https://github.com/apache/spark/commit/81ddfa9afa03531edd9ac7b805a09be2be96d88c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19488
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19488
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82865/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19488
  
**[Test build #82865 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82865/testReport)**
 for PR 19488 at commit 
[`ece2062`](https://github.com/apache/spark/commit/ece206223822bde6e36d654a0d34593c405f8ffd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19524
  
**[Test build #82871 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82871/testReport)**
 for PR 19524 at commit 
[`26eeae1`](https://github.com/apache/spark/commit/26eeae140093896155d0b7196cd2de7dd4c6bda6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19524
  
cc @JoshRosen, @holdenk and @shaneknapp, could you take a look and see if 
makes sense please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19524: [SPARK-22302][INFRA] Remove manual backports for ...

2017-10-17 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/19524

[SPARK-22302][INFRA] Remove manual backports for subprocess and print 
explicit message for < Python 2.7

## What changes were proposed in this pull request?

Seems there is a mistake from missing import for `subprocess.call` which 
should be used for Python 2.6 backports for some missing functions in 
`subprocess`. Reproduction is:

```
cd dev && python2.6
```

```
>>> from sparktestsupport import shellutils
>>> shellutils.subprocess_check_call("ls")
Traceback (most recent call last):
  File "", line 1, in 
  File "sparktestsupport/shellutils.py", line 46, in subprocess_check_call
retcode = call(*popenargs, **kwargs)
NameError: global name 'call' is not defined
```

For Jenkins logs, please see 
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3950/console

Since we dropped the Python 2.6.x support, looks better we remove those 
workarounds and print out explicit error messages in order to duplicate the 
efforts to find out the root causes for such cases.

## How was this patch tested?

Manually tested:


```
cd dev && python2.6
```

```
>>> from sparktestsupport import shellutils
[error] Python versions prior to 2.7 are not supported.
```

```
cd dev && python2.7
```

```
>>> from sparktestsupport import shellutils
>>> shellutils.subprocess_check_output("ls")

'README.md\nappveyor-guide.md\nappveyor-install-dependencies.ps1\nchange-scala-version.sh\ncheck-license\ncheckstyle-suppressions.xml\ncheckstyle.xml\ncreate-release\ndeps\ngithub_jira_sync.py\ngithub_jira_sync.pyc\nlint-java\nlint-python\nlint-r\nlint-r-report.log\nlint-r.R\nlint-scala\nmake-distribution.sh\nmerge_spark_pr.py\nmerge_spark_pr.pyc\nmerge_spark_pr_jira.py\nmerge_spark_pr_jira.pyc\nmima\npep8-1.7.0.py\npep8-1.7.0.pyc\npip-sanity-check.py\npip-sanity-check.pyc\nrequirements.txt\nrun-pip-tests\nrun-tests\nrun-tests-jenkins\nrun-tests-jenkins.py\nrun-tests-jenkins.pyc\nrun-tests.py\nrun-tests.pyc\nscalastyle\nsparktestsupport\ntest-dependencies.sh\ntests\ntox.ini\n'
>>> shellutils.subprocess_check_call("ls")
README.md checkstyle-suppressions.xml   
github_jira_sync.pyc  lint-r.R  
merge_spark_pr_jira.pypip-sanity-check.py   
run-tests-jenkins scalastyle
appveyor-guide.md checkstyle.xml
lint-java lint-scala
merge_spark_pr_jira.pyc   pip-sanity-check.pyc  
run-tests-jenkins.py  sparktestsupport
appveyor-install-dependencies.ps1 create-release
lint-python   make-distribution.sh  mima
  requirements.txt  run-tests-jenkins.pyc   
  test-dependencies.sh
change-scala-version.sh   deps  lint-r  
  merge_spark_pr.py pep8-1.7.0.py   
  run-pip-tests run-tests.py
  tests
check-license github_jira_sync.py   
lint-r-report.log merge_spark_pr.pyc
pep8-1.7.0.pycrun-tests 
run-tests.pyc tox.ini
0
```

```
cd dev && python3.6
```

```
>>> from sparktestsupport import shellutils
>>> shellutils.subprocess_check_output("ls")

b'README.md\nappveyor-guide.md\nappveyor-install-dependencies.ps1\nchange-scala-version.sh\ncheck-license\ncheckstyle-suppressions.xml\ncheckstyle.xml\ncreate-release\ndeps\ngithub_jira_sync.py\ngithub_jira_sync.pyc\nlint-java\nlint-python\nlint-r\nlint-r-report.log\nlint-r.R\nlint-scala\nmake-distribution.sh\nmerge_spark_pr.py\nmerge_spark_pr.pyc\nmerge_spark_pr_jira.py\nmerge_spark_pr_jira.pyc\nmima\npep8-1.7.0.py\npep8-1.7.0.pyc\npip-sanity-check.py\npip-sanity-check.pyc\nrequirements.txt\nrun-pip-tests\nrun-tests\nrun-tests-jenkins\nrun-tests-jenkins.py\nrun-tests-jenkins.pyc\nrun-tests.py\nrun-tests.pyc\nscalastyle\nsparktestsupport\ntest-dependencies.sh\ntests\ntox.ini\n'
>>> shellutils.subprocess_check_call("ls")
README.md checkstyle-suppressions.xml   
github_jira_sync.pyc  lint-r.R  
merge_spark_pr_jira.pypip-sanity-check.py   
run-tests-jenkins scalastyle
appveyor-guide.md checkstyle.xml
lint-java lint-scala
merge_spark_pr_jira.pyc   pip-sanity-check.pyc

[GitHub] spark pull request #19272: [Spark-21842][Mesos] Support Kerberos ticket rene...

2017-10-17 Thread ArtRand

Github user ArtRand commented on a diff in the pull request:

https://github.com/apache/spark/pull/19272#discussion_r145301221
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
 ---
@@ -194,6 +198,27 @@ private[spark] class 
MesosCoarseGrainedSchedulerBackend(
   sc.conf.getOption("spark.mesos.driver.frameworkId").map(_ + suffix)
 )
 
+// check that the credentials are defined, even though it's likely 
that auth would have failed
+// already if you've made it this far
+if (principal != null && hadoopDelegationCreds.isDefined) {
+  logDebug(s"Principal found ($principal) starting token renewer")
+  val credentialRenewerThread = new Thread {
+setName("MesosCredentialRenewer")
+override def run(): Unit = {
+  val rt = 
MesosCredentialRenewer.getTokenRenewalTime(hadoopDelegationCreds.get, conf)
+  val credentialRenewer =
+new MesosCredentialRenewer(
+  conf,
+  hadoopDelegationTokenManager.get,
+  MesosCredentialRenewer.getNextRenewalTime(rt),
+  driverEndpoint)
+  credentialRenewer.scheduleTokenRenewal()
+}
+  }
+
+  credentialRenewerThread.start()
+  credentialRenewerThread.join()
--- End diff --

Yes, I believe so. If you look here 
(https://github.com/apache/spark/blob/6735433cde44b3430fb44edfff58ef8149c66c13/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L271)
 it's the same pattern. The thread needs to run as long as the application 
driver, correct? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19272: [Spark-21842][Mesos] Support Kerberos ticket rene...

2017-10-17 Thread ArtRand

Github user ArtRand commented on a diff in the pull request:

https://github.com/apache/spark/pull/19272#discussion_r145300847
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCredentialRenewer.scala
 ---
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster.mesos
+
+import java.security.PrivilegedExceptionAction
+import java.util.concurrent.{Executors, TimeUnit}
+
+import scala.collection.JavaConverters._
+import scala.util.Try
+
+import org.apache.hadoop.security.UserGroupInformation
+
+import org.apache.spark.SparkConf
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.deploy.security.HadoopDelegationTokenManager
+import org.apache.spark.internal.Logging
+import org.apache.spark.rpc.RpcEndpointRef
+import 
org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages.UpdateDelegationTokens
+import org.apache.spark.util.ThreadUtils
+
+
+class MesosCredentialRenewer(
+conf: SparkConf,
+tokenManager: HadoopDelegationTokenManager,
+nextRenewal: Long,
+de: RpcEndpointRef) extends Logging {
+  private val credentialRenewerThread =
+Executors.newSingleThreadScheduledExecutor(
--- End diff --

I also changed `AMCredentialRenewer` to the same. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19272
  
**[Test build #82870 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82870/testReport)**
 for PR 19272 at commit 
[`e522150`](https://github.com/apache/spark/commit/e522150c32ffca9f116197f80530554d3cb68b6c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19519: [SPARK-21840][core] Add trait that allows conf to be dir...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19519
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82863/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19519: [SPARK-21840][core] Add trait that allows conf to be dir...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19519
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19519: [SPARK-21840][core] Add trait that allows conf to be dir...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19519
  
**[Test build #82863 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82863/testReport)**
 for PR 19519 at commit 
[`d4466f2`](https://github.com/apache/spark/commit/d4466f269451ef23d9c4d45fcd5b1c03988ac0e7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19272: [Spark-21842][Mesos] Support Kerberos ticket rene...

2017-10-17 Thread ArtRand

Github user ArtRand commented on a diff in the pull request:

https://github.com/apache/spark/pull/19272#discussion_r145298313
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCredentialRenewer.scala
 ---
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster.mesos
+
+import java.security.PrivilegedExceptionAction
+import java.util.concurrent.{Executors, TimeUnit}
+
+import scala.collection.JavaConverters._
+import scala.util.Try
+
+import org.apache.hadoop.security.UserGroupInformation
+
+import org.apache.spark.SparkConf
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.deploy.security.HadoopDelegationTokenManager
+import org.apache.spark.internal.Logging
+import org.apache.spark.rpc.RpcEndpointRef
+import 
org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages.UpdateDelegationTokens
+import org.apache.spark.util.ThreadUtils
+
+
+class MesosCredentialRenewer(
+conf: SparkConf,
+tokenManager: HadoopDelegationTokenManager,
+nextRenewal: Long,
+de: RpcEndpointRef) extends Logging {
--- End diff --

fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...

2017-10-17 Thread ArtRand

Github user ArtRand commented on a diff in the pull request:

https://github.com/apache/spark/pull/19390#discussion_r145298003
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
 ---
@@ -380,7 +389,8 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
   } else {
 declineOffer(
   driver,
-  offer)
+  offer,
--- End diff --

Yeah for now that's fine. I was thinking more specific messages but we can 
address that on a later ticket. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19495
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19495
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82864/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19495
  
**[Test build #82864 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82864/testReport)**
 for PR 19495 at commit 
[`0d788fe`](https://github.com/apache/spark/commit/0d788fe1d9f5cf430ac548bbda3947b3d879505b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-17 Thread ash211

Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19269#discussion_r145296142
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Command.scala
 ---
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.spark.{SparkException, TaskContext}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.{Row, SparkSession}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, 
RowEncoder}
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.command.RunnableCommand
+import org.apache.spark.sql.sources.v2.writer._
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+
+case class WriteToDataSourceV2Command(writer: DataSourceV2Writer, query: 
LogicalPlan)
+  extends RunnableCommand {
--- End diff --

I've also observed this issue where the explain output of commands behaves 
differently than from logical plans, and have a repro at 
https://issues.apache.org/jira/browse/SPARK-22204


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-17 Thread ArtRand

Github user ArtRand commented on a diff in the pull request:

https://github.com/apache/spark/pull/19437#discussion_r145294156
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala
 ---
@@ -173,6 +178,90 @@ private[mesos] object MesosSchedulerBackendUtil 
extends Logging {
 containerInfo
   }
 
+  private def getSecrets(conf: SparkConf, secretConfig: 
MesosSecretConfig): Seq[Secret] = {
+def createValueSecret(data: String): Secret = {
+  Secret.newBuilder()
+.setType(Secret.Type.VALUE)
+
.setValue(Secret.Value.newBuilder().setData(ByteString.copyFrom(data.getBytes)))
+.build()
+}
+
+def createReferenceSecret(name: String): Secret = {
+  Secret.newBuilder()
+.setReference(Secret.Reference.newBuilder().setName(name))
+.setType(Secret.Type.REFERENCE)
+.build()
+}
+
+val referenceSecrets: Seq[Secret] =
+  conf.get(secretConfig.SECRET_NAMES).getOrElse(Nil).map(s => 
createReferenceSecret(s))
+
+val valueSecrets: Seq[Secret] = {
+  conf.get(secretConfig.SECRET_VALUES).getOrElse(Nil).map(s => 
createValueSecret(s))
+}
+
+if (valueSecrets.nonEmpty && referenceSecrets.nonEmpty) {
+  throw new SparkException("Cannot specify VALUE type secrets and 
REFERENCE types ones")
+}
+
+if (referenceSecrets.nonEmpty) referenceSecrets else valueSecrets
+  }
+
+  private def illegalSecretInput(dest: Seq[String], secrets: Seq[Secret]): 
Boolean = {
+if (dest.nonEmpty) {
+  // make sure there is a one-to-one correspondence between 
destinations and secrets
+  if (dest.length != secrets.length) {
+return true
+  }
+}
+false
+  }
+
+  def getSecretVolume(conf: SparkConf, secretConfig: MesosSecretConfig): 
List[Volume] = {
+val secrets = getSecrets(conf, secretConfig)
+val secretPaths: Seq[String] =
+  conf.get(secretConfig.SECRET_FILENAMES).getOrElse(Nil)
+
+if (illegalSecretInput(secretPaths, secrets)) {
+  throw new SparkException(
+s"Need to give equal numbers of secrets and file paths for 
file-based " +
+  s"reference secrets got secrets $secrets, and paths 
$secretPaths")
+}
+
+secrets.zip(secretPaths).map {
+  case (s, p) =>
+val source = Volume.Source.newBuilder()
+  .setType(Volume.Source.Type.SECRET)
+  .setSecret(s)
+Volume.newBuilder()
+  .setContainerPath(p)
+  .setSource(source)
+  .setMode(Volume.Mode.RO)
+  .build
+}.toList
+  }
+
+  def getSecretEnvVar(conf: SparkConf, secretConfig: MesosSecretConfig):
+  List[Variable] = {
--- End diff --

indentation? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-17 Thread ArtRand

Github user ArtRand commented on a diff in the pull request:

https://github.com/apache/spark/pull/19437#discussion_r145294352
  
--- Diff: docs/running-on-mesos.md ---
@@ -501,23 +503,74 @@ See the [configuration page](configuration.html) for 
information on Spark config
 spark.mesos.driver.secret.names or spark.mesos.driver.secret.values 
will be
 written to the provided file. Paths are relative to the container's 
work
 directory.  Absolute paths must already exist.  Consult the Mesos 
Secret
-protobuf for more information.
+protobuf for more information. Example:
--- End diff --

I'm not sure what the policy should be on this, but IIUC file-based secrets 
does require a backend module. Should we mention that here? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-17 Thread ArtRand

Github user ArtRand commented on a diff in the pull request:

https://github.com/apache/spark/pull/19437#discussion_r145294074
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala
 ---
@@ -17,10 +17,14 @@
 
 package org.apache.spark.scheduler.cluster.mesos
--- End diff --

Ah ok. One of these days let's make a "clean up" JIRA and harmonize all of 
this code. The naming is also all over the place.. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-17 Thread ArtRand

Github user ArtRand commented on a diff in the pull request:

https://github.com/apache/spark/pull/19437#discussion_r145294610
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala
 ---
@@ -122,7 +126,8 @@ private[mesos] object MesosSchedulerBackendUtil extends 
Logging {
 .toList
   }
 
-  def containerInfo(conf: SparkConf): ContainerInfo.Builder = {
+  def buildContainerInfo(conf: SparkConf):
+  ContainerInfo.Builder = {
--- End diff --

indentation?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18664
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18664
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82862/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18664
  
**[Test build #82862 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82862/testReport)**
 for PR 18664 at commit 
[`e428cbe`](https://github.com/apache/spark/commit/e428cbe2f8e626510e6660b38faff7f03b229e92).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19480
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19480
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82861/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19480
  
**[Test build #82861 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82861/testReport)**
 for PR 19480 at commit 
[`bce3616`](https://github.com/apache/spark/commit/bce3616cccfb5c1b1dc7fca32d17764fda142e7d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19488
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19488
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82860/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r145294433
  
--- Diff: python/pyspark/sql/session.py ---
@@ -414,6 +415,43 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _createFromPandasWithArrow(self, df, schema):
+"""
+Create a DataFrame from a given pandas.DataFrame by slicing the 
into partitions, converting
+to Arrow data, then reading into the JVM to parallelsize. If a 
schema is passed in, the
+data types will be used to coerce the data in Pandas to Arrow 
conversion.
+"""
+import os
+from tempfile import NamedTemporaryFile
+from pyspark.serializers import ArrowSerializer
+from pyspark.sql.types import from_arrow_schema, to_arrow_schema
+import pyarrow as pa
+
+# Slice the DataFrame into batches
+step = -(-len(df) // self.sparkContext.defaultParallelism)  # 
round int up
+df_slices = (df[start:start + step] for start in xrange(0, 
len(df), step))
+arrow_schema = to_arrow_schema(schema) if schema is not None else 
None
+batches = [pa.RecordBatch.from_pandas(df_slice, 
schema=arrow_schema, preserve_index=False)
+   for df_slice in df_slices]
+
+# write batches to temp file, read by JVM (borrowed from 
context.parallelize)
+tempFile = NamedTemporaryFile(delete=False, dir=self._sc._temp_dir)
+try:
+serializer = ArrowSerializer()
+serializer.dump_stream(batches, tempFile)
+tempFile.close()
+readRDDFromFile = self._jvm.PythonRDD.readRDDFromFile
+jrdd = readRDDFromFile(self._jsc, tempFile.name, len(batches))
+finally:
+# readRDDFromFile eagerily reads the file so we can delete 
right after.
+os.unlink(tempFile.name)
+
+# Create the Spark DataFrame, there will be at least 1 batch
+schema = from_arrow_schema(batches[0].schema)
--- End diff --

H.. still get the same exception with pyarrow 0.4.1 ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19488
  
**[Test build #82860 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82860/testReport)**
 for PR 19488 at commit 
[`9c33a0c`](https://github.com/apache/spark/commit/9c33a0cf82aff058f117fd309643006118ea3b08).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...

2017-10-17 Thread ConeyLiu

Github user ConeyLiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19317#discussion_r145294297
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -180,6 +180,56 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
* as in scala.TraversableOnce. The former operation is used for merging 
values within a
* partition, and the latter is used for merging values between 
partitions. To avoid memory
* allocation, both of these functions are allowed to modify and return 
their first argument
+   * instead of creating a new U. This method is different from the 
ordinary "aggregateByKey"
+   * method, it directly returns a map to the driver, rather than a rdd. 
This will also perform
+   * the merging locally on each mapper before sending results to a 
reducer, similarly to a
--- End diff --

thansk for the advance, I'll close it and try `mapPartitions(...).collect` 
in `NaiveBayes`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...

2017-10-17 Thread ConeyLiu

Github user ConeyLiu closed the pull request at:

https://github.com/apache/spark/pull/19317


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19511: [SPARK-22293][SQL] Avoid unnecessary traversal in Resolv...

2017-10-17 Thread ConeyLiu

Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/19511
  
Hi @gatorsmile, if we can combine the two traverse, this should be simplify 
the code not complicate. However, this can't get big performance improvement. 
And I can close it if this change unnecessary.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-17 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19459
  
BTW, https://github.com/apache/spark/pull/19459#discussion_r145034007 looks 
missed :).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r145293860
  
--- Diff: python/pyspark/sql/session.py ---
@@ -414,6 +415,39 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _createFromPandasWithArrow(self, pdf, schema):
+"""
+Create a DataFrame from a given pandas.DataFrame by slicing it 
into partitions, converting
+to Arrow data, then sending to the JVM to parallelize. If a schema 
is passed in, the
+data types will be used to coerce the data in Pandas to Arrow 
conversion.
+"""
+from pyspark.serializers import ArrowSerializer
+from pyspark.sql.types import from_arrow_schema, to_arrow_schema
+import pyarrow as pa
+
+# Slice the DataFrame into batches
+step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # 
round int up
+pdf_slices = (pdf[start:start + step] for start in xrange(0, 
len(pdf), step))
+arrow_schema = to_arrow_schema(schema) if schema is not None else 
None
+batches = [pa.RecordBatch.from_pandas(pdf_slice, 
schema=arrow_schema, preserve_index=False)
+   for pdf_slice in pdf_slices]
--- End diff --

However, if we go 1. way, I think we should avoid creating whole batches 
first. I think falling back might make sense if its cost is cheap.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19480
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19480
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82858/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19480
  
**[Test build #82858 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82858/testReport)**
 for PR 19480 at commit 
[`95b0ad8`](https://github.com/apache/spark/commit/95b0ad82a68828055c45dbd729c753fae2007a5c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r145293209
  
--- Diff: python/pyspark/sql/session.py ---
@@ -414,6 +415,39 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _createFromPandasWithArrow(self, pdf, schema):
+"""
+Create a DataFrame from a given pandas.DataFrame by slicing it 
into partitions, converting
+to Arrow data, then sending to the JVM to parallelize. If a schema 
is passed in, the
+data types will be used to coerce the data in Pandas to Arrow 
conversion.
+"""
+from pyspark.serializers import ArrowSerializer
+from pyspark.sql.types import from_arrow_schema, to_arrow_schema
+import pyarrow as pa
+
+# Slice the DataFrame into batches
+step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # 
round int up
+pdf_slices = (pdf[start:start + step] for start in xrange(0, 
len(pdf), step))
+arrow_schema = to_arrow_schema(schema) if schema is not None else 
None
+batches = [pa.RecordBatch.from_pandas(pdf_slice, 
schema=arrow_schema, preserve_index=False)
+   for pdf_slice in pdf_slices]
+
+# Verify schema, there will be at least 1 batch from 
pandas.DataFrame
+schema_from_arrow = from_arrow_schema(batches[0].schema)
+if schema is not None and schema != schema_from_arrow:
+raise ValueError("Supplied schema does not match result from 
Arrow\nsupplied: " +
+ "%s\n!=\nfrom Arrow: %s" % (str(schema), 
str(schema_from_arrow)))
--- End diff --

I'd prefer 1. if anything becomes too complicated to match the results for 
now. I guess 2. could be failed for any unexpected reason comparing 
`createDataFrame` without Arrow so I thought 1. is required as a guarantee and 
2. is good to do. I wonder what @ueshin thinks about this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19437
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82868/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19515: [SPARK-22287][MESOS] SPARK_DAEMON_MEMORY not honored by ...

2017-10-17 Thread ArtRand

Github user ArtRand commented on the issue:

https://github.com/apache/spark/pull/19515
  
Hello @pmackles, thanks for this. It would be set to the value of 
`SPARK_DRIVER_MEMORY` by default correct? What to you think about introducing a 
new envvar (`SPARK_DISPATCHER_MEMORY`) so that it can be individually 
configured? Only motivated by the fact that the Dispatcher is a pretty unique 
beast compared to the other Spark daemons. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19437
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19437
  
**[Test build #82868 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82868/testReport)**
 for PR 19437 at commit 
[`b2a3675`](https://github.com/apache/spark/commit/b2a36753b41820ec5e2d85a6a29fd8677bc0029a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19495
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19495
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82859/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r145292822
  
--- Diff: python/pyspark/sql/session.py ---
@@ -414,6 +415,39 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _createFromPandasWithArrow(self, pdf, schema):
+"""
+Create a DataFrame from a given pandas.DataFrame by slicing it 
into partitions, converting
+to Arrow data, then sending to the JVM to parallelize. If a schema 
is passed in, the
+data types will be used to coerce the data in Pandas to Arrow 
conversion.
+"""
+from pyspark.serializers import ArrowSerializer
+from pyspark.sql.types import from_arrow_schema, to_arrow_schema
+import pyarrow as pa
+
+# Slice the DataFrame into batches
+step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # 
round int up
+pdf_slices = (pdf[start:start + step] for start in xrange(0, 
len(pdf), step))
+arrow_schema = to_arrow_schema(schema) if schema is not None else 
None
+batches = [pa.RecordBatch.from_pandas(pdf_slice, 
schema=arrow_schema, preserve_index=False)
+   for pdf_slice in pdf_slices]
+
+# Verify schema, there will be at least 1 batch from 
pandas.DataFrame
+schema_from_arrow = from_arrow_schema(batches[0].schema)
+if schema is not None and schema != schema_from_arrow:
+raise ValueError("Supplied schema does not match result from 
Arrow\nsupplied: " +
+ "%s\n!=\nfrom Arrow: %s" % (str(schema), 
str(schema_from_arrow)))
--- End diff --

Oh, actually we do have for warning in Python:

```python
import warnings
warnings.warn("...")
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-17 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19488
  
LGTM except one comment, thanks for working on it!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19495
  
**[Test build #82859 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82859/testReport)**
 for PR 19495 at commit 
[`037e036`](https://github.com/apache/spark/commit/037e03658b2102cb1025019b2a22535a5ecb3f50).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19488: [SPARK-22266][SQL] The same aggregate function wa...

2017-10-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19488#discussion_r145292665
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2677,4 +2678,29 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   checkAnswer(df, Row(1, 1, 1))
 }
   }
+
+  test("SRARK-22266: the same aggregate function was calculated multiple 
times") {
+val query = "SELECT a, max(b+1), max(b+1) + 1 FROM testData2 GROUP BY 
a"
+val df = sql(query)
+val physical = df.queryExecution.sparkPlan
+val aggregates = physical.collect {
+  case agg : HashAggregateExec => agg
+}
+aggregates.foreach { agg =>
--- End diff --

nit:
```
assert(aggregates.length == 1)
assert(aggregates.head.aggregateExpressions.size == 1)
```
If the implementation changed and we execute the query with other aggregate 
implementations, we should make sure this test is not ignored.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19488: [SPARK-22266][SQL] The same aggregate function wa...

2017-10-17 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19488#discussion_r145292679
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2677,4 +2678,29 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   checkAnswer(df, Row(1, 1, 1))
 }
   }
+
+  test("SRARK-22266: the same aggregate function was calculated multiple 
times") {
+val query = "SELECT a, max(b+1), max(b+1) + 1 FROM testData2 GROUP BY 
a"
+val df = sql(query)
+val physical = df.queryExecution.sparkPlan
+val aggregates = physical.collect {
+  case agg : HashAggregateExec => agg
+}
+aggregates.foreach { agg =>
+  assert (agg.aggregateExpressions.size == 1)
+}
+checkAnswer(df, Row(1, 3, 4) :: Row(2, 3, 4) :: Row(3, 3, 4) :: Nil)
+  }
+
+  test("Non-deterministic aggregate functions should not be deduplicated") 
{
+val query = "SELECT a, first_value(b), first_value(b) + 1 FROM 
testData2 GROUP BY a"
+val df = sql(query)
+val physical = df.queryExecution.sparkPlan
+val aggregates = physical.collect {
+  case agg : HashAggregateExec => agg
+}
+aggregates.foreach { agg =>
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19437
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19437
  
**[Test build #82867 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82867/testReport)**
 for PR 19437 at commit 
[`770d307`](https://github.com/apache/spark/commit/770d307bd67796b93f20d7dba105905059af7a0e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19437
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82867/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19374: [SPARK-22145][MESOS] fix supervise with checkpoin...

2017-10-17 Thread ArtRand

Github user ArtRand commented on a diff in the pull request:

https://github.com/apache/spark/pull/19374#discussion_r145292254
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
 ---
@@ -135,22 +135,24 @@ private[spark] class MesosClusterScheduler(
   private val useFetchCache = 
conf.getBoolean("spark.mesos.fetchCache.enable", false)
   private val schedulerState = engineFactory.createEngine("scheduler")
   private val stateLock = new Object()
+  // Keyed by submission id
   private val finishedDrivers =
 new mutable.ArrayBuffer[MesosClusterSubmissionState](retainedDrivers)
   private var frameworkId: String = null
-  // Holds all the launched drivers and current launch state, keyed by 
driver id.
+  // Holds all the launched drivers and current launch state, keyed by 
submission id.
   private val launchedDrivers = new mutable.HashMap[String, 
MesosClusterSubmissionState]()
   // Holds a map of driver id to expected slave id that is passed to Mesos 
for reconciliation.
   // All drivers that are loaded after failover are added here, as we need 
get the latest
-  // state of the tasks from Mesos.
+  // state of the tasks from Mesos. Keyed by task Id.
   private val pendingRecover = new mutable.HashMap[String, SlaveID]()
-  // Stores all the submitted drivers that hasn't been launched.
+  // Stores all the submitted drivers that hasn't been launched, keyed by 
submission id
   private val queuedDrivers = new ArrayBuffer[MesosDriverDescription]()
-  // All supervised drivers that are waiting to retry after termination.
+  // All supervised drivers that are waiting to retry after termination, 
keyed by submission id
   private val pendingRetryDrivers = new 
ArrayBuffer[MesosDriverDescription]()
   private val queuedDriversState = 
engineFactory.createEngine("driverQueue")
   private val launchedDriversState = 
engineFactory.createEngine("launchedDrivers")
   private val pendingRetryDriversState = 
engineFactory.createEngine("retryList")
+  private final val RETRY_ID = "-retry-"
--- End diff --

Sorry, super nit, maybe `RETRY_SEP`. Feel free to ignore. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19374: [SPARK-22145][MESOS] fix supervise with checkpointing on...

2017-10-17 Thread ArtRand

Github user ArtRand commented on the issue:

https://github.com/apache/spark/pull/19374
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19495
  
**[Test build #82869 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82869/testReport)**
 for PR 19495 at commit 
[`ed9d3a2`](https://github.com/apache/spark/commit/ed9d3a2e4fd4b1a0ec0e0ceb34419ede4e2303b8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-17 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r145291702
  
--- Diff: python/pyspark/sql/session.py ---
@@ -414,6 +415,39 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _createFromPandasWithArrow(self, pdf, schema):
+"""
+Create a DataFrame from a given pandas.DataFrame by slicing it 
into partitions, converting
+to Arrow data, then sending to the JVM to parallelize. If a schema 
is passed in, the
+data types will be used to coerce the data in Pandas to Arrow 
conversion.
+"""
+from pyspark.serializers import ArrowSerializer
+from pyspark.sql.types import from_arrow_schema, to_arrow_schema
+import pyarrow as pa
+
+# Slice the DataFrame into batches
+step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # 
round int up
+pdf_slices = (pdf[start:start + step] for start in xrange(0, 
len(pdf), step))
+arrow_schema = to_arrow_schema(schema) if schema is not None else 
None
+batches = [pa.RecordBatch.from_pandas(pdf_slice, 
schema=arrow_schema, preserve_index=False)
+   for pdf_slice in pdf_slices]
+
+# Verify schema, there will be at least 1 batch from 
pandas.DataFrame
+schema_from_arrow = from_arrow_schema(batches[0].schema)
+if schema is not None and schema != schema_from_arrow:
+raise ValueError("Supplied schema does not match result from 
Arrow\nsupplied: " +
+ "%s\n!=\nfrom Arrow: %s" % (str(schema), 
str(schema_from_arrow)))
--- End diff --

@ueshin and @HyukjinKwon after thinking about what to do when the schema is 
not equal, I have some concerns:

1. Fallback to `createDataFrame` without Arrow - I implemented this and 
works fine, but there is no logging in python (afaik) so my concern is that it 
does this silently and causes bad performance and the user will not know why.

2. Cast types using `astype` similar to `ArrowPandasSerializer.dump_stream` 
- The issue I see with that is if there are null values and ints have been 
promoted to floats, this works fine in `dump_stream` because we are working 
with pd.Series and  pyarrow allows us to pass a validity mask, which ignores 
the filled values.  There aren't options to pass in masks for pd.DataFrames, so 
I believe it will try to interpret whatever fill values are there and cause an 
error.  I can look into this more though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...

2017-10-17 Thread brkyvz

Github user brkyvz commented on the issue:

https://github.com/apache/spark/pull/19495
  
LGTM!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19523: [SPARK-22301][SQL] Add rule to Optimizer for In with emp...

2017-10-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19523
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-17 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19437
  
**[Test build #82868 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82868/testReport)**
 for PR 19437 at commit 
[`b2a3675`](https://github.com/apache/spark/commit/b2a36753b41820ec5e2d85a6a29fd8677bc0029a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-17 Thread susanxhuynh

Github user susanxhuynh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19437#discussion_r145287767
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala
 ---
@@ -170,9 +175,119 @@ private[mesos] object MesosSchedulerBackendUtil 
extends Logging {
   containerInfo.addNetworkInfos(info)
 }
 
+getSecretVolume(conf, secretConfig).foreach { volume =>
+  if (volume.getSource.getSecret.getReference.isInitialized) {
+logInfo(s"Setting reference secret 
${volume.getSource.getSecret.getReference.getName}" +
+  s"on file ${volume.getContainerPath}")
+  } else {
+logInfo(s"Setting secret on file name=${volume.getContainerPath}")
+  }
+  containerInfo.addVolumes(volume)
+}
+
 containerInfo
   }
 
+  def addSecretEnvVar(
--- End diff --

I've removed this method. To be more consistent, I've moved this code back 
into MesosClusterScheduler. There's a little duplication, because 
MesosCoarseGrainedSchedulerBackend now has a similar code snippet, but it does 
avoid the mutation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-17 Thread susanxhuynh

Github user susanxhuynh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19437#discussion_r145290625
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala
 ---
@@ -170,9 +175,119 @@ private[mesos] object MesosSchedulerBackendUtil 
extends Logging {
   containerInfo.addNetworkInfos(info)
 }
 
+getSecretVolume(conf, secretConfig).foreach { volume =>
+  if (volume.getSource.getSecret.getReference.isInitialized) {
+logInfo(s"Setting reference secret 
${volume.getSource.getSecret.getReference.getName}" +
+  s"on file ${volume.getContainerPath}")
+  } else {
+logInfo(s"Setting secret on file name=${volume.getContainerPath}")
+  }
+  containerInfo.addVolumes(volume)
+}
+
 containerInfo
   }
 
+  def addSecretEnvVar(
+  envBuilder: Environment.Builder,
+  conf: SparkConf,
+  secretConfig: MesosSecretConfig): Unit = {
+getSecretEnvVar(conf, secretConfig).foreach { variable =>
+  if (variable.getSecret.getReference.isInitialized) {
+logInfo(s"Setting reference secret 
${variable.getSecret.getReference.getName}" +
--- End diff --

Fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-17 Thread susanxhuynh

Github user susanxhuynh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19437#discussion_r145290603
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala
 ---
@@ -170,9 +175,119 @@ private[mesos] object MesosSchedulerBackendUtil 
extends Logging {
   containerInfo.addNetworkInfos(info)
 }
 
+getSecretVolume(conf, secretConfig).foreach { volume =>
+  if (volume.getSource.getSecret.getReference.isInitialized) {
+logInfo(s"Setting reference secret 
${volume.getSource.getSecret.getReference.getName}" +
+  s"on file ${volume.getContainerPath}")
+  } else {
+logInfo(s"Setting secret on file name=${volume.getContainerPath}")
+  }
+  containerInfo.addVolumes(volume)
+}
+
 containerInfo
   }
 
+  def addSecretEnvVar(
+  envBuilder: Environment.Builder,
+  conf: SparkConf,
+  secretConfig: MesosSecretConfig): Unit = {
+getSecretEnvVar(conf, secretConfig).foreach { variable =>
+  if (variable.getSecret.getReference.isInitialized) {
+logInfo(s"Setting reference secret 
${variable.getSecret.getReference.getName}" +
+  s"on file ${variable.getName}")
+  } else {
+logInfo(s"Setting secret on environment variable 
name=${variable.getName}")
+  }
+  envBuilder.addVariables(variable)
+}
+  }
+
+  private def getSecrets(conf: SparkConf, secretConfig: MesosSecretConfig):
+  Seq[Secret] = {
--- End diff --

Fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 439 matches

Mail list logo