date:20180205

[GitHub] spark pull request #20505: [SPARK-23251][SQL] Add checks for collection elem...

2018-02-05 Thread michalsenkyr

Github user michalsenkyr commented on a diff in the pull request:

https://github.com/apache/spark/pull/20505#discussion_r165903346
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala 
---
@@ -165,11 +165,15 @@ abstract class SQLImplicits extends 
LowPrioritySQLImplicits {
   def newProductSeqEncoder[A <: Product : TypeTag]: Encoder[Seq[A]] = 
ExpressionEncoder()
 
   /** @since 2.2.0 */
-  implicit def newSequenceEncoder[T <: Seq[_] : TypeTag]: Encoder[T] = 
ExpressionEncoder()
+  implicit def newSequenceEncoder[T[_], E : Encoder]
--- End diff --

Looks like we are. I can add new methods and make the old ones not 
implicit. That should fix MiMa. Although that might add to the clutter that's 
already in this class. Is that OK? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for ...

2018-02-05 Thread ueshin

GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/20506

[SPARK-23290][SQL][PYTHON] Use datetime.date for date type when converting 
Spark DataFrame to Pandas DataFrame.

## What changes were proposed in this pull request?

In #18664, there was a change in how `DateType` is being returned to users 
([line 1968 in 
dataframe.py](https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968)).
 This can cause client code which works in Spark 2.2 to fail.
See 
[SPARK-23290](https://issues.apache.org/jira/browse/SPARK-23290?focusedCommentId=16350917&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16350917)
 for an example.

This pr modifies to use `datetime.date` for date type as Spark 2.2 does.

## How was this patch tested?

Tests modified to fit the new behavior and existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-23290

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20506.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20506


commit 223d0a06a755d3ceb59664b37a87af82f61f2ae4
Author: Takuya UESHIN 
Date:   2018-02-05T06:52:43Z

Use datetime.date for date type when converting Spark DataFrame to Pandas 
DataFrame.

commit 57ab41b90dbdace4dc5ce71421c42cfff27d061c
Author: Takuya UESHIN 
Date:   2018-02-05T07:49:36Z

Modify a test for date type.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20506
  
**[Test build #87062 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87062/testReport)**
 for PR 20506 at commit 
[`57ab41b`](https://github.com/apache/spark/commit/57ab41b90dbdace4dc5ce71421c42cfff27d061c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...

2018-02-05 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20506
  
cc @BryanCutler @icexelloss @HyukjinKwon @cloud-fan @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20506
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20506
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/585/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tes...

2018-02-05 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/20487#discussion_r165911313
  
--- Diff: pom.xml ---
@@ -185,6 +185,10 @@
 2.8
 1.8
 1.0.0
+

[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...

2018-02-05 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/20373
  
this is targeting master, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20506
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87062/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20506
  
**[Test build #87062 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87062/testReport)**
 for PR 20506 at commit 
[`57ab41b`](https://github.com/apache/spark/commit/57ab41b90dbdace4dc5ce71421c42cfff27d061c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20506
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20495: [SPARK-23327] [SQL] Update the description and te...

2018-02-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20495#discussion_r165928335
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1705,10 +1705,12 @@ def unhex(col):
 @ignore_unicode_prefix
 @since(1.5)
 def length(col):
-"""Calculates the length of a string or binary expression.
+"""Computes the character length of a given string or number of bytes 
or a binary string.
--- End diff --

`number of bytes of a binary value`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...

2018-02-05 Thread ueshin

GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/20507

[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to 
handle str type properly in Python 2.

## What changes were proposed in this pull request?

In Python 2, when `pandas_udf` tries to return string type value created in 
the udf with `".."`, the execution fails. E.g.,

```python
from pyspark.sql.functions import pandas_udf, col
import pandas as pd

df = spark.range(10)
str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
df.select(str_f(col('id'))).show()
```

raises the following exception:

```
...

java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: 
expected StringType, got BinaryType
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:93)

...
```

Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` 
and consider it as binary type when the type is string type and the string 
values are `str` instead of `unicode` in Python 2.

This pr adds a workaround for the case.

## How was this patch tested?

Added a test and existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-23334

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20507.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20507


commit 47b88734b91a7f9a4335bc3c667640eb4600b8e1
Author: Takuya UESHIN 
Date:   2018-02-05T09:30:20Z

Fix pandas_udf with return type StringType() to handle str type properly.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20507
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20507
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/586/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20507
  
**[Test build #87063 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87063/testReport)**
 for PR 20507 at commit 
[`47b8873`](https://github.com/apache/spark/commit/47b88734b91a7f9a4335bc3c667640eb4600b8e1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...

2018-02-05 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20507
  
cc @BryanCutler @icexelloss @HyukjinKwon 
Could you help me double-check this?
Since seems like this happens only in Python 2 environment, Jenkins will 
skip the tests.
And let me know if you know better workaround.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20226: [SPARK-23034][SQL] Override `nodeName` for all *S...

2018-02-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20226#discussion_r165932670
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -86,6 +86,9 @@ case class RowDataSourceScanExec(
 
   def output: Seq[Attribute] = requiredColumnsIndex.map(fullOutput)
 
+  override val nodeName: String =
--- End diff --

`DataSourceScanExec.nodeName` is defined as `s"Scan $relation 
${tableIdentifier.map(_.unquotedString).getOrElse("")}"`, do we really need to 
overwrite it here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20226: [SPARK-23034][SQL] Override `nodeName` for all *ScanExec...

2018-02-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20226
  
By default `simpleString` is defined as `s"$nodeName $argString".trim`, if 
we overwrite `nodeName` in some node, we should also overwrite `argString`, 
otherwise we may have duplicated information in `simpleString`, which is used 
with `explain`.

Can we just change the UI code to put `plan.simpleString` in the plan graph?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20477
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20477
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/587/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20481: [SPARK-23307][WEBUI]Sort jobs/stages/tasks/querie...

2018-02-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20481#discussion_r165934385
  
--- Diff: 
core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ---
@@ -875,8 +875,8 @@ private[spark] class AppStatusListener(
   return
 }
 
-val toDelete = KVUtils.viewToSeq(kvstore.view(classOf[JobDataWrapper]),
-countToDelete.toInt) { j =>
+val view = 
kvstore.view(classOf[JobDataWrapper]).index("completionTime").first(0L)
--- End diff --

use `TaskIndexNames.COMPLETION_TIME`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20477
  
**[Test build #87064 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87064/testReport)**
 for PR 20477 at commit 
[`a40d18e`](https://github.com/apache/spark/commit/a40d18ea08a62ecafa1d120bb7ce38019ba57869).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20481: [SPARK-23307][WEBUI]Sort jobs/stages/tasks/querie...

2018-02-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20481#discussion_r165934452
  
--- Diff: 
core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ---
@@ -888,8 +888,8 @@ private[spark] class AppStatusListener(
   return
 }
 
-val stages = KVUtils.viewToSeq(kvstore.view(classOf[StageDataWrapper]),
-countToDelete.toInt) { s =>
+val view = 
kvstore.view(classOf[StageDataWrapper]).index("completionTime").first(0L)
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20481: [SPARK-23307][WEBUI]Sort jobs/stages/tasks/queries with ...

2018-02-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20481
  
thanks, merging to master/2.3!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20481: [SPARK-23307][WEBUI]Sort jobs/stages/tasks/querie...

2018-02-05 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20481


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20507
  
**[Test build #87063 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87063/testReport)**
 for PR 20507 at commit 
[`47b8873`](https://github.com/apache/spark/commit/47b88734b91a7f9a4335bc3c667640eb4600b8e1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20507
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87063/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20507
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20508: [SPARK-23335][SQL] Should not convert to double w...

2018-02-05 Thread caneGuy

GitHub user caneGuy opened a pull request:

https://github.com/apache/spark/pull/20508

[SPARK-23335][SQL] Should not convert to double when there is an Integraâ¦

â¦l value in BinaryArithmetic which will loss precison

## What changes were proposed in this pull request?

For below expression:

`select conv('',16,10) % 2;`

it will return 0.

```
0: jdbc:hive2://xxx:16> select conv('',16,10) % 2;
 
+--+--+
 
| (CAST(conv(, 16, 10) AS DOUBLE) % CAST(CAST(2 AS 
DECIMAL(20,0)) AS DOUBLE)) |
 
+--+--+
 | 0.0 | 
+--+--+

```
It caused by:

```
case a @ BinaryArithmetic(left @ StringType(), right) => 
a.makeCopy(Array(Cast(left, DoubleType), right))
case a @ BinaryArithmetic(left, right @ StringType()) => 
a.makeCopy(Array(left, Cast(right, DoubleType)))
```
This patch fix this by add rule check when has an intergral type in 
BinaryArithmetic operator,we should not convert value to double.
Result as below:
```
0: jdbc:hive2://xxx:16> select conv('',16,10) % 2;

+---+--+
| (CAST(CAST(conv(, 16, 10) AS DECIMAL(38,0)) AS 
DECIMAL(38,0)) % CAST(CAST(2 AS DECIMAL(38,0)) AS DECIMAL(38,0)))  |

+---+--+
| 1 
|

+---+--+
```
## How was this patch tested?
Exist tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/caneGuy/spark zhoukang/fix-castasdouble

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20508.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20508


commit 1a2c62f6e2725cbbdc44c464c7fc0b9358e064b2
Author: zhoukang 
Date:   2018-02-05T10:52:40Z

[SPARK-MI][SQL] Should not convert to double when there is an Integral 
value in BinaryArithmetic which will loss precison




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20508: [SPARK-23335][SQL] Should not convert to double when the...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20508
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20508: [SPARK-23335][SQL] Should not convert to double when the...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20508
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20509: [SPARK-23268][SQL][followup] Reorganize packages ...

2018-02-05 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/20509

[SPARK-23268][SQL][followup] Reorganize packages in data source V2

## What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/20435.

While reorganizing the packages for streaming data source v2, the top level 
stream read/write support interfaces should not be in the reader/writer 
package, but should be in the `sources.v2` package, to follow the 
`ReadSupport`, `WriteSupport`, etc.



## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark followup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20509.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20509


commit e3f007f5ccf3f1404ad37e40f6d3112933da3c24
Author: Wenchen Fan 
Date:   2018-02-05T10:22:02Z

move streaming read/write support interface to sources.v2 package




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...

2018-02-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20509
  
cc @gengliangwang @jose-torres @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tes...

2018-02-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20487#discussion_r165940636
  
--- Diff: pom.xml ---
@@ -185,6 +185,10 @@
 2.8
 1.8
 1.0.0
+

[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20509
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20509
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/588/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20509
  
**[Test build #87065 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87065/testReport)**
 for PR 20509 at commit 
[`e3f007f`](https://github.com/apache/spark/commit/e3f007f5ccf3f1404ad37e40f6d3112933da3c24).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...

2018-02-05 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20373
  
To me, yup.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...

2018-02-05 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20373
  
FYI, I am trying to make a minor release of cloudpickle to match with this 
to deduplicate our efforts. We put many efforts to find and backport bug fixes 
here. :-).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20509
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20509
  
**[Test build #87065 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87065/testReport)**
 for PR 20509 at commit 
[`e3f007f`](https://github.com/apache/spark/commit/e3f007f5ccf3f1404ad37e40f6d3112933da3c24).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20509
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87065/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18555: [SPARK-21353][CORE]add checkValue in spark.internal.conf...

2018-02-05 Thread heary-cao

Github user heary-cao commented on the issue:

https://github.com/apache/spark/pull/18555
  
cc @HyukjinKwon,@cloud-fan 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tests for ...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20487
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tests for ...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20487
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/589/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tests for ...

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20487
  
**[Test build #87066 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87066/testReport)**
 for PR 20487 at commit 
[`873b4b9`](https://github.com/apache/spark/commit/873b4b96804ebc41b538a090064218141c0f2589).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r165950192
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1557,6 +1557,9 @@ def replace(self, to_replace, value=None, 
subset=None):
 For example, if `value` is a string, and subset contains a 
non-string column,
 then the non-string column is simply ignored.
 
+.. note:: `value` can only be omitted when `to_replace` is a 
dictionary. Otherwise,
+it is required.
--- End diff --

Sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r165951235
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2186,7 +2186,7 @@ def test_replace(self):
 # replace with subset specified with one column replaced, another 
column not in subset
 # stays unchanged.
 row = self.spark.createDataFrame(
-[(u'Alice', 10, 10.0)], schema).replace(10, 20, 
subset=['name', 'age']).first()
+[(u'Alice', 10, 10.0)], schema).replace(10, value=20, 
subset=['name', 'age']).first()
--- End diff --

I don't think it's necessary but let me keep them since at least it tests 
different combinations of valid cases.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4

2018-02-05 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/20510

[SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4

## What changes were proposed in this pull request?

This PR upgrade snappy-java to 1.1.4. release notes: 

- Fix a 1% performance regression when snappy is used in PIE executables.
- Improve compression performance by 5%.
- Improve decompression performance by 20%.

More details:

https://github.com/xerial/snappy-java/blob/master/Milestone.md#snappy-java-114-2017-05-22

## How was this patch tested?

manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-23336

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20510.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20510


commit 1055afc107b0c2357449ae3f23bda089480579d9
Author: Yuming Wang 
Date:   2018-02-05T11:59:47Z

Upgrade snappy-java to 1.1.4




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20510
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20510
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/590/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20510
  
**[Test build #87067 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87067/testReport)**
 for PR 20510 at commit 
[`1055afc`](https://github.com/apache/spark/commit/1055afc107b0c2357449ae3f23bda089480579d9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20499
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/591/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20499
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20499
  
**[Test build #87068 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87068/testReport)**
 for PR 20499 at commit 
[`1849f59`](https://github.com/apache/spark/commit/1849f5948d41d9a0a137a810b8a699755232f7cb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20499
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20499
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87068/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20499
  
**[Test build #87068 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87068/testReport)**
 for PR 20499 at commit 
[`1849f59`](https://github.com/apache/spark/commit/1849f5948d41d9a0a137a810b8a699755232f7cb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...

2018-02-05 Thread danielvdende

Github user danielvdende commented on the issue:

https://github.com/apache/spark/pull/20057
  
@Stephan202 thanks for pointing out those docs issues, just pushed the 
changes :-).
@gatorsmile @dongjoon-hyun would you have a chance to take a look at this 
again? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...

2018-02-05 Thread Fokko

Github user Fokko commented on the issue:

https://github.com/apache/spark/pull/20057
  
Any idea when this will be merged into master? We could use this since we 
are ditching sqoop ð 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20508: [SPARK-23335][SQL] Should not convert to double w...

2018-02-05 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/20508#discussion_r165968094
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -327,6 +327,14 @@ object TypeCoercion {
   // Skip nodes who's children have not been resolved yet.
   case e if !e.childrenResolved => e
 
+  // For integralType should not convert to double which will cause 
precision loss.
+  case a @ BinaryArithmetic(left @ StringType(), right @ 
IntegralType()) =>
--- End diff --

What will happen if string value beyond the long type range?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...

2018-02-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20507#discussion_r165968902
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3920,6 +3920,14 @@ def test_vectorized_udf_null_string(self):
 res = df.select(str_f(col('str')))
 self.assertEquals(df.collect(), res.collect())
 
+def test_vectorized_udf_string_in_udf(self):
+from pyspark.sql.functions import pandas_udf, col
+import pandas as pd
+df = self.spark.range(10)
+str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), 
StringType())
+res = df.select(str_f(col('id')))
--- End diff --

How about variable names 'expected' and 'actual'?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...

2018-02-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20507#discussion_r165972212
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3920,6 +3920,14 @@ def test_vectorized_udf_null_string(self):
 res = df.select(str_f(col('str')))
 self.assertEquals(df.collect(), res.collect())
 
+def test_vectorized_udf_string_in_udf(self):
+from pyspark.sql.functions import pandas_udf, col
+import pandas as pd
+df = self.spark.range(10)
+str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), 
StringType())
--- End diff --

Not a big deal. How about `pd.Series(map(str, x))`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20477
  
**[Test build #87064 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87064/testReport)**
 for PR 20477 at commit 
[`a40d18e`](https://github.com/apache/spark/commit/a40d18ea08a62ecafa1d120bb7ce38019ba57869).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20477
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87064/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20477
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18555: [SPARK-21353][CORE]add checkValue in spark.internal.conf...

2018-02-05 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18555
  
Hmm .. why not addressing 
https://github.com/apache/spark/pull/18555#discussion_r126293557? I think that 
comment makes sense. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...

2018-02-05 Thread gengliangwang

Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/20509
  
The proposal sounds good to me ð 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable logical ...

2018-02-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20387
  
For doing pushdown at logical or physical phase, I don't have a strong 
preference. I think at logical phase we should try our best to push down 
data-size-reduction operators(like filter, aggregate, limit, etc.) close to the 
bottom of the plan, as it should always be good. We should apply pushdown to 
data sources at physical phase, as it's not always good and we need to consider 
the cost. Currently it's done in logical phase because of the `computeStats` 
problem. Eventually we should compute the statistics and apply pushdown to data 
sources in physical phase.

About how to apply pushdown to data sources, I think `PhysicalOperation` is 
in the right direction and the new pushdown rule also follows it. Generally the 
logical phase is responsible for pushing down data-size-reduction operators 
close to the data source relation, and in the physical phase we collect 
supported operators(currently it's only project and filter) above the data 
source relation and apply the pushdown once, so this doesn't need to be 
incremental.

We definitely need to document the contract for ordering and interactions 
between different types of pushdowns. For now we don't need to worry about it 
as we only support column pruning and filter push down, and these 2 are 
orthogonal, it doesn't matter if data source run project first or filter first. 
Here are some initial thoughts on how to define the contract.

Let's say Data Source V2 framework supports pushing down required 
columns(column pruning), filter, limit, aggregate. Semantically filter, limit 
and aggregate are not exchangeable, we should respect their order in the query. 
If we have all these operators in a query, how to tell the data source about 
the order of these operators?

My proposal is, since `DataSourceReader` is mutable(not the plan!), we can 
ask the data source to remember which operators have been pushed down, via the 
order of when these `pushXXX` methods are called. And data source 
implementations should respect the order of pushdown when applying them 
internally.

About `PhysicalOperation`, it's pretty simple and we probably need to 
change it a lot when adding more operator pushdown. Another concern is, 
`PhysicalOperation` is used in a lot of places, not only data source pushdown. 
For safety, I wanna keep it unchanged, and start something new for data source 
v2 only.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...

2018-02-05 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20507#discussion_r165980594
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3920,6 +3920,14 @@ def test_vectorized_udf_null_string(self):
 res = df.select(str_f(col('str')))
 self.assertEquals(df.collect(), res.collect())
 
+def test_vectorized_udf_string_in_udf(self):
+from pyspark.sql.functions import pandas_udf, col
+import pandas as pd
+df = self.spark.range(10)
+str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), 
StringType())
+res = df.select(str_f(col('id')))
--- End diff --

Sure, I'll update it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...

2018-02-05 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20507#discussion_r165980572
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3920,6 +3920,14 @@ def test_vectorized_udf_null_string(self):
 res = df.select(str_f(col('str')))
 self.assertEquals(df.collect(), res.collect())
 
+def test_vectorized_udf_string_in_udf(self):
+from pyspark.sql.functions import pandas_udf, col
+import pandas as pd
+df = self.spark.range(10)
+str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), 
StringType())
--- End diff --

Sounds good. I'll take it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...

2018-02-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20387#discussion_r165981421
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,17 +17,151 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
+import java.util.UUID
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.apache.spark.sql.{AnalysisException, SaveMode}
+import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
-import org.apache.spark.sql.catalyst.expressions.AttributeReference
-import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
-import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression}
+import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, 
Statistics}
+import org.apache.spark.sql.execution.datasources.DataSourceStrategy
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema, WriteSupport}
+import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, 
SupportsPushDownCatalystFilters, SupportsPushDownFilters, 
SupportsPushDownRequiredColumns, SupportsReportStatistics}
+import org.apache.spark.sql.sources.v2.writer.DataSourceWriter
+import org.apache.spark.sql.types.StructType
 
 case class DataSourceV2Relation(
-fullOutput: Seq[AttributeReference],
-reader: DataSourceReader)
-  extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder {
+source: DataSourceV2,
+options: Map[String, String],
+path: Option[String] = None,
+table: Option[TableIdentifier] = None,
--- End diff --

have you considered about 
https://github.com/apache/spark/pull/20387#issuecomment-362148217 ?

I feel it's better to define these common options in `DataSourceOptions`, 
so that data source implementations can also get these common options easily.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20507
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/592/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20507
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20507
  
**[Test build #87069 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87069/testReport)**
 for PR 20507 at commit 
[`06ae568`](https://github.com/apache/spark/commit/06ae568df2088652754c2df66d2f78c8fbdac48d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for ...

2018-02-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20506#discussion_r165980562
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1694,6 +1694,21 @@ def from_arrow_schema(arrow_schema):
  for field in arrow_schema])
 
 
+def _correct_date_of_dataframe_from_arrow(pdf, schema):
+""" Correct date type value to use datetime.date.
+
+Pandas DataFrame created from PyArrow uses datetime64[ns] for date 
type values, but we should
+use datetime.date to keep backward compatibility.
--- End diff --

Shall we say like to match it with when Arrow optimization is disabled?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20509
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/593/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20509
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20509
  
**[Test build #87070 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87070/testReport)**
 for PR 20509 at commit 
[`613d180`](https://github.com/apache/spark/commit/613d18034e8c43d534a6e0d51c522799be37384a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for ...

2018-02-05 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20506#discussion_r165987965
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1694,6 +1694,21 @@ def from_arrow_schema(arrow_schema):
  for field in arrow_schema])
 
 
+def _correct_date_of_dataframe_from_arrow(pdf, schema):
+""" Correct date type value to use datetime.date.
+
+Pandas DataFrame created from PyArrow uses datetime64[ns] for date 
type values, but we should
+use datetime.date to keep backward compatibility.
--- End diff --

Maybe we don't need to say about backward compatibility here. I'll update 
it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20506
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/594/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20506
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20506
  
**[Test build #87071 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87071/testReport)**
 for PR 20506 at commit 
[`ebdbd8c`](https://github.com/apache/spark/commit/ebdbd8c4a06a4da52fc61b1dc98d6e2f2facdf9c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-02-05 Thread DaimonPl

Github user DaimonPl commented on the issue:

https://github.com/apache/spark/pull/16578
  
So if it's not going to be included in `2.3.0` maybe we could change 
`spark.sql.nestedSchemaPruning.enabled` to default `true` ? I hope that this 
time this PR could be finalized at the early stage of `2.4.0` so there would be 
plenty of time to fix any unforseen problems?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20507
  
**[Test build #87069 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87069/testReport)**
 for PR 20507 at commit 
[`06ae568`](https://github.com/apache/spark/commit/06ae568df2088652754c2df66d2f78c8fbdac48d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20507
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87069/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20507
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20167: [SPARK-16501] [MESOS] Allow providing Mesos princ...

2018-02-05 Thread ArtRand

Github user ArtRand commented on a diff in the pull request:

https://github.com/apache/spark/pull/20167#discussion_r165994809
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala
 ---
@@ -71,40 +74,64 @@ trait MesosSchedulerUtils extends Logging {
   failoverTimeout: Option[Double] = None,
   frameworkId: Option[String] = None): SchedulerDriver = {
 val fwInfoBuilder = 
FrameworkInfo.newBuilder().setUser(sparkUser).setName(appName)
-val credBuilder = Credential.newBuilder()
+
fwInfoBuilder.setHostname(Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(
+  conf.get(DRIVER_HOST_ADDRESS)))
 webuiUrl.foreach { url => fwInfoBuilder.setWebuiUrl(url) }
 checkpoint.foreach { checkpoint => 
fwInfoBuilder.setCheckpoint(checkpoint) }
 failoverTimeout.foreach { timeout => 
fwInfoBuilder.setFailoverTimeout(timeout) }
 frameworkId.foreach { id =>
   fwInfoBuilder.setId(FrameworkID.newBuilder().setValue(id).build())
 }
-
fwInfoBuilder.setHostname(Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(
-  conf.get(DRIVER_HOST_ADDRESS)))
-conf.getOption("spark.mesos.principal").foreach { principal =>
-  fwInfoBuilder.setPrincipal(principal)
-  credBuilder.setPrincipal(principal)
-}
-conf.getOption("spark.mesos.secret").foreach { secret =>
-  credBuilder.setSecret(secret)
-}
-if (credBuilder.hasSecret && !fwInfoBuilder.hasPrincipal) {
-  throw new SparkException(
-"spark.mesos.principal must be configured when spark.mesos.secret 
is set")
-}
+
 conf.getOption("spark.mesos.role").foreach { role =>
   fwInfoBuilder.setRole(role)
 }
 val maxGpus = conf.getInt("spark.mesos.gpus.max", 0)
 if (maxGpus > 0) {
   
fwInfoBuilder.addCapabilities(Capability.newBuilder().setType(Capability.Type.GPU_RESOURCES))
 }
+val credBuilder = buildCredentials(conf, fwInfoBuilder)
 if (credBuilder.hasPrincipal) {
   new MesosSchedulerDriver(
 scheduler, fwInfoBuilder.build(), masterUrl, credBuilder.build())
 } else {
   new MesosSchedulerDriver(scheduler, fwInfoBuilder.build(), masterUrl)
 }
   }
+  
+  def buildCredentials(
+  conf: SparkConf, 
+  fwInfoBuilder: Protos.FrameworkInfo.Builder): 
Protos.Credential.Builder = {
+val credBuilder = Credential.newBuilder()
+conf.getOption("spark.mesos.principal")
+  .orElse(Option(conf.getenv("SPARK_MESOS_PRINCIPAL")))
--- End diff --

I would want to make sure that @susanxhuynh and/or @skonto agree, but I 
think this is probably fine. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20510
  
**[Test build #87067 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87067/testReport)**
 for PR 20510 at commit 
[`1055afc`](https://github.com/apache/spark/commit/1055afc107b0c2357449ae3f23bda089480579d9).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20510
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20510
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87067/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20506
  
**[Test build #87071 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87071/testReport)**
 for PR 20506 at commit 
[`ebdbd8c`](https://github.com/apache/spark/commit/ebdbd8c4a06a4da52fc61b1dc98d6e2f2facdf9c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20506
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20506
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87071/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tests for ...

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20487
  
**[Test build #87066 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87066/testReport)**
 for PR 20487 at commit 
[`873b4b9`](https://github.com/apache/spark/commit/873b4b96804ebc41b538a090064218141c0f2589).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tests for ...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20487
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87066/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tests for ...

2018-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20487
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4

2018-02-05 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/20510
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4

2018-02-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20510
  
**[Test build #87072 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87072/testReport)**
 for PR 20510 at commit 
[`1055afc`](https://github.com/apache/spark/commit/1055afc107b0c2357449ae3f23bda089480579d9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 337 matches

Mail list logo