[GitHub] spark pull request #20505: [SPARK-23251][SQL] Add checks for collection elem...
Github user michalsenkyr commented on a diff in the pull request: https://github.com/apache/spark/pull/20505#discussion_r165903346 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala --- @@ -165,11 +165,15 @@ abstract class SQLImplicits extends LowPrioritySQLImplicits { def newProductSeqEncoder[A <: Product : TypeTag]: Encoder[Seq[A]] = ExpressionEncoder() /** @since 2.2.0 */ - implicit def newSequenceEncoder[T <: Seq[_] : TypeTag]: Encoder[T] = ExpressionEncoder() + implicit def newSequenceEncoder[T[_], E : Encoder] --- End diff -- Looks like we are. I can add new methods and make the old ones not implicit. That should fix MiMa. Although that might add to the clutter that's already in this class. Is that OK? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for ...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/20506 [SPARK-23290][SQL][PYTHON] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame. ## What changes were proposed in this pull request? In #18664, there was a change in how `DateType` is being returned to users ([line 1968 in dataframe.py](https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968)). This can cause client code which works in Spark 2.2 to fail. See [SPARK-23290](https://issues.apache.org/jira/browse/SPARK-23290?focusedCommentId=16350917&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16350917) for an example. This pr modifies to use `datetime.date` for date type as Spark 2.2 does. ## How was this patch tested? Tests modified to fit the new behavior and existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-23290 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20506.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20506 commit 223d0a06a755d3ceb59664b37a87af82f61f2ae4 Author: Takuya UESHIN Date: 2018-02-05T06:52:43Z Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame. commit 57ab41b90dbdace4dc5ce71421c42cfff27d061c Author: Takuya UESHIN Date: 2018-02-05T07:49:36Z Modify a test for date type. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20506 **[Test build #87062 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87062/testReport)** for PR 20506 at commit [`57ab41b`](https://github.com/apache/spark/commit/57ab41b90dbdace4dc5ce71421c42cfff27d061c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20506 cc @BryanCutler @icexelloss @HyukjinKwon @cloud-fan @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20506 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20506 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/585/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tes...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20487#discussion_r165911313 --- Diff: pom.xml --- @@ -185,6 +185,10 @@ 2.8 1.8 1.0.0 +
[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20373 this is targeting master, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20506 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87062/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20506 **[Test build #87062 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87062/testReport)** for PR 20506 at commit [`57ab41b`](https://github.com/apache/spark/commit/57ab41b90dbdace4dc5ce71421c42cfff27d061c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20506 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20495: [SPARK-23327] [SQL] Update the description and te...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20495#discussion_r165928335 --- Diff: python/pyspark/sql/functions.py --- @@ -1705,10 +1705,12 @@ def unhex(col): @ignore_unicode_prefix @since(1.5) def length(col): -"""Calculates the length of a string or binary expression. +"""Computes the character length of a given string or number of bytes or a binary string. --- End diff -- `number of bytes of a binary value`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/20507 [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2. ## What changes were proposed in this pull request? In Python 2, when `pandas_udf` tries to return string type value created in the udf with `".."`, the execution fails. E.g., ```python from pyspark.sql.functions import pandas_udf, col import pandas as pd df = spark.range(10) str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string") df.select(str_f(col('id'))).show() ``` raises the following exception: ``` ... java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:93) ... ``` Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` and consider it as binary type when the type is string type and the string values are `str` instead of `unicode` in Python 2. This pr adds a workaround for the case. ## How was this patch tested? Added a test and existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-23334 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20507.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20507 commit 47b88734b91a7f9a4335bc3c667640eb4600b8e1 Author: Takuya UESHIN Date: 2018-02-05T09:30:20Z Fix pandas_udf with return type StringType() to handle str type properly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20507 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20507 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/586/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20507 **[Test build #87063 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87063/testReport)** for PR 20507 at commit [`47b8873`](https://github.com/apache/spark/commit/47b88734b91a7f9a4335bc3c667640eb4600b8e1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20507 cc @BryanCutler @icexelloss @HyukjinKwon Could you help me double-check this? Since seems like this happens only in Python 2 environment, Jenkins will skip the tests. And let me know if you know better workaround. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20226: [SPARK-23034][SQL] Override `nodeName` for all *S...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20226#discussion_r165932670 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -86,6 +86,9 @@ case class RowDataSourceScanExec( def output: Seq[Attribute] = requiredColumnsIndex.map(fullOutput) + override val nodeName: String = --- End diff -- `DataSourceScanExec.nodeName` is defined as `s"Scan $relation ${tableIdentifier.map(_.unquotedString).getOrElse("")}"`, do we really need to overwrite it here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL] Override `nodeName` for all *ScanExec...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20226 By default `simpleString` is defined as `s"$nodeName $argString".trim`, if we overwrite `nodeName` in some node, we should also overwrite `argString`, otherwise we may have duplicated information in `simpleString`, which is used with `explain`. Can we just change the UI code to put `plan.simpleString` in the plan graph? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20477 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20477 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/587/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20481: [SPARK-23307][WEBUI]Sort jobs/stages/tasks/querie...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20481#discussion_r165934385 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -875,8 +875,8 @@ private[spark] class AppStatusListener( return } -val toDelete = KVUtils.viewToSeq(kvstore.view(classOf[JobDataWrapper]), -countToDelete.toInt) { j => +val view = kvstore.view(classOf[JobDataWrapper]).index("completionTime").first(0L) --- End diff -- use `TaskIndexNames.COMPLETION_TIME`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20477 **[Test build #87064 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87064/testReport)** for PR 20477 at commit [`a40d18e`](https://github.com/apache/spark/commit/a40d18ea08a62ecafa1d120bb7ce38019ba57869). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20481: [SPARK-23307][WEBUI]Sort jobs/stages/tasks/querie...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20481#discussion_r165934452 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -888,8 +888,8 @@ private[spark] class AppStatusListener( return } -val stages = KVUtils.viewToSeq(kvstore.view(classOf[StageDataWrapper]), -countToDelete.toInt) { s => +val view = kvstore.view(classOf[StageDataWrapper]).index("completionTime").first(0L) --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20481: [SPARK-23307][WEBUI]Sort jobs/stages/tasks/queries with ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20481 thanks, merging to master/2.3! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20481: [SPARK-23307][WEBUI]Sort jobs/stages/tasks/querie...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20481 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20507 **[Test build #87063 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87063/testReport)** for PR 20507 at commit [`47b8873`](https://github.com/apache/spark/commit/47b88734b91a7f9a4335bc3c667640eb4600b8e1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20507 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87063/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20507 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20508: [SPARK-23335][SQL] Should not convert to double w...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/20508 [SPARK-23335][SQL] Should not convert to double when there is an Integra⦠â¦l value in BinaryArithmetic which will loss precison ## What changes were proposed in this pull request? For below expression: `select conv('',16,10) % 2;` it will return 0. ``` 0: jdbc:hive2://xxx:16> select conv('',16,10) % 2; +--+--+ | (CAST(conv(, 16, 10) AS DOUBLE) % CAST(CAST(2 AS DECIMAL(20,0)) AS DOUBLE)) | +--+--+ | 0.0 | +--+--+ ``` It caused by: ``` case a @ BinaryArithmetic(left @ StringType(), right) => a.makeCopy(Array(Cast(left, DoubleType), right)) case a @ BinaryArithmetic(left, right @ StringType()) => a.makeCopy(Array(left, Cast(right, DoubleType))) ``` This patch fix this by add rule check when has an intergral type in BinaryArithmetic operator,we should not convert value to double. Result as below: ``` 0: jdbc:hive2://xxx:16> select conv('',16,10) % 2; +---+--+ | (CAST(CAST(conv(, 16, 10) AS DECIMAL(38,0)) AS DECIMAL(38,0)) % CAST(CAST(2 AS DECIMAL(38,0)) AS DECIMAL(38,0))) | +---+--+ | 1 | +---+--+ ``` ## How was this patch tested? Exist tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/fix-castasdouble Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20508.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20508 commit 1a2c62f6e2725cbbdc44c464c7fc0b9358e064b2 Author: zhoukang Date: 2018-02-05T10:52:40Z [SPARK-MI][SQL] Should not convert to double when there is an Integral value in BinaryArithmetic which will loss precison --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20508: [SPARK-23335][SQL] Should not convert to double when the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20508 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20508: [SPARK-23335][SQL] Should not convert to double when the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20508 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20509: [SPARK-23268][SQL][followup] Reorganize packages ...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/20509 [SPARK-23268][SQL][followup] Reorganize packages in data source V2 ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/20435. While reorganizing the packages for streaming data source v2, the top level stream read/write support interfaces should not be in the reader/writer package, but should be in the `sources.v2` package, to follow the `ReadSupport`, `WriteSupport`, etc. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark followup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20509.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20509 commit e3f007f5ccf3f1404ad37e40f6d3112933da3c24 Author: Wenchen Fan Date: 2018-02-05T10:22:02Z move streaming read/write support interface to sources.v2 package --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20509 cc @gengliangwang @jose-torres @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tes...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20487#discussion_r165940636 --- Diff: pom.xml --- @@ -185,6 +185,10 @@ 2.8 1.8 1.0.0 +
[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20509 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20509 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/588/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20509 **[Test build #87065 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87065/testReport)** for PR 20509 at commit [`e3f007f`](https://github.com/apache/spark/commit/e3f007f5ccf3f1404ad37e40f6d3112933da3c24). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20373 To me, yup. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20373: [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20373 FYI, I am trying to make a minor release of cloudpickle to match with this to deduplicate our efforts. We put many efforts to find and backport bug fixes here. :-). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20509 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20509 **[Test build #87065 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87065/testReport)** for PR 20509 at commit [`e3f007f`](https://github.com/apache/spark/commit/e3f007f5ccf3f1404ad37e40f6d3112933da3c24). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20509 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87065/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18555: [SPARK-21353][CORE]add checkValue in spark.internal.conf...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/18555 cc @HyukjinKwon,@cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tests for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20487 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tests for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20487 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/589/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tests for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20487 **[Test build #87066 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87066/testReport)** for PR 20487 at commit [`873b4b9`](https://github.com/apache/spark/commit/873b4b96804ebc41b538a090064218141c0f2589). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r165950192 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1557,6 +1557,9 @@ def replace(self, to_replace, value=None, subset=None): For example, if `value` is a string, and subset contains a non-string column, then the non-string column is simply ignored. +.. note:: `value` can only be omitted when `to_replace` is a dictionary. Otherwise, +it is required. --- End diff -- Sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r165951235 --- Diff: python/pyspark/sql/tests.py --- @@ -2186,7 +2186,7 @@ def test_replace(self): # replace with subset specified with one column replaced, another column not in subset # stays unchanged. row = self.spark.createDataFrame( -[(u'Alice', 10, 10.0)], schema).replace(10, 20, subset=['name', 'age']).first() +[(u'Alice', 10, 10.0)], schema).replace(10, value=20, subset=['name', 'age']).first() --- End diff -- I don't think it's necessary but let me keep them since at least it tests different combinations of valid cases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/20510 [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4 ## What changes were proposed in this pull request? This PR upgrade snappy-java to 1.1.4. release notes: - Fix a 1% performance regression when snappy is used in PIE executables. - Improve compression performance by 5%. - Improve decompression performance by 20%. More details: https://github.com/xerial/snappy-java/blob/master/Milestone.md#snappy-java-114-2017-05-22 ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-23336 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20510.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20510 commit 1055afc107b0c2357449ae3f23bda089480579d9 Author: Yuming Wang Date: 2018-02-05T11:59:47Z Upgrade snappy-java to 1.1.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20510 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20510 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/590/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20510 **[Test build #87067 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87067/testReport)** for PR 20510 at commit [`1055afc`](https://github.com/apache/spark/commit/1055afc107b0c2357449ae3f23bda089480579d9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20499 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/591/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20499 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20499 **[Test build #87068 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87068/testReport)** for PR 20499 at commit [`1849f59`](https://github.com/apache/spark/commit/1849f5948d41d9a0a137a810b8a699755232f7cb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20499 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20499 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87068/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20499 **[Test build #87068 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87068/testReport)** for PR 20499 at commit [`1849f59`](https://github.com/apache/spark/commit/1849f5948d41d9a0a137a810b8a699755232f7cb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user danielvdende commented on the issue: https://github.com/apache/spark/pull/20057 @Stephan202 thanks for pointing out those docs issues, just pushed the changes :-). @gatorsmile @dongjoon-hyun would you have a chance to take a look at this again? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user Fokko commented on the issue: https://github.com/apache/spark/pull/20057 Any idea when this will be merged into master? We could use this since we are ditching sqoop ð --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20508: [SPARK-23335][SQL] Should not convert to double w...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/20508#discussion_r165968094 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -327,6 +327,14 @@ object TypeCoercion { // Skip nodes who's children have not been resolved yet. case e if !e.childrenResolved => e + // For integralType should not convert to double which will cause precision loss. + case a @ BinaryArithmetic(left @ StringType(), right @ IntegralType()) => --- End diff -- What will happen if string value beyond the long type range? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20507#discussion_r165968902 --- Diff: python/pyspark/sql/tests.py --- @@ -3920,6 +3920,14 @@ def test_vectorized_udf_null_string(self): res = df.select(str_f(col('str'))) self.assertEquals(df.collect(), res.collect()) +def test_vectorized_udf_string_in_udf(self): +from pyspark.sql.functions import pandas_udf, col +import pandas as pd +df = self.spark.range(10) +str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), StringType()) +res = df.select(str_f(col('id'))) --- End diff -- How about variable names 'expected' and 'actual'? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20507#discussion_r165972212 --- Diff: python/pyspark/sql/tests.py --- @@ -3920,6 +3920,14 @@ def test_vectorized_udf_null_string(self): res = df.select(str_f(col('str'))) self.assertEquals(df.collect(), res.collect()) +def test_vectorized_udf_string_in_udf(self): +from pyspark.sql.functions import pandas_udf, col +import pandas as pd +df = self.spark.range(10) +str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), StringType()) --- End diff -- Not a big deal. How about `pd.Series(map(str, x))`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20477 **[Test build #87064 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87064/testReport)** for PR 20477 at commit [`a40d18e`](https://github.com/apache/spark/commit/a40d18ea08a62ecafa1d120bb7ce38019ba57869). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20477 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87064/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20477 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18555: [SPARK-21353][CORE]add checkValue in spark.internal.conf...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18555 Hmm .. why not addressing https://github.com/apache/spark/pull/18555#discussion_r126293557? I think that comment makes sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/20509 The proposal sounds good to me ð --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable logical ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20387 For doing pushdown at logical or physical phase, I don't have a strong preference. I think at logical phase we should try our best to push down data-size-reduction operators(like filter, aggregate, limit, etc.) close to the bottom of the plan, as it should always be good. We should apply pushdown to data sources at physical phase, as it's not always good and we need to consider the cost. Currently it's done in logical phase because of the `computeStats` problem. Eventually we should compute the statistics and apply pushdown to data sources in physical phase. About how to apply pushdown to data sources, I think `PhysicalOperation` is in the right direction and the new pushdown rule also follows it. Generally the logical phase is responsible for pushing down data-size-reduction operators close to the data source relation, and in the physical phase we collect supported operators(currently it's only project and filter) above the data source relation and apply the pushdown once, so this doesn't need to be incremental. We definitely need to document the contract for ordering and interactions between different types of pushdowns. For now we don't need to worry about it as we only support column pruning and filter push down, and these 2 are orthogonal, it doesn't matter if data source run project first or filter first. Here are some initial thoughts on how to define the contract. Let's say Data Source V2 framework supports pushing down required columns(column pruning), filter, limit, aggregate. Semantically filter, limit and aggregate are not exchangeable, we should respect their order in the query. If we have all these operators in a query, how to tell the data source about the order of these operators? My proposal is, since `DataSourceReader` is mutable(not the plan!), we can ask the data source to remember which operators have been pushed down, via the order of when these `pushXXX` methods are called. And data source implementations should respect the order of pushdown when applying them internally. About `PhysicalOperation`, it's pretty simple and we probably need to change it a lot when adding more operator pushdown. Another concern is, `PhysicalOperation` is used in a lot of places, not only data source pushdown. For safety, I wanna keep it unchanged, and start something new for data source v2 only. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20507#discussion_r165980594 --- Diff: python/pyspark/sql/tests.py --- @@ -3920,6 +3920,14 @@ def test_vectorized_udf_null_string(self): res = df.select(str_f(col('str'))) self.assertEquals(df.collect(), res.collect()) +def test_vectorized_udf_string_in_udf(self): +from pyspark.sql.functions import pandas_udf, col +import pandas as pd +df = self.spark.range(10) +str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), StringType()) +res = df.select(str_f(col('id'))) --- End diff -- Sure, I'll update it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20507#discussion_r165980572 --- Diff: python/pyspark/sql/tests.py --- @@ -3920,6 +3920,14 @@ def test_vectorized_udf_null_string(self): res = df.select(str_f(col('str'))) self.assertEquals(df.collect(), res.collect()) +def test_vectorized_udf_string_in_udf(self): +from pyspark.sql.functions import pandas_udf, col +import pandas as pd +df = self.spark.range(10) +str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), StringType()) --- End diff -- Sounds good. I'll take it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20387#discussion_r165981421 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,17 +17,151 @@ package org.apache.spark.sql.execution.datasources.v2 +import java.util.UUID + +import scala.collection.JavaConverters._ +import scala.collection.mutable + +import org.apache.spark.sql.{AnalysisException, SaveMode} +import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation -import org.apache.spark.sql.catalyst.expressions.AttributeReference -import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} -import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression} +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} +import org.apache.spark.sql.execution.datasources.DataSourceStrategy +import org.apache.spark.sql.sources.{DataSourceRegister, Filter} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema, WriteSupport} +import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, SupportsPushDownCatalystFilters, SupportsPushDownFilters, SupportsPushDownRequiredColumns, SupportsReportStatistics} +import org.apache.spark.sql.sources.v2.writer.DataSourceWriter +import org.apache.spark.sql.types.StructType case class DataSourceV2Relation( -fullOutput: Seq[AttributeReference], -reader: DataSourceReader) - extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder { +source: DataSourceV2, +options: Map[String, String], +path: Option[String] = None, +table: Option[TableIdentifier] = None, --- End diff -- have you considered about https://github.com/apache/spark/pull/20387#issuecomment-362148217 ? I feel it's better to define these common options in `DataSourceOptions`, so that data source implementations can also get these common options easily. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20507 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/592/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20507 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20507 **[Test build #87069 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87069/testReport)** for PR 20507 at commit [`06ae568`](https://github.com/apache/spark/commit/06ae568df2088652754c2df66d2f78c8fbdac48d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20506#discussion_r165980562 --- Diff: python/pyspark/sql/types.py --- @@ -1694,6 +1694,21 @@ def from_arrow_schema(arrow_schema): for field in arrow_schema]) +def _correct_date_of_dataframe_from_arrow(pdf, schema): +""" Correct date type value to use datetime.date. + +Pandas DataFrame created from PyArrow uses datetime64[ns] for date type values, but we should +use datetime.date to keep backward compatibility. --- End diff -- Shall we say like to match it with when Arrow optimization is disabled? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20509 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/593/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20509 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20509: [SPARK-23268][SQL][followup] Reorganize packages in data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20509 **[Test build #87070 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87070/testReport)** for PR 20509 at commit [`613d180`](https://github.com/apache/spark/commit/613d18034e8c43d534a6e0d51c522799be37384a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20506#discussion_r165987965 --- Diff: python/pyspark/sql/types.py --- @@ -1694,6 +1694,21 @@ def from_arrow_schema(arrow_schema): for field in arrow_schema]) +def _correct_date_of_dataframe_from_arrow(pdf, schema): +""" Correct date type value to use datetime.date. + +Pandas DataFrame created from PyArrow uses datetime64[ns] for date type values, but we should +use datetime.date to keep backward compatibility. --- End diff -- Maybe we don't need to say about backward compatibility here. I'll update it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20506 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/594/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20506 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20506 **[Test build #87071 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87071/testReport)** for PR 20506 at commit [`ebdbd8c`](https://github.com/apache/spark/commit/ebdbd8c4a06a4da52fc61b1dc98d6e2f2facdf9c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on the issue: https://github.com/apache/spark/pull/16578 So if it's not going to be included in `2.3.0` maybe we could change `spark.sql.nestedSchemaPruning.enabled` to default `true` ? I hope that this time this PR could be finalized at the early stage of `2.4.0` so there would be plenty of time to fix any unforseen problems? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20507 **[Test build #87069 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87069/testReport)** for PR 20507 at commit [`06ae568`](https://github.com/apache/spark/commit/06ae568df2088652754c2df66d2f78c8fbdac48d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20507 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87069/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20507 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20167: [SPARK-16501] [MESOS] Allow providing Mesos princ...
Github user ArtRand commented on a diff in the pull request: https://github.com/apache/spark/pull/20167#discussion_r165994809 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala --- @@ -71,40 +74,64 @@ trait MesosSchedulerUtils extends Logging { failoverTimeout: Option[Double] = None, frameworkId: Option[String] = None): SchedulerDriver = { val fwInfoBuilder = FrameworkInfo.newBuilder().setUser(sparkUser).setName(appName) -val credBuilder = Credential.newBuilder() + fwInfoBuilder.setHostname(Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse( + conf.get(DRIVER_HOST_ADDRESS))) webuiUrl.foreach { url => fwInfoBuilder.setWebuiUrl(url) } checkpoint.foreach { checkpoint => fwInfoBuilder.setCheckpoint(checkpoint) } failoverTimeout.foreach { timeout => fwInfoBuilder.setFailoverTimeout(timeout) } frameworkId.foreach { id => fwInfoBuilder.setId(FrameworkID.newBuilder().setValue(id).build()) } - fwInfoBuilder.setHostname(Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse( - conf.get(DRIVER_HOST_ADDRESS))) -conf.getOption("spark.mesos.principal").foreach { principal => - fwInfoBuilder.setPrincipal(principal) - credBuilder.setPrincipal(principal) -} -conf.getOption("spark.mesos.secret").foreach { secret => - credBuilder.setSecret(secret) -} -if (credBuilder.hasSecret && !fwInfoBuilder.hasPrincipal) { - throw new SparkException( -"spark.mesos.principal must be configured when spark.mesos.secret is set") -} + conf.getOption("spark.mesos.role").foreach { role => fwInfoBuilder.setRole(role) } val maxGpus = conf.getInt("spark.mesos.gpus.max", 0) if (maxGpus > 0) { fwInfoBuilder.addCapabilities(Capability.newBuilder().setType(Capability.Type.GPU_RESOURCES)) } +val credBuilder = buildCredentials(conf, fwInfoBuilder) if (credBuilder.hasPrincipal) { new MesosSchedulerDriver( scheduler, fwInfoBuilder.build(), masterUrl, credBuilder.build()) } else { new MesosSchedulerDriver(scheduler, fwInfoBuilder.build(), masterUrl) } } + + def buildCredentials( + conf: SparkConf, + fwInfoBuilder: Protos.FrameworkInfo.Builder): Protos.Credential.Builder = { +val credBuilder = Credential.newBuilder() +conf.getOption("spark.mesos.principal") + .orElse(Option(conf.getenv("SPARK_MESOS_PRINCIPAL"))) --- End diff -- I would want to make sure that @susanxhuynh and/or @skonto agree, but I think this is probably fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20510 **[Test build #87067 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87067/testReport)** for PR 20510 at commit [`1055afc`](https://github.com/apache/spark/commit/1055afc107b0c2357449ae3f23bda089480579d9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20510 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20510 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87067/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20506 **[Test build #87071 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87071/testReport)** for PR 20506 at commit [`ebdbd8c`](https://github.com/apache/spark/commit/ebdbd8c4a06a4da52fc61b1dc98d6e2f2facdf9c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20506 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20506: [SPARK-23290][SQL][PYTHON] Use datetime.date for date ty...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20506 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87071/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tests for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20487 **[Test build #87066 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87066/testReport)** for PR 20487 at commit [`873b4b9`](https://github.com/apache/spark/commit/873b4b96804ebc41b538a090064218141c0f2589). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tests for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20487 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87066/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20487: [SPARK-23319][TESTS] Explicitly skips PySpark tests for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20487 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/20510 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20510: [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.4
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20510 **[Test build #87072 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87072/testReport)** for PR 20510 at commit [`1055afc`](https://github.com/apache/spark/commit/1055afc107b0c2357449ae3f23bda089480579d9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org