[GitHub] [spark] viirya commented on a change in pull request #24675: [SPARK-27803][SQL][PYTHON] Fix column pruning for Python UDF
viirya commented on a change in pull request #24675: [SPARK-27803][SQL][PYTHON] Fix column pruning for Python UDF URL: https://github.com/apache/spark/pull/24675#discussion_r286786053 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala ## @@ -38,3 +38,30 @@ case class FlatMapGroupsInPandas( */ override val producedAttributes = AttributeSet(output) } + +trait BaseEvalPython extends UnaryNode { Review comment: Is `producedAttributes` missing from this? Previously, `BatchEvalPython` and `ArrowEvalPython` have it defined. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #24675: [SPARK-27803][SQL][PYTHON] Fix column pruning for Python UDF
viirya commented on a change in pull request #24675: [SPARK-27803][SQL][PYTHON] Fix column pruning for Python UDF URL: https://github.com/apache/spark/pull/24675#discussion_r286785463 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala ## @@ -38,3 +38,30 @@ case class FlatMapGroupsInPandas( */ override val producedAttributes = AttributeSet(output) } + +trait BaseEvalPython extends UnaryNode { + + def udfs: Seq[PythonUDF] + + def resultAttrs: Seq[Attribute] + + override def output: Seq[Attribute] = child.output ++ resultAttrs + + override def references: AttributeSet = AttributeSet(udfs.flatMap(_.references)) Review comment: If `references` only cover references in `udfs`, will some output attributes from child that aren't referred by `udfs` be pruned from `BaseEvalPython`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on issue #24680: [SPARK-26045][BUILD] Leave avro, avro-ipc dependendencies as compile scope even for hadoop-provided usages
dongjoon-hyun edited a comment on issue #24680: [SPARK-26045][BUILD] Leave avro, avro-ipc dependendencies as compile scope even for hadoop-provided usages URL: https://github.com/apache/spark/pull/24680#issuecomment-495068392 I'll leave this PR here since @vanzin 's review is requested. We need this in `master/2.4` branches. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24640: [SPARK-27770] [SQL] [TEST] Port AGGREGATES.sql [Part 1]
dongjoon-hyun commented on issue #24640: [SPARK-27770] [SQL] [TEST] Port AGGREGATES.sql [Part 1] URL: https://github.com/apache/spark/pull/24640#issuecomment-495074398 Could you fix the UT failure? ``` [info] - aggregates_part1.sql *** FAILED *** (3 seconds, 720 milliseconds) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pengbo removed a comment on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page
pengbo removed a comment on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page URL: https://github.com/apache/spark/pull/24666#issuecomment-495049729 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pengbo commented on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page
pengbo commented on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page URL: https://github.com/apache/spark/pull/24666#issuecomment-495073870 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #24675: [SPARK-27803][SQL][PYTHON] Fix column pruning for Python UDF
HyukjinKwon commented on issue #24675: [SPARK-27803][SQL][PYTHON] Fix column pruning for Python UDF URL: https://github.com/apache/spark/pull/24675#issuecomment-495073865 makes sense to me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #24344: [SPARK-27440][SQL] Optimize uncorrelated predicate subquery
cloud-fan commented on issue #24344: [SPARK-27440][SQL] Optimize uncorrelated predicate subquery URL: https://github.com/apache/spark/pull/24344#issuecomment-495070324 I think @dilipbiswal has a good point here. For non-correlated EXISTS/IN, it's a bad idea to collect all the data of a table to the driver side and do the calculation. That said, we should not have a physical version of EXISTS/IN, they always need to be converted to join (sorry for the back and forth!). But we do have a chance to optimize non-correlated EXISTS/IN. More generally, if a left semi/anti join has a condition that only refers to attributes from the right side, we can probably turn this join into a filter operator. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API
AmplabJenkins removed a comment on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API URL: https://github.com/apache/spark/pull/24559#issuecomment-495068453 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105709/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API
AmplabJenkins removed a comment on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API URL: https://github.com/apache/spark/pull/24559#issuecomment-495068451 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24680: [SPARK-26045][BUILD] Leave avro, avro-ipc dependendencies as compile scope even for hadoop-provided usages
dongjoon-hyun commented on issue #24680: [SPARK-26045][BUILD] Leave avro, avro-ipc dependendencies as compile scope even for hadoop-provided usages URL: https://github.com/apache/spark/pull/24680#issuecomment-495068392 I'll leave this PR here since @vanzin 's review is requested. We need this in `master/2.4` branch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API
AmplabJenkins commented on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API URL: https://github.com/apache/spark/pull/24559#issuecomment-495068451 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API
AmplabJenkins commented on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API URL: https://github.com/apache/spark/pull/24559#issuecomment-495068453 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105709/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API
SparkQA removed a comment on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API URL: https://github.com/apache/spark/pull/24559#issuecomment-495039716 **[Test build #105709 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105709/testReport)** for PR 24559 at commit [`21a5f07`](https://github.com/apache/spark/commit/21a5f074e3b564a353da28901c8d6cb107ec04c2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API
SparkQA commented on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API URL: https://github.com/apache/spark/pull/24559#issuecomment-495068179 **[Test build #105709 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105709/testReport)** for PR 24559 at commit [`21a5f07`](https://github.com/apache/spark/commit/21a5f074e3b564a353da28901c8d6cb107ec04c2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
AmplabJenkins commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#issuecomment-495067333 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105708/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
AmplabJenkins commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#issuecomment-495067331 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
AmplabJenkins removed a comment on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#issuecomment-495067331 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
AmplabJenkins removed a comment on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#issuecomment-495067333 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105708/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
SparkQA commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#issuecomment-495067035 **[Test build #105708 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105708/testReport)** for PR 24617 at commit [`47d89d3`](https://github.com/apache/spark/commit/47d89d37a196e75173996adc6feb475a5c8ce87b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
SparkQA removed a comment on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#issuecomment-495038346 **[Test build #105708 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105708/testReport)** for PR 24617 at commit [`47d89d3`](https://github.com/apache/spark/commit/47d89d37a196e75173996adc6feb475a5c8ce87b). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
AmplabJenkins removed a comment on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-495066701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105710/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
AmplabJenkins removed a comment on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-495066698 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
AmplabJenkins commented on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-495066698 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
AmplabJenkins commented on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-495066701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105710/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
SparkQA removed a comment on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-495045176 **[Test build #105710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105710/testReport)** for PR 24671 at commit [`3f79e89`](https://github.com/apache/spark/commit/3f79e89e00f920af959a6b979e736af5a43a93c7). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
SparkQA commented on issue #24671: [SPARK-27811][Core][Docs]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-495066402 **[Test build #105710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105710/testReport)** for PR 24671 at commit [`3f79e89`](https://github.com/apache/spark/commit/3f79e89e00f920af959a6b979e736af5a43a93c7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
viirya commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#discussion_r286774379 ## File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ## @@ -930,6 +930,33 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("support user provided non-nullable avro schema " + Review comment: Have we documented `avroSchema` about this the behavior? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
viirya commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#discussion_r286774235 ## File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ## @@ -930,6 +930,33 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("support user provided non-nullable avro schema " + +"for nullable catalyst schema without any null record") { Review comment: Sounds good to have warning messages for the case. So can let users know they're actually writing from nullable catalyst schema into non-nullable avro schema. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng edited a comment on issue #24648: [SPARK-27777][ML] Eliminate uncessary sliding job in AreaUnderCurve
zhengruifeng edited a comment on issue #24648: [SPARK-2][ML] Eliminate uncessary sliding job in AreaUnderCurve URL: https://github.com/apache/spark/pull/24648#issuecomment-495060029 @srowen Oh, not a pass. My expression was not correct. Sliding need a separate job to collect head rows on each partitions, which can be eliminated. When the number of points is small, e.g. 1000, the difference is tiny. As shown in the first fig, only 0.8 sec is saved. ![图片](https://user-images.githubusercontent.com/7322292/58225023-ac1eca00-7d52-11e9-997e-76821b2594fd.png) Serveral reasons will result in more points in curve: 1, when I want a more accurate score 2, if we evaluate on a big dataset, then the points easily exceed 1000 even if we set `numBins`=1000. Since the grouping in the curve is limiited in partitions, or each partition will contains at least one point. In many practical cases, there are tens of thounds of partitions, so there are tens of thounds of points. As shown in the second fig, we set `numBins` to default value, and repartition the input data to 2000 partitions. Then the sliding job can not be ignored. ![图片](https://user-images.githubusercontent.com/7322292/58225172-6f070780-7d53-11e9-96f0-5b773b3e5a28.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng edited a comment on issue #24648: [SPARK-27777][ML] Eliminate uncessary sliding job in AreaUnderCurve
zhengruifeng edited a comment on issue #24648: [SPARK-2][ML] Eliminate uncessary sliding job in AreaUnderCurve URL: https://github.com/apache/spark/pull/24648#issuecomment-495060029 @srowen Oh, not a pass. My expression was not correct. Sliding need a separate job to collect head rows on each partitions, which can be eliminated. When the number of points is small, e.g. 1000, the difference is tiny. As shown in the first fig, only 0.8 sec is saved. ![图片](https://user-images.githubusercontent.com/7322292/58225023-ac1eca00-7d52-11e9-997e-76821b2594fd.png) Serveral reasons will result in more points in curve: 1, when I want a more accurate score 2, if we evaluate on a big dataset, then the points easily exceed 1000 even if we set `numBins`=1000. Since the grouping in the curve is limiited in partitions, or each partition will contains at least one point. In many practical cases, there are tens of thounds of partitions, so there are tens of thounds of points. As shown in the second fig, we set `numBins` to default value, and repartition the input data to 2000 partitions. Then the sliding job will take 12 sec, which is much longer than the computation time of AUC (2s) ![图片](https://user-images.githubusercontent.com/7322292/58225172-6f070780-7d53-11e9-96f0-5b773b3e5a28.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang edited a comment on issue #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
gengliangwang edited a comment on issue #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#issuecomment-495059043 This should not be a big concern. The file writing job is almost transactional since Spark follows the `FileCommitProtocol`. If failure happens during writes, the middle output files won't show up in the target path. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on issue #24648: [SPARK-27777][ML] Eliminate uncessary sliding job in AreaUnderCurve
zhengruifeng commented on issue #24648: [SPARK-2][ML] Eliminate uncessary sliding job in AreaUnderCurve URL: https://github.com/apache/spark/pull/24648#issuecomment-495060029 @srowen Oh, not a pass. My expression was not correct. Sliding need a separate job to collect head rows on each partitions, which can be eliminated. When the number of points is small, e.g. 1000, the difference is tiny. As shown in the first fig, only 0.8 sec is saved. ![图片](https://user-images.githubusercontent.com/7322292/58225023-ac1eca00-7d52-11e9-997e-76821b2594fd.png) Serveral reasons will result in more points in curve: 1, when I want a more accurate score 2, if we evaluate on a big dataset, then the points easily exceed 1000 even if we set `numBins`=1000. Since the grouping in the curve is limiit in partitions, or each partition will contains at least on partition. In many practional cases, there are tens of thounds of partitions, so there are tens of thounds of points. As shown in the second fig, we set `numBins` to default value, and repartition the input data to 2000 partitions. Then the sliding job will take 12 sec, which is much longer than the computation time of AUC (2s) ![图片](https://user-images.githubusercontent.com/7322292/58225172-6f070780-7d53-11e9-96f0-5b773b3e5a28.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on issue #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
gengliangwang commented on issue #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#issuecomment-495059043 This should not be a big concern. The file writing job is almost transactional since Spark follows the `FileCommitProtocol`. If failure happens during writes, the existing output file won't show up in the target path. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
gengliangwang commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#discussion_r286765012 ## File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ## @@ -930,6 +930,33 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("support user provided non-nullable avro schema " + +"for nullable catalyst schema without any null record") { Review comment: @cloud-fan I think this is fine. Otherwise, there is no way for users to write with a non-nullable schema. But should we show warning messages for such case? So that users can be aware of the risk. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
gengliangwang commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#discussion_r286765012 ## File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ## @@ -930,6 +930,33 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("support user provided non-nullable avro schema " + +"for nullable catalyst schema without any null record") { Review comment: @cloud-fan I think this is fine. Otherwise, that is no way for users to write with a non-nullable schema. But should we show warning messages for such case? So that users can be aware of the risk. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on issue #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
dongjoon-hyun edited a comment on issue #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#issuecomment-495054805 I understand the concern about the difference from our default `.schema` option. I believe this is the main reason why we add `.option("avroSchema", ...)`. For Avro, `nullable` column type is `"type": ["int", "null"]` and non-nullable column type is `"type": "int"` explicitly. For ORC/Parquet (DSv1/v2), everything is always nullable by default when reading. So, please don't worry about `.schema` use cases. This is a different option for different use cases. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JkSelf commented on issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during broadcast join
JkSelf commented on issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during broadcast join URL: https://github.com/apache/spark/pull/21899#issuecomment-495055518 @beliefer Thanks for your working. Here before we new the newPage in `val newPage = new Array[Long](newNumWords.toInt)`, we already check the available memory by `ensureAcquireMemory(newNumWords * 8L)` and if enough memory, we will do the creation operation of `newPage`. And if the memory is enough, why throw the oom exception in `val newPage = new Array[Long](newNumWords.toInt)`? Thanks for your help. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] francis0407 commented on a change in pull request #24344: [SPARK-27440][SQL] Optimize uncorrelated predicate subquery
francis0407 commented on a change in pull request #24344: [SPARK-27440][SQL] Optimize uncorrelated predicate subquery URL: https://github.com/apache/spark/pull/24344#discussion_r286764851 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala ## @@ -55,6 +55,112 @@ object ExecSubqueryExpression { } } +/** + * Exists is used to test for the existence of any record in a subquery. + * + * This is the physical copy of Exists to be used inside SparkPlan. + */ +case class Exists( +plan: BaseSubqueryExec, +exprId: ExprId) + extends ExecSubqueryExpression { + + override def dataType: DataType = BooleanType + override def children: Seq[Expression] = Nil + override def nullable: Boolean = false + override def toString: String = plan.simpleString(SQLConf.get.maxToStringFields) + override def withNewPlan(plan: BaseSubqueryExec): Exists = copy(plan = plan) + + // Whether the subquery returns one or more records + @volatile private var result: Boolean = _ + @volatile private var updated: Boolean = false + + def updateResult(): Unit = { +val rows = plan.executeCollect() +result = rows.nonEmpty +updated = true + } + + override def eval(input: InternalRow): Boolean = { +require(updated, s"$this has not finished") +result + } + + override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { +require(updated, s"$this has not finished") +Literal.create(result, BooleanType).doGenCode(ctx, ev) + } +} + +/** + * Evaluates to `true` if `values` are returned in the subquery's result set. + * If `values` are not found in the subquery's result set, and there are nulls in + * `values` or the result set, it should return null. + * This is the physical copy of InSubquery to be used inside SparkPlan. + */ +case class InSubquery( +values: Seq[Literal], +plan: BaseSubqueryExec, +exprId: ExprId) + extends ExecSubqueryExpression { + override def dataType: DataType = BooleanType + override def children: Seq[Expression] = Nil + override def nullable: Boolean = true + override def toString: String = plan.simpleString(SQLConf.get.maxToStringFields) + override def withNewPlan(plan: BaseSubqueryExec): InSubquery = copy(plan = plan) + + @volatile private var result: Boolean = _ + @volatile private var isNull: Boolean = false + @volatile private var updated: Boolean = false + + def updateResult(): Unit = { +val rows = plan.executeCollect() +// The semantic of '(a,b) in ((x1, y1), (x2, y2), ...)' is +// '(a = x1 and b = y1) or (a = x2 and b = y2) or ...' +val expression = rows.map(row => { Review comment: I have updated this, could you please help check it? cc @dilipbiswal @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on issue #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
dongjoon-hyun edited a comment on issue #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#issuecomment-495054805 I understand the concern about the difference from our default `.schema` option. I believe this is the main reason why we add `.option("avroSchema", ...)`. For Avro, `nullable` column type is `"type": ["int", "null"]` and non-nullable column type is `"type": "int"` explicitly. For ORC/Parquet (DSv1/v2), everything is always nullable by default when reading. So, please don't worry about `.schema` use cases. This is a different use case. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
dongjoon-hyun commented on issue #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#issuecomment-495054805 I understand the concern about the difference from our default `.schema` option. I believe this is the main reason why we add `.option("avroSchema", ...)`. For Avro, `nullable` column type is `"type": ["int", "null"]` and non-nullable column type is `"type": "int"` explicitly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
dongjoon-hyun commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#discussion_r286763767 ## File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ## @@ -930,6 +930,33 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("support user provided non-nullable avro schema " + Review comment: `catalyst` schema is always nullable when we read from the file. This is a special support for `.option("avroSchema", ...)` for Avro. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
dongjoon-hyun commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#discussion_r286763619 ## File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ## @@ -930,6 +930,33 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("support user provided non-nullable avro schema " + +"for nullable catalyst schema without any null record") { Review comment: For `.schema()` option, we always enforce `nullable` by using `dataSchema.asNullable` in `FileTable`. For me, this is a special support for `.option("avroSchema", "")`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #23791: [SPARK-20597][SQL][SS][WIP] KafkaSourceProvider falls back on path as synonym for topic
AmplabJenkins commented on issue #23791: [SPARK-20597][SQL][SS][WIP] KafkaSourceProvider falls back on path as synonym for topic URL: https://github.com/apache/spark/pull/23791#issuecomment-495053513 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
HyukjinKwon commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#discussion_r286763265 ## File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ## @@ -930,6 +930,33 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("support user provided non-nullable avro schema " + Review comment: BTW, note that we don't currently support non-nullable schema in file format sources because they are turned to be nullable in SQL batch code path. Non-nullable is able to be set in SS tho. Both codes paths should be matched - it's a long standing issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #23791: [SPARK-20597][SQL][SS][WIP] KafkaSourceProvider falls back on path as synonym for topic
AmplabJenkins removed a comment on issue #23791: [SPARK-20597][SQL][SS][WIP] KafkaSourceProvider falls back on path as synonym for topic URL: https://github.com/apache/spark/pull/23791#issuecomment-463663060 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
HyukjinKwon commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#discussion_r286762917 ## File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ## @@ -930,6 +930,33 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("support user provided non-nullable avro schema " + Review comment: This doesn't quite make sense to me. Looks if the catalyst schema is nullable, it should reject non-nullable avro schema This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on issue #24672: [SPARK-27801] Improve performance of InMemoryFileIndex.listLeafFiles for HDFS directories with many files
wangyum commented on issue #24672: [SPARK-27801] Improve performance of InMemoryFileIndex.listLeafFiles for HDFS directories with many files URL: https://github.com/apache/spark/pull/24672#issuecomment-495052170 Thank you @rrusso2007 @JoshRosen I did simple benchmark in our production environment(Hadoop version is 2.7.1): ``` 19/05/22 19:53:18 WARN InMemoryFileIndex: Elements: 10. default time token: 41, SPARK-27801 time token: 18, SPARK-27807 time token: 30 19/05/22 19:53:29 WARN InMemoryFileIndex: Elements: 20. default time token: 22, SPARK-27801 time token: 10, SPARK-27807 time token: 24 19/05/22 19:53:30 WARN InMemoryFileIndex: Elements: 50. default time token: 47, SPARK-27801 time token: 13, SPARK-27807 time token: 25 19/05/22 19:53:33 WARN InMemoryFileIndex: Elements: 100. default time token: 54, SPARK-27801 time token: 10, SPARK-27807 time token: 30 19/05/22 19:53:42 WARN InMemoryFileIndex: Elements: 200. default time token: 86, SPARK-27801 time token: 19, SPARK-27807 time token: 40 19/05/22 19:53:52 WARN InMemoryFileIndex: Elements: 500. default time token: 254, SPARK-27801 time token: 30, SPARK-27807 time token: 90 19/05/22 19:54:06 WARN InMemoryFileIndex: Elements: 1000. default time token: 507, SPARK-27801 time token: 165, SPARK-27807 time token: 117 19/05/22 19:54:18 WARN InMemoryFileIndex: Elements: 2000. default time token: 1193, SPARK-27801 time token: 114, SPARK-27807 time token: 216 19/05/22 19:54:34 WARN InMemoryFileIndex: Elements: 5000. default time token: 2401, SPARK-27801 time token: 430, SPARK-27807 time token: 565 19/05/22 19:54:56 WARN InMemoryFileIndex: Elements: 1. default time token: 4831, SPARK-27801 time token: 646, SPARK-27807 time token: 1202 19/05/22 19:55:40 WARN InMemoryFileIndex: Elements: 2. default time token: 9121, SPARK-27801 time token: 1535, SPARK-27807 time token: 1920 19/05/22 19:56:45 WARN InMemoryFileIndex: Elements: 4. default time token: 18873, SPARK-27801 time token: 2784, SPARK-27807 time token: 3997 19/05/22 19:58:18 WARN InMemoryFileIndex: Elements: 8. default time token: 33658, SPARK-27801 time token: 6476, SPARK-27807 time token: 8326 ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum closed pull request #24679: [SPARK-27807][SQL] Parallel resolve leaf statuses InMemoryFileIndex
wangyum closed pull request #24679: [SPARK-27807][SQL] Parallel resolve leaf statuses InMemoryFileIndex URL: https://github.com/apache/spark/pull/24679 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pengbo removed a comment on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page
pengbo removed a comment on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page URL: https://github.com/apache/spark/pull/24666#issuecomment-495044115 Retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pengbo commented on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page
pengbo commented on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page URL: https://github.com/apache/spark/pull/24666#issuecomment-495049729 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pengbo removed a comment on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page
pengbo removed a comment on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page URL: https://github.com/apache/spark/pull/24666#issuecomment-495042381 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24628: [SPARK-27749][SQL][test-hadoop3.2] hadoop-3.2 support hive-thriftserver
AmplabJenkins removed a comment on issue #24628: [SPARK-27749][SQL][test-hadoop3.2] hadoop-3.2 support hive-thriftserver URL: https://github.com/apache/spark/pull/24628#issuecomment-495046736 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24628: [SPARK-27749][SQL][test-hadoop3.2] hadoop-3.2 support hive-thriftserver
AmplabJenkins removed a comment on issue #24628: [SPARK-27749][SQL][test-hadoop3.2] hadoop-3.2 support hive-thriftserver URL: https://github.com/apache/spark/pull/24628#issuecomment-495046742 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105707/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24628: [SPARK-27749][SQL][test-hadoop3.2] hadoop-3.2 support hive-thriftserver
AmplabJenkins commented on issue #24628: [SPARK-27749][SQL][test-hadoop3.2] hadoop-3.2 support hive-thriftserver URL: https://github.com/apache/spark/pull/24628#issuecomment-495046736 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24628: [SPARK-27749][SQL][test-hadoop3.2] hadoop-3.2 support hive-thriftserver
AmplabJenkins commented on issue #24628: [SPARK-27749][SQL][test-hadoop3.2] hadoop-3.2 support hive-thriftserver URL: https://github.com/apache/spark/pull/24628#issuecomment-495046742 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105707/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24628: [SPARK-27749][SQL][test-hadoop3.2] hadoop-3.2 support hive-thriftserver
SparkQA removed a comment on issue #24628: [SPARK-27749][SQL][test-hadoop3.2] hadoop-3.2 support hive-thriftserver URL: https://github.com/apache/spark/pull/24628#issuecomment-495024200 **[Test build #105707 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105707/testReport)** for PR 24628 at commit [`a0e52aa`](https://github.com/apache/spark/commit/a0e52aae93fd8c1b3a3b1931b2102943cb0202a4). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24628: [SPARK-27749][SQL][test-hadoop3.2] hadoop-3.2 support hive-thriftserver
SparkQA commented on issue #24628: [SPARK-27749][SQL][test-hadoop3.2] hadoop-3.2 support hive-thriftserver URL: https://github.com/apache/spark/pull/24628#issuecomment-495046418 **[Test build #105707 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105707/testReport)** for PR 24628 at commit [`a0e52aa`](https://github.com/apache/spark/commit/a0e52aae93fd8c1b3a3b1931b2102943cb0202a4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24671: [MINOR][DOCS]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
SparkQA commented on issue #24671: [MINOR][DOCS]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-495045176 **[Test build #105710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105710/testReport)** for PR 24671 at commit [`3f79e89`](https://github.com/apache/spark/commit/3f79e89e00f920af959a6b979e736af5a43a93c7). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24671: [MINOR][DOCS]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
AmplabJenkins removed a comment on issue #24671: [MINOR][DOCS]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-495044831 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24671: [MINOR][DOCS]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
AmplabJenkins removed a comment on issue #24671: [MINOR][DOCS]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-495044835 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/10966/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24671: [MINOR][DOCS]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
AmplabJenkins commented on issue #24671: [MINOR][DOCS]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-495044835 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/10966/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24671: [MINOR][DOCS]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
AmplabJenkins commented on issue #24671: [MINOR][DOCS]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-495044831 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on issue #24671: [MINOR][DOCS]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead.
beliefer commented on issue #24671: [MINOR][DOCS]Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. URL: https://github.com/apache/spark/pull/24671#issuecomment-495044178 Retest this please. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pengbo commented on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page
pengbo commented on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page URL: https://github.com/apache/spark/pull/24666#issuecomment-495044115 Retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #24647: [SPARK-27776][SQL]Avoid duplicate Java reflection in DataSource.
beliefer commented on a change in pull request #24647: [SPARK-27776][SQL]Avoid duplicate Java reflection in DataSource. URL: https://github.com/apache/spark/pull/24647#discussion_r286753702 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ## @@ -105,6 +105,8 @@ case class DataSource( case _ => cls } } + private def providingInstance = providingClass.getConstructor().newInstance() Review comment: If we add a return type, only `Any` could use here. This method is modified by private, so whether the return type can be omitted? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] habren commented on issue #24663: [SPARK-27792][SQL] SkewJoin--handle only skewed keys with broadcastjoin
habren commented on issue #24663: [SPARK-27792][SQL] SkewJoin--handle only skewed keys with broadcastjoin URL: https://github.com/apache/spark/pull/24663#issuecomment-495043133 @viirya Could you please review this pull request ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pengbo removed a comment on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page
pengbo removed a comment on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page URL: https://github.com/apache/spark/pull/24666#issuecomment-494819839 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pengbo commented on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page
pengbo commented on issue #24666: [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page URL: https://github.com/apache/spark/pull/24666#issuecomment-495042381 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #24647: [SPARK-27776][SQL]Avoid duplicate Java reflection in DataSource.
beliefer commented on a change in pull request #24647: [SPARK-27776][SQL]Avoid duplicate Java reflection in DataSource. URL: https://github.com/apache/spark/pull/24647#discussion_r286753702 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ## @@ -105,6 +105,8 @@ case class DataSource( case _ => cls } } + private def providingInstance = providingClass.getConstructor().newInstance() Review comment: If we add a return type, only Any could use here. This method is modified by private, so whether the return type can be omitted? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sjrand edited a comment on issue #24645: [SPARK-27773][Shuffle] add metrics for number of exceptions caught in ExternalShuffleBlockHandler
sjrand edited a comment on issue #24645: [SPARK-27773][Shuffle] add metrics for number of exceptions caught in ExternalShuffleBlockHandler URL: https://github.com/apache/spark/pull/24645#issuecomment-495040379 On the client (executor) side we were seeing lots of timeouts, e.g.: ``` ERROR [2019-05-16T18:34:57.782Z] org.apache.spark.storage.BlockManager: Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds... java.io.IOException: Failed to connect to /:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:250) at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:206) at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142) at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:300) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:297) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:271) at org.apache.spark.executor.Executor.(Executor.scala:121) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:92) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:222) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: /:7337 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) ``` And in the NodeManager logs we were seeing lots of `ClosedChannelException` errors from netty, along with the occasional `java.io.IOException: Broken pipe` error. For example: ``` 2019-05-16 05:13:17,999 ERROR org.apache.spark.network.server.TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1647907385644, chunkIndex=22}, buffer=FileSegmentManagedBuffer{file=/scratch/hadoop/tmp/nm-local-dir/usercache//appcache/application_1557300039674_635976/blockmgr-0ec1d292-3e75-40bd-afd3-79314f427338/11/shuffle_5_3900_0.data, offset=12387017, length=1235}} to /:35922; closing connection java.nio.channels.ClosedChannelException at org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ``` We confirmed that the
[GitHub] [spark] sjrand edited a comment on issue #24645: [SPARK-27773][Shuffle] add metrics for number of exceptions caught in ExternalShuffleBlockHandler
sjrand edited a comment on issue #24645: [SPARK-27773][Shuffle] add metrics for number of exceptions caught in ExternalShuffleBlockHandler URL: https://github.com/apache/spark/pull/24645#issuecomment-495040379 On the client (executor) side we were seeing lots of timeouts, e.g.: ``` ERROR [2019-05-16T18:34:57.782Z] org.apache.spark.storage.BlockManager: Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds... java.io.IOException: Failed to connect to /:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:250) at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:206) at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142) at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:300) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:297) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:271) at org.apache.spark.executor.Executor.(Executor.scala:121) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:92) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:222) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: /:7337 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) ``` And in the NodeManager logs we were seeing lots of `ClosedChannelException` errors from netty, along with the occasional `java.io.IOException: Broken pipe` error. For example: ``` 2019-05-16 05:13:17,999 ERROR org.apache.spark.network.server.TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1647907385644, chunkIndex=22}, buffer=FileSegmentManagedBuffer{file=/scratch/hadoop/tmp/nm-local-dir/usercache//appcache/application_1557300039674_635976/blockmgr-0ec1d292-3e75-40bd-afd3-79314f427338/11/shuffle_5_3900_0.data, offset=12387017, length=1235}} to /:35922; closing connection java.nio.channels.ClosedChannelException at org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ``` We confirmed that the `shuffle-server`
[GitHub] [spark] sjrand edited a comment on issue #24645: [SPARK-27773][Shuffle] add metrics for number of exceptions caught in ExternalShuffleBlockHandler
sjrand edited a comment on issue #24645: [SPARK-27773][Shuffle] add metrics for number of exceptions caught in ExternalShuffleBlockHandler URL: https://github.com/apache/spark/pull/24645#issuecomment-495040379 On the client (executor) side we were seeing lots of timeouts, e.g.: ``` ERROR [2019-05-16T18:34:57.782Z] org.apache.spark.storage.BlockManager: Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds... java.io.IOException: Failed to connect to /:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:250) at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:206) at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142) at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:300) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:297) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:271) at org.apache.spark.executor.Executor.(Executor.scala:121) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:92) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:222) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: /:7337 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) ``` And in the NodeManager logs we were seeing lots of `ClosedChannelException` errors from netty, along with the occasional `java.io.IOException: Broken pipe` error. For example: ``` 2019-05-16 05:13:17,999 ERROR org.apache.spark.network.server.TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1647907385644, chunkIndex=22}, buffer=FileSegmentManagedBuffer{file=/scratch/hadoop/tmp/nm-local-dir/usercache//appcache/application_1557300039674_635976/blockmgr-0ec1d292-3e75-40bd-afd3-79314f427338/11/shuffle_5_3900_0.data, offset=12387017, length=1235}} to /:35922; closing connection java.nio.channels.ClosedChannelException ``` We confirmed that the `shuffle-server` threads were still alive in the NM and took thread dumps, but we weren't able to determine what the
[GitHub] [spark] sjrand commented on issue #24645: [SPARK-27773][Shuffle] add metrics for number of exceptions caught in ExternalShuffleBlockHandler
sjrand commented on issue #24645: [SPARK-27773][Shuffle] add metrics for number of exceptions caught in ExternalShuffleBlockHandler URL: https://github.com/apache/spark/pull/24645#issuecomment-495040379 On the client (executor) side we were seeing lots of timeouts, e.g.: ``` ERROR [2019-05-16T18:34:57.782Z] org.apache.spark.storage.BlockManager: Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds... java.io.IOException: Failed to connect to /:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:250) at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:206) at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142) at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:300) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:297) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:271) at org.apache.spark.executor.Executor.(Executor.scala:121) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:92) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:222) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: /:7337 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) ``` And in the NodeManager logs we were seeing lots of `ClosedChannelException` errors from netty, along with the occasional `java.io.IOException: Broken pipe` error. We confirmed that the `shuffle-server` threads were still alive in the NM and took thread dumps, but we weren't able to determine what the issue was. In the end we just restarted the NodeManagers and this fixed the problem. I didn't create a JIRA for this just because I don't think the information I have so far is enough to be actionable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With
[GitHub] [spark] SparkQA commented on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API
SparkQA commented on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API URL: https://github.com/apache/spark/pull/24559#issuecomment-495039716 **[Test build #105709 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105709/testReport)** for PR 24559 at commit [`21a5f07`](https://github.com/apache/spark/commit/21a5f074e3b564a353da28901c8d6cb107ec04c2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API
AmplabJenkins removed a comment on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API URL: https://github.com/apache/spark/pull/24559#issuecomment-495039354 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API
AmplabJenkins removed a comment on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API URL: https://github.com/apache/spark/pull/24559#issuecomment-495039363 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/10965/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API
AmplabJenkins commented on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API URL: https://github.com/apache/spark/pull/24559#issuecomment-495039363 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/10965/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API
AmplabJenkins commented on issue #24559: [SPARK-27658][SQL] Add FunctionCatalog API URL: https://github.com/apache/spark/pull/24559#issuecomment-495039354 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
SparkQA commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#issuecomment-495038346 **[Test build #105708 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105708/testReport)** for PR 24617 at commit [`47d89d3`](https://github.com/apache/spark/commit/47d89d37a196e75173996adc6feb475a5c8ce87b). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
AmplabJenkins removed a comment on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#issuecomment-495038020 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
AmplabJenkins removed a comment on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#issuecomment-495038025 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/10964/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
AmplabJenkins commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#issuecomment-495038020 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
AmplabJenkins commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#issuecomment-495038025 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/10964/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on issue #24233: [SPARK-26356][SQL] remove SaveMode from data source v2
rdblue commented on issue #24233: [SPARK-26356][SQL] remove SaveMode from data source v2 URL: https://github.com/apache/spark/pull/24233#issuecomment-495037807 @cloud-fan, I don't recall that conclusion from a sync. Can you quote from the notes that you're talking about? I'm fine fixing this in a follow-up, as long as there's a blocker filed so that this doesn't go into the 3.0 release. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
rdblue commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#issuecomment-495037408 @mccheah, I made the changes you requested. Should be good to go when tests pass. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jiangxb1987 commented on issue #24605: [SPARK-27711][CORE] Unset InputFileBlockHolder at the end of tasks
jiangxb1987 commented on issue #24605: [SPARK-27711][CORE] Unset InputFileBlockHolder at the end of tasks URL: https://github.com/apache/spark/pull/24605#issuecomment-495034840 Thanks! Merged to master, please manually backport to 2.4! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jiangxb1987 closed pull request #24605: [SPARK-27711][CORE] Unset InputFileBlockHolder at the end of tasks
jiangxb1987 closed pull request #24605: [SPARK-27711][CORE] Unset InputFileBlockHolder at the end of tasks URL: https://github.com/apache/spark/pull/24605 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jiangxb1987 commented on a change in pull request #24615: [SPARK-27488][CORE] Driver interface to support GPU resources
jiangxb1987 commented on a change in pull request #24615: [SPARK-27488][CORE] Driver interface to support GPU resources URL: https://github.com/apache/spark/pull/24615#discussion_r286741051 ## File path: docs/configuration.md ## @@ -187,6 +187,25 @@ of the most common options to set are: This option is currently supported on YARN, Mesos and Kubernetes. + + spark.driver.resource.{resourceName}.count + 0 + +The number of a particular resource type to use on the driver. +If this is used, you must also specify the +spark.driver.resource.{resourceName}.discoveryScript Review comment: Do we want to mention `spark.driver.resource.{resourceName}.addresses` here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mccheah edited a comment on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
mccheah edited a comment on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#issuecomment-495027809 Looks good, about what I would expect apart from some small changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mccheah commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
mccheah commented on issue #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#issuecomment-495027809 Looks good, about what we would expect apart from some small changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mccheah commented on a change in pull request #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
mccheah commented on a change in pull request #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#discussion_r286740813 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateTableExec.scala ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import scala.collection.JavaConverters._ + +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalog.v2.{Identifier, TableCatalog} +import org.apache.spark.sql.catalog.v2.expressions.Transform +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.execution.LeafExecNode +import org.apache.spark.sql.types.StructType + +case class CreateTableExec( +catalog: TableCatalog, +identifier: Identifier, +tableSchema: StructType, +partitioning: Seq[Transform], +tableProperties: Map[String, String], +ignoreIfExists: Boolean) extends LeafExecNode { + + override protected def doExecute(): RDD[InternalRow] = { +def create(): Unit = { + catalog.createTable(identifier, tableSchema, partitioning.toArray, tableProperties.asJava) +} + +if (!catalog.tableExists(identifier)) { + if (ignoreIfExists) { Review comment: I think this can be simplified a bit: ``` try { create() } catch { case e: TableAlreadyExistsException if ignoreIfExists { logInfo("Table was created concurrently. Ignoring.", e) } } ``` This removes the need to have two branches both calling `create()` and only differing by one having a try-catch clause. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mccheah commented on a change in pull request #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation.
mccheah commented on a change in pull request #24617: [SPARK-27732][SQL] Add v2 CreateTable implementation. URL: https://github.com/apache/spark/pull/24617#discussion_r286739814 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateTableExec.scala ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import scala.collection.JavaConverters._ + +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalog.v2.{Identifier, TableCatalog} +import org.apache.spark.sql.catalog.v2.expressions.Transform +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.execution.LeafExecNode +import org.apache.spark.sql.types.StructType + +case class CreateTableExec( +catalog: TableCatalog, +identifier: Identifier, +tableSchema: StructType, +partitioning: Seq[Transform], +tableProperties: Map[String, String], +ignoreIfExists: Boolean) extends LeafExecNode { + + override protected def doExecute(): RDD[InternalRow] = { +def create(): Unit = { + catalog.createTable(identifier, tableSchema, partitioning.toArray, tableProperties.asJava) +} + +if (!catalog.tableExists(identifier)) { + if (ignoreIfExists) { +try { + create() +} catch { + case _: TableAlreadyExistsException => +// ignore the table that was created after checking existence Review comment: Might be worth adding a simple log at INFO level indicating there was a concurrent create, and then with an exception. I'm naturally wary of swallowing exceptions without logging them though. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #24675: [SPARK-27803][SQL] fix column pruning for python UDF
cloud-fan commented on a change in pull request #24675: [SPARK-27803][SQL] fix column pruning for python UDF URL: https://github.com/apache/spark/pull/24675#discussion_r286738983 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ## @@ -226,22 +214,4 @@ object ExtractPythonUDFs extends Rule[LogicalPlan] with PredicateHelper { } } } - - // Split the original FilterExec to two FilterExecs. Only push down the first few predicates - // that are all deterministic. - private def trySplitFilter(plan: LogicalPlan): LogicalPlan = { Review comment: quote from the PR description > There are some hacks in the ExtractPythonUDFs rule, to duplicate the column pruning and filter pushdown logic. However, it has some bugs as demonstrated in the new test case(only column pruning is broken). This PR removes the hacks and re-apply the column pruning and filter pushdown rules explicitly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #24675: [SPARK-27803][SQL] fix column pruning for python UDF
cloud-fan commented on a change in pull request #24675: [SPARK-27803][SQL] fix column pruning for python UDF URL: https://github.com/apache/spark/pull/24675#discussion_r286738848 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -1143,6 +1143,8 @@ object PushDownPredicate extends Rule[LogicalPlan] with PredicateHelper { case _: Repartition => true case _: ScriptTransformation => true case _: Sort => true +case _: BatchEvalPython => true Review comment: This defines the nodes that we can push filters through. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
cloud-fan commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#discussion_r286738681 ## File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ## @@ -930,6 +930,33 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("support user provided non-nullable avro schema " + +"for nullable catalyst schema without any null record") { Review comment: does parquet/orc have the same behavior? It seems better to forbid this at the beginning, otherwise we need to do null check at runtime, which may fail a long-running query middle away. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
cloud-fan commented on a change in pull request #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#discussion_r286738727 ## File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ## @@ -930,6 +930,33 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("support user provided non-nullable avro schema " + +"for nullable catalyst schema without any null record") { Review comment: cc @dongjoon-hyun @gengliangwang This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24628: [SPARK-27749][SQL][test-hadoop3.2] hadoop-3.2 support hive-thriftserver
SparkQA commented on issue #24628: [SPARK-27749][SQL][test-hadoop3.2] hadoop-3.2 support hive-thriftserver URL: https://github.com/apache/spark/pull/24628#issuecomment-495024200 **[Test build #105707 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105707/testReport)** for PR 24628 at commit [`a0e52aa`](https://github.com/apache/spark/commit/a0e52aae93fd8c1b3a3b1931b2102943cb0202a4). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
AmplabJenkins removed a comment on issue #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#issuecomment-495023973 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/105706/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide
AmplabJenkins removed a comment on issue #24682: [SPARK-27762][SQL] [FOLLOWUP] Add behavior change for Avro writer in migration guide URL: https://github.com/apache/spark/pull/24682#issuecomment-495023966 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org