[GitHub] peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r251307934 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -1530,6 +1650,18 @@ class Analyzer( object GlobalAggregates extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperators { case Project(projectList, child) if containsAggregates(projectList) => + +def traversePlanAndCheck(plan: LogicalPlan): Unit = Review comment: Ok then I will try to move all the check into that rule. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] httfighter commented on issue #23649: [SPARK-26726] The amount of memory used by the broadcast variable is …
httfighter commented on issue #23649: [SPARK-26726] The amount of memory used by the broadcast variable is … URL: https://github.com/apache/spark/pull/23649#issuecomment-458029276 @vanzin Thank you for your review! I have added the test case. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query
peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query URL: https://github.com/apache/spark/pull/23531#discussion_r251307934 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -1530,6 +1650,18 @@ class Analyzer( object GlobalAggregates extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperators { case Project(projectList, child) if containsAggregates(projectList) => + +def traversePlanAndCheck(plan: LogicalPlan): Unit = Review comment: Ok then I will try to move all the checks into that rule. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] httfighter commented on a change in pull request #23649: [SPARK-26726] The amount of memory used by the broadcast variable is …
httfighter commented on a change in pull request #23649: [SPARK-26726] The amount of memory used by the broadcast variable is … URL: https://github.com/apache/spark/pull/23649#discussion_r251307769 ## File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ## @@ -995,6 +996,30 @@ private[spark] class AppStatusListener( } } + def updateBroadcastBlock(event: SparkListenerBlockUpdated, broadcast: BroadcastBlockId): Unit = { Review comment: Thank you for your review! I have submitted a new change. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] felixcheung commented on issue #23514: [SPARK-24902][K8s] Add PV integration tests
felixcheung commented on issue #23514: [SPARK-24902][K8s] Add PV integration tests URL: https://github.com/apache/spark/pull/23514#issuecomment-458027865 @shaneknapp ^^ This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] felixcheung commented on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0
felixcheung commented on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0 URL: https://github.com/apache/spark/pull/23657#issuecomment-458026691 probably a good idea - arrow moves quickly; 0.10 is kinda "dated" This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise
AmplabJenkins removed a comment on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise URL: https://github.com/apache/spark/pull/23669#issuecomment-458023244 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7479/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise
AmplabJenkins commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise URL: https://github.com/apache/spark/pull/23669#issuecomment-458023240 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise
AmplabJenkins removed a comment on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise URL: https://github.com/apache/spark/pull/23669#issuecomment-458023240 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise
AmplabJenkins commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise URL: https://github.com/apache/spark/pull/23669#issuecomment-458023244 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7479/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #23580: [SPARK-26660]Add warning logs when broadcasting large task binary
srowen commented on a change in pull request #23580: [SPARK-26660]Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23580#discussion_r251302189 ## File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ## @@ -1162,6 +1162,10 @@ private[spark] class DAGScheduler( partitions = stage.rdd.partitions } + if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) { Review comment: Yeah should have looked at the test logs, my fault. 1000 = KB, 1024 = KiB. This should be correct unless the constant is misnamed This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise
SparkQA commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise URL: https://github.com/apache/spark/pull/23669#issuecomment-458022899 **[Test build #101748 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101748/testReport)** for PR 23669 at commit [`ddd3b56`](https://github.com/apache/spark/commit/ddd3b56421d170bef6f5de67f4049e6a28bba6a1). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] maropu commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise
maropu commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise URL: https://github.com/apache/spark/pull/23669#issuecomment-458022193 It seems `ElementAt` could have the same fix. This change is trivial, so I think it might be better to include an`ElementAt` fix in this pr, too. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] maropu opened a new pull request #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise
maropu opened a new pull request #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise URL: https://github.com/apache/spark/pull/23669 ## What changes were proposed in this pull request? IIn master, `GetMapValue` nullable is always true; https://github.com/apache/spark/blob/cf133e611020ed178f90358464a1b88cdd9b7889/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L371 But, If input key is foldable, we could make its nullability more precise. This fix is the same with SPARK-26637(#23566). ## How was this patch tested? Added tests in `ComplexTypeSuite`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon edited a comment on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0
HyukjinKwon edited a comment on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0 URL: https://github.com/apache/spark/pull/23657#issuecomment-458019236 For PyArrow 0.12.0, I don't think so but it will run with Arrow 0.12.0 + PyArrow 0.8.0 combination. I needs to manual upgrade. Currently, IIRC, the version is 0.8.0 to test minimum PyArrow version. I was thinking about bumping up the minimal PyArrow version at Spark 3.0 so that we can reduce such if-else branches, and reduce such overhead. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon commented on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0
HyukjinKwon commented on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0 URL: https://github.com/apache/spark/pull/23657#issuecomment-458019236 For PyArrow 0.12.0, I don't think so but it will run with Arrow 0.12.0 + PyArrow 0.8.0 combination. I needs to manual upgrade. Currently, IIRC, the version is 0.8.0 to test minimise versions. I was thinking about bumping up the minimal PyArrow version at Spark 3.0 so that we can reduce such if-else branches, and reduce such overhead. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] eatoncys commented on issue #23010: [SPARK-26012][SQL]Null and '' values should not cause dynamic partition failure of string types
eatoncys commented on issue #23010: [SPARK-26012][SQL]Null and '' values should not cause dynamic partition failure of string types URL: https://github.com/apache/spark/pull/23010#issuecomment-458018613 @cloud-fan Would you like to review it again, thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] maropu commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
maropu commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23668#discussion_r251298706 ## File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ## @@ -1162,7 +1162,7 @@ private[spark] class DAGScheduler( partitions = stage.rdd.partitions } - if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) { + if (taskBinaryBytes.length > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1000) { Review comment: yea, I think 1024 is better.. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] eatoncys removed a comment on issue #23010: [SPARK-26012][SQL]Null and '' values should not cause dynamic partition failure of string types
eatoncys removed a comment on issue #23010: [SPARK-26012][SQL]Null and '' values should not cause dynamic partition failure of string types URL: https://github.com/apache/spark/pull/23010#issuecomment-457778864 ping @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] beliefer commented on a change in pull request #23650: [SPARK-26728]Make rdd.unpersist blocking configurable
beliefer commented on a change in pull request #23650: [SPARK-26728]Make rdd.unpersist blocking configurable URL: https://github.com/apache/spark/pull/23650#discussion_r251297717 ## File path: core/src/main/scala/org/apache/spark/rdd/RDD.scala ## @@ -209,13 +210,14 @@ abstract class RDD[T: ClassTag]( */ def cache(): this.type = persist() + private val unpersistBlocking = conf.get(RDD_UNPERSIST_BLOCKING) /** * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk. * * @param blocking Whether to block until all blocks are deleted. * @return This RDD. */ - def unpersist(blocking: Boolean = true): this.type = { + def unpersist(blocking: Boolean = unpersistBlocking): this.type = { Review comment: Maybe this pr is a temp solution for how to avoid RPC issue. There are unable to prove the the measure is effective before a lot of discussion about the blocking of unpersist in detail. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] viirya commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
viirya commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23668#discussion_r251297474 ## File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ## @@ -1162,7 +1162,7 @@ private[spark] class DAGScheduler( partitions = stage.rdd.partitions } - if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) { + if (taskBinaryBytes.length > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1000) { Review comment: We use 1024 at L497 in TaskSetManager: ```scala if (serializedTask.limit() > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 && !emittedTaskSizeWarning) { emittedTaskSizeWarning = true logWarning(s"Stage ${task.stageId} contains a task of very large size " + s"(${serializedTask.limit() / 1024} KB). The maximum recommended task size is " + s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.") } ``` Shall we use a consistent value? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] viirya commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
viirya commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23668#discussion_r251297474 ## File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ## @@ -1162,7 +1162,7 @@ private[spark] class DAGScheduler( partitions = stage.rdd.partitions } - if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) { + if (taskBinaryBytes.length > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1000) { Review comment: We use 1024 at L497: ```scala if (serializedTask.limit() > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 && !emittedTaskSizeWarning) { emittedTaskSizeWarning = true logWarning(s"Stage ${task.stageId} contains a task of very large size " + s"(${serializedTask.limit() / 1024} KB). The maximum recommended task size is " + s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.") } ``` Shall we use a consistent value? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] gatorsmile commented on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0
gatorsmile commented on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0 URL: https://github.com/apache/spark/pull/23657#issuecomment-458015915 Will our Jenkin run Arrow 0.12? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
SparkQA commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp URL: https://github.com/apache/spark/pull/23656#issuecomment-458013714 **[Test build #101747 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101747/testReport)** for PR 23656 at commit [`52ac627`](https://github.com/apache/spark/commit/52ac627589db4ad63bf959406ad70b9d8790f40a). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp URL: https://github.com/apache/spark/pull/23656#issuecomment-458013641 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp URL: https://github.com/apache/spark/pull/23656#issuecomment-458013646 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7478/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp URL: https://github.com/apache/spark/pull/23656#issuecomment-458013641 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable
caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable URL: https://github.com/apache/spark/pull/23519#discussion_r251292265 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/ExchangeSuite.scala ## @@ -132,4 +135,33 @@ class ExchangeSuite extends SparkPlanTest with SharedSQLContext { val projection2 = cached.select("_1", "_3").queryExecution.executedPlan assert(!projection1.sameResult(projection2)) } + + test("SPARK-26601: Make broadcast-exchange thread pool configurable") { +val previousNumber = SparkSession.getActiveSession.get.sparkContext.conf + .get(StaticSQLConf.MAX_BROADCAST_EXCHANGE_THREADNUMBER) + +SparkSession.getActiveSession.get.sparkContext.conf. + set(StaticSQLConf.MAX_BROADCAST_EXCHANGE_THREADNUMBER, 1) + assert(SQLConf.get.getConf(StaticSQLConf.MAX_BROADCAST_EXCHANGE_THREADNUMBER) === 1) + +Future { + Thread.sleep(5*1000) +} (BroadcastExchangeExec.executionContext) + +val f = Future {} (BroadcastExchangeExec.executionContext) Review comment: Actually, i have thought about this method. But i can not call `getMaximumPoolSize` api, then i change the solution. I will search more about this idea This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp URL: https://github.com/apache/spark/pull/23656#issuecomment-458013646 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7478/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon commented on a change in pull request #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable
HyukjinKwon commented on a change in pull request #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable URL: https://github.com/apache/spark/pull/23519#discussion_r251294071 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala ## @@ -126,4 +126,13 @@ object StaticSQLConf { .intConf .createWithDefault(1000) + val MAX_BROADCAST_EXCHANGE_THREADNUMBER = +buildStaticConf("spark.sql.broadcastExchange.maxThreadNumber") + .doc("The maximum degree of parallelism to fetch and broadcast the table.If we " + Review comment: If that's around this code and that's the only the one, yea, let's do that. If there are multiple across this files, let's don't include. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] dongjoon-hyun commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
dongjoon-hyun commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp URL: https://github.com/apache/spark/pull/23656#issuecomment-458012910 Retest this please. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon commented on issue #23336: [SPARK-26378][SQL] Restore performance of queries against wide CSV/JSON tables
HyukjinKwon commented on issue #23336: [SPARK-26378][SQL] Restore performance of queries against wide CSV/JSON tables URL: https://github.com/apache/spark/pull/23336#issuecomment-458012967 I think it's clear that it saves the tame. Let's just push the results file here. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
SparkQA removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp URL: https://github.com/apache/spark/pull/23656#issuecomment-457993454 **[Test build #101743 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101743/testReport)** for PR 23656 at commit [`52ac627`](https://github.com/apache/spark/commit/52ac627589db4ad63bf959406ad70b9d8790f40a). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp URL: https://github.com/apache/spark/pull/23656#issuecomment-458012248 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp URL: https://github.com/apache/spark/pull/23656#issuecomment-458012250 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101743/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp URL: https://github.com/apache/spark/pull/23656#issuecomment-458012248 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp URL: https://github.com/apache/spark/pull/23656#issuecomment-458012250 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101743/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
SparkQA commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp URL: https://github.com/apache/spark/pull/23656#issuecomment-458012077 **[Test build #101743 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101743/testReport)** for PR 23656 at commit [`52ac627`](https://github.com/apache/spark/commit/52ac627589db4ad63bf959406ad70b9d8790f40a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] deshanxiao edited a comment on issue #23637: [SPARK-26714][CORE][WEBUI] Show 0 partition job in WebUI
deshanxiao edited a comment on issue #23637: [SPARK-26714][CORE][WEBUI] Show 0 partition job in WebUI URL: https://github.com/apache/spark/pull/23637#issuecomment-458003956 Yes, it takes no time at all and It always succeeds. Maybe using the same time in `SparkListenerJobStart` and `SparkListenerJobEnd` will be better. In addition, the method `submitJob` is invoked in two positions. I don't want to handle it twice. Hence, I think the best place to place the code is the method `submitJob` itself. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable
caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable URL: https://github.com/apache/spark/pull/23519#discussion_r251292265 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/ExchangeSuite.scala ## @@ -132,4 +135,33 @@ class ExchangeSuite extends SparkPlanTest with SharedSQLContext { val projection2 = cached.select("_1", "_3").queryExecution.executedPlan assert(!projection1.sameResult(projection2)) } + + test("SPARK-26601: Make broadcast-exchange thread pool configurable") { +val previousNumber = SparkSession.getActiveSession.get.sparkContext.conf + .get(StaticSQLConf.MAX_BROADCAST_EXCHANGE_THREADNUMBER) + +SparkSession.getActiveSession.get.sparkContext.conf. + set(StaticSQLConf.MAX_BROADCAST_EXCHANGE_THREADNUMBER, 1) + assert(SQLConf.get.getConf(StaticSQLConf.MAX_BROADCAST_EXCHANGE_THREADNUMBER) === 1) + +Future { + Thread.sleep(5*1000) +} (BroadcastExchangeExec.executionContext) + +val f = Future {} (BroadcastExchangeExec.executionContext) Review comment: Actually, i have thought about this method. But i can not call `getMaximumPoolSize` api, then i change the solution. I will search more about this idea This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
HyukjinKwon commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23668#discussion_r251292205 ## File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ## @@ -1162,7 +1162,7 @@ private[spark] class DAGScheduler( partitions = stage.rdd.partitions } - if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) { + if (taskBinaryBytes.length > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1000) { Review comment: it's KB not KiB, isn't it? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liupc commented on issue #23614: [SPARK-26689]Support blacklisting bad disk directory and retry in DiskBlockManager
liupc commented on issue #23614: [SPARK-26689]Support blacklisting bad disk directory and retry in DiskBlockManager URL: https://github.com/apache/spark/pull/23614#issuecomment-458010952 cc @srowen @vanzin @HyukjinKwon @squito @dongjoon-hyun Any body have a look at this PR and give some suggestions? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liupc commented on issue #23580: [SPARK-26660]Add warning logs when broadcasting large task binary
liupc commented on issue #23580: [SPARK-26660]Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23580#issuecomment-458010241 @maropu @srowen @HyukjinKwon Really sorry for that, I will be more carefully next time! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] bersprockets edited a comment on issue #23336: [SPARK-26378][SQL] Restore performance of queries against wide CSV/JSON tables
bersprockets edited a comment on issue #23336: [SPARK-26378][SQL] Restore performance of queries against wide CSV/JSON tables URL: https://github.com/apache/spark/pull/23336#issuecomment-458008758 @HyukjinKwon The last time CSVBenchmark-results.txt or JSONBenchmark-results.txt was committed to master was roughly 200 commits before the partial results change. Also, this PR contains some changes to CSVBenchmark.scala and JsonBenchmark.scala. Until this PR, JsonBenchmark.scala had no test for wide rows (which is where the issue really shows up). Thought 1: I am thinking of creating a separate PR with only the CSVBenchmark.scala and JsonBenchmark.scala changes, plus the two results files. This would be the baseline, and it would include wide-row tests. After that PR is merged, I would run the benchmarks again, but against this PR. I would include the new results files as part of this PR. Or, thought 2: I could just update CSVBenchmark.scala and JsonBenchmark.scala in a local working copy of the baseline, run benchmarks for both my local baseline and the PR, and simply verify that the PR fixes the issue. Then, I would commit the two results file to this PR. Lastly, thought 3: Maybe we don't need the changes to CSVBenchmark.scala and JsonBenchmark.scala. I would just commit the results files without the new cases. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] bersprockets commented on issue #23336: [SPARK-26378][SQL] Restore performance of queries against wide CSV/JSON tables
bersprockets commented on issue #23336: [SPARK-26378][SQL] Restore performance of queries against wide CSV/JSON tables URL: https://github.com/apache/spark/pull/23336#issuecomment-458008758 @HyukjinKwon The last time CSVBenchmark-results.txt or JSONBenchmark-results.txt was committed to master was roughly 200 commits before the partial results change. Also, this PR contains some changes to CSVBenchmark.scala and JsonBenchmark.scala. Until this PR, JsonBenchmark.scala had no test for wide rows (which is where the issue really shows up). Thought #1: I am thinking of creating a separate PR with only the CSVBenchmark.scala and JsonBenchmark.scala changes, plus the two results files. This would be the baseline, and it would include wide-row tests. After that PR is merged, I would run the benchmarks again, but against this PR. I would include the new results files as part of this PR. Or, thought #2: I could just update CSVBenchmark.scala and JsonBenchmark.scala in a local working copy of the baseline, run benchmarks for both my local baseline and the PR, and simply verify that the PR fixes the issue. Then, I would commit the two results file to this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable
caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable URL: https://github.com/apache/spark/pull/23519#discussion_r251289944 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala ## @@ -126,4 +126,13 @@ object StaticSQLConf { .intConf .createWithDefault(1000) + val MAX_BROADCAST_EXCHANGE_THREADNUMBER = +buildStaticConf("spark.sql.broadcastExchange.maxThreadNumber") + .doc("The maximum degree of parallelism to fetch and broadcast the table.If we " + Review comment: Sorry for this style @HyukjinKwon actually i found some other code has the same problem,can i open pr to fix that? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable
caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable URL: https://github.com/apache/spark/pull/23519#discussion_r251289872 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala ## @@ -126,4 +126,13 @@ object StaticSQLConf { .intConf .createWithDefault(1000) + val MAX_BROADCAST_EXCHANGE_THREADNUMBER = Review comment: Ok @maropu Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liupc commented on a change in pull request #23580: [SPARK-26660]Add warning logs when broadcasting large task binary
liupc commented on a change in pull request #23580: [SPARK-26660]Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23580#discussion_r251289438 ## File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ## @@ -1162,6 +1162,10 @@ private[spark] class DAGScheduler( partitions = stage.rdd.partitions } + if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) { Review comment: @HyukjinKwon Sorry for that, also in the new PR we can make 1000 to 1024, that sounds more reasonable. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] deshanxiao commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
deshanxiao commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23668#discussion_r251289378 ## File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ## @@ -1162,7 +1162,7 @@ private[spark] class DAGScheduler( partitions = stage.rdd.partitions } - if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) { + if (taskBinaryBytes.length > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1000) { Review comment: More precise I think would be 1024, is it right? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] deshanxiao commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
deshanxiao commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23668#discussion_r251289378 ## File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ## @@ -1162,7 +1162,7 @@ private[spark] class DAGScheduler( partitions = stage.rdd.partitions } - if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) { + if (taskBinaryBytes.length > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1000) { Review comment: More precisely I think would be 1024, is it right? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon commented on a change in pull request #23580: [SPARK-26660]Add warning logs when broadcasting large task binary
HyukjinKwon commented on a change in pull request #23580: [SPARK-26660]Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23580#discussion_r251287698 ## File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ## @@ -1162,6 +1162,10 @@ private[spark] class DAGScheduler( partitions = stage.rdd.partitions } + if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) { Review comment: @liupc, even if it was a minor log change, manual test should better be done to avoid such mistake. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] deshanxiao edited a comment on issue #23637: [SPARK-26714][CORE][WEBUI] Show 0 partition job in WebUI
deshanxiao edited a comment on issue #23637: [SPARK-26714][CORE][WEBUI] Show 0 partition job in WebUI URL: https://github.com/apache/spark/pull/23637#issuecomment-458003956 Yes, it takes no time at all and It always succeeds. I think in this case, we could not check it twice. Event not creating a JobWaiter if necessary This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] deshanxiao commented on issue #23637: [SPARK-26714][CORE][WEBUI] Show 0 partition job in WebUI
deshanxiao commented on issue #23637: [SPARK-26714][CORE][WEBUI] Show 0 partition job in WebUI URL: https://github.com/apache/spark/pull/23637#issuecomment-458003956 Yes, it takes no time at all and It always succeeds. I think in this case, we could not check it twich. Event not creating a JobWaiter if necessary This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect
AmplabJenkins commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect URL: https://github.com/apache/spark/pull/23665#issuecomment-458003268 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101739/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect
AmplabJenkins removed a comment on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect URL: https://github.com/apache/spark/pull/23665#issuecomment-458003268 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101739/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA removed a comment on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect
SparkQA removed a comment on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect URL: https://github.com/apache/spark/pull/23665#issuecomment-457973075 **[Test build #101739 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101739/testReport)** for PR 23665 at commit [`cd2f30c`](https://github.com/apache/spark/commit/cd2f30cbec303356e276ab83a83f092b8a2c36e6). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect
AmplabJenkins removed a comment on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect URL: https://github.com/apache/spark/pull/23665#issuecomment-458003266 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect
SparkQA commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect URL: https://github.com/apache/spark/pull/23665#issuecomment-458003043 **[Test build #101739 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101739/testReport)** for PR 23665 at commit [`cd2f30c`](https://github.com/apache/spark/commit/cd2f30cbec303356e276ab83a83f092b8a2c36e6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect
AmplabJenkins commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect URL: https://github.com/apache/spark/pull/23665#issuecomment-458003266 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zjf2012 commented on issue #23560: [SPARK-26632][Spark Core] Separate Thread Configurations of Driver and Executor
zjf2012 commented on issue #23560: [SPARK-26632][Spark Core] Separate Thread Configurations of Driver and Executor URL: https://github.com/apache/spark/pull/23560#issuecomment-458002943 @jerryshao , I checked configuration.md as well as other related md files under the docs folder. I cannot find a proper place to put my documentation. Actually, the spark documentation even doesn't mention IO threads-related configurations. Would you please give some suggestion? thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
AmplabJenkins removed a comment on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23668#issuecomment-458001541 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
AmplabJenkins commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23668#issuecomment-458001546 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7477/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] maropu commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
maropu commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23668#issuecomment-458001665 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] viirya commented on a change in pull request #22193: [SPARK-25186][SQL] Remove v2 save mode.
viirya commented on a change in pull request #22193: [SPARK-25186][SQL] Remove v2 save mode. URL: https://github.com/apache/spark/pull/22193#discussion_r251284321 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala ## @@ -143,14 +145,17 @@ object DataSourceV2Strategy extends Strategy { DataSourceV2ScanExec( r.output, r.source, r.options, r.pushedFilters, r.readSupport, scanConfig)) :: Nil -case WriteToDataSourceV2(writer, query) => - WriteToDataSourceV2Exec(writer, planLater(query)) :: Nil +case WriteToDataSourceV2(writer, writeConfig, query) => + WriteToDataSourceV2Exec(writer, writeConfig, planLater(query)) :: Nil case AppendData(r: DataSourceV2Relation, query, _) => - WriteToDataSourceV2Exec(r.newWriteSupport(), planLater(query)) :: Nil + val writeSupport = r.newWriteSupport() + val writeConfig = writeSupport.createWriteConfig( +query.schema, new DataSourceOptions(r.options.asJava)) + WriteToDataSourceV2Exec(r.newWriteSupport(), writeConfig, planLater(query)) :: Nil Review comment: Why do `r.newWriteSupport()` again? We can reuse `writeSupport`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
AmplabJenkins removed a comment on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23668#issuecomment-458001546 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7477/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
AmplabJenkins commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23668#issuecomment-458001541 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees
srowen commented on issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees URL: https://github.com/apache/spark/pull/21632#issuecomment-458001245 Is there a good reason to scale it by the square of the samples? if not, yeah, worth a follow-up. If there is a good reason, then is there a case in the tests here where epsilon becomes really large, like of the same order of magnitude as the expected values? I don't think the tests have ~1e8 samples. Up to your judgment. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
SparkQA commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23668#issuecomment-458001181 **[Test build #101746 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101746/testReport)** for PR 23668 at commit [`e53831a`](https://github.com/apache/spark/commit/e53831a20a91fb7aed88ac57dfac9dd9671b05f7). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR edited a comment on issue #23634: [SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows
HeartSaVioR edited a comment on issue #23634: [SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows URL: https://github.com/apache/spark/pull/23634#issuecomment-458000808 > The rows in one of the inputs are immediately discarded while I think both inputs should be retained till the watermark of the operator advances so that the correct results can be produced. The rows were stored to states in both sides in batch 1: they just evicted from states at different batches (batch 3 for left side, batch 2 for right side), which global watermark was advanced in batch 2, so the right-side eviction from batch 2 was valid under the watermark condition. Left-side "late" eviction is just due to join condition. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on issue #23634: [SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows
HeartSaVioR commented on issue #23634: [SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows URL: https://github.com/apache/spark/pull/23634#issuecomment-458000808 > The rows in one of the inputs are immediately discarded while I think both inputs should be retained till the watermark of the operator advances so that the correct results can be produced. The rows were stored to states in both sides in batch 1: they just evicted from states at different batches (batch 3 for left side, batch 2 for right side), which global watermark was advanced in batch 2, so the right-side eviction from batch 2 was valid under the watermark condition. Left-side late eviction is just due to join condition. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen opened a new pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
srowen opened a new pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23668 ## What changes were proposed in this pull request? The warning introduced in https://github.com/apache/spark/pull/23580 has a bug: https://github.com/apache/spark/pull/23580#issuecomment-458000380 This just fixes the logic. ## How was this patch tested? N/A This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] maropu commented on issue #23580: [SPARK-26660]Add warning logs when broadcasting large task binary
maropu commented on issue #23580: [SPARK-26660]Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23580#issuecomment-458000585 oh...thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on issue #23580: [SPARK-26660]Add warning logs when broadcasting large task binary
srowen commented on issue #23580: [SPARK-26660]Add warning logs when broadcasting large task binary URL: https://github.com/apache/spark/pull/23580#issuecomment-458000380 Oh, shoot, there's a bug here: `if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB)` The `* 1000` should be on the other size. I read right past that. I'll make a follow up. That should eliminate a lot of the warnings! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] sadhen commented on a change in pull request #23276: [SPARK-26321][SQL] Improve the behavior of sql text splitting for the spark-sql command line
sadhen commented on a change in pull request #23276: [SPARK-26321][SQL] Improve the behavior of sql text splitting for the spark-sql command line URL: https://github.com/apache/spark/pull/23276#discussion_r251283220 ## File path: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala ## @@ -331,6 +337,65 @@ private[hive] class SparkSQLCLIDriver extends CliDriver with Logging { console.printInfo(s"Spark master: $master, Application Id: $appId") } + // method body imported from Hive and translated from Java to Scala + override def processLine(line: String, allowInterrupting: Boolean): Int = { Review comment: Yes,and I had not checked the impl on Hive master. We may judge which impl is better This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala.
srowen commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala. URL: https://github.com/apache/spark/pull/23534#discussion_r251283106 ## File path: docs/sql-migration-guide-upgrade.md ## @@ -45,6 +45,8 @@ displayTitle: Spark SQL Upgrading Guide - In Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. Since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default. + - Since Spark 3.0, `DataFrame.toJSON()` in PySpark returns `DataFrame` of JSON string instead of `RDD`. The method in Scala/Java was changed to return `DataFrame` before, but the one in PySpark was not changed at that time. If you still want to return `RDD`, you can restore the previous behavior by setting `spark.sql.legacy.pyspark.toJsonShouldReturnDataFrame` to `false`. Review comment: The current Scala method does _not_ return a DataFrame. That's where I'm confused about this argument. Yes, either that has to change too, or this change actually makes it less consistent. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959
AmplabJenkins removed a comment on issue #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959 URL: https://github.com/apache/spark/pull/23667#issuecomment-457999378 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959
AmplabJenkins removed a comment on issue #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959 URL: https://github.com/apache/spark/pull/23667#issuecomment-457999380 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7476/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959
AmplabJenkins commented on issue #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959 URL: https://github.com/apache/spark/pull/23667#issuecomment-457999378 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959
AmplabJenkins commented on issue #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959 URL: https://github.com/apache/spark/pull/23667#issuecomment-457999380 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7476/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959
SparkQA commented on issue #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959 URL: https://github.com/apache/spark/pull/23667#issuecomment-457999116 **[Test build #101745 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101745/testReport)** for PR 23667 at commit [`dd5b177`](https://github.com/apache/spark/commit/dd5b1771a95f139d3773f4d796fc29cd61de0ff1). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect
HyukjinKwon commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect URL: https://github.com/apache/spark/pull/23665#issuecomment-457998561 @sumitsu, if we agree upon reverting it (at 23667), let's convert this PR into a test-only PR. We can add the test. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon edited a comment on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect
HyukjinKwon edited a comment on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect URL: https://github.com/apache/spark/pull/23665#issuecomment-457998561 @sumitsu, if we agree upon reverting it (at #23667), let's convert this PR into a test-only PR. We can add the test. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon opened a new pull request #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959
HyukjinKwon opened a new pull request #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959 URL: https://github.com/apache/spark/pull/23667 ## What changes were proposed in this pull request? This PR reverts JSON count optimization part of #21909. We cannot distinguish the cases below without parsing: ``` [{...}, {...}] ``` ``` [] ``` ``` {...} ``` ```bash # empty string ``` when we `count()`. One line (input: IN) can be, 0 record, 1 record and multiple records and this is dependent on each input. See also https://github.com/apache/spark/pull/23665#discussion_r251276720. ## How was this patch tested? Manually tested. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] maropu commented on a change in pull request #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect
maropu commented on a change in pull request #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect URL: https://github.com/apache/spark/pull/23665#discussion_r251281725 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala ## @@ -55,11 +56,15 @@ class FailureSafeParser[IN]( def parse(input: IN): Iterator[InternalRow] = { try { - if (skipParsing) { - Iterator.single(InternalRow.empty) - } else { - rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () => null)) - } + if (skipParsing) { +if (unparsedRecordIsNonEmpty(input)) { + Iterator.single(InternalRow.empty) +} else { + Iterator.empty +} + } else { +rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () => null)) + } Review comment: I thought this: https://github.com/maropu/spark/commit/f4df9074619e6cafff679c5f22fa153727be1079 But, you should follow other guys who are familiar this part, @HyukjinKwon and @MaxGekk This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on a change in pull request #23276: [SPARK-26321][SQL] Improve the behavior of sql text splitting for the spark-sql command line
cloud-fan commented on a change in pull request #23276: [SPARK-26321][SQL] Improve the behavior of sql text splitting for the spark-sql command line URL: https://github.com/apache/spark/pull/23276#discussion_r251281683 ## File path: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala ## @@ -331,6 +337,65 @@ private[hive] class SparkSQLCLIDriver extends CliDriver with Logging { console.printInfo(s"Spark master: $master, Application Id: $appId") } + // method body imported from Hive and translated from Java to Scala + override def processLine(line: String, allowInterrupting: Boolean): Int = { Review comment: I checked the hive code and seems the `;` is well handled at least in the master branch: https://github.com/apache/hive/blob/master/cli/src/java/org/apache/hadoop/hive/cli/CliDriver.java#L395 Do you mean only Hive 1.2 has this bug? Maybe we should upgrade hive. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] imatiach-msft commented on issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees
imatiach-msft commented on issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees URL: https://github.com/apache/spark/pull/21632#issuecomment-457997552 @srowen thank you for the merge and the thorough review. I have some doubts about the tolerance we decided for zero values: val tolerance = Utils.EPSILON * unweightedNumSamples * unweightedNumSamples https://github.com/apache/spark/pull/21632/files#diff-1fd1bc8d3fc9306c83cd65fbf3ca4bbeR1054 For a large number of unweighted samples I am worried that it might be too high. Note EPSILON=2.2E-16. I am wondering if I should change the tolerance to be: val tolerance = Utils.EPSILON * unweightedNumSamples * (some constant) What are your thoughts? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] ueshin commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala.
ueshin commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala. URL: https://github.com/apache/spark/pull/23534#discussion_r251281383 ## File path: docs/sql-migration-guide-upgrade.md ## @@ -45,6 +45,8 @@ displayTitle: Spark SQL Upgrading Guide - In Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. Since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default. + - Since Spark 3.0, `DataFrame.toJSON()` in PySpark returns `DataFrame` of JSON string instead of `RDD`. The method in Scala/Java was changed to return `DataFrame` before, but the one in PySpark was not changed at that time. If you still want to return `RDD`, you can restore the previous behavior by setting `spark.sql.legacy.pyspark.toJsonShouldReturnDataFrame` to `false`. Review comment: Actually I'm still feeling it's inconsistent because the abstraction layer is different between RDD and DataFrame/Dataset. I guess users expect it returns something of the same abstraction, i.e., RDD returns RDD, DataFrame returns DataFrame, so I'd rather handle Dataset as DataFrame in Python than RDD.We might still need to discuss the function `map` or the behavior of `DataFrameReader.csv`/`.json` as @HyukjinKwon mentioned. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] ueshin commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala.
ueshin commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala. URL: https://github.com/apache/spark/pull/23534#discussion_r251281383 ## File path: docs/sql-migration-guide-upgrade.md ## @@ -45,6 +45,8 @@ displayTitle: Spark SQL Upgrading Guide - In Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. Since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default. + - Since Spark 3.0, `DataFrame.toJSON()` in PySpark returns `DataFrame` of JSON string instead of `RDD`. The method in Scala/Java was changed to return `DataFrame` before, but the one in PySpark was not changed at that time. If you still want to return `RDD`, you can restore the previous behavior by setting `spark.sql.legacy.pyspark.toJsonShouldReturnDataFrame` to `false`. Review comment: Actually I'm still feeling it's inconsistent because the abstraction layer is different between RDD and DataFrame/Dataset. I guess users expect it returns something of the same abstraction, i.e., RDD returns RDD, DataFrame returns DataFrame, so I'd rather handle Dataset as DataFrame in Python than RDD. We might still need to discuss the function `map` or the behavior of `DataFrameReader.csv`/`.json` as @HyukjinKwon mentioned. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on issue #23645: [SPARK-26725][TEST] Fix the input values of UnifiedMemoryManager constructor in test suites
cloud-fan commented on issue #23645: [SPARK-26725][TEST] Fix the input values of UnifiedMemoryManager constructor in test suites URL: https://github.com/apache/spark/pull/23645#issuecomment-457997323 thanks, merging to master! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan closed pull request #23645: [SPARK-26725][TEST] Fix the input values of UnifiedMemoryManager constructor in test suites
cloud-fan closed pull request #23645: [SPARK-26725][TEST] Fix the input values of UnifiedMemoryManager constructor in test suites URL: https://github.com/apache/spark/pull/23645 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on issue #23666: [SPARK-26718][SS][Master] Fixed integer overflow in SS kafka rateLimit calculation
HeartSaVioR commented on issue #23666: [SPARK-26718][SS][Master] Fixed integer overflow in SS kafka rateLimit calculation URL: https://github.com/apache/spark/pull/23666#issuecomment-457997277 You might be confused regarding policy for the first time, but we only add `[BRANCH-NAME]` for PRs against non-master branch. You can remove `[Master]` in PR's title. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on a change in pull request #23644: [SPARK-26708][SQL] Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan
cloud-fan commented on a change in pull request #23644: [SPARK-26708][SQL] Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan URL: https://github.com/apache/spark/pull/23644#discussion_r251281049 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ## @@ -180,7 +180,26 @@ class CacheManager extends Logging { val it = cachedData.iterator() while (it.hasNext) { val cd = it.next() -if (condition(cd.plan)) { +// If `clearCache` is false (which means the recache request comes from a non-cascading +// cache invalidation) and the cache buffer has already been loaded, we do not need to +// re-compile a physical plan because the old plan will not be used any more by the +// CacheManager although it still lives in compiled `Dataset`s and it could still work. Review comment: > the old plan will not be used any more by the CacheManager How do we guarantee it? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on a change in pull request #23666: [SPARK-26718][SS][Master] Fixed integer overflow in SS kafka rateLimit calculation
HeartSaVioR commented on a change in pull request #23666: [SPARK-26718][SS][Master] Fixed integer overflow in SS kafka rateLimit calculation URL: https://github.com/apache/spark/pull/23666#discussion_r251280936 ## File path: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala ## @@ -231,7 +231,11 @@ private[kafka010] class KafkaMicroBatchStream( val begin = from.get(tp).getOrElse(fromNew(tp)) val prorate = limit * (size / total) // Don't completely starve small topicpartitions -val off = begin + (if (prorate < 1) Math.ceil(prorate) else Math.floor(prorate)).toLong +val prorateLong = (if (prorate < 1) Math.ceil(prorate) else Math.floor(prorate)).toLong +// need to be careful of integer overflow +// therefore added canary checks where to see if off variable could be overflowed +// refer to [https://issues.apache.org/jira/browse/SPARK-26718] +val off = if (prorateLong > Long.MaxValue - begin) Long.MaxValue else begin + prorateLong Review comment: Line length is limited to 100 chars and this line exceeds it. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101744/console You are encouraged to run `dev/scalastyle` and fix all style issues before submitting additional commit. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liupc edited a comment on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce
liupc edited a comment on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce URL: https://github.com/apache/spark/pull/23647#issuecomment-457995661 @vanzin @HyukjinKwon we once run into a similar problem on Spark2.0.1 when https://github.com/apache/spark/pull/14162 is not introduced. The disk broken of recovery path caused NM started without `YarnShuffleService`, so that executors scheduled on the node were unable to register with `YarnShuffleService`, and finally caused the application failure. Even though, we now have https://github.com/apache/spark/pull/14162 and the application level blacklist, but I think this PR still make sense for long running applications(for instance, Spark ThriftServer applications or spark streaming applications). For these type of applications, this case might not be a uncommon thing for they are running for a long time, and even if we suppose spark would recover with application level blacklist enabled, it will still cause resource waste, for shuffle will always fail on the node, not not mention that there are chances that the node is not blacklisted and will cause job failure. And the resource waste or job failure is unacceptable for a thriftServer or streaming applications. Hope my explanation can make you convinced. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] sadhen commented on a change in pull request #23276: [SPARK-26321][SQL] Improve the behavior of sql text splitting for the spark-sql command line
sadhen commented on a change in pull request #23276: [SPARK-26321][SQL] Improve the behavior of sql text splitting for the spark-sql command line URL: https://github.com/apache/spark/pull/23276#discussion_r251280587 ## File path: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala ## @@ -331,6 +337,65 @@ private[hive] class SparkSQLCLIDriver extends CliDriver with Logging { console.printInfo(s"Spark master: $master, Application Id: $appId") } + // method body imported from Hive and translated from Java to Scala + override def processLine(line: String, allowInterrupting: Boolean): Int = { Review comment: yes,there is a buggy impl in hive This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon commented on a change in pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
HyukjinKwon commented on a change in pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV URL: https://github.com/apache/spark/pull/21909#discussion_r251280295 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ## @@ -402,7 +403,7 @@ class JacksonParser( } } } catch { - case e @ (_: RuntimeException | _: JsonProcessingException) => + case e @ (_: RuntimeException | _: JsonProcessingException | _: MalformedInputException) => Review comment: Is this change related, @MaxGekk? Let's don't add unrelated changes next time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liupc commented on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce
liupc commented on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce URL: https://github.com/apache/spark/pull/23647#issuecomment-457995661 @vanzin @HyukjinKwon we once run into a similar problem on Spark2.0.1 when https://github.com/apache/spark/pull/14162 is not introduced. The disk broken of recovery path caused NM started without `YarnShuffleService`, so that executors scheduled on the node were unable to register with `YarnShuffleService`, and finally caused the application failure. Even though, we now have https://github.com/apache/spark/pull/14162 and the application level blacklist, but I think this PR still make sense for long running applications(for instance, Spark ThriftServer applications or spark streaming applications). For these type of applications, this case might not be a uncommon thing for they are running for a long time. and even if we suppose spark would recover with application level blacklist enabled, it will still cause resource waste, for shuffle will always fail on the node, not not mention that there are chances that the node is not blacklisted and will cause job failure. Hope my explanation can make you convinced. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liupc edited a comment on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce
liupc edited a comment on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce URL: https://github.com/apache/spark/pull/23647#issuecomment-457995661 @vanzin @HyukjinKwon we once run into a similar problem on Spark2.0.1 when https://github.com/apache/spark/pull/14162 is not introduced. The disk broken of recovery path caused NM started without `YarnShuffleService`, so that executors scheduled on the node were unable to register with `YarnShuffleService`, and finally caused the application failure. Even though, we now have https://github.com/apache/spark/pull/14162 and the application level blacklist, but I think this PR still make sense for long running applications(for instance, Spark ThriftServer applications or spark streaming applications). For these type of applications, this case might not be a uncommon thing for they are running for a long time, and even if we suppose spark would recover with application level blacklist enabled, it will still cause resource waste, for shuffle will always fail on the node, not not mention that there are chances that the node is not blacklisted and will cause job failure. Hope my explanation can make you convinced. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23666: [SPARK-26718][SS][Master] Fixed integer overflow in SS kafka rateLimit calculation
AmplabJenkins commented on issue #23666: [SPARK-26718][SS][Master] Fixed integer overflow in SS kafka rateLimit calculation URL: https://github.com/apache/spark/pull/23666#issuecomment-457994835 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23666: [SPARK-26718][SS][Master] Fixed integer overflow in SS kafka rateLimit calculation
AmplabJenkins removed a comment on issue #23666: [SPARK-26718][SS][Master] Fixed integer overflow in SS kafka rateLimit calculation URL: https://github.com/apache/spark/pull/23666#issuecomment-457994835 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org