[GitHub] [spark] cloud-fan commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite
cloud-fan commented on a change in pull request #29422: URL: https://github.com/apache/spark/pull/29422#discussion_r471246654 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ## @@ -136,7 +139,21 @@ private[spark] class TaskSchedulerImpl( // IDs of the tasks running on each executor private val executorIdToRunningTaskIds = new HashMap[String, HashSet[Long]] - private val executorsPendingDecommission = new HashMap[String, ExecutorDecommissionInfo] + // We add executors here when we first get decommission notification for them. Executors can + // continue to run even after being asked to decommission, but they will eventually exit. + val executorsPendingDecommission = new HashMap[String, ExecutorDecommissionInfo] + + // When they exit and we know of that via heartbeat failure, we will add them to this cache. + // This cache is consulted to know if a fetch failure is because a source executor was + // decommissioned. + lazy val decommissionedExecutorsRemoved = CacheBuilder.newBuilder() +.expireAfterWrite( + conf.getLong("spark.decommissioningRememberAfterRemoval.seconds", 60L), TimeUnit.SECONDS) Review comment: Not related to this PR but just for double-check: The fetch failure may arrive after the executor is marked as removed, even without the "early exit" feature, right? For example, the RPC framework has a huge delay. So this is a best-effort anyway. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
imback82 commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r471246355 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ## @@ -64,7 +64,9 @@ abstract class CSVDataSource extends Serializable { inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { if (inputPaths.nonEmpty) { - Some(infer(sparkSession, inputPaths, parsedOptions)) + // Remove "path" option so that it is not added to the given `inputPaths`. Review comment: And `FileTable.inferSchema` doesn't take in options. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
AmplabJenkins removed a comment on pull request #29434: URL: https://github.com/apache/spark/pull/29434#issuecomment-674671126 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
AmplabJenkins commented on pull request #29434: URL: https://github.com/apache/spark/pull/29434#issuecomment-674671126 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
SparkQA removed a comment on pull request #29434: URL: https://github.com/apache/spark/pull/29434#issuecomment-674635457 **[Test build #127495 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127495/testReport)** for PR 29434 at commit [`5944512`](https://github.com/apache/spark/commit/594451208426eb3b30bcd9a2b4cce74f25621e8e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SaurabhChawla100 edited a comment on pull request #29413: [SPARK-32597][CORE] Tune Event Drop in Async Event Queue
SaurabhChawla100 edited a comment on pull request #29413: URL: https://github.com/apache/spark/pull/29413#issuecomment-674664896 > Yeah so you have described the affects of it dropping the events, which I know. The thing I want to know is why it dropped the events in your cases. There are scenarios where in stage there are large number of task and all task completed in very small time resulting in the queue to fill up very fast and dropping for event after some time from those Queues (executormanagement Queue, AppStatus Queue, even in Event Log Queue). For us executormanagement and AppStatus matters most since we have dynamic allocation enabled and most of the jobs are on the Notebooks( jupyter/zeppelin) > I'm not sure the circumstances you are running or seeing these issues, is this just individuals running their own clusters. If it's a centralized cluster, why not have some cluster defaults and just raise the default values there? This is essentially just doing that. We are already setting some value in cluster defaults . But as I said earlier also "There is no fixed size of the Queue which can be used in all the Spark Jobs and even for the Same Spark Job that ran on different input set on daily basis" Thats the reason we are looking for some dynamism in this process queue and some way to reduce human efforts to change the Queue size. IMO changing the driver memory(increasing / decreasing the driver memory) is easier compare to setting the Queue size(also there are multiple Queues) and also debugging the application failed due to OOM is easier to debug than debugging the impact of Event drop, since there are many cases to look upon this event drop (application hangs, wasted resources etc.) @tgravescs @Ngone51 @itskals This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a change in pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command
ulysses-you commented on a change in pull request #28840: URL: https://github.com/apache/spark/pull/28840#discussion_r471244364 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala ## @@ -236,6 +236,45 @@ case class ShowFunctionsCommand( } } + +/** + * A command for users to refresh the persistent function. + * The syntax of using this command in SQL is: + * {{{ + *REFRESH FUNCTION functionName + * }}} + */ +case class RefreshFunctionCommand( +databaseName: Option[String], +functionName: String) + extends RunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +if (FunctionRegistry.builtin.functionExists(FunctionIdentifier(functionName))) { + throw new AnalysisException(s"Cannot refresh builtin function $functionName") Review comment: get it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
SparkQA commented on pull request #29434: URL: https://github.com/apache/spark/pull/29434#issuecomment-674670309 **[Test build #127495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127495/testReport)** for PR 29434 at commit [`5944512`](https://github.com/apache/spark/commit/594451208426eb3b30bcd9a2b4cce74f25621e8e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a change in pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command
ulysses-you commented on a change in pull request #28840: URL: https://github.com/apache/spark/pull/28840#discussion_r471244264 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala ## @@ -236,6 +236,45 @@ case class ShowFunctionsCommand( } } + +/** + * A command for users to refresh the persistent function. + * The syntax of using this command in SQL is: + * {{{ + *REFRESH FUNCTION functionName + * }}} + */ +case class RefreshFunctionCommand( +databaseName: Option[String], +functionName: String) + extends RunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +if (FunctionRegistry.builtin.functionExists(FunctionIdentifier(functionName))) { Review comment: It seems no meaning to refresh a persistent function whose name is same as a built-in function. Yes, we can create a persistent function with the same name as the built-in function, but just create in metastore. The actual function we used is the built-in function. The reason is built-in functions are pre-cached in registry and we lookup cached function first. e.g., `CREATE FUNCTION rand AS 'xxx'`, `DESC FUNCTION rand` will always return `Class: org.apache.spark.sql.catalyst.expressions.Rand`. BTW, maybe it's the reason why we create function and load it lazy that just be a Hive client, otherwise we can't create such function like `rand`,`md5` in metastore. @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite
cloud-fan commented on a change in pull request #29422: URL: https://github.com/apache/spark/pull/29422#discussion_r471243906 ## File path: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala ## @@ -294,10 +294,15 @@ private[spark] class CoarseGrainedExecutorBackend( override def run(): Unit = { var lastTaskRunningTime = System.nanoTime() val sleep_time = 1000 // 1s - + // This config is internal and only used by unit tests to force an executor + // to hang around for longer when decommissioned. + val initialSleepMillis = env.conf.getInt( +"spark.executor.decommission.initial.sleep.millis", sleep_time) Review comment: for testing only conf, let's add prefix `spark.test` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
imback82 commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r471243651 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ## @@ -64,7 +64,9 @@ abstract class CSVDataSource extends Serializable { inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { if (inputPaths.nonEmpty) { - Some(infer(sparkSession, inputPaths, parsedOptions)) + // Remove "path" option so that it is not added to the given `inputPaths`. Review comment: But it's not going to fix v2 datasources since `inferSchema` is called from `FileTable`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
AmplabJenkins removed a comment on pull request #29395: URL: https://github.com/apache/spark/pull/29395#issuecomment-674665654 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
AmplabJenkins commented on pull request #29395: URL: https://github.com/apache/spark/pull/29395#issuecomment-674665654 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
SparkQA commented on pull request #29395: URL: https://github.com/apache/spark/pull/29395#issuecomment-674665237 **[Test build #127498 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127498/testReport)** for PR 29395 at commit [`9c18479`](https://github.com/apache/spark/commit/9c18479e0b71c5b6ec1a2a0f268c598cf03fa879). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SaurabhChawla100 commented on pull request #29413: [SPARK-32597][CORE] Tune Event Drop in Async Event Queue
SaurabhChawla100 commented on pull request #29413: URL: https://github.com/apache/spark/pull/29413#issuecomment-674664896 > Yeah so you have described the affects of it dropping the events, which I know. The thing I want to know is why it dropped the events in your cases. There are scenarios where in stage there are large number of task and all task completed in very small time resulting in the queue to fill up very fast and dropping for event after some time from those Queues (executormanagement Queue, AppStatus Queue, even in Event Log Queue). For us executormanagement and AppStatus matters most since we have dynamic allocation enabled and most of the jobs are on the Notebooks( jupyter/zeppelin) > I'm not sure the circumstances you are running or seeing these issues, is this just individuals running their own clusters. If it's a centralized cluster, why not have some cluster defaults and just raise the default values there? This is essentially just doing that. We are already setting some value in cluster defaults . But as I said earlier also "There is no fixed size of the Queue which can be used in all the Spark Jobs and even for the Same Spark Job that ran on different input set on daily basis" Thats the reason we are looking for some dynamism in this process queue and some way to reduce human efforts to change the Queue size. Also IMO changing the driver memory(increasing / decreasing the driver memory) is easier compare to setting the Queue size(also there are multiple Queues). @tgravescs @Ngone51 @itskals This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 edited a comment on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
Ngone51 edited a comment on pull request #29395: URL: https://github.com/apache/spark/pull/29395#issuecomment-674664611 `org.apache.spark.sql.DataFrameSuite.SPARK-28224: Aggregate sum big decimal overflow` is quite flaky. But I'm sure it can pass on my laptop. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
Ngone51 commented on pull request #29395: URL: https://github.com/apache/spark/pull/29395#issuecomment-674664662 retest this please. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
AmplabJenkins removed a comment on pull request #29395: URL: https://github.com/apache/spark/pull/29395#issuecomment-674663941 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127496/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
Ngone51 commented on pull request #29395: URL: https://github.com/apache/spark/pull/29395#issuecomment-674664611 `org.apache.spark.sql.DataFrameSuite.SPARK-28224: Aggregate sum big decimal overflow` is pretty flaky. But I'm sure it can pass on my laptop. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
AmplabJenkins removed a comment on pull request #29395: URL: https://github.com/apache/spark/pull/29395#issuecomment-674663938 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
SparkQA removed a comment on pull request #29395: URL: https://github.com/apache/spark/pull/29395#issuecomment-674637079 **[Test build #127496 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127496/testReport)** for PR 29395 at commit [`9c18479`](https://github.com/apache/spark/commit/9c18479e0b71c5b6ec1a2a0f268c598cf03fa879). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
cloud-fan commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r471239203 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ## @@ -64,7 +64,9 @@ abstract class CSVDataSource extends Serializable { inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { if (inputPaths.nonEmpty) { - Some(infer(sparkSession, inputPaths, parsedOptions)) + // Remove "path" option so that it is not added to the given `inputPaths`. Review comment: Can we do it at the caller side of `FileFormat.inferSchema`? Then we don't need to change various file source implementations. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
AmplabJenkins commented on pull request #29395: URL: https://github.com/apache/spark/pull/29395#issuecomment-674663938 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
SparkQA commented on pull request #29395: URL: https://github.com/apache/spark/pull/29395#issuecomment-674663650 **[Test build #127496 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127496/testReport)** for PR 29395 at commit [`9c18479`](https://github.com/apache/spark/commit/9c18479e0b71c5b6ec1a2a0f268c598cf03fa879). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression
Ngone51 commented on a change in pull request #29270: URL: https://github.com/apache/spark/pull/29270#discussion_r471236995 ## File path: sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala ## @@ -256,4 +285,39 @@ trait TPCDSSchema { |${options.mkString("\n")} """.stripMargin) } + + private val originalCBCEnabled = conf.cboEnabled + private val originalPlanStatsEnabled = conf.planStatsEnabled + private val originalJoinReorderEnabled = conf.joinReorderEnabled + + override def beforeAll(): Unit = { +super.beforeAll() +if (injectStats) { + // Sets configurations for enabling the optimization rules that + // exploit data statistics. + conf.setConf(SQLConf.CBO_ENABLED, true) + conf.setConf(SQLConf.PLAN_STATS_ENABLED, true) + conf.setConf(SQLConf.JOIN_REORDER_ENABLED, true) Review comment: The configurations have been extracted to the base class(`TPCDSBase`). If we change it here, I think it will also take effect for the `TPCDSQuerySuite`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29322: [SPARK-32511][SQL] Add dropFields method to Column class
cloud-fan commented on a change in pull request #29322: URL: https://github.com/apache/spark/pull/29322#discussion_r471236515 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala ## @@ -453,60 +453,81 @@ class ComplexTypesSuite extends PlanTest with ExpressionEvalHelper { checkEvaluation(GetMapValue(mb0, Literal(Array[Byte](3, 4))), null) } - private val structAttr = 'struct1.struct('a.int) + private val structAttr = 'struct1.struct('a.int, 'b.int) private val testStructRelation = LocalRelation(structAttr) - test("simplify GetStructField on WithFields that is not changing the attribute being extracted") { -val query = testStructRelation.select( - GetStructField(WithFields('struct1, Seq("b"), Seq(Literal(1))), 0, Some("a")) as "outerAtt") -val expected = testStructRelation.select(GetStructField('struct1, 0, Some("a")) as "outerAtt") -checkRule(query, expected) + test("simplify GetStructField on UpdateFields that is not modifying the attribute being " + +"extracted") { +// add attribute, extract an attribute from the original struct +val query1 = testStructRelation.select(GetStructField(UpdateFields('struct1, + WithField("b", Literal(1)) :: Nil), 0, None) as "outerAtt") +// drop attribute, extract an attribute from the original struct +val query2 = testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("b") :: + Nil), 0, None) as "outerAtt") +// drop attribute, add attribute, extract an attribute from the original struct +val query3 = testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("b") :: + WithField("c", Literal(2)) :: Nil), 0, None) as "outerAtt") +// drop attribute, add attribute, extract an attribute from the original struct +val query4 = testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("a") :: + WithField("a", Literal(1)) :: Nil), 0, None) as "outerAtt") +val expected = testStructRelation.select(GetStructField('struct1, 0, None) as "outerAtt") Review comment: thanks for catching it! I've reverted it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression
cloud-fan commented on a change in pull request #29270: URL: https://github.com/apache/spark/pull/29270#discussion_r471236200 ## File path: sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala ## @@ -256,4 +285,39 @@ trait TPCDSSchema { |${options.mkString("\n")} """.stripMargin) } + + private val originalCBCEnabled = conf.cboEnabled + private val originalPlanStatsEnabled = conf.planStatsEnabled + private val originalJoinReorderEnabled = conf.joinReorderEnabled + + override def beforeAll(): Unit = { +super.beforeAll() +if (injectStats) { + // Sets configurations for enabling the optimization rules that + // exploit data statistics. + conf.setConf(SQLConf.CBO_ENABLED, true) + conf.setConf(SQLConf.PLAN_STATS_ENABLED, true) + conf.setConf(SQLConf.JOIN_REORDER_ENABLED, true) Review comment: Let's do it later. The `TPCDSQuerySuite` doesn't test shuffle hash join either and we should fix it as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression
Ngone51 commented on pull request #29270: URL: https://github.com/apache/spark/pull/29270#issuecomment-674656885 > Then I think we don't need the tag. Ok. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression
cloud-fan commented on pull request #29270: URL: https://github.com/apache/spark/pull/29270#issuecomment-674651592 > The total running time with or without the annotation are both about 2 mins Then I think we don't need the tag. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29448: [SPARK-32018][SQL][3.0][FOLLOWUP] Fix a test failure in DataFrameSuite
AmplabJenkins removed a comment on pull request #29448: URL: https://github.com/apache/spark/pull/29448#issuecomment-674651219 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29448: [SPARK-32018][SQL][3.0][FOLLOWUP] Fix a test failure in DataFrameSuite
AmplabJenkins commented on pull request #29448: URL: https://github.com/apache/spark/pull/29448#issuecomment-674651219 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29448: [SPARK-32018][SQL][3.0][FOLLOWUP] Fix a test failure in DataFrameSuite
SparkQA commented on pull request #29448: URL: https://github.com/apache/spark/pull/29448#issuecomment-674650965 **[Test build #127497 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127497/testReport)** for PR 29448 at commit [`69278b2`](https://github.com/apache/spark/commit/69278b29c65f34f073efcf7c6e463b3a20dc). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang edited a comment on pull request #29404: [SPARK-32018][SQL][FollowUp][3.0] Throw exception on decimal value overflow of sum aggregation
gengliangwang edited a comment on pull request #29404: URL: https://github.com/apache/spark/pull/29404#issuecomment-674650147 @maropu Thanks for reporting. I have created https://github.com/apache/spark/pull/29448 to fix the test failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #29448: [SPARK-32018][SQL][3.0][FOLLOWUP] Fix a test failure in DataFrameSuite
gengliangwang commented on pull request #29448: URL: https://github.com/apache/spark/pull/29448#issuecomment-674650068 cc @maropu @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang opened a new pull request #29448: [SPARK-32018][SQL][3.0][FOLLOWUP] Fix a test failure in DataFrameSuite
gengliangwang opened a new pull request #29448: URL: https://github.com/apache/spark/pull/29448 ### What changes were proposed in this pull request? Fix a test failure in DataFrameSuite introduced by https://github.com/apache/spark/pull/29404 ### Why are the changes needed? Fix test failure ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Udbhav30 closed pull request #29441: [SPARK-32626][CORE] Do not increase the input metrics when read rdd from cache
Udbhav30 closed pull request #29441: URL: https://github.com/apache/spark/pull/29441 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Udbhav30 commented on a change in pull request #29441: [SPARK-32626][CORE] Do not increase the input metrics when read rdd from cache
Udbhav30 commented on a change in pull request #29441: URL: https://github.com/apache/spark/pull/29441#discussion_r471226995 ## File path: core/src/main/scala/org/apache/spark/rdd/RDD.scala ## @@ -388,11 +388,9 @@ abstract class RDD[T: ClassTag]( // Block hit. case Left(blockResult) => if (readCachedBlock) { - val existingMetrics = context.taskMetrics().inputMetrics - existingMetrics.incBytesRead(blockResult.bytes) Review comment: Thanks for the information @viirya. I'll close the pr This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #29404: [SPARK-32018][SQL][FollowUp][3.0] Throw exception on decimal value overflow of sum aggregation
gengliangwang commented on pull request #29404: URL: https://github.com/apache/spark/pull/29404#issuecomment-674650147 @maropu Thanks for reporting. I have created https://github.com/apache/spark/pull/29448. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
LuciferYang commented on a change in pull request #29434: URL: https://github.com/apache/spark/pull/29434#discussion_r471213921 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala ## @@ -314,7 +314,7 @@ trait Row extends Serializable { * * @throws ClassCastException when data type does not match. */ - def getSeq[T](i: Int): Seq[T] = getAs[Seq[T]](i) + def getSeq[T](i: Int): scala.collection.Seq[T] = getAs[scala.collection.Seq[T]](i) Review comment: Address [5944512](https://github.com/apache/spark/pull/29434/commits/594451208426eb3b30bcd9a2b4cce74f25621e8e) revert to use `Seq` alias as `Row.getSeq` method return type. The internal `ArrayType -> scala.collection.Seq` encode mode remain unchanged, I think we can change it to `ArrayType -> scala.collection.immutable.Seq` when we no longer need to be compatible with Scala 2.12. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
AmplabJenkins removed a comment on pull request #29395: URL: https://github.com/apache/spark/pull/29395#issuecomment-674637326 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
AmplabJenkins commented on pull request #29395: URL: https://github.com/apache/spark/pull/29395#issuecomment-674637326 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
SparkQA commented on pull request #29395: URL: https://github.com/apache/spark/pull/29395#issuecomment-674637079 **[Test build #127496 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127496/testReport)** for PR 29395 at commit [`9c18479`](https://github.com/apache/spark/commit/9c18479e0b71c5b6ec1a2a0f268c598cf03fa879). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
LuciferYang commented on a change in pull request #29434: URL: https://github.com/apache/spark/pull/29434#discussion_r471213921 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala ## @@ -314,7 +314,7 @@ trait Row extends Serializable { * * @throws ClassCastException when data type does not match. */ - def getSeq[T](i: Int): Seq[T] = getAs[Seq[T]](i) + def getSeq[T](i: Int): scala.collection.Seq[T] = getAs[scala.collection.Seq[T]](i) Review comment: Address [5944512](https://github.com/apache/spark/pull/29434/commits/594451208426eb3b30bcd9a2b4cce74f25621e8e) revert to use `Seq` alias as `Row.getSeq` method return type. The internal `ArrayType -> scala.collection.Seq` encode mode remain unchanged This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
AmplabJenkins removed a comment on pull request #29434: URL: https://github.com/apache/spark/pull/29434#issuecomment-674635770 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
AmplabJenkins commented on pull request #29434: URL: https://github.com/apache/spark/pull/29434#issuecomment-674635770 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
Ngone51 commented on pull request #29395: URL: https://github.com/apache/spark/pull/29395#issuecomment-674635748 retest this please. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
SparkQA commented on pull request #29434: URL: https://github.com/apache/spark/pull/29434#issuecomment-674635457 **[Test build #127495 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127495/testReport)** for PR 29434 at commit [`5944512`](https://github.com/apache/spark/commit/594451208426eb3b30bcd9a2b4cce74f25621e8e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #29225: [SPARK-32287][TESTS] Flaky Test: ExecutorAllocationManagerSuite.add executors default profile
Ngone51 commented on a change in pull request #29225: URL: https://github.com/apache/spark/pull/29225#discussion_r471211841 ## File path: core/src/test/scala/org/apache/spark/ExecutorAllocationManagerSuite.scala ## @@ -1603,7 +1603,7 @@ class ExecutorAllocationManagerSuite extends SparkFunSuite { .set(config.DYN_ALLOCATION_TESTING, true) // SPARK-22864: effectively disable the allocation schedule by setting the period to a // really long value. - .set(TEST_SCHEDULE_INTERVAL, 1L) + .set(TEST_SCHEDULE_INTERVAL, 3L) Review comment: @tgravescs It happens again at: https://github.com/apache/spark/pull/29418/checks?check_run_id=975304054 BTW, this change only prevents the automatic invocation of `schedule()` for the second time but the first time invocation of `schedule()` always happens because the `initialDelay` is 0? https://github.com/apache/spark/blob/8f0fef18438aa8fb07f5ed885ffad1339992f102/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L250 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel
wangyum commented on pull request #29088: URL: https://github.com/apache/spark/pull/29088#issuecomment-674633117 Thank you all. There is a workaround: ![image](https://user-images.githubusercontent.com/5399861/90353884-22a39800-e07a-11ea-8a7b-37ddb1689fd6.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum closed pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel
wangyum closed pull request #29088: URL: https://github.com/apache/spark/pull/29088 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #29224: [SPARK-32430][SQL] Extend SparkSessionExtensions to inject rules into AQE query stage preparation
gatorsmile commented on a change in pull request #29224: URL: https://github.com/apache/spark/pull/29224#discussion_r471209639 ## File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala ## @@ -106,13 +109,28 @@ class SparkSessionExtensions { columnarRuleBuilders.map(_.apply(session)).toSeq } + /** + * Build the override rules for the query stage preparation phase of adaptive query execution. + */ + private[sql] def buildQueryStagePrepRules(session: SparkSession): Seq[Rule[SparkPlan]] = { +queryStagePrepRuleBuilders.map(_.apply(session)).toSeq + } + /** * Inject a rule that can override the columnar execution of an executor. */ def injectColumnar(builder: ColumnarRuleBuilder): Unit = { columnarRuleBuilders += builder } + /** + * Inject a rule that can override the the query stage preparation phase of adaptive query Review comment: Nit: the the => the This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #29224: [SPARK-32430][SQL] Extend SparkSessionExtensions to inject rules into AQE query stage preparation
gatorsmile commented on a change in pull request #29224: URL: https://github.com/apache/spark/pull/29224#discussion_r471209521 ## File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala ## @@ -96,8 +97,10 @@ class SparkSessionExtensions { type ParserBuilder = (SparkSession, ParserInterface) => ParserInterface type FunctionDescription = (FunctionIdentifier, ExpressionInfo, FunctionBuilder) type ColumnarRuleBuilder = SparkSession => ColumnarRule + type QueryStagePrepRuleBuilder = SparkSession => Rule[SparkPlan] Review comment: This is a public API. I think we also need to add a version information. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #29224: [SPARK-32430][SQL] Extend SparkSessionExtensions to inject rules into AQE query stage preparation
gatorsmile commented on a change in pull request #29224: URL: https://github.com/apache/spark/pull/29224#discussion_r471209398 ## File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala ## @@ -96,8 +96,10 @@ class SparkSessionExtensions { type ParserBuilder = (SparkSession, ParserInterface) => ParserInterface type FunctionDescription = (FunctionIdentifier, ExpressionInfo, FunctionBuilder) type ColumnarRuleBuilder = SparkSession => ColumnarRule + type QueryStagePrepRuleBuilder = SparkSession => Rule[SparkPlan] Review comment: This is a public API. I think we also need to add a version information. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression
Ngone51 commented on pull request #29270: URL: https://github.com/apache/spark/pull/29270#issuecomment-674631270 > Have you checked the running time? https://github.com/apache/spark/pull/29270/files#r469675879 I may not fully understand the usage of `@ExtendedSQLTest`. The total running time with or without the annotation are both about 2 mins(I used the same command). We may not need it? @maropu This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #29224: [SPARK-32430][SQL] Extend SparkSessionExtensions to inject rules into AQE query stage preparation
gatorsmile commented on a change in pull request #29224: URL: https://github.com/apache/spark/pull/29224#discussion_r471208359 ## File path: sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala ## @@ -73,7 +74,8 @@ private[sql] class SessionState( resourceLoaderBuilder: () => SessionResourceLoader, createQueryExecution: LogicalPlan => QueryExecution, createClone: (SparkSession, SessionState) => SessionState, -val columnarRules: Seq[ColumnarRule]) { +val columnarRules: Seq[ColumnarRule], +val queryStagePrepRules: Seq[Rule[SparkPlan]]) { Review comment: SessionState is a critical concept/class in Spark SQL. Adding `queryStagePrepRules` into SessionState looks weird to me. WDYT? cc @hvanhovell This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #29224: [SPARK-32430][SQL] Extend SparkSessionExtensions to inject rules into AQE query stage preparation
gatorsmile commented on a change in pull request #29224: URL: https://github.com/apache/spark/pull/29224#discussion_r471207785 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala ## @@ -90,7 +90,7 @@ case class AdaptiveSparkPlanExec( // Exchange nodes) after running these rules. private def queryStagePreparationRules: Seq[Rule[SparkPlan]] = Seq( Review comment: It sounds like this function should be moved out of the physical node `AdaptiveSparkPlanExec`? cc @maryannxue @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #29224: [SPARK-32430][SQL] Extend SparkSessionExtensions to inject rules into AQE query stage preparation
gatorsmile commented on a change in pull request #29224: URL: https://github.com/apache/spark/pull/29224#discussion_r471207785 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala ## @@ -90,7 +90,7 @@ case class AdaptiveSparkPlanExec( // Exchange nodes) after running these rules. private def queryStagePreparationRules: Seq[Rule[SparkPlan]] = Seq( Review comment: It sounds like these rules should be moved out of the physical node `AdaptiveSparkPlanExec`? cc @maryannxue @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression
Ngone51 commented on a change in pull request #29270: URL: https://github.com/apache/spark/pull/29270#discussion_r471203868 ## File path: sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala ## @@ -256,4 +285,39 @@ trait TPCDSSchema { |${options.mkString("\n")} """.stripMargin) } + + private val originalCBCEnabled = conf.cboEnabled + private val originalPlanStatsEnabled = conf.planStatsEnabled + private val originalJoinReorderEnabled = conf.joinReorderEnabled + + override def beforeAll(): Unit = { +super.beforeAll() +if (injectStats) { + // Sets configurations for enabling the optimization rules that + // exploit data statistics. + conf.setConf(SQLConf.CBO_ENABLED, true) + conf.setConf(SQLConf.PLAN_STATS_ENABLED, true) + conf.setConf(SQLConf.JOIN_REORDER_ENABLED, true) Review comment: hmm..I'm not sure about this. but it seems `preferSortMergeJoin` is enabled by default? Or you mean we should test both `preferSortMergeJoin` enabled and disabled? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates
Karl-WangSK commented on pull request #29360: URL: https://github.com/apache/spark/pull/29360#issuecomment-674625909 hi. Can anyone deal with this pr? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] fqaiser94 commented on a change in pull request #29322: [SPARK-32511][SQL] Add dropFields method to Column class
fqaiser94 commented on a change in pull request #29322: URL: https://github.com/apache/spark/pull/29322#discussion_r471198235 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala ## @@ -453,60 +453,81 @@ class ComplexTypesSuite extends PlanTest with ExpressionEvalHelper { checkEvaluation(GetMapValue(mb0, Literal(Array[Byte](3, 4))), null) } - private val structAttr = 'struct1.struct('a.int) + private val structAttr = 'struct1.struct('a.int, 'b.int) private val testStructRelation = LocalRelation(structAttr) - test("simplify GetStructField on WithFields that is not changing the attribute being extracted") { -val query = testStructRelation.select( - GetStructField(WithFields('struct1, Seq("b"), Seq(Literal(1))), 0, Some("a")) as "outerAtt") -val expected = testStructRelation.select(GetStructField('struct1, 0, Some("a")) as "outerAtt") -checkRule(query, expected) + test("simplify GetStructField on UpdateFields that is not modifying the attribute being " + +"extracted") { +// add attribute, extract an attribute from the original struct +val query1 = testStructRelation.select(GetStructField(UpdateFields('struct1, + WithField("b", Literal(1)) :: Nil), 0, None) as "outerAtt") +// drop attribute, extract an attribute from the original struct +val query2 = testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("b") :: + Nil), 0, None) as "outerAtt") +// drop attribute, add attribute, extract an attribute from the original struct +val query3 = testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("b") :: + WithField("c", Literal(2)) :: Nil), 0, None) as "outerAtt") +// drop attribute, add attribute, extract an attribute from the original struct +val query4 = testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("a") :: + WithField("a", Literal(1)) :: Nil), 0, None) as "outerAtt") +val expected = testStructRelation.select(GetStructField('struct1, 0, None) as "outerAtt") Review comment: @cloud-fan I'm afraid I have to recommend we revert the changes in this PR from master. There is another correctness issue I've discovered which exists even in our initial `withField` implementation: ``` sql("SELECT CAST(NULL AS struct) struct_col") .select($"struct_col".withField("d", lit(4)).getField("d").as("d")) // currently returns this +---+ |d | +---+ |4 | +---+ // when in fact it should return this: ++ |d | ++ |null| ++ ``` I'd like to get `withField` right before coming back to `dropFields`. If that sounds good to you, here is the order of PRs I propose: 1. Revert `dropFields` changes immediately 2. Fix the above issue in `withField` (the real issue is in the optimizer rule) through another PR asap 3. Figure out how to implement `dropFields` in another PR in the future. I think I have to go back to the drawing board for this one . The combination of requirements (return null for null input struct, allow nested API calls, allow users to arbitrarily mix `dropFields` and `withField` and `getField` in the same query and still generate reasonably well-optimized physical plans in a timely basis) is making for a slightly trickier programming exercise than I anticipated and needs a more robust testing strategy. Please let me know your thoughts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] fqaiser94 commented on a change in pull request #29322: [SPARK-32511][SQL] Add dropFields method to Column class
fqaiser94 commented on a change in pull request #29322: URL: https://github.com/apache/spark/pull/29322#discussion_r471198235 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala ## @@ -453,60 +453,81 @@ class ComplexTypesSuite extends PlanTest with ExpressionEvalHelper { checkEvaluation(GetMapValue(mb0, Literal(Array[Byte](3, 4))), null) } - private val structAttr = 'struct1.struct('a.int) + private val structAttr = 'struct1.struct('a.int, 'b.int) private val testStructRelation = LocalRelation(structAttr) - test("simplify GetStructField on WithFields that is not changing the attribute being extracted") { -val query = testStructRelation.select( - GetStructField(WithFields('struct1, Seq("b"), Seq(Literal(1))), 0, Some("a")) as "outerAtt") -val expected = testStructRelation.select(GetStructField('struct1, 0, Some("a")) as "outerAtt") -checkRule(query, expected) + test("simplify GetStructField on UpdateFields that is not modifying the attribute being " + +"extracted") { +// add attribute, extract an attribute from the original struct +val query1 = testStructRelation.select(GetStructField(UpdateFields('struct1, + WithField("b", Literal(1)) :: Nil), 0, None) as "outerAtt") +// drop attribute, extract an attribute from the original struct +val query2 = testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("b") :: + Nil), 0, None) as "outerAtt") +// drop attribute, add attribute, extract an attribute from the original struct +val query3 = testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("b") :: + WithField("c", Literal(2)) :: Nil), 0, None) as "outerAtt") +// drop attribute, add attribute, extract an attribute from the original struct +val query4 = testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("a") :: + WithField("a", Literal(1)) :: Nil), 0, None) as "outerAtt") +val expected = testStructRelation.select(GetStructField('struct1, 0, None) as "outerAtt") Review comment: @cloud-fan I'm afraid I have to recommend we revert the changes in this PR from master. There is another correctness issue I've discovered which exists even in our initial `withField` implementation: ``` // The following query sql("SELECT CAST(NULL AS struct) struct_col") .select($"struct_col".withField("d", lit(4)).getField("d").as("d")) // currently returns this +---+ |d | +---+ |4 | +---+ // when in fact it should return this: ++ |d | ++ |null| ++ ``` I'd like to get `withField` right before coming back to `dropFields`. If that sounds good to you, here is the order of PRs I propose: 1. Revert `dropFields` changes immediately 2. Fix the above issue in `withField` (the real issue is in the optimizer rule) through another PR asap 3. Figure out how to implement `dropFields` in another PR in the future. I think I have to go back to the drawing board for this one . The combination of requirements (return null for null input struct, allow nested API calls, allow users to mix `dropFields` and `withField` and `getField` in the same query and still generate reasonably well-optimized physical plans in a timely basis) is making for a slightly trickier programming exercise than I anticipated and needs a more robust testing strategy. Please let me know your thoughts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] brandonJY edited a comment on pull request #29443: [MINOR][DOCS] fix typo in couple docs
brandonJY edited a comment on pull request #29443: URL: https://github.com/apache/spark/pull/29443#issuecomment-674618402 > Ah, I see. I've check the other PR and are only three typos in the codebase? no. For the codebase typo, that is not the complete list (I don't have it yet). And it might result mass merge conflict to other PRs if we pack all the code typo corrections in one PR? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] brandonJY edited a comment on pull request #29443: [MINOR][DOCS] fix typo in couple docs
brandonJY edited a comment on pull request #29443: URL: https://github.com/apache/spark/pull/29443#issuecomment-674618402 > Ah, I see. I've check the other PR and are only three typos in the codebase? no. For the codebase typo, that is not the complete list. It might result mass merge conflict to other PRs if we pack all the code typo corrections in one PR? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] agrawaldevesh commented on pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite
agrawaldevesh commented on pull request #29422: URL: https://github.com/apache/spark/pull/29422#issuecomment-674618506 > PySpark failure is likely unrelated. I'm going to go ahead and said LGTM but let's give it until Monday to see if anyone else wants to review. Sounds good. Thanks @holdenk ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] brandonJY commented on pull request #29443: [MINOR][DOCS] fix typo in couple docs
brandonJY commented on pull request #29443: URL: https://github.com/apache/spark/pull/29443#issuecomment-674618402 > Ah, I see. I've check the other PR and are only three typos in the codebase? no. For the codebase typo, that is not the complete list. It might result mass merge conflict to other PRs ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29447: [SPARK-32631][SQL] Handle Null error message in hive ThriftServer UI
AmplabJenkins removed a comment on pull request #29447: URL: https://github.com/apache/spark/pull/29447#issuecomment-674618061 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29447: [SPARK-32631][SQL] Handle Null error message in hive ThriftServer UI
AmplabJenkins commented on pull request #29447: URL: https://github.com/apache/spark/pull/29447#issuecomment-674618302 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken commented on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException
leanken commented on pull request #29431: URL: https://github.com/apache/spark/pull/29431#issuecomment-674618015 @cloud-fan Test passed, ready to merge. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29447: [SPARK-32631][SQL] Handle Null error message in hive ThriftServer UI
AmplabJenkins commented on pull request #29447: URL: https://github.com/apache/spark/pull/29447#issuecomment-674618061 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tianhanhu opened a new pull request #29447: [SPARK-32631][SQL] Handle Null error message in hive ThriftServer UI
tianhanhu opened a new pull request #29447: URL: https://github.com/apache/spark/pull/29447 ### What changes were proposed in this pull request? This fix is trying to prevent NullPointerException caused by Null error message by wrapping the message with Option and using getOrElse to handle Null value. The possible reason why the error message would be null is that java Throwable allows the detailMessage to be null. However, in the render code we assume that the error message would be be null and try to do indexOf() on the string. Therefore, if some exception doesn't set the error message, it would trigger NullPointerException in the hive ThriftServer UI. ### Why are the changes needed? To prevent NullPointerException caused by Null error message. ### Does this PR introduce _any_ user-facing change? User would see blank message instead of a NullPointerException. ### How was this patch tested? Manual Testing. For normal behavior, this change would not change anything. ![37435401-658510a8-27a0-11e8-8f8b-00271fd3ef58](https://user-images.githubusercontent.com/30429546/90350571-93f83200-e00b-11ea-9ff3-3b315d2114cc.png) To test the null case, we manually set detail to be null in code. In this case, ![37435498-cebc8182-27a0-11e8-80e7-40df2a65612b](https://user-images.githubusercontent.com/30429546/90350601-ac684c80-e00b-11ea-8375-1911f5852bd5.png) UI would show blank. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
AmplabJenkins removed a comment on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-674615464 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
AmplabJenkins commented on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-674615464 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
SparkQA commented on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-674615193 **[Test build #127494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127494/testReport)** for PR 29396 at commit [`dc8c28c`](https://github.com/apache/spark/commit/dc8c28c134c9aa311ac6c76be407e10d764755e6). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
AmplabJenkins removed a comment on pull request #29445: URL: https://github.com/apache/spark/pull/29445#issuecomment-674611954 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
SparkQA removed a comment on pull request #29445: URL: https://github.com/apache/spark/pull/29445#issuecomment-674608241 **[Test build #127493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127493/testReport)** for PR 29445 at commit [`b662ee0`](https://github.com/apache/spark/commit/b662ee0a434bb37d99df1546b3f8f94e01b1f610). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
SparkQA commented on pull request #29445: URL: https://github.com/apache/spark/pull/29445#issuecomment-674611838 **[Test build #127493 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127493/testReport)** for PR 29445 at commit [`b662ee0`](https://github.com/apache/spark/commit/b662ee0a434bb37d99df1546b3f8f94e01b1f610). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
AmplabJenkins commented on pull request #29445: URL: https://github.com/apache/spark/pull/29445#issuecomment-674611954 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
viirya commented on a change in pull request #29445: URL: https://github.com/apache/spark/pull/29445#discussion_r471189589 ## File path: python/pyspark/ml/tuning.py ## @@ -536,7 +536,7 @@ def copy(self, extra=None): bestModel = self.bestModel.copy(extra) avgMetrics = self.avgMetrics subModels = self.subModels -return CrossValidatorModel(bestModel, avgMetrics, subModels) +return self._copyValues(CrossValidatorModel(bestModel, avgMetrics, subModels), extra=extra) Review comment: Should we use `Params.copy` like `CrossValidator`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
AmplabJenkins removed a comment on pull request #29445: URL: https://github.com/apache/spark/pull/29445#issuecomment-674608573 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
AmplabJenkins commented on pull request #29445: URL: https://github.com/apache/spark/pull/29445#issuecomment-674608573 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
SparkQA commented on pull request #29445: URL: https://github.com/apache/spark/pull/29445#issuecomment-674608241 **[Test build #127493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127493/testReport)** for PR 29445 at commit [`b662ee0`](https://github.com/apache/spark/commit/b662ee0a434bb37d99df1546b3f8f94e01b1f610). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
AmplabJenkins removed a comment on pull request #29445: URL: https://github.com/apache/spark/pull/29445#issuecomment-674607134 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
viirya commented on pull request #29445: URL: https://github.com/apache/spark/pull/29445#issuecomment-674607222 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
AmplabJenkins commented on pull request #29445: URL: https://github.com/apache/spark/pull/29445#issuecomment-674607134 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
huaxingao commented on pull request #29445: URL: https://github.com/apache/spark/pull/29445#issuecomment-674607031 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()
AmplabJenkins removed a comment on pull request #29445: URL: https://github.com/apache/spark/pull/29445#issuecomment-674530895 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #29441: [SPARK-32626][CORE] Do not increase the input metrics when read rdd from cache
maropu commented on a change in pull request #29441: URL: https://github.com/apache/spark/pull/29441#discussion_r471172827 ## File path: core/src/main/scala/org/apache/spark/rdd/RDD.scala ## @@ -388,11 +388,9 @@ abstract class RDD[T: ClassTag]( // Block hit. case Left(blockResult) => if (readCachedBlock) { - val existingMetrics = context.taskMetrics().inputMetrics - existingMetrics.incBytesRead(blockResult.bytes) Review comment: +1 on the @viirya comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command
gatorsmile commented on a change in pull request #28840: URL: https://github.com/apache/spark/pull/28840#discussion_r471172044 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ## @@ -3030,6 +3030,49 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { } } } + + test("REFRESH FUNCTION") { +val msg = intercept[AnalysisException] { + sql("REFRESH FUNCTION md5") +}.getMessage +assert(msg.contains("Cannot refresh builtin function")) + +withUserDefinedFunction("func1" -> true) { + sql("CREATE TEMPORARY FUNCTION func1 AS 'test.org.apache.spark.sql.MyDoubleAvg'") + val msg = intercept[AnalysisException] { +sql("REFRESH FUNCTION func1") + }.getMessage + assert(msg.contains("Cannot refresh temporary function")) +} + +withUserDefinedFunction("func1" -> false) { + intercept[NoSuchFunctionException] { +sql("REFRESH FUNCTION func1") + } + + val func = FunctionIdentifier("func1", Some("default")) + sql("CREATE FUNCTION func1 AS 'test.org.apache.spark.sql.MyDoubleAvg'") + assert(!spark.sessionState.catalog.isRegisteredFunction(func)) + sql("REFRESH FUNCTION func1") Review comment: This is the only positive test case. Could you think more and try to cover more cases? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command
gatorsmile commented on a change in pull request #28840: URL: https://github.com/apache/spark/pull/28840#discussion_r471172006 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala ## @@ -236,6 +236,45 @@ case class ShowFunctionsCommand( } } + +/** + * A command for users to refresh the persistent function. + * The syntax of using this command in SQL is: + * {{{ + *REFRESH FUNCTION functionName + * }}} + */ +case class RefreshFunctionCommand( +databaseName: Option[String], +functionName: String) + extends RunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +if (FunctionRegistry.builtin.functionExists(FunctionIdentifier(functionName))) { + throw new AnalysisException(s"Cannot refresh builtin function $functionName") Review comment: Nit: built-in This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command
gatorsmile commented on a change in pull request #28840: URL: https://github.com/apache/spark/pull/28840#discussion_r471171963 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala ## @@ -236,6 +236,45 @@ case class ShowFunctionsCommand( } } + +/** + * A command for users to refresh the persistent function. + * The syntax of using this command in SQL is: + * {{{ + *REFRESH FUNCTION functionName + * }}} + */ +case class RefreshFunctionCommand( +databaseName: Option[String], +functionName: String) + extends RunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +if (FunctionRegistry.builtin.functionExists(FunctionIdentifier(functionName))) { Review comment: We still can create persistent function with the same name as the built-in function. For example, ```SQL CREATE FUNCTION rand AS 'org.apache.spark.sql.catalyst.expressions.Abs' DESC function default.rand ``` I think we should still allow this case. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join
c21 commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-674590117 Thank you @maropu, @cloud-fan, @agrawaldevesh, and @viirya for all these insightful discussion and review! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join
maropu commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-674589592 Thanks, @c21, for the great improvement! Merged to master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu closed pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join
maropu closed pull request #29342: URL: https://github.com/apache/spark/pull/29342 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29366: [SPARK-32550][SQL] Make SpecificInternalRow constructors faster by using while loops instead of maps
AmplabJenkins commented on pull request #29366: URL: https://github.com/apache/spark/pull/29366#issuecomment-674584858 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29366: [SPARK-32550][SQL] Make SpecificInternalRow constructors faster by using while loops instead of maps
AmplabJenkins removed a comment on pull request #29366: URL: https://github.com/apache/spark/pull/29366#issuecomment-674584858 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29366: [SPARK-32550][SQL] Make SpecificInternalRow constructors faster by using while loops instead of maps
SparkQA commented on pull request #29366: URL: https://github.com/apache/spark/pull/29366#issuecomment-674584683 **[Test build #127491 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127491/testReport)** for PR 29366 at commit [`6c878ba`](https://github.com/apache/spark/commit/6c878ba62abdb6255b842d59201d712f4c5eadf3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29366: [SPARK-32550][SQL] Make SpecificInternalRow constructors faster by using while loops instead of maps
SparkQA removed a comment on pull request #29366: URL: https://github.com/apache/spark/pull/29366#issuecomment-674555103 **[Test build #127491 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127491/testReport)** for PR 29366 at commit [`6c878ba`](https://github.com/apache/spark/commit/6c878ba62abdb6255b842d59201d712f4c5eadf3). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #29425: [SPARK-32350][FOLLOW-UP] Divide the value list into a set of small batches when bulk writing to LevelDB
gatorsmile commented on a change in pull request #29425: URL: https://github.com/apache/spark/pull/29425#discussion_r471163708 ## File path: common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java ## @@ -164,35 +165,39 @@ public void writeAll(List values) throws Exception { Preconditions.checkArgument(values != null && !values.isEmpty(), "Non-empty values required."); -// Group by class, in case there are values from different classes in the values +// Group by class, in case there are values from different classes in the values. // Typical usecase is for this to be a single class. // A NullPointerException will be thrown if values contain null object. for (Map.Entry, ? extends List> entry : values.stream().collect(Collectors.groupingBy(Object::getClass)).entrySet()) { - - final Iterator valueIter = entry.getValue().iterator(); - final Iterator serializedValueIter; - - // Deserialize outside synchronized block - List list = new ArrayList<>(entry.getValue().size()); - for (Object value : values) { -list.add(serializer.serialize(value)); - } - serializedValueIter = list.iterator(); - final Class klass = entry.getKey(); - final LevelDBTypeInfo ti = getTypeInfo(klass); - synchronized (ti) { -final LevelDBTypeInfo.Index naturalIndex = ti.naturalIndex(); -final Collection indices = ti.indices(); + // Partition the value list to a set of the 128-values batches. It can reduce the + // memory pressure caused by serialization and give fairness to other writing threads + // when writing a very large list. + for (List batchList : Iterables.partition(entry.getValue(), 128)) { Review comment: I left a comment in https://github.com/apache/spark/pull/28412#pullrequestreview-468098043 . I think we should add the unit test cases for verifying the code work as our design. We might need to update the implementation to provide the private APIs for testing. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu edited a comment on pull request #29439: [SPARK-32624][SQL] Use getCanonicalName to fix byte[] compile issue
maropu edited a comment on pull request #29439: URL: https://github.com/apache/spark/pull/29439#issuecomment-674526844 Users can hit this issue? If you have an query to reproduce it, I think its better to add end-2-end tests, too. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
imback82 commented on pull request #29437: URL: https://github.com/apache/spark/pull/29437#issuecomment-674582324 cc @cloud-fan @viirya This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org