[GitHub] [spark] AmplabJenkins commented on pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables
AmplabJenkins commented on pull request #29045: URL: https://github.com/apache/spark/pull/29045#issuecomment-658900416 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource
MaxGekk commented on a change in pull request #27366: URL: https://github.com/apache/spark/pull/27366#discussion_r455238390 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala ## @@ -508,6 +548,9 @@ object JsonBenchmark extends SqlBasedBenchmark { jsonInDS(50 * 1000 * 1000, numIters) jsonInFile(50 * 1000 * 1000, numIters) datetimeBenchmark(rowsNum = 10 * 1000 * 1000, numIters) + // Benchmark pushdown filters that refer to top-level columns. + // TODO: Add benchmarks for filters with nested column attributes. Review comment: I created the sub-task https://issues.apache.org/jira/browse/SPARK-32325 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor
AmplabJenkins removed a comment on pull request #29032: URL: https://github.com/apache/spark/pull/29032#issuecomment-658914488 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29115: [SPARK-32315][ML] Provide an explanation error message when calling require
AmplabJenkins commented on pull request #29115: URL: https://github.com/apache/spark/pull/29115#issuecomment-658914448 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor
AmplabJenkins commented on pull request #29032: URL: https://github.com/apache/spark/pull/29032#issuecomment-658914488 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29115: [SPARK-32315][ML] Provide an explanation error message when calling require
AmplabJenkins removed a comment on pull request #29115: URL: https://github.com/apache/spark/pull/29115#issuecomment-658914448 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29101: [WIP][SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions
AmplabJenkins removed a comment on pull request #29101: URL: https://github.com/apache/spark/pull/29101#issuecomment-658914370 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29126: [SPARK-32324][SQL]Fix error messages during using PIVOT and lateral view
AmplabJenkins commented on pull request #29126: URL: https://github.com/apache/spark/pull/29126#issuecomment-658914030 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor
cloud-fan commented on a change in pull request #29032: URL: https://github.com/apache/spark/pull/29032#discussion_r455237329 ## File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ## @@ -715,7 +715,8 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi accumUpdates: Array[(Long, Seq[AccumulatorV2[_, _]])], blockManagerId: BlockManagerId, executorUpdates: Map[(Int, Int), ExecutorMetrics]): Boolean = true - override def executorDecommission(executorId: String): Unit = {} + override def executorDecommission(executorId: String, Review comment: ditto: indentation ## File path: core/src/test/scala/org/apache/spark/scheduler/ExternalClusterManagerSuite.scala ## @@ -90,7 +90,8 @@ private class DummyTaskScheduler extends TaskScheduler { override def notifyPartitionCompletion(stageId: Int, partitionId: Int): Unit = {} override def setDAGScheduler(dagScheduler: DAGScheduler): Unit = {} override def defaultParallelism(): Int = 2 - override def executorDecommission(executorId: String): Unit = {} + override def executorDecommission(executorId: String, Review comment: ditto This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29101: [WIP][SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions
AmplabJenkins commented on pull request #29101: URL: https://github.com/apache/spark/pull/29101#issuecomment-658914370 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor
cloud-fan commented on a change in pull request #29032: URL: https://github.com/apache/spark/pull/29032#discussion_r455237815 ## File path: core/src/main/scala/org/apache/spark/scheduler/DecommissionInfo.scala ## @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +/** + * Provides more detail about a decommissioning event. + * @param message Human readable reason for why the decommissioning is happening. + * @param isWorkerDecommissioned Whether the worker is being decommissioned too. + * Used to know if the shuffle data might be lost too. + */ +private[spark] +case class DecommissionInfo(message: String, isWorkerDecommissioned: Boolean) Review comment: so this PR is just a refactor and doesn't actually use the `isWorkerDecommissioned` flag? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes
aokolnychyi commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658932378 @dongjoon-hyun @viirya @hvanhovell @maropu, what do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command
AmplabJenkins commented on pull request #28840: URL: https://github.com/apache/spark/pull/28840#issuecomment-658940338 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource
MaxGekk commented on a change in pull request #27366: URL: https://github.com/apache/spark/pull/27366#discussion_r455259788 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/StructFiltersSuite.scala ## @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.sources +import org.apache.spark.sql.sources.{AlwaysFalse, AlwaysTrue, Filter} +import org.apache.spark.sql.types.{IntegerType, StructType} +import org.apache.spark.unsafe.types.UTF8String + +abstract class StructFiltersSuite extends SparkFunSuite { + + def createFilters(filters: Seq[sources.Filter], schema: StructType): StructFilters Review comment: You mix 2 things - scope and what should be implemented in child classes. `protected` doesn't indicate that a method must be implemented in a child class because it can have an implementation in the parent class. > You had better change your point of view to become a committer. Thank you, now I know what blocks me ;-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] venkata91 commented on pull request #28287: [SPARK-31418][SCHEDULER] Request more executors in case of dynamic allocation is enabled and a task becomes unschedulable due to spark's bl
venkata91 commented on pull request #28287: URL: https://github.com/apache/spark/pull/28287#issuecomment-658957953 > yes we have a test in TaskSchedulerImplSuite that checks to make sure it aborted, but I don't think it covers when dynamic allocation on, so it doesn't hit your new code. So we would want to add a test where it can't acquire a new executor and aborts. I think the new test which I added is just duplicated. Do you think its better to just add the config to enable dynamic allocation to the other test itself in order to avoid the duplication This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #29121: [SPARK-32319][PYSPARK] Remove unused imports
dongjoon-hyun commented on pull request #29121: URL: https://github.com/apache/spark/pull/29121#issuecomment-658958309 It would be great if you mention that in the PR title and PR description. Otherwise, the PR title is misleading. > By suppressing it, This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource
MaxGekk commented on pull request #27366: URL: https://github.com/apache/spark/pull/27366#issuecomment-658966760 jenkins, retest this, please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
dongjoon-hyun commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-658966540 Hi, @jiangxb1987 . Could you ping someone in your mind explicitly like I did at https://github.com/apache/spark/pull/28708#issuecomment-658965320 ? > Please wait for a couple of days (maybe until the end of this week ?) to allow other committers to review and post +1, thanks! Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on pull request #28287: [SPARK-31418][SCHEDULER] Request more executors in case of dynamic allocation is enabled and a task becomes unschedulable due to spark's bl
tgravescs commented on pull request #28287: URL: https://github.com/apache/spark/pull/28287#issuecomment-658974291 you can make a common function that has most of the code that gets called from 2 separate tests. one test passes with dynamic allocation on, the other with it off. that will reduce code duplication. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29123: [SPARK-32283][CORE] Kryo should support multiple user registrators
SparkQA commented on pull request #29123: URL: https://github.com/apache/spark/pull/29123#issuecomment-658983163 **[Test build #125880 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125880/testReport)** for PR 29123 at commit [`45d1e43`](https://github.com/apache/spark/commit/45d1e4341ecab8d5271e17f9ae13072c71c46e32). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29123: [SPARK-32283][CORE] Kryo should support multiple user registrators
SparkQA removed a comment on pull request #29123: URL: https://github.com/apache/spark/pull/29123#issuecomment-658857296 **[Test build #125880 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125880/testReport)** for PR 29123 at commit [`45d1e43`](https://github.com/apache/spark/commit/45d1e4341ecab8d5271e17f9ae13072c71c46e32). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #29111: [SPARK-29292][SQL][ML] Update rest of default modules (Hive, ML, etc) for Scala 2.13 compilation
dongjoon-hyun commented on pull request #29111: URL: https://github.com/apache/spark/pull/29111#issuecomment-658998288 Hi, @srowen . You last commit passed the GitHub Action. Please see here. - https://github.com/apache/spark/pull/29111/commits/6390b6c46f5bf35e0c92b140bfbe12f98c35cd8f This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #29111: [SPARK-29292][SQL][ML] Update rest of default modules (Hive, ML, etc) for Scala 2.13 compilation
dongjoon-hyun commented on pull request #29111: URL: https://github.com/apache/spark/pull/29111#issuecomment-658998506 Also, here. ![Screen Shot 2020-07-15 at 1 41 21 PM](https://user-images.githubusercontent.com/9700541/87593815-e4910e00-c6a0-11ea-9e09-1c8b68fc8ed2.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor
SparkQA commented on pull request #29032: URL: https://github.com/apache/spark/pull/29032#issuecomment-659020484 **[Test build #125909 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125909/testReport)** for PR 29032 at commit [`090eecd`](https://github.com/apache/spark/commit/090eecd7a9c0293aeb270f154d000da123e602aa). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29127: [SPARK-32327][SQL] Introduce UnresolvedTableOrPermanentView for commands that support a table and permanent view, but not a temporary v
AmplabJenkins commented on pull request #29127: URL: https://github.com/apache/spark/pull/29127#issuecomment-659047837 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation
karuppayya commented on a change in pull request #28804: URL: https://github.com/apache/spark/pull/28804#discussion_r455402409 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -2196,6 +2196,25 @@ object SQLConf { .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].") .createWithDefault(16) + val SKIP_PARTIAL_AGGREGATE_ENABLED = +buildConf("spark.sql.aggregate.partialaggregate.skip.enabled") + .internal() + .doc("Avoid sort/spill to disk during partial aggregation") + .booleanConf + .createWithDefault(true) + + val SKIP_PARTIAL_AGGREGATE_THRESHOLD = +buildConf("spark.sql.aggregate.partialaggregate.skip.threshold") + .internal() + .longConf + .createWithDefault(10) Review comment: @cloud-fan we skip partial aggregartion only when the aggragation was not able to cut down records by 50%(define by spark.sql.aggregate.partialaggregate.skip.ratio). In this case it will not kick in. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation
AmplabJenkins removed a comment on pull request #28804: URL: https://github.com/apache/spark/pull/28804#issuecomment-659047829 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation
AmplabJenkins commented on pull request #28804: URL: https://github.com/apache/spark/pull/28804#issuecomment-659047829 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables
SparkQA removed a comment on pull request #29045: URL: https://github.com/apache/spark/pull/29045#issuecomment-658857875 **[Test build #125892 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125892/testReport)** for PR 29045 at commit [`cf68729`](https://github.com/apache/spark/commit/cf6872989fdcb5396357c0e4cd3b3529e1334e6a). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables
SparkQA commented on pull request #29045: URL: https://github.com/apache/spark/pull/29045#issuecomment-659047383 **[Test build #125892 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125892/testReport)** for PR 29045 at commit [`cf68729`](https://github.com/apache/spark/commit/cf6872989fdcb5396357c0e4cd3b3529e1334e6a). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables
AmplabJenkins removed a comment on pull request #29045: URL: https://github.com/apache/spark/pull/29045#issuecomment-658900416 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #29112: [SPARK-32310][ML][PySpark] ML params default value parity part 1
huaxingao commented on pull request #29112: URL: https://github.com/apache/spark/pull/29112#issuecomment-658901689 cc @srowen @viirya @zhengruifeng This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29015: [SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI
AmplabJenkins removed a comment on pull request #29015: URL: https://github.com/apache/spark/pull/29015#issuecomment-654419342 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29015: [SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI
cloud-fan commented on pull request #29015: URL: https://github.com/apache/spark/pull/29015#issuecomment-658915498 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29101: [SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions
AmplabJenkins removed a comment on pull request #29101: URL: https://github.com/apache/spark/pull/29101#issuecomment-658936909 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29101: [SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions
AmplabJenkins commented on pull request #29101: URL: https://github.com/apache/spark/pull/29101#issuecomment-658936909 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on pull request #28287: [SPARK-31418][SCHEDULER] Request more executors in case of dynamic allocation is enabled and a task becomes unschedulable due to spark's bl
tgravescs commented on pull request #28287: URL: https://github.com/apache/spark/pull/28287#issuecomment-658951229 yes we have a test in TaskSchedulerImplSuite that checks to make sure it aborted, but I don't think it covers when dynamic allocation on, so it doesn't hit your new code. So we would want to add a test where it can't acquire a new executor and aborts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #29125: [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value
dongjoon-hyun commented on pull request #29125: URL: https://github.com/apache/spark/pull/29125#issuecomment-658958999 Thank you for pinging me, @cloud-fan . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #29125: [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value
dongjoon-hyun commented on pull request #29125: URL: https://github.com/apache/spark/pull/29125#issuecomment-658959140 Retest this please. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29121: [SPARK-32319][PYSPARK] Remove unused imports
dongjoon-hyun edited a comment on pull request #29121: URL: https://github.com/apache/spark/pull/29121#issuecomment-658958309 It would be great if you mention `suppressing` in the PR title and PR description. Otherwise, the PR title is misleading. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] venkata91 commented on pull request #28287: [SPARK-31418][SCHEDULER] Request more executors in case of dynamic allocation is enabled and a task becomes unschedulable due to spark's bl
venkata91 commented on pull request #28287: URL: https://github.com/apache/spark/pull/28287#issuecomment-658977388 > you can make a common function that has most of the code that gets called from 2 separate tests. one test passes with dynamic allocation on, the other with it off. that will reduce code duplication. nevermind, I made some changes to the test so that it goes to the `None` block where we check if dynamic allocation is enabled or not and request accordingly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29120: [SPARK-32291][SQL] COALESCE should not reduce the child parallelism if it contains a Join
SparkQA commented on pull request #29120: URL: https://github.com/apache/spark/pull/29120#issuecomment-658987615 **[Test build #125881 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125881/testReport)** for PR 29120 at commit [`e56f5d4`](https://github.com/apache/spark/commit/e56f5d4936fc8105d672fea5fe8ae441b7de0f2b). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29120: [SPARK-32291][SQL] COALESCE should not reduce the child parallelism if it contains a Join
SparkQA removed a comment on pull request #29120: URL: https://github.com/apache/spark/pull/29120#issuecomment-658857353 **[Test build #125881 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125881/testReport)** for PR 29120 at commit [`e56f5d4`](https://github.com/apache/spark/commit/e56f5d4936fc8105d672fea5fe8ae441b7de0f2b). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29021: [WIP][SPARK-32201][SQL] More general skew join pattern matching
AmplabJenkins removed a comment on pull request #29021: URL: https://github.com/apache/spark/pull/29021#issuecomment-658995807 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125893/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28977: [WIP] Add all hive.execution suite in the parallel test group
AmplabJenkins removed a comment on pull request #28977: URL: https://github.com/apache/spark/pull/28977#issuecomment-659007597 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125896/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29090: [SPARK-32293] Fix inconsistency between Spark memory configs and JVM option
SparkQA commented on pull request #29090: URL: https://github.com/apache/spark/pull/29090#issuecomment-659008171 **[Test build #125885 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125885/testReport)** for PR 29090 at commit [`cc495c1`](https://github.com/apache/spark/commit/cc495c1c45ac0648156b662fdc308287c79f3fdc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29111: [SPARK-29292][SQL][ML] Update rest of default modules (Hive, ML, etc) for Scala 2.13 compilation
AmplabJenkins removed a comment on pull request #29111: URL: https://github.com/apache/spark/pull/29111#issuecomment-659007765 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
AmplabJenkins removed a comment on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-658910627 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/30513/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
SparkQA commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-659007979 **[Test build #125905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125905/testReport)** for PR 28708 at commit [`eb43f20`](https://github.com/apache/spark/commit/eb43f2055a38067c63f925526f91d435d7c90aaa). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #29125: [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value
viirya commented on pull request #29125: URL: https://github.com/apache/spark/pull/29125#issuecomment-659008506 Jenkins seems not working one this. But GitHub Actions are passed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
tgravescs commented on a change in pull request #28708: URL: https://github.com/apache/spark/pull/28708#discussion_r455369019 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala ## @@ -0,0 +1,330 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.storage + +import java.util.concurrent.ExecutorService + +import scala.collection.JavaConverters._ +import scala.collection.mutable +import scala.util.control.NonFatal + +import org.apache.spark._ +import org.apache.spark.internal.Logging +import org.apache.spark.internal.config +import org.apache.spark.shuffle.{MigratableResolver, ShuffleBlockInfo} +import org.apache.spark.storage.BlockManagerMessages.ReplicateBlock +import org.apache.spark.util.ThreadUtils + +/** + * Class to handle block manager decommissioning retries. + * It creates a Thread to retry offloading all RDD cache and Shuffle blocks + */ +private[storage] class BlockManagerDecommissioner( + conf: SparkConf, Review comment: nit, these should be 4 space indented ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala ## @@ -0,0 +1,330 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.storage + +import java.util.concurrent.ExecutorService + +import scala.collection.JavaConverters._ +import scala.collection.mutable +import scala.util.control.NonFatal + +import org.apache.spark._ +import org.apache.spark.internal.Logging +import org.apache.spark.internal.config +import org.apache.spark.shuffle.{MigratableResolver, ShuffleBlockInfo} +import org.apache.spark.storage.BlockManagerMessages.ReplicateBlock +import org.apache.spark.util.ThreadUtils + +/** + * Class to handle block manager decommissioning retries. + * It creates a Thread to retry offloading all RDD cache and Shuffle blocks Review comment: these creates a thread per add and shuffle block migration correct? and possibly another pool for the actual migration. Wonder if we can just clarify or generalize This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 opened a new pull request #29127: [SPARK-32327][SQL] Introduce UnresolvedTableOrPermanentView for commands that support a table and permanent view, but not a temporary view
imback82 opened a new pull request #29127: URL: https://github.com/apache/spark/pull/29127 ### What changes were proposed in this pull request? This PR proposes to introduce `UnresolvedTableOrPermanentView` for commands that support a table and a permanent view, but not a temporary view as discussed here: https://github.com/apache/spark/pull/28375#discussion_r416343587. This new logical plan is now used for `SHOW TBLPROPERTIES`. ### Why are the changes needed? There are commands that support both a table and a permanent view, but not a temporary view. Using `UnresolvedTableOrPermanentView` makes it for the analyzer to resolve only the relation that's needed for those commands. ### Does this PR introduce _any_ user-facing change? Yes, Before: ``` scala> sql("CREATE TEMPORARY VIEW tv TBLPROPERTIES('p1'='v1') AS SELECT 1 AS c1") res0: org.apache.spark.sql.DataFrame = [] scala> sql("SHOW TBLPROPERTIES tv").show +---+-+ |key|value| +---+-+ +---+-+ ``` After: ``` scala> sql("CREATE TEMPORARY VIEW tv TBLPROPERTIES('p1'='v1') AS SELECT 1 AS c1") res0: org.apache.spark.sql.DataFrame = [] scala> sql("SHOW TBLPROPERTIES tv").show org.apache.spark.sql.AnalysisException: tv is a temp view, not a table or permanent view.; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$42(Analyzer.scala:863) at scala.Option.foreach(Option.scala:407) ... ``` ### How was this patch tested? Updated existing tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] GuoPhilipse commented on a change in pull request #29056: [SPARK-31753][SQL][DOCS] Add missing keywords in the SQL docs
GuoPhilipse commented on a change in pull request #29056: URL: https://github.com/apache/spark/pull/29056#discussion_r455220561 ## File path: docs/sql-ref-syntax-qry-select-groupby.md ## @@ -38,6 +38,8 @@ GROUP BY GROUPING SETS (grouping_set [ , ...]) While aggregate functions are defined as ```sql aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_expression ) ] + +[ FIRST | LAST ] ( expression [ IGNORE NULLS ] ) ] Review comment: I just tried, not working ,even aggregate functions do not support `FILTER` in V2.4.5, i will test for other version tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #29112: [SPARK-32310][ML][PySpark] ML params default value parity part 1
srowen commented on pull request #29112: URL: https://github.com/apache/spark/pull/29112#issuecomment-658908251 So in theory this shouldn't change behavior, or if it does, it's fixing an incompatibility that's likely more a bug than anything right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor
cloud-fan commented on a change in pull request #29032: URL: https://github.com/apache/spark/pull/29032#discussion_r455235801 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ## @@ -912,7 +912,8 @@ private[spark] class TaskSchedulerImpl( } } - override def executorDecommission(executorId: String): Unit = { + override def executorDecommission(executorId: String, Review comment: nit: code style should be ``` override def ...( para1: T, para2: T): ... ``` 4 space indentation for the parameter list. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource
dongjoon-hyun commented on a change in pull request #27366: URL: https://github.com/apache/spark/pull/27366#discussion_r455251187 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonFilters.scala ## @@ -0,0 +1,157 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.json + +import org.apache.spark.sql.catalyst.{InternalRow, StructFilters} +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.sources +import org.apache.spark.sql.types.StructType + +/** + * The class provides API for applying pushed down source filters to rows with + * a struct schema parsed from JSON records. The class should be used in this way: + * 1. Before processing of the next row, `JacksonParser` (parser for short) resets the internal + *state of `JsonFilters` by calling the `reset()` method. + * 2. The parser reads JSON fields one-by-one in streaming fashion. It converts an incoming + *field value to the desired type from the schema. After that, it sets the value to an instance + *of `InternalRow` at the position according to the schema. Order of parsed JSON fields can + *be different from the order in the schema. + * 3. Per every JSON field of the top-level JSON object, the parser calls `skipRow` by passing + *an `InternalRow` in which some of fields can be already set, and the position of the JSON + *field according to the schema. + *3.1 `skipRow` finds a group of predicates that refers to this JSON field. + *3.2 Per each predicate from the group, `skipRow` decrements its reference counter. + *3.2.1 If predicate reference counter becomes 0, it means that all predicate attributes have + * been already set in the internal row, and the predicate can be applied to it. `skipRow` + * invokes the predicate for the row. + *3.3 `skipRow` applies predicates until one of them returns `false`. In that case, the method + *returns `true` to the parser. + *3.4 If all predicates with zero reference counter return `true`, the final result of + *the method is `false` which tells the parser to not skip the row. + * 4. If the parser gets `true` from `JsonFilters.skipRow`, it must not call the method anymore + *for this internal row, and should go the step 1. + * + * `JsonFilters` assumes that: + * - `reset()` is called before any `skipRow()` calls for new row. + * - `skipRow()` can be called for any valid index of the struct fields, + * and in any order. + * - After `skipRow()` returns `true`, the internal state of `JsonFilters` can be inconsistent, + * so, `skipRow()` must not be called for the current row anymore without `reset()`. + * + * @param pushedFilters The pushed down source filters. The filters should refer to + * the fields of the provided schema. + * @param schema The required schema of records from datasource files. + */ +class JsonFilters(pushedFilters: Seq[sources.Filter], schema: StructType) + extends StructFilters(pushedFilters, schema) { + + /** + * Stateful JSON predicate that keeps track of its dependent references in the + * current row via `refCount`. + * + * @param predicate The predicate compiled from pushed down source filters. + * @param totalRefs The total amount of all filters references which the predicate + * compiled from. + */ + case class JsonPredicate(predicate: BasePredicate, totalRefs: Int) { +// The current number of predicate references in the row that have been not set yet. +// When `refCount` reaches zero, the predicate has all dependencies are set, and can +// be applied to the row. +var refCount: Int = totalRefs + +def reset(): Unit = { + refCount = totalRefs +} + } + + // Predicates compiled from the pushed down filters. The predicates are grouped by their + // attributes. The i-th group contains predicates that refer to the i-th field of the given + // schema. A predicates can be placed to many groups if it has many attributes. For example: + // schema: i INTEGER, s STRING + //
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29115: [SPARK-32315][ML] Provide an explanation error message when calling require
AmplabJenkins removed a comment on pull request #29115: URL: https://github.com/apache/spark/pull/29115#issuecomment-658929884 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28287: [SPARK-31418][SCHEDULER] Request more executors in case of dynamic allocation is enabled and a task becomes unschedulable due t
AmplabJenkins removed a comment on pull request #28287: URL: https://github.com/apache/spark/pull/28287#issuecomment-658960595 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29125: [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value
AmplabJenkins removed a comment on pull request #29125: URL: https://github.com/apache/spark/pull/29125#issuecomment-658960503 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29124: [WIP][SPARK-31168][BUILD] Upgrade Scala to 2.12.12
dongjoon-hyun edited a comment on pull request #29124: URL: https://github.com/apache/spark/pull/29124#issuecomment-658956530 The failure looks consistent. Could you take a look at that, @wangyum ? ``` [info] org.apache.spark.serializer.KryoSerializerSuite *** ABORTED *** (324 milliseconds) [info] java.lang.NoSuchFieldError: numNonEmptyBlocks [info] at org.apache.spark.scheduler.HighlyCompressedMapStatus.(MapStatus.scala:174) ``` That might be another Scala bug. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29125: [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value
AmplabJenkins commented on pull request #29125: URL: https://github.com/apache/spark/pull/29125#issuecomment-658960503 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29125: [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value
AmplabJenkins removed a comment on pull request #29125: URL: https://github.com/apache/spark/pull/29125#issuecomment-658896338 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Fokko edited a comment on pull request #29121: [SPARK-32319][PYSPARK] Remove unused imports
Fokko edited a comment on pull request #29121: URL: https://github.com/apache/spark/pull/29121#issuecomment-658967694 Good point @dongjoon-hyun, I was focusing on getting the CI green again. I've updated the PR description and title. While rereading it. Technically the title is correct. If we suppress the error, the import serves a purpose. Feel free to update if there is something that covers the content better in your opinion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29124: [WIP][SPARK-31168][BUILD] Upgrade Scala to 2.12.12
SparkQA removed a comment on pull request #29124: URL: https://github.com/apache/spark/pull/29124#issuecomment-658857295 **[Test build #125879 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125879/testReport)** for PR 29124 at commit [`3adc82a`](https://github.com/apache/spark/commit/3adc82a2c4f9dc4f4ae418efba885ad713d8ee26). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29124: [WIP][SPARK-31168][BUILD] Upgrade Scala to 2.12.12
SparkQA commented on pull request #29124: URL: https://github.com/apache/spark/pull/29124#issuecomment-658988384 **[Test build #125879 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125879/testReport)** for PR 29124 at commit [`3adc82a`](https://github.com/apache/spark/commit/3adc82a2c4f9dc4f4ae418efba885ad713d8ee26). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
holdenk commented on a change in pull request #28708: URL: https://github.com/apache/spark/pull/28708#discussion_r455337945 ## File path: core/src/main/scala/org/apache/spark/internal/config/package.scala ## @@ -420,6 +420,29 @@ package object config { .booleanConf .createWithDefault(false) + private[spark] val STORAGE_DECOMMISSION_SHUFFLE_BLOCKS_ENABLED = +ConfigBuilder("spark.storage.decommission.shuffleBlocks.enabled") Review comment: I was planning on saving that for once we've agreed it's ready for general usage. I know the SPIP is approved, but I still view this as more of a developer feature (e.g. one we would expect a cloud vendor to build on top of) than ready for end user feature. What do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28977: [WIP] Add all hive.execution suite in the parallel test group
SparkQA removed a comment on pull request #28977: URL: https://github.com/apache/spark/pull/28977#issuecomment-658874516 **[Test build #125896 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125896/testReport)** for PR 28977 at commit [`9600708`](https://github.com/apache/spark/commit/96007086d18db5838fb57e7cd298709f26f1f088). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28977: [WIP] Add all hive.execution suite in the parallel test group
SparkQA commented on pull request #28977: URL: https://github.com/apache/spark/pull/28977#issuecomment-659006088 **[Test build #125896 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125896/testReport)** for PR 28977 at commit [`9600708`](https://github.com/apache/spark/commit/96007086d18db5838fb57e7cd298709f26f1f088). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29114: [SPARK-32094][PYTHON] Update cloudpickle to v1.5.0
SparkQA commented on pull request #29114: URL: https://github.com/apache/spark/pull/29114#issuecomment-659014189 **[Test build #125883 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125883/testReport)** for PR 29114 at commit [`465fd8a`](https://github.com/apache/spark/commit/465fd8a5f4773c3fee69df9c5cf8d3ad57160d03). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29114: [SPARK-32094][PYTHON] Update cloudpickle to v1.5.0
SparkQA removed a comment on pull request #29114: URL: https://github.com/apache/spark/pull/29114#issuecomment-658857494 **[Test build #125883 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125883/testReport)** for PR 29114 at commit [`465fd8a`](https://github.com/apache/spark/commit/465fd8a5f4773c3fee69df9c5cf8d3ad57160d03). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
tgravescs commented on a change in pull request #28708: URL: https://github.com/apache/spark/pull/28708#discussion_r455352852 ## File path: core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala ## @@ -44,9 +47,9 @@ import org.apache.spark.util.Utils // org.apache.spark.network.shuffle.ExternalShuffleBlockResolver#getSortBasedShuffleBlockData(). private[spark] class IndexShuffleBlockResolver( conf: SparkConf, -_blockManager: BlockManager = null) +var _blockManager: BlockManager = null) Review comment: this is a var for testing? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
tgravescs commented on a change in pull request #28708: URL: https://github.com/apache/spark/pull/28708#discussion_r455388677 ## File path: core/src/main/scala/org/apache/spark/internal/config/package.scala ## @@ -420,6 +420,29 @@ package object config { .booleanConf .createWithDefault(false) + private[spark] val STORAGE_DECOMMISSION_SHUFFLE_BLOCKS_ENABLED = +ConfigBuilder("spark.storage.decommission.shuffleBlocks.enabled") Review comment: ok, that is fine with me. just wanted to make sure we thought about it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28287: [SPARK-31418][SCHEDULER] Request more executors in case of dynamic allocation is enabled and a task becomes unschedulable due to spark'
AmplabJenkins commented on pull request #28287: URL: https://github.com/apache/spark/pull/28287#issuecomment-659036401 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] karuppayya closed pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation
karuppayya closed pull request #28804: URL: https://github.com/apache/spark/pull/28804 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation
karuppayya commented on a change in pull request #28804: URL: https://github.com/apache/spark/pull/28804#discussion_r455403785 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -2196,6 +2196,25 @@ object SQLConf { .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].") .createWithDefault(16) + val SKIP_PARTIAL_AGGREGATE_ENABLED = +buildConf("spark.sql.aggregate.partialaggregate.skip.enabled") + .internal() + .doc("Avoid sort/spill to disk during partial aggregation") + .booleanConf + .createWithDefault(true) + + val SKIP_PARTIAL_AGGREGATE_THRESHOLD = +buildConf("spark.sql.aggregate.partialaggregate.skip.threshold") + .internal() + .longConf + .createWithDefault(10) + + val SKIP_PARTIAL_AGGREGATE_RATIO = +buildConf("spark.sql.aggregate.partialaggregate.skip.ratio") + .internal() + .doubleConf + .createWithDefault(0.5) Review comment: @maropu I have borrowed this heuristic from Hive. We can merge them into one. Any suggestions here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] karuppayya opened a new pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation
karuppayya opened a new pull request #28804: URL: https://github.com/apache/spark/pull/28804 ### What changes were proposed in this pull request? In case of HashAggregation, a partial aggregation(update) is done followed by final aggregation(merge) During partial aggregation we sort and spill to disk every-time fby, when the fast Map(when enabled) and UnsafeFixedWidthAggregationMap gets exhausted **When the cardinality of grouping column is close to the total number of records being processed, the sorting of data spilling to disk is not required, since it is kind of no-op and we can directly use rows in Final aggregation.** When the user is aware of nature of data, currently he has no control over disabling this sort, spill operation. This is similar to following issues in Hive: https://issues.apache.org/jira/browse/HIVE-223 https://issues.apache.org/jira/browse/HIVE-291 In this PR, the ability to disable sort/spill during partial aggregation is added ### Benchmark spark.executor.memory = 12G Init code ``` // init code case class Data(name: String, value1: String, value2: String, value3: Long, random: Int) val numRecords = Seq(6000) val tblName = "tbl" ``` Generate data ``` // init code case class Data(name: String, value1: String, value2: String, value3: Long, random: Int) val numRecords = Seq(3000, 6000) val basePath = "s3://qubole-spar/karuppayya/SPAR-4477/benchmark/" val rand = scala.util.Random // write numRecords.foreach { recordCount => val dataLocation = s"$basePath/$recordCount" val dataDF = spark.range(recordCount).map { x => if (x < 10) Data(s"name1", s"value1", s"value1", 10, rand.nextInt(100)) else Data(s"name$x", s"value$x", s"value$x", 1, rand.nextInt(100)) } // creating data to be processed by on task(aslo gzip-ing to ensure spark doesnt // create multiple splits ) val randomDF = dataDF.orderBy("random") randomDF.drop("random").repartition(1) .write .mode("overwrite") .option("compression", "gzip") .parquet(dataLocation) } ``` query ``` val query = s""" |SELECT name, value1, value2, SUM(value3) s |FROM $tblName |GROUP BY name, value1, value2 |""" ``` Benchmark code ``` .add(StructField("name", StringType)) .add(StructField("value1", StringType)) .add(StructField("value2", StringType)) .add(StructField("value3", LongType)) val query = """ |SELECT name, value1, value2, SUM(value3) s |FROM tbl |GROUP BY name, value1, value2 |""" case class Metric(recordCount: Long, partialAggregateEnabled: Boolean, timeTaken: Long) val metrics = Seq(true, false).flatMap { enabled => sql(s"set spark.sql.aggregate.partialaggregate.skip.enabled=$enabled").collect numRecords.map { recordCount => import java.util.concurrent.TimeUnit.NANOSECONDS val dataLocation = s"$basePath/$recordCount" spark.read .option("inferTimestamp", "false") .schema(userSpecifiedSchema) .json(dataLocation) .createOrReplaceTempView("tbl") val start = System.nanoTime() spark.sql(query).filter("s > 10").collect val end = System.nanoTime() val diff = end - start Metric(recordCount, enabled, NANOSECONDS.toMillis(diff)) } } ``` ### Results ``` val df = metrics.toDF df.createOrReplaceTempView("a") val df = sql("select * from a order by recordcount desc, partialAggregateEnabled") df.show() scala> df.show +---+---+-+ |recordCount|partialAggregateEnabled|timeTaken| +---+---+-+ | 9000| false| 593844| | 9000| true| 412958| | 6000| false| 377054| | 6000| true| 276363| ``` ### Percent improvement: 9000 → 30.46%, 6000 → 26.70% ### Why are the changes needed? This improvement can improve the performance of queries ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This patch was tested manually This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional
[GitHub] [spark] karuppayya commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation
karuppayya commented on pull request #28804: URL: https://github.com/apache/spark/pull/28804#issuecomment-659049598 Updated the description with the benchmarks, after the latest changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource
SparkQA commented on pull request #27366: URL: https://github.com/apache/spark/pull/27366#issuecomment-659048554 **[Test build #125912 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125912/testReport)** for PR 27366 at commit [`fc725bc`](https://github.com/apache/spark/commit/fc725bc8def91f175f84eb1244386cd9d6f52fca). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan opened a new pull request #29125: [SPARK-32018][SQL] UnsafeRow.setDecimal should set null with overflowed value
cloud-fan opened a new pull request #29125: URL: https://github.com/apache/spark/pull/29125 partially backport https://github.com/apache/spark/pull/29026 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource
MaxGekk commented on a change in pull request #27366: URL: https://github.com/apache/spark/pull/27366#discussion_r455208221 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonFilters.scala ## @@ -0,0 +1,157 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.json + +import org.apache.spark.sql.catalyst.{InternalRow, StructFilters} +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.sources +import org.apache.spark.sql.types.StructType + +/** + * The class provides API for applying pushed down source filters to rows with + * a struct schema parsed from JSON records. The class should be used in this way: + * 1. Before processing of the next row, `JacksonParser` (parser for short) resets the internal + *state of `JsonFilters` by calling the `reset()` method. + * 2. The parser reads JSON fields one-by-one in streaming fashion. It converts an incoming + *field value to the desired type from the schema. After that, it sets the value to an instance + *of `InternalRow` at the position according to the schema. Order of parsed JSON fields can + *be different from the order in the schema. + * 3. Per every JSON field of the top-level JSON object, the parser calls `skipRow` by passing + *an `InternalRow` in which some of fields can be already set, and the position of the JSON + *field according to the schema. + *3.1 `skipRow` finds a group of predicates that refers to this JSON field. + *3.2 Per each predicate from the group, `skipRow` decrements its reference counter. + *3.2.1 If predicate reference counter becomes 0, it means that all predicate attributes have + * been already set in the internal row, and the predicate can be applied to it. `skipRow` + * invokes the predicate for the row. + *3.3 `skipRow` applies predicates until one of them returns `false`. In that case, the method + *returns `true` to the parser. + *3.4 If all predicates with zero reference counter return `true`, the final result of + *the method is `false` which tells the parser to not skip the row. + * 4. If the parser gets `true` from `JsonFilters.skipRow`, it must not call the method anymore + *for this internal row, and should go the step 1. + * + * `JsonFilters` assumes that: + * - `reset()` is called before any `skipRow()` calls for new row. + * - `skipRow()` can be called for any valid index of the struct fields, + * and in any order. + * - After `skipRow()` returns `true`, the internal state of `JsonFilters` can be inconsistent, + * so, `skipRow()` must not be called for the current row anymore without `reset()`. Review comment: Actually, only the first one is applicable to `StructFilters` in general. Two other assumptions are `JsonFilters` specific. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29125: [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value
cloud-fan commented on pull request #29125: URL: https://github.com/apache/spark/pull/29125#issuecomment-658894290 cc @dongjoon-hyun @viirya This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #29112: [SPARK-32310][ML][PySpark] ML params default value parity part 1
viirya commented on pull request #29112: URL: https://github.com/apache/spark/pull/29112#issuecomment-658922003 "classification, regression, clustering and fpm" instead of "part 1" in the title? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29015: [SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI
cloud-fan commented on a change in pull request #29015: URL: https://github.com/apache/spark/pull/29015#discussion_r455246869 ## File path: core/src/main/scala/org/apache/spark/internal/config/UI.scala ## @@ -191,4 +191,14 @@ private[spark] object UI { .version("3.0.0") .stringConf .createOptional + + val MASTER_UI_DECOMMISSION_ALLOW_MODE = ConfigBuilder("spark.master.ui.decommission.allow.mode") +.doc("Specifies the behavior of the Master Web UI's /workers/kill endpoint. Possible choices" + + " are: `local` means allow this endpoint from IP's that are local to the machine running" + + " the Master, `deny` means to completely disable this endpoint, `allow` means to allow" + + " calling this endpoint from any IP.") +.internal() +.version("3.1.0") +.stringConf +.createWithDefault("deny") Review comment: shall we use `local` as default? looks safe enough. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29015: [SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI
cloud-fan commented on a change in pull request #29015: URL: https://github.com/apache/spark/pull/29015#discussion_r455247627 ## File path: core/src/main/scala/org/apache/spark/internal/config/UI.scala ## @@ -191,4 +191,14 @@ private[spark] object UI { .version("3.0.0") .stringConf .createOptional + + val MASTER_UI_DECOMMISSION_ALLOW_MODE = ConfigBuilder("spark.master.ui.decommission.allow.mode") +.doc("Specifies the behavior of the Master Web UI's /workers/kill endpoint. Possible choices" + + " are: `local` means allow this endpoint from IP's that are local to the machine running" + + " the Master, `deny` means to completely disable this endpoint, `allow` means to allow" + + " calling this endpoint from any IP.") +.internal() +.version("3.1.0") +.stringConf Review comment: it's common to always upper case the config value, as it should be case insensitive. e.g. ``` ... .stringConf .transform(_.toUpperCase(Locale.ROOT)) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes
aokolnychyi commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658931940 Yes, my proposal is to optimize cases when we sort the data after the repartition like in the examples I gave above. In those cases, sorts below seem to be redundant. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-658944693 All checks pass, I'm going to merge this to our current development branch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-658954395 The SPIP has been voted on, this has been reviewed extensively, the original design is from 2017, I'm not waiting unless someone wishes to -1 for a valid technical reason. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
frankyin-factual commented on pull request #28898: URL: https://github.com/apache/spark/pull/28898#issuecomment-658954329 @dongjoon-hyun friendly bump This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource
AmplabJenkins commented on pull request #27366: URL: https://github.com/apache/spark/pull/27366#issuecomment-658986674 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #27366: [SPARK-30648][SQL] Support filters pushdown in JSON datasource
AmplabJenkins removed a comment on pull request #27366: URL: https://github.com/apache/spark/pull/27366#issuecomment-658986674 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29111: [SPARK-29292][SQL][ML] Update rest of default modules (Hive, ML, etc) for Scala 2.13 compilation
dongjoon-hyun edited a comment on pull request #29111: URL: https://github.com/apache/spark/pull/29111#issuecomment-658998506 Also, here. The green checkbox at the commit id. ![Screen Shot 2020-07-15 at 1 41 21 PM](https://user-images.githubusercontent.com/9700541/87593815-e4910e00-c6a0-11ea-9e09-1c8b68fc8ed2.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29090: [SPARK-32293] Fix inconsistency between Spark memory configs and JVM option
AmplabJenkins commented on pull request #29090: URL: https://github.com/apache/spark/pull/29090#issuecomment-659010229 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29114: [SPARK-32094][PYTHON] Update cloudpickle to v1.5.0
AmplabJenkins removed a comment on pull request #29114: URL: https://github.com/apache/spark/pull/29114#issuecomment-659015242 Build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29114: [SPARK-32094][PYTHON] Update cloudpickle to v1.5.0
AmplabJenkins commented on pull request #29114: URL: https://github.com/apache/spark/pull/29114#issuecomment-659015242 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29101: [SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions
AmplabJenkins commented on pull request #29101: URL: https://github.com/apache/spark/pull/29101#issuecomment-659026999 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
AmplabJenkins commented on pull request #28917: URL: https://github.com/apache/spark/pull/28917#issuecomment-659026829 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
AmplabJenkins removed a comment on pull request #28917: URL: https://github.com/apache/spark/pull/28917#issuecomment-659026829 Build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29101: [SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions
AmplabJenkins removed a comment on pull request #29101: URL: https://github.com/apache/spark/pull/29101#issuecomment-659026999 Build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29015: [SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI
SparkQA commented on pull request #29015: URL: https://github.com/apache/spark/pull/29015#issuecomment-659027181 **[Test build #125911 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125911/testReport)** for PR 29015 at commit [`d8e241f`](https://github.com/apache/spark/commit/d8e241fc492a6a626d6cd00ef1f666fa62ffd178). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
AmplabJenkins removed a comment on pull request #28917: URL: https://github.com/apache/spark/pull/28917#issuecomment-659026841 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125897/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org