[GitHub] [spark] viirya commented on a change in pull request #23943: [SPARK-27034][SQL] Nested schema pruning for ORC
viirya commented on a change in pull request #23943: [SPARK-27034][SQL] Nested schema pruning for ORC URL: https://github.com/apache/spark/pull/23943#discussion_r263265265 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSchemaPruningSuite.scala ## @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import java.io.File + +import org.apache.spark.SparkConf +import org.apache.spark.sql.Row +import org.apache.spark.sql.execution.datasources.SchemaPruningSuite +import org.apache.spark.sql.internal.SQLConf + +class OrcSchemaPruningSuite extends SchemaPruningSuite { + override protected val dataSourceName: String = "orc" + override protected val vectorizedReaderEnabledKey: String = +SQLConf.ORC_VECTORIZED_READER_ENABLED.key + + override protected def sparkConf: SparkConf = +super + .sparkConf + .set(SQLConf.USE_V1_SOURCE_READER_LIST, "orc") + .set(SQLConf.USE_V1_SOURCE_WRITER_LIST, "orc") + + testSchemaPruning("select a single complex field") { +val query = sql("select name.middle from contacts") +checkScan(query, "struct>") +checkAnswer(query.orderBy("id"), Row("X.") :: Row("Y.") :: Nil) + } + + testSchemaPruning("select a single complex field and its parent struct") { +val query = sql("select name.middle, name from contacts") +checkScan(query, "struct>") +checkAnswer(query.orderBy("id"), + Row("X.", Row("Jane", "X.", "Doe")) :: +Row("Y.", Row("John", "Y.", "Doe")) :: Review comment: Sure. Automatically edited by the IDE... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #23943: [SPARK-27034][SQL] Nested schema pruning for ORC
viirya commented on a change in pull request #23943: [SPARK-27034][SQL] Nested schema pruning for ORC URL: https://github.com/apache/spark/pull/23943#discussion_r263265164 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala ## @@ -0,0 +1,350 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File + +import org.scalactic.Equality + +import org.apache.spark.sql.{DataFrame, QueryTest, Row} +import org.apache.spark.sql.catalyst.SchemaPruningTest +import org.apache.spark.sql.catalyst.parser.CatalystSqlParser +import org.apache.spark.sql.execution.FileSourceScanExec +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types.StructType + +abstract class SchemaPruningSuite +extends QueryTest +with FileBasedDataSourceTest +with SchemaPruningTest +with SharedSQLContext { + case class FullName(first: String, middle: String, last: String) + case class Company(name: String, address: String) + case class Employer(id: Int, company: Company) + case class Contact( +id: Int, +name: FullName, +address: String, +pets: Int, +friends: Array[FullName] = Array.empty, +relatives: Map[String, FullName] = Map.empty, +employer: Employer = null) + + val janeDoe = FullName("Jane", "X.", "Doe") + val johnDoe = FullName("John", "Y.", "Doe") + val susanSmith = FullName("Susan", "Z.", "Smith") + + val employer = Employer(0, Company("abc", "123 Business Street")) + val employerWithNullCompany = Employer(1, null) + + val contacts = +Contact(0, janeDoe, "123 Main Street", 1, friends = Array(susanSmith), + relatives = Map("brother" -> johnDoe), employer = employer) :: +Contact(1, johnDoe, "321 Wall Street", 3, relatives = Map("sister" -> janeDoe), + employer = employerWithNullCompany) :: Nil + + case class Name(first: String, last: String) + case class BriefContact(id: Int, name: Name, address: String) + + private val briefContacts = +BriefContact(2, Name("Janet", "Jones"), "567 Maple Drive") :: +BriefContact(3, Name("Jim", "Jones"), "6242 Ash Street") :: Nil + + case class ContactWithDataPartitionColumn( +id: Int, +name: FullName, +address: String, +pets: Int, +friends: Array[FullName] = Array(), +relatives: Map[String, FullName] = Map(), +employer: Employer = null, +p: Int) + + case class BriefContactWithDataPartitionColumn(id: Int, name: Name, address: String, p: Int) + + val contactsWithDataPartitionColumn = +contacts.map { case Contact(id, name, address, pets, friends, relatives, employer) => + ContactWithDataPartitionColumn(id, name, address, pets, friends, relatives, employer, 1) } + val briefContactsWithDataPartitionColumn = +briefContacts.map { case BriefContact(id, name, address) => + BriefContactWithDataPartitionColumn(id, name, address, 2) } + + testSchemaPruning("select a single complex field array and its parent struct array") { Review comment: Yes, we should be able to use all same tests between Parquet and ORC once user specified schema works. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #23998: [MINOR][SQL]Add a conf to control subqueryReuse
AmplabJenkins commented on issue #23998: [MINOR][SQL]Add a conf to control subqueryReuse URL: https://github.com/apache/spark/pull/23998#issuecomment-470417898 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/103116/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #23943: [SPARK-27034][SQL] Nested schema pruning for ORC
viirya commented on a change in pull request #23943: [SPARK-27034][SQL] Nested schema pruning for ORC URL: https://github.com/apache/spark/pull/23943#discussion_r263264514 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSchemaPruningSuite.scala ## @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import java.io.File + +import org.apache.spark.SparkConf +import org.apache.spark.sql.Row +import org.apache.spark.sql.execution.datasources.SchemaPruningSuite +import org.apache.spark.sql.internal.SQLConf + +class OrcSchemaPruningSuite extends SchemaPruningSuite { + override protected val dataSourceName: String = "orc" + override protected val vectorizedReaderEnabledKey: String = +SQLConf.ORC_VECTORIZED_READER_ENABLED.key + + override protected def sparkConf: SparkConf = +super + .sparkConf + .set(SQLConf.USE_V1_SOURCE_READER_LIST, "orc") + .set(SQLConf.USE_V1_SOURCE_WRITER_LIST, "orc") + + testSchemaPruning("select a single complex field") { +val query = sql("select name.middle from contacts") +checkScan(query, "struct>") +checkAnswer(query.orderBy("id"), Row("X.") :: Row("Y.") :: Nil) + } + + testSchemaPruning("select a single complex field and its parent struct") { +val query = sql("select name.middle, name from contacts") +checkScan(query, "struct>") +checkAnswer(query.orderBy("id"), + Row("X.", Row("Jane", "X.", "Doe")) :: +Row("Y.", Row("John", "Y.", "Doe")) :: +Nil) + } + + testSchemaPruning("select a single complex field and the partition column") { +val query = sql("select name.middle, p from contacts") +checkScan(query, "struct>") +checkAnswer(query.orderBy("id"), + Row("X.", 1) :: Row("Y.", 1) :: Nil) + } + + testSchemaPruning("no unnecessary schema pruning") { +val query = + sql("select id, name.last, name.middle, name.first, relatives[''].last, " + +"relatives[''].middle, relatives[''].first, friends[0].last, friends[0].middle, " + +"friends[0].first, pets, address from contacts where p=1") +// We've selected every field in the schema. Therefore, no schema pruning should be performed. +// We check this by asserting that the scanned schema of the query is identical to the schema +// of the contacts relation, even though the fields are selected in different orders. +checkScan(query, + "struct,address:string,pets:int," + +"friends:array>," + + "relatives:map>>") +checkAnswer(query.orderBy("id"), + Row(0, "Doe", "X.", "Jane", null, null, null, "Smith", "Z.", "Susan", 1, "123 Main Street") :: +Row(1, "Doe", "Y.", "John", null, null, null, null, null, null, 3, "321 Wall Street") :: +Nil) + } + + testSchemaPruning("select a single complex field and is null expression in project") { +val query = sql("select name.first, address is not null from contacts") +checkScan(query, "struct,address:string>") +checkAnswer(query.orderBy("id"), + Row("Jane", true) :: Row("John", true) :: Nil) + } + + /** + * Overrides this because ORC datasource doesn't support schema merging currently. Review comment: I haven't tried. But I guess with user specified schema, it should work. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #23998: [MINOR][SQL]Add a conf to control subqueryReuse
AmplabJenkins removed a comment on issue #23998: [MINOR][SQL]Add a conf to control subqueryReuse URL: https://github.com/apache/spark/pull/23998#issuecomment-470417887 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #23998: [MINOR][SQL]Add a conf to control subqueryReuse
AmplabJenkins removed a comment on issue #23998: [MINOR][SQL]Add a conf to control subqueryReuse URL: https://github.com/apache/spark/pull/23998#issuecomment-470417898 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/103116/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #23998: [MINOR][SQL]Add a conf to control subqueryReuse
AmplabJenkins commented on issue #23998: [MINOR][SQL]Add a conf to control subqueryReuse URL: https://github.com/apache/spark/pull/23998#issuecomment-470417887 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #23998: [MINOR][SQL]Add a conf to control subqueryReuse
SparkQA removed a comment on issue #23998: [MINOR][SQL]Add a conf to control subqueryReuse URL: https://github.com/apache/spark/pull/23998#issuecomment-470369391 **[Test build #103116 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103116/testReport)** for PR 23998 at commit [`d468c7f`](https://github.com/apache/spark/commit/d468c7f813f557c01e94e0ae76de395d3bdcd71f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #23998: [MINOR][SQL]Add a conf to control subqueryReuse
SparkQA commented on issue #23998: [MINOR][SQL]Add a conf to control subqueryReuse URL: https://github.com/apache/spark/pull/23998#issuecomment-470417392 **[Test build #103116 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103116/testReport)** for PR 23998 at commit [`d468c7f`](https://github.com/apache/spark/commit/d468c7f813f557c01e94e0ae76de395d3bdcd71f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] fangshil commented on a change in pull request #20303: [SPARK-23128][SQL] A new approach to do adaptive execution in Spark SQL
fangshil commented on a change in pull request #20303: [SPARK-23128][SQL] A new approach to do adaptive execution in Spark SQL URL: https://github.com/apache/spark/pull/20303#discussion_r263260234 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala ## @@ -36,107 +35,12 @@ import org.apache.spark.sql.internal.SQLConf * the input partition ordering requirements are met. */ case class EnsureRequirements(conf: SQLConf) extends Rule[SparkPlan] { - private def defaultNumPreShufflePartitions: Int = conf.numShufflePartitions - - private def targetPostShuffleInputSize: Long = conf.targetPostShuffleInputSize - - private def adaptiveExecutionEnabled: Boolean = conf.adaptiveExecutionEnabled - - private def minNumPostShufflePartitions: Option[Int] = { -val minNumPostShufflePartitions = conf.minNumPostShufflePartitions -if (minNumPostShufflePartitions > 0) Some(minNumPostShufflePartitions) else None - } - - /** - * Adds [[ExchangeCoordinator]] to [[ShuffleExchangeExec]]s if adaptive query execution is enabled - * and partitioning schemes of these [[ShuffleExchangeExec]]s support [[ExchangeCoordinator]]. - */ - private def withExchangeCoordinator( - children: Seq[SparkPlan], - requiredChildDistributions: Seq[Distribution]): Seq[SparkPlan] = { -val supportsCoordinator = - if (children.exists(_.isInstanceOf[ShuffleExchangeExec])) { -// Right now, ExchangeCoordinator only support HashPartitionings. -children.forall { - case e @ ShuffleExchangeExec(hash: HashPartitioning, _, _) => true - case child => -child.outputPartitioning match { - case hash: HashPartitioning => true - case collection: PartitioningCollection => - collection.partitionings.forall(_.isInstanceOf[HashPartitioning]) - case _ => false -} -} - } else { -// In this case, although we do not have Exchange operators, we may still need to -// shuffle data when we have more than one children because data generated by -// these children may not be partitioned in the same way. -// Please see the comment in withCoordinator for more details. -val supportsDistribution = requiredChildDistributions.forall { dist => - dist.isInstanceOf[ClusteredDistribution] || dist.isInstanceOf[HashClusteredDistribution] -} -children.length > 1 && supportsDistribution - } - -val withCoordinator = - if (adaptiveExecutionEnabled && supportsCoordinator) { -val coordinator = - new ExchangeCoordinator( -targetPostShuffleInputSize, -minNumPostShufflePartitions) -children.zip(requiredChildDistributions).map { - case (e: ShuffleExchangeExec, _) => -// This child is an Exchange, we need to add the coordinator. -e.copy(coordinator = Some(coordinator)) - case (child, distribution) => -// If this child is not an Exchange, we need to add an Exchange for now. -// Ideally, we can try to avoid this Exchange. However, when we reach here, -// there are at least two children operators (because if there is a single child -// and we can avoid Exchange, supportsCoordinator will be false and we -// will not reach here.). Although we can make two children have the same number of -// post-shuffle partitions. Their numbers of pre-shuffle partitions may be different. -// For example, let's say we have the following plan -// Join -// / \ -// Agg Exchange -// / \ -//Exchange t2 -// / -// t1 -// In this case, because a post-shuffle partition can include multiple pre-shuffle -// partitions, a HashPartitioning will not be strictly partitioned by the hashcodes -// after shuffle. So, even we can use the child Exchange operator of the Join to -// have a number of post-shuffle partitions that matches the number of partitions of -// Agg, we cannot say these two children are partitioned in the same way. -// Here is another case -// Join -// / \ -// Agg1 Agg2 -// / \ -// Exchange1 Exchange2 -// / \ -// t1 t2 -// In this case, two Aggs shuffle data with the same column of the join condition. -// After we use ExchangeCoordinator, these two Aggs may not be partitioned in the same -// way. Let's say that Agg1 and Agg2 both have 5 pre-shuffle partitions and 2 -// post-shuffle partitions. It is possible that Agg1 fetches those
[GitHub] [spark] fangshil commented on a change in pull request #20303: [SPARK-23128][SQL] A new approach to do adaptive execution in Spark SQL
fangshil commented on a change in pull request #20303: [SPARK-23128][SQL] A new approach to do adaptive execution in Spark SQL URL: https://github.com/apache/spark/pull/20303#discussion_r263260234 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala ## @@ -36,107 +35,12 @@ import org.apache.spark.sql.internal.SQLConf * the input partition ordering requirements are met. */ case class EnsureRequirements(conf: SQLConf) extends Rule[SparkPlan] { - private def defaultNumPreShufflePartitions: Int = conf.numShufflePartitions - - private def targetPostShuffleInputSize: Long = conf.targetPostShuffleInputSize - - private def adaptiveExecutionEnabled: Boolean = conf.adaptiveExecutionEnabled - - private def minNumPostShufflePartitions: Option[Int] = { -val minNumPostShufflePartitions = conf.minNumPostShufflePartitions -if (minNumPostShufflePartitions > 0) Some(minNumPostShufflePartitions) else None - } - - /** - * Adds [[ExchangeCoordinator]] to [[ShuffleExchangeExec]]s if adaptive query execution is enabled - * and partitioning schemes of these [[ShuffleExchangeExec]]s support [[ExchangeCoordinator]]. - */ - private def withExchangeCoordinator( - children: Seq[SparkPlan], - requiredChildDistributions: Seq[Distribution]): Seq[SparkPlan] = { -val supportsCoordinator = - if (children.exists(_.isInstanceOf[ShuffleExchangeExec])) { -// Right now, ExchangeCoordinator only support HashPartitionings. -children.forall { - case e @ ShuffleExchangeExec(hash: HashPartitioning, _, _) => true - case child => -child.outputPartitioning match { - case hash: HashPartitioning => true - case collection: PartitioningCollection => - collection.partitionings.forall(_.isInstanceOf[HashPartitioning]) - case _ => false -} -} - } else { -// In this case, although we do not have Exchange operators, we may still need to -// shuffle data when we have more than one children because data generated by -// these children may not be partitioned in the same way. -// Please see the comment in withCoordinator for more details. -val supportsDistribution = requiredChildDistributions.forall { dist => - dist.isInstanceOf[ClusteredDistribution] || dist.isInstanceOf[HashClusteredDistribution] -} -children.length > 1 && supportsDistribution - } - -val withCoordinator = - if (adaptiveExecutionEnabled && supportsCoordinator) { -val coordinator = - new ExchangeCoordinator( -targetPostShuffleInputSize, -minNumPostShufflePartitions) -children.zip(requiredChildDistributions).map { - case (e: ShuffleExchangeExec, _) => -// This child is an Exchange, we need to add the coordinator. -e.copy(coordinator = Some(coordinator)) - case (child, distribution) => -// If this child is not an Exchange, we need to add an Exchange for now. -// Ideally, we can try to avoid this Exchange. However, when we reach here, -// there are at least two children operators (because if there is a single child -// and we can avoid Exchange, supportsCoordinator will be false and we -// will not reach here.). Although we can make two children have the same number of -// post-shuffle partitions. Their numbers of pre-shuffle partitions may be different. -// For example, let's say we have the following plan -// Join -// / \ -// Agg Exchange -// / \ -//Exchange t2 -// / -// t1 -// In this case, because a post-shuffle partition can include multiple pre-shuffle -// partitions, a HashPartitioning will not be strictly partitioned by the hashcodes -// after shuffle. So, even we can use the child Exchange operator of the Join to -// have a number of post-shuffle partitions that matches the number of partitions of -// Agg, we cannot say these two children are partitioned in the same way. -// Here is another case -// Join -// / \ -// Agg1 Agg2 -// / \ -// Exchange1 Exchange2 -// / \ -// t1 t2 -// In this case, two Aggs shuffle data with the same column of the join condition. -// After we use ExchangeCoordinator, these two Aggs may not be partitioned in the same -// way. Let's say that Agg1 and Agg2 both have 5 pre-shuffle partitions and 2 -// post-shuffle partitions. It is possible that Agg1 fetches those
[GitHub] [spark] fangshil commented on a change in pull request #20303: [SPARK-23128][SQL] A new approach to do adaptive execution in Spark SQL
fangshil commented on a change in pull request #20303: [SPARK-23128][SQL] A new approach to do adaptive execution in Spark SQL URL: https://github.com/apache/spark/pull/20303#discussion_r263260234 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala ## @@ -36,107 +35,12 @@ import org.apache.spark.sql.internal.SQLConf * the input partition ordering requirements are met. */ case class EnsureRequirements(conf: SQLConf) extends Rule[SparkPlan] { - private def defaultNumPreShufflePartitions: Int = conf.numShufflePartitions - - private def targetPostShuffleInputSize: Long = conf.targetPostShuffleInputSize - - private def adaptiveExecutionEnabled: Boolean = conf.adaptiveExecutionEnabled - - private def minNumPostShufflePartitions: Option[Int] = { -val minNumPostShufflePartitions = conf.minNumPostShufflePartitions -if (minNumPostShufflePartitions > 0) Some(minNumPostShufflePartitions) else None - } - - /** - * Adds [[ExchangeCoordinator]] to [[ShuffleExchangeExec]]s if adaptive query execution is enabled - * and partitioning schemes of these [[ShuffleExchangeExec]]s support [[ExchangeCoordinator]]. - */ - private def withExchangeCoordinator( - children: Seq[SparkPlan], - requiredChildDistributions: Seq[Distribution]): Seq[SparkPlan] = { -val supportsCoordinator = - if (children.exists(_.isInstanceOf[ShuffleExchangeExec])) { -// Right now, ExchangeCoordinator only support HashPartitionings. -children.forall { - case e @ ShuffleExchangeExec(hash: HashPartitioning, _, _) => true - case child => -child.outputPartitioning match { - case hash: HashPartitioning => true - case collection: PartitioningCollection => - collection.partitionings.forall(_.isInstanceOf[HashPartitioning]) - case _ => false -} -} - } else { -// In this case, although we do not have Exchange operators, we may still need to -// shuffle data when we have more than one children because data generated by -// these children may not be partitioned in the same way. -// Please see the comment in withCoordinator for more details. -val supportsDistribution = requiredChildDistributions.forall { dist => - dist.isInstanceOf[ClusteredDistribution] || dist.isInstanceOf[HashClusteredDistribution] -} -children.length > 1 && supportsDistribution - } - -val withCoordinator = - if (adaptiveExecutionEnabled && supportsCoordinator) { -val coordinator = - new ExchangeCoordinator( -targetPostShuffleInputSize, -minNumPostShufflePartitions) -children.zip(requiredChildDistributions).map { - case (e: ShuffleExchangeExec, _) => -// This child is an Exchange, we need to add the coordinator. -e.copy(coordinator = Some(coordinator)) - case (child, distribution) => -// If this child is not an Exchange, we need to add an Exchange for now. -// Ideally, we can try to avoid this Exchange. However, when we reach here, -// there are at least two children operators (because if there is a single child -// and we can avoid Exchange, supportsCoordinator will be false and we -// will not reach here.). Although we can make two children have the same number of -// post-shuffle partitions. Their numbers of pre-shuffle partitions may be different. -// For example, let's say we have the following plan -// Join -// / \ -// Agg Exchange -// / \ -//Exchange t2 -// / -// t1 -// In this case, because a post-shuffle partition can include multiple pre-shuffle -// partitions, a HashPartitioning will not be strictly partitioned by the hashcodes -// after shuffle. So, even we can use the child Exchange operator of the Join to -// have a number of post-shuffle partitions that matches the number of partitions of -// Agg, we cannot say these two children are partitioned in the same way. -// Here is another case -// Join -// / \ -// Agg1 Agg2 -// / \ -// Exchange1 Exchange2 -// / \ -// t1 t2 -// In this case, two Aggs shuffle data with the same column of the join condition. -// After we use ExchangeCoordinator, these two Aggs may not be partitioned in the same -// way. Let's say that Agg1 and Agg2 both have 5 pre-shuffle partitions and 2 -// post-shuffle partitions. It is possible that Agg1 fetches those
[GitHub] [spark] maropu commented on a change in pull request #23964: [SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition
maropu commented on a change in pull request #23964: [SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition URL: https://github.com/apache/spark/pull/23964#discussion_r263259691 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -0,0 +1,193 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.QueryPlan +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.types._ + +/** + * This aims to handle nested column aliasing pattern inside `ColumnPruning` optimizer rule. + * If a project or its child references to nested fields, and not all the fields + * in a nested attribute are used, we can substitute the by alias attributes; then a project + * of the nested fields as aliases on the children of the child will be created. + */ +object NestedColumnAliasing { + + def unapply(plan: LogicalPlan) +: Option[(Map[GetStructField, Alias], Map[ExprId, Seq[Alias]])] = plan match { +case Project(_, child) if canProjectPushThrough(child) => + getAliasSubMap(plan, child) +case _ => None + } + + /** + * Replace nested columns to prune unused nested columns later. + */ + def replaceToAliases( + plan: LogicalPlan, + nestedFieldToAlias: Map[GetStructField, Alias], + attrToAliases: Map[ExprId, Seq[Alias]]): LogicalPlan = plan match { +case Project(projectList, g @ GlobalLimit(_, grandChild: LocalLimit)) => + Project( +getNewProjectList(projectList, nestedFieldToAlias), +g.copy(child = replaceChildrenWithAliases(grandChild, attrToAliases))) + +case Project(projectList, child) => + Project( +getNewProjectList(projectList, nestedFieldToAlias), +replaceChildrenWithAliases(child, attrToAliases)) + } + + /** + * Return a replaced project list. + */ + private def getNewProjectList( + projectList: Seq[NamedExpression], + nestedFieldToAlias: Map[GetStructField, Alias]): Seq[NamedExpression] = { +projectList.map(_.transform { + case f: GetStructField if nestedFieldToAlias.contains(f) => +nestedFieldToAlias(f).toAttribute +}.asInstanceOf[NamedExpression]) + } + + /** + * Return a plan with new childen replaced with aliases. + */ + private def replaceChildrenWithAliases( + plan: LogicalPlan, + attrToAliases: Map[ExprId, Seq[Alias]]): LogicalPlan = { +plan.withNewChildren(plan.children.map { plan => + Project(plan.output.flatMap(a => attrToAliases.getOrElse(a.exprId, Seq(a))), plan) +}) + } + + /** + * Returns true for those operators that project can be pushed through. + */ + private def canProjectPushThrough(plan: LogicalPlan) = plan match { +case _: GlobalLimit => true +case _: LocalLimit => true +case _: Repartition => true +case _: Sample => true +case _ => false + } + + /** + * Similar to [[QueryPlan.references]], but this only returns all attributes + * that are explicitly referenced on the root levels in a [[LogicalPlan]]. + */ + private def getRootReferences(plan: LogicalPlan): AttributeSet = { +def helper(e: Expression): AttributeSet = e match { + case attr: AttributeReference => AttributeSet(attr) + case _: GetStructField => AttributeSet.empty + case es if es.children.nonEmpty => AttributeSet(es.children.flatMap(helper)) + case _ => AttributeSet.empty +} +AttributeSet.fromAttributeSets(plan.expressions.map(helper)) + } + + /** + * Returns all the nested fields that are explicitly referenced as [[Expression]] + * in a [[LogicalPlan]]. Currently, we only support having [[GetStructField]] in the chain + * of the expressions. If the chain contains GetArrayStructFields, GetMapValue, or + * GetArrayItem, the nested field substitution will not be performed. + */ + private def getNestedFieldReferences(plan: LogicalPlan): Seq[GetStructField] = { +def helper(e: Expression): Seq[GetStructField] = e match { +
[GitHub] [spark] maropu commented on a change in pull request #23964: [SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition
maropu commented on a change in pull request #23964: [SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition URL: https://github.com/apache/spark/pull/23964#discussion_r263257592 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -0,0 +1,193 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.QueryPlan +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.types._ + +/** + * This aims to handle nested column aliasing pattern inside `ColumnPruning` optimizer rule. + * If a project or its child references to nested fields, and not all the fields + * in a nested attribute are used, we can substitute the by alias attributes; then a project Review comment: typo: `substitute the` -> `substitute them` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #23964: [SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition
maropu commented on a change in pull request #23964: [SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition URL: https://github.com/apache/spark/pull/23964#discussion_r263257374 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -0,0 +1,193 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.QueryPlan +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.types._ + +/** + * This aims to handle nested column aliasing pattern inside `ColumnPruning` optimizer rule. Review comment: super nit: " * This aims to handle a nested column aliasing pattern inside the `ColumnPruning` optimizer rule." This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #23964: [SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition
maropu commented on a change in pull request #23964: [SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition URL: https://github.com/apache/spark/pull/23964#discussion_r263257374 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -0,0 +1,193 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.QueryPlan +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.types._ + +/** + * This aims to handle nested column aliasing pattern inside `ColumnPruning` optimizer rule. Review comment: super nit: ` * This aims to handle a nested column aliasing pattern inside the ColumnPruning optimizer rule.` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #23964: [SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition
maropu commented on a change in pull request #23964: [SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition URL: https://github.com/apache/spark/pull/23964#discussion_r263257374 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -0,0 +1,193 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.QueryPlan +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.types._ + +/** + * This aims to handle nested column aliasing pattern inside `ColumnPruning` optimizer rule. Review comment: super nit: ` * This aims to handle a nested column aliasing pattern inside the `ColumnPruning` optimizer rule.` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #23964: [SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition
maropu commented on a change in pull request #23964: [SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition URL: https://github.com/apache/spark/pull/23964#discussion_r263257086 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -0,0 +1,193 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.QueryPlan +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.types._ + +/** + * This aims to handle nested column aliasing pattern inside `ColumnPruning` optimizer rule. + * If a project or its child references to nested fields, and not all the fields + * in a nested attribute are used, we can substitute the by alias attributes; then a project + * of the nested fields as aliases on the children of the child will be created. + */ +object NestedColumnAliasing { + + def unapply(plan: LogicalPlan) +: Option[(Map[GetStructField, Alias], Map[ExprId, Seq[Alias]])] = plan match { +case Project(_, child) if canProjectPushThrough(child) => + getAliasSubMap(plan, child) +case _ => None + } + + /** + * Replace nested columns to prune unused nested columns later. + */ + def replaceToAliases( + plan: LogicalPlan, + nestedFieldToAlias: Map[GetStructField, Alias], + attrToAliases: Map[ExprId, Seq[Alias]]): LogicalPlan = plan match { +case Project(projectList, g @ GlobalLimit(_, grandChild: LocalLimit)) => + Project( +getNewProjectList(projectList, nestedFieldToAlias), +g.copy(child = replaceChildrenWithAliases(grandChild, attrToAliases))) + +case Project(projectList, child) => + Project( +getNewProjectList(projectList, nestedFieldToAlias), +replaceChildrenWithAliases(child, attrToAliases)) + } + + /** + * Return a replaced project list. + */ + private def getNewProjectList( + projectList: Seq[NamedExpression], + nestedFieldToAlias: Map[GetStructField, Alias]): Seq[NamedExpression] = { +projectList.map(_.transform { + case f: GetStructField if nestedFieldToAlias.contains(f) => +nestedFieldToAlias(f).toAttribute +}.asInstanceOf[NamedExpression]) + } + + /** + * Return a plan with new childen replaced with aliases. + */ + private def replaceChildrenWithAliases( + plan: LogicalPlan, + attrToAliases: Map[ExprId, Seq[Alias]]): LogicalPlan = { +plan.withNewChildren(plan.children.map { plan => + Project(plan.output.flatMap(a => attrToAliases.getOrElse(a.exprId, Seq(a))), plan) +}) + } + + /** + * Returns true for those operators that project can be pushed through. + */ + private def canProjectPushThrough(plan: LogicalPlan) = plan match { +case _: GlobalLimit => true +case _: LocalLimit => true +case _: Repartition => true +case _: Sample => true +case _ => false + } + + /** + * Similar to [[QueryPlan.references]], but this only returns all attributes + * that are explicitly referenced on the root levels in a [[LogicalPlan]]. + */ + private def getRootReferences(plan: LogicalPlan): AttributeSet = { +def helper(e: Expression): AttributeSet = e match { + case attr: AttributeReference => AttributeSet(attr) + case _: GetStructField => AttributeSet.empty + case es if es.children.nonEmpty => AttributeSet(es.children.flatMap(helper)) + case _ => AttributeSet.empty +} +AttributeSet.fromAttributeSets(plan.expressions.map(helper)) + } + + /** + * Returns all the nested fields that are explicitly referenced as [[Expression]] + * in a [[LogicalPlan]]. Currently, we only support having [[GetStructField]] in the chain + * of the expressions. If the chain contains GetArrayStructFields, GetMapValue, or + * GetArrayItem, the nested field substitution will not be performed. + */ + private def getNestedFieldReferences(plan: LogicalPlan): Seq[GetStructField] = { +def helper(e: Expression): Seq[GetStructField] = e match { +
[GitHub] [spark] AmplabJenkins removed a comment on issue #23947: [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats
AmplabJenkins removed a comment on issue #23947: [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats URL: https://github.com/apache/spark/pull/23947#issuecomment-470408131 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #23947: [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats
AmplabJenkins commented on issue #23947: [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats URL: https://github.com/apache/spark/pull/23947#issuecomment-470408136 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/103115/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #23947: [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats
AmplabJenkins removed a comment on issue #23947: [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats URL: https://github.com/apache/spark/pull/23947#issuecomment-470408136 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/103115/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #23947: [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats
AmplabJenkins commented on issue #23947: [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats URL: https://github.com/apache/spark/pull/23947#issuecomment-470408131 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #23947: [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats
SparkQA removed a comment on issue #23947: [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats URL: https://github.com/apache/spark/pull/23947#issuecomment-470361512 **[Test build #103115 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103115/testReport)** for PR 23947 at commit [`c359b8d`](https://github.com/apache/spark/commit/c359b8ddf4d52d2529d7a0b7dd24f962e128dd6e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #23947: [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats
SparkQA commented on issue #23947: [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats URL: https://github.com/apache/spark/pull/23947#issuecomment-470407653 **[Test build #103115 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103115/testReport)** for PR 23947 at commit [`c359b8d`](https://github.com/apache/spark/commit/c359b8ddf4d52d2529d7a0b7dd24f962e128dd6e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24002: [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2
AmplabJenkins removed a comment on issue #24002: [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2 URL: https://github.com/apache/spark/pull/24002#issuecomment-470405704 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24002: [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2
AmplabJenkins commented on issue #24002: [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2 URL: https://github.com/apache/spark/pull/24002#issuecomment-470406127 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24002: [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2
AmplabJenkins removed a comment on issue #24002: [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2 URL: https://github.com/apache/spark/pull/24002#issuecomment-470405610 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jeffwan commented on issue #24002: [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2
Jeffwan commented on issue #24002: [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2 URL: https://github.com/apache/spark/pull/24002#issuecomment-470405747 Please hold this PR until @shaneknapp confirms. Not sure if minikube get updated on Jenkins side. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24002: [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2
AmplabJenkins commented on issue #24002: [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2 URL: https://github.com/apache/spark/pull/24002#issuecomment-470405704 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24002: [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2
AmplabJenkins commented on issue #24002: [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2 URL: https://github.com/apache/spark/pull/24002#issuecomment-470405610 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jeffwan opened a new pull request #24002: [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2
Jeffwan opened a new pull request #24002: [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2 URL: https://github.com/apache/spark/pull/24002 ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/23814 was reverted because of Jenkins integration tests failed. After minikube upgrade, v1.4.2 work with kubernetes v1.13. We can bring this change back. Reference: [Bump Kubernetes Client Version to 4.1.2](https://issues.apache.org/jira/browse/SPARK-26742) [Original PR] (https://github.com/apache/spark/pull/23814) [Kubernetes client upgrade for Spark 2.4](https://github.com/apache/spark/pull/23993) https://issues.apache.org/jira/browse/SPARK-26742 ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] prodeezy commented on issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields in DataSource Strategy
prodeezy commented on issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields in DataSource Strategy URL: https://github.com/apache/spark/pull/22573#issuecomment-470403048 @dbtsai can you point me to the jira that tracks the maps/arrays support? thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sandeep-katta commented on issue #22466: [SPARK-25464][SQL] Create Database to the location, only if it is empty or does not exists.
sandeep-katta commented on issue #22466: [SPARK-25464][SQL] Create Database to the location,only if it is empty or does not exists. URL: https://github.com/apache/spark/pull/22466#issuecomment-470403068 @gatorsmile I have updated the SQL migration guide This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #22466: [SPARK-25464][SQL] Create Database to the location, only if it is empty or does not exists.
SparkQA commented on issue #22466: [SPARK-25464][SQL] Create Database to the location,only if it is empty or does not exists. URL: https://github.com/apache/spark/pull/22466#issuecomment-470403101 **[Test build #103124 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103124/testReport)** for PR 22466 at commit [`6532466`](https://github.com/apache/spark/commit/653246692eaccaec5da1e22d80ce8f7982d1161f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #23943: [SPARK-27034][SQL] Nested schema pruning for ORC
viirya commented on a change in pull request #23943: [SPARK-27034][SQL] Nested schema pruning for ORC URL: https://github.com/apache/spark/pull/23943#discussion_r263250824 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala ## @@ -103,42 +43,13 @@ class ParquetSchemaPruningSuite Nil) } - testSchemaPruning("select a single complex field array and its parent struct array") { -val query = sql("select friends.middle, friends from contacts where p=1") -checkScan(query, - "struct>>") -checkAnswer(query.orderBy("id"), - Row(Array("Z."), Array(Row("Susan", "Z.", "Smith"))) :: - Row(Array.empty[String], Array.empty[Row]) :: - Nil) - } - - testSchemaPruning("select a single complex field from a map entry and its parent map entry") { -val query = - sql("select relatives[\"brother\"].middle, relatives[\"brother\"] from contacts where p=1") -checkScan(query, - "struct>>") -checkAnswer(query.orderBy("id"), - Row("Y.", Row("John", "Y.", "Doe")) :: - Row(null, null) :: - Nil) - } - testSchemaPruning("select a single complex field and the partition column") { val query = sql("select name.middle, p from contacts") checkScan(query, "struct>") checkAnswer(query.orderBy("id"), Row("X.", 1) :: Row("Y.", 1) :: Row(null, 2) :: Row(null, 2) :: Nil) Review comment: Ah, sorry, it is not due to schema merging. But the inferred schema between ORC and Parquet is different. We can test it on current master branch like: ```scala withTempPath { dir => val path = dir.getCanonicalPath makeDataSourceFile(contacts, new File(path + "/contacts/p=1")) makeDataSourceFile(briefContacts, new File(path + "/contacts/p=2")) spark.read.format(dataSourceName).load(path + "/contacts").createOrReplaceTempView("contacts") spark.sql("select * from contacts").printSchema() } ``` When `dataSourceName` is parquet, the schema is: ``` root |-- id: integer (nullable = true) |-- name: struct (nullable = true) ||-- first: string (nullable = true) ||-- middle: string (nullable = true) ||-- last: string (nullable = true) |-- address: string (nullable = true) |-- pets: integer (nullable = true) |-- friends: array (nullable = true) ||-- element: struct (containsNull = true) |||-- first: string (nullable = true) |||-- middle: string (nullable = true) |||-- last: string (nullable = true) |-- relatives: map (nullable = true) ||-- key: string ||-- value: struct (valueContainsNull = true) |||-- first: string (nullable = true) |||-- middle: string (nullable = true) |||-- last: string (nullable = true)
[GitHub] [spark] felixcheungu removed a comment on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager
felixcheungu removed a comment on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager URL: https://github.com/apache/spark/pull/23521#issuecomment-470290769 I think we need to backport this into branch-2.4 @felixcheung This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation
AmplabJenkins removed a comment on issue #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation URL: https://github.com/apache/spark/pull/23946#issuecomment-470397967 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/8583/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation
AmplabJenkins removed a comment on issue #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation URL: https://github.com/apache/spark/pull/23946#issuecomment-470397959 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation
AmplabJenkins commented on issue #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation URL: https://github.com/apache/spark/pull/23946#issuecomment-470397959 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation
AmplabJenkins commented on issue #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation URL: https://github.com/apache/spark/pull/23946#issuecomment-470397967 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/8583/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation
SparkQA commented on issue #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation URL: https://github.com/apache/spark/pull/23946#issuecomment-470396985 **[Test build #103123 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103123/testReport)** for PR 23946 at commit [`b349544`](https://github.com/apache/spark/commit/b3495445fa1944134b59ac7b8a16115d7979ce95). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sandeep-katta commented on a change in pull request #22466: [SPARK-25464][SQL] Create Database to the location, only if it is empty or does not exists.
sandeep-katta commented on a change in pull request #22466: [SPARK-25464][SQL] Create Database to the location,only if it is empty or does not exists. URL: https://github.com/apache/spark/pull/22466#discussion_r263247189 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala ## @@ -2370,4 +2370,17 @@ class HiveDDLSuite )) } } + + test("SPARK-25464 create a database with a non empty location") { Review comment: yes [Test code is here](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala#L1119) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] felixcheung commented on a change in pull request #23977: [SPARK-26923][SQL][R] Refactor ArrowRRunner and RRunner to share one BaseRRunner
felixcheung commented on a change in pull request #23977: [SPARK-26923][SQL][R] Refactor ArrowRRunner and RRunner to share one BaseRRunner URL: https://github.com/apache/spark/pull/23977#discussion_r263245255 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/r/ArrowRRunner.scala ## @@ -202,14 +175,17 @@ class ArrowRRunner( // Likewise, there looks no way to send each batch in streaming format via socket // connection. See ARROW-4512. // So, it reads the whole Arrow streaming-formatted binary at once for now. - val in = new ByteArrayReadableSeekableByteChannel(readByteArrayData(length)) + val buffer = new Array[Byte](length) Review comment: why replace `readByteArrayData`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] felixcheung commented on a change in pull request #23977: [SPARK-26923][SQL][R] Refactor ArrowRRunner and RRunner to share one BaseRRunner
felixcheung commented on a change in pull request #23977: [SPARK-26923][SQL][R] Refactor ArrowRRunner and RRunner to share one BaseRRunner URL: https://github.com/apache/spark/pull/23977#discussion_r263245012 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala ## @@ -546,14 +548,23 @@ case class FlatMapGroupsInRWithArrowExec( override protected def doExecute(): RDD[InternalRow] = { child.execute().mapPartitionsInternal { iter => - val grouped = GroupedIterator(iter, groupingAttributes, child.output) + val grouped = GroupedIterator(iter, groupingAttributes, child.output).filter(_._2.hasNext) val getKey = ObjectOperator.deserializeRowToObject(keyDeserializer, groupingAttributes) - val runner = new ArrowRRunner(func, packageNames, broadcastVars, inputSchema, -SQLConf.get.sessionLocalTimeZone, RRunnerModes.DATAFRAME_GAPPLY) - val groupedByRKey = grouped.map { case (key, rowIter) => -val newKey = rowToRBytes(getKey(key).asInstanceOf[Row]) -(newKey, rowIter) + // Iterating over keys is relatively cheap. + val keys: Iterator[Array[Byte]] = +grouped.map { case (key, rowIter) => rowToRBytes(getKey(key).asInstanceOf[Row]) } + val groupedByRKey: Iterator[Iterator[InternalRow]] = +grouped.map { case (key, rowIter) => rowIter } + + val runner = new ArrowRRunner(func, packageNames, broadcastVars, inputSchema, +SQLConf.get.sessionLocalTimeZone, RRunnerModes.DATAFRAME_GAPPLY) { +protected override def bufferedWrite( +dataOut: DataOutputStream)(writeFunc: ByteArrayOutputStream => Unit): Unit = { + super.bufferedWrite(dataOut)(writeFunc) + // Don't forget we're sending keys additionally. + keys.foreach(dataOut.write) Review comment: It's a bit hard for me to visualize... isnt each row associated with a key, which could be different? if I understand before it's a pairing for `(key, row)` or `(key, rowIter)` - how does it handle this now? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] felixcheung commented on a change in pull request #23977: [SPARK-26923][SQL][R] Refactor ArrowRRunner and RRunner to share one BaseRRunner
felixcheung commented on a change in pull request #23977: [SPARK-26923][SQL][R] Refactor ArrowRRunner and RRunner to share one BaseRRunner URL: https://github.com/apache/spark/pull/23977#discussion_r263246347 ## File path: core/src/main/scala/org/apache/spark/api/r/RRunner.scala ## @@ -44,380 +35,149 @@ private[spark] class RRunner[U]( isDataFrame: Boolean = false, colNames: Array[String] = null, mode: Int = RRunnerModes.RDD) - extends Logging { - protected var bootTime: Double = _ - private var dataStream: DataInputStream = _ - val readData = numPartitions match { -case -1 => - serializer match { -case SerializationFormats.STRING => readStringData _ -case _ => readByteArrayData _ - } -case _ => readShuffledData _ - } - - def compute( - inputIterator: Iterator[_], - partitionIndex: Int): Iterator[U] = { -// Timing start -bootTime = System.currentTimeMillis / 1000.0 - -// we expect two connections -val serverSocket = new ServerSocket(0, 2, InetAddress.getByName("localhost")) -val listenPort = serverSocket.getLocalPort() - -// The stdout/stderr is shared by multiple tasks, because we use one daemon -// to launch child process as worker. -val errThread = RRunner.createRWorker(listenPort) - -// We use two sockets to separate input and output, then it's easy to manage -// the lifecycle of them to avoid deadlock. -// TODO: optimize it to use one socket - -// the socket used to send out the input of task -serverSocket.setSoTimeout(1) -dataStream = try { - val inSocket = serverSocket.accept() - RRunner.authHelper.authClient(inSocket) - startStdinThread(inSocket.getOutputStream(), inputIterator, partitionIndex) - - // the socket used to receive the output of task - val outSocket = serverSocket.accept() - RRunner.authHelper.authClient(outSocket) - val inputStream = new BufferedInputStream(outSocket.getInputStream) - new DataInputStream(inputStream) -} finally { - serverSocket.close() -} - -try { - newReaderIterator(dataStream, errThread) -} catch { - case e: Exception => -throw new SparkException("R computation failed with\n " + errThread.getLines(), e) -} - } + extends BaseRRunner[IN, OUT]( +func, +deserializer, +serializer, +packageNames, +broadcastVars, +numPartitions, +isDataFrame, +colNames, +mode) { protected def newReaderIterator( - dataStream: DataInputStream, errThread: BufferedStreamThread): Iterator[U] = { -new Iterator[U] { - def next(): U = { -val obj = _nextObj -if (hasNext()) { - _nextObj = read() -} -obj + dataStream: DataInputStream, errThread: BufferedStreamThread): ReaderIterator = { +new ReaderIterator(dataStream, errThread) { + private val readData = numPartitions match { +case -1 => + serializer match { +case SerializationFormats.STRING => readStringData _ +case _ => readByteArrayData _ + } +case _ => readShuffledData _ } - private var _nextObj = read() - - def hasNext(): Boolean = { -val hasMore = _nextObj != null -if (!hasMore) { - dataStream.close() + private def readShuffledData(length: Int): (Int, Array[Byte]) = { +length match { + case length if length == 2 => +val hashedKey = dataStream.readInt() +val contentPairsLength = dataStream.readInt() +val contentPairs = new Array[Byte](contentPairsLength) +dataStream.readFully(contentPairs) +(hashedKey, contentPairs) + case _ => null } -hasMore } -} - } - protected def writeData( - dataOut: DataOutputStream, - printOut: PrintStream, - iter: Iterator[_]): Unit = { -def writeElem(elem: Any): Unit = { - if (deserializer == SerializationFormats.BYTE) { -val elemArr = elem.asInstanceOf[Array[Byte]] -dataOut.writeInt(elemArr.length) -dataOut.write(elemArr) - } else if (deserializer == SerializationFormats.ROW) { -dataOut.write(elem.asInstanceOf[Array[Byte]]) - } else if (deserializer == SerializationFormats.STRING) { -// write string(for StringRRDD) -// scalastyle:off println -printOut.println(elem) -// scalastyle:on println + private def readByteArrayData(length: Int): Array[Byte] = { +length match { + case length if length > 0 => +val obj = new Array[Byte](length) +dataStream.readFully(obj) +obj + case _ => null +} } -} -for (elem <- iter) { - elem match { -case (key, innerIter: Iterator[_]) => - for (innerElem <- innerIter) { -writeElem(innerElem) - } -
[GitHub] [spark] felixcheung commented on a change in pull request #23977: [SPARK-26923][SQL][R] Refactor ArrowRRunner and RRunner to share one BaseRRunner
felixcheung commented on a change in pull request #23977: [SPARK-26923][SQL][R] Refactor ArrowRRunner and RRunner to share one BaseRRunner URL: https://github.com/apache/spark/pull/23977#discussion_r263244744 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala ## @@ -546,14 +548,23 @@ case class FlatMapGroupsInRWithArrowExec( override protected def doExecute(): RDD[InternalRow] = { child.execute().mapPartitionsInternal { iter => - val grouped = GroupedIterator(iter, groupingAttributes, child.output) + val grouped = GroupedIterator(iter, groupingAttributes, child.output).filter(_._2.hasNext) val getKey = ObjectOperator.deserializeRowToObject(keyDeserializer, groupingAttributes) - val runner = new ArrowRRunner(func, packageNames, broadcastVars, inputSchema, -SQLConf.get.sessionLocalTimeZone, RRunnerModes.DATAFRAME_GAPPLY) - val groupedByRKey = grouped.map { case (key, rowIter) => -val newKey = rowToRBytes(getKey(key).asInstanceOf[Row]) -(newKey, rowIter) + // Iterating over keys is relatively cheap. + val keys: Iterator[Array[Byte]] = +grouped.map { case (key, rowIter) => rowToRBytes(getKey(key).asInstanceOf[Row]) } + val groupedByRKey: Iterator[Iterator[InternalRow]] = +grouped.map { case (key, rowIter) => rowIter } + + val runner = new ArrowRRunner(func, packageNames, broadcastVars, inputSchema, +SQLConf.get.sessionLocalTimeZone, RRunnerModes.DATAFRAME_GAPPLY) { +protected override def bufferedWrite( +dataOut: DataOutputStream)(writeFunc: ByteArrayOutputStream => Unit): Unit = { + super.bufferedWrite(dataOut)(writeFunc) + // Don't forget we're sending keys additionally. + keys.foreach(dataOut.write) Review comment: so we expect `keys` is small? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #22466: [SPARK-25464][SQL] Create Database to the location, only if it is empty or does not exists.
SparkQA commented on issue #22466: [SPARK-25464][SQL] Create Database to the location,only if it is empty or does not exists. URL: https://github.com/apache/spark/pull/22466#issuecomment-470395577 **[Test build #103122 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103122/testReport)** for PR 22466 at commit [`6f4c9ce`](https://github.com/apache/spark/commit/6f4c9ce71a4d5951e3306c047263b37dfeb657db). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] codeborui commented on issue #24001: [SPARK-27080][SQL]bug fix: mergeWithMetastoreSchema with uniform lower case comparison
codeborui commented on issue #24001: [SPARK-27080][SQL]bug fix: mergeWithMetastoreSchema with uniform lower case comparison URL: https://github.com/apache/spark/pull/24001#issuecomment-470395593 @xuanyuanking This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24001: [SPARK-27080][SQL]bug fix: mergeWithMetastoreSchema with uniform lower case comparison
AmplabJenkins removed a comment on issue #24001: [SPARK-27080][SQL]bug fix: mergeWithMetastoreSchema with uniform lower case comparison URL: https://github.com/apache/spark/pull/24001#issuecomment-470394959 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24001: [SPARK-27080][SQL]bug fix: mergeWithMetastoreSchema with uniform lower case comparison
AmplabJenkins removed a comment on issue #24001: [SPARK-27080][SQL]bug fix: mergeWithMetastoreSchema with uniform lower case comparison URL: https://github.com/apache/spark/pull/24001#issuecomment-470394070 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24001: [SPARK-27080][SQL]bug fix: mergeWithMetastoreSchema with uniform lower case comparison
AmplabJenkins commented on issue #24001: [SPARK-27080][SQL]bug fix: mergeWithMetastoreSchema with uniform lower case comparison URL: https://github.com/apache/spark/pull/24001#issuecomment-470395080 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24001: [SPARK-27080][SQL]bug fix: mergeWithMetastoreSchema with uniform lower case comparison
AmplabJenkins commented on issue #24001: [SPARK-27080][SQL]bug fix: mergeWithMetastoreSchema with uniform lower case comparison URL: https://github.com/apache/spark/pull/24001#issuecomment-470394959 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24001: [SPARK-27080][SQL]bug fix: mergeWithMetastoreSchema with uniform lower case comparison
AmplabJenkins commented on issue #24001: [SPARK-27080][SQL]bug fix: mergeWithMetastoreSchema with uniform lower case comparison URL: https://github.com/apache/spark/pull/24001#issuecomment-470394070 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #23996: [MINOR][BUILD] Add 2 maven properties(hive.classifier and hive.parquet.group)
AmplabJenkins removed a comment on issue #23996: [MINOR][BUILD] Add 2 maven properties(hive.classifier and hive.parquet.group) URL: https://github.com/apache/spark/pull/23996#issuecomment-470393882 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #23996: [MINOR][BUILD] Add 2 maven properties(hive.classifier and hive.parquet.group)
AmplabJenkins commented on issue #23996: [MINOR][BUILD] Add 2 maven properties(hive.classifier and hive.parquet.group) URL: https://github.com/apache/spark/pull/23996#issuecomment-470393882 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #23996: [MINOR][BUILD] Add 2 maven properties(hive.classifier and hive.parquet.group)
AmplabJenkins removed a comment on issue #23996: [MINOR][BUILD] Add 2 maven properties(hive.classifier and hive.parquet.group) URL: https://github.com/apache/spark/pull/23996#issuecomment-470393884 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/103114/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #23943: [SPARK-27034][SQL] Nested schema pruning for ORC
AmplabJenkins removed a comment on issue #23943: [SPARK-27034][SQL] Nested schema pruning for ORC URL: https://github.com/apache/spark/pull/23943#issuecomment-470393759 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #23943: [SPARK-27034][SQL] Nested schema pruning for ORC
AmplabJenkins removed a comment on issue #23943: [SPARK-27034][SQL] Nested schema pruning for ORC URL: https://github.com/apache/spark/pull/23943#issuecomment-470393765 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/8582/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #23996: [MINOR][BUILD] Add 2 maven properties(hive.classifier and hive.parquet.group)
AmplabJenkins commented on issue #23996: [MINOR][BUILD] Add 2 maven properties(hive.classifier and hive.parquet.group) URL: https://github.com/apache/spark/pull/23996#issuecomment-470393884 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/103114/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] codeborui opened a new pull request #24001: [SPARK-27080][SQL]bug fix: mergeWithMetastoreSchema with uniform lower case comparison
codeborui opened a new pull request #24001: [SPARK-27080][SQL]bug fix: mergeWithMetastoreSchema with uniform lower case comparison URL: https://github.com/apache/spark/pull/24001 (cherry picked from commit f47a765) ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #23943: [SPARK-27034][SQL] Nested schema pruning for ORC
AmplabJenkins commented on issue #23943: [SPARK-27034][SQL] Nested schema pruning for ORC URL: https://github.com/apache/spark/pull/23943#issuecomment-470393765 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/8582/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #23943: [SPARK-27034][SQL] Nested schema pruning for ORC
AmplabJenkins commented on issue #23943: [SPARK-27034][SQL] Nested schema pruning for ORC URL: https://github.com/apache/spark/pull/23943#issuecomment-470393759 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #23996: [MINOR][BUILD] Add 2 maven properties(hive.classifier and hive.parquet.group)
SparkQA removed a comment on issue #23996: [MINOR][BUILD] Add 2 maven properties(hive.classifier and hive.parquet.group) URL: https://github.com/apache/spark/pull/23996#issuecomment-470334837 **[Test build #103114 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103114/testReport)** for PR 23996 at commit [`1f977a5`](https://github.com/apache/spark/commit/1f977a5e5404c01325ef08473f58ed0c49e8be6e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #23996: [MINOR][BUILD] Add 2 maven properties(hive.classifier and hive.parquet.group)
SparkQA commented on issue #23996: [MINOR][BUILD] Add 2 maven properties(hive.classifier and hive.parquet.group) URL: https://github.com/apache/spark/pull/23996#issuecomment-470393474 **[Test build #103114 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103114/testReport)** for PR 23996 at commit [`1f977a5`](https://github.com/apache/spark/commit/1f977a5e5404c01325ef08473f58ed0c49e8be6e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager
viirya commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager URL: https://github.com/apache/spark/pull/23521#issuecomment-470392984 If it can't be directly merged to branch-2.4, I can make a PR for it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #23943: [SPARK-27034][SQL] Nested schema pruning for ORC
SparkQA commented on issue #23943: [SPARK-27034][SQL] Nested schema pruning for ORC URL: https://github.com/apache/spark/pull/23943#issuecomment-470392739 **[Test build #103121 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103121/testReport)** for PR 23943 at commit [`8ac4aed`](https://github.com/apache/spark/commit/8ac4aed746aae252d17c20474bf29edddb8b6ca9). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sandeep-katta commented on issue #22824: [SPARK-25834] [Structured Streaming]Update Mode should not be supported for stream-stream Outer Joins
sandeep-katta commented on issue #22824: [SPARK-25834] [Structured Streaming]Update Mode should not be supported for stream-stream Outer Joins URL: https://github.com/apache/spark/pull/22824#issuecomment-470391935 @koeninger @jose-torres, any comments need to be handled further ?? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] felixcheung commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager
felixcheung commented on issue #23521: [SPARK-26604][CORE] Clean up channel registration for StreamManager URL: https://github.com/apache/spark/pull/23521#issuecomment-470391521 I think we need to backport this into branch-2.4 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] felixcheung commented on issue #23842: [SPARK-26927]Fix race condition may cause dynamic allocation not working
felixcheung commented on issue #23842: [SPARK-26927]Fix race condition may cause dynamic allocation not working URL: https://github.com/apache/spark/pull/23842#issuecomment-470391381 FYI, Jenkins hasn't run on this This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] felixcheung commented on issue #23975: [SPARK-27056][MESOS]Remove start-shuffle-service.sh
felixcheung commented on issue #23975: [SPARK-27056][MESOS]Remove start-shuffle-service.sh URL: https://github.com/apache/spark/pull/23975#issuecomment-470390980 @tnachen @susanxhuynh do you know about this for mesos? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] chandulal edited a comment on issue #23990: [SPARK-27061][K8S] Expose Driver UI port on driver service to access …
chandulal edited a comment on issue #23990: [SPARK-27061][K8S] Expose Driver UI port on driver service to access … URL: https://github.com/apache/spark/pull/23990#issuecomment-470349577 > IIRC a headless service doesn't expose the port outside the k8s cluster, right? > > If that's the case, in which situations is this useful? > IIRC a headless service doesn't expose the port outside the k8s cluster, right? > > If that's the case, in which situations is this useful? Yes, It won't expose the port outside of the cluster. Our use-case is we have users who submit spark jobs to Kubernetes cluster, but they don't have Kubernetes background and access to the cluster. But, we need to provide the spark job logs for the debugging. There are two ways to do that: 1) Port forward driver pod on 4040. (This won't work, because users don't have access to Kubernetes cluster) 2) Expose 4040 port on driver service. Then, we can easily relay these logs to UI using Nginx reverse proxy on all driver services. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on a change in pull request #23842: [SPARK-26927]Fix race condition may cause dynamic allocation not working
vanzin commented on a change in pull request #23842: [SPARK-26927]Fix race condition may cause dynamic allocation not working URL: https://github.com/apache/spark/pull/23842#discussion_r263240842 ## File path: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ## @@ -725,10 +740,15 @@ private[spark] class ExecutorAllocationManager( if (stageIdToNumRunningTask.contains(stageId)) { stageIdToNumRunningTask(stageId) += 1 } -// This guards against the race condition in which the `SparkListenerTaskStart` -// event is posted before the `SparkListenerBlockManagerAdded` event, which is -// possible because these events are posted in different threads. (see SPARK-4951) -if (!allocationManager.executorIds.contains(executorId)) { +// This guards against the following race condition: +// 1. The `SparkListenerTaskStart` event is posted before the +// `SparkListenerExecutorAdded` event +// 2. The `SparkListenerExecutorRemoved` event is posted before the +// `SparkListenerTaskStart` event +// Above cases are possible because these events are posted in different threads. +// (see SPARK-4951 SPARK-26927) +if (!allocationManager.executorIds.contains(executorId) && + !allocationManager.removedExecutorIds.contains(executorId)) { Review comment: You didn't understand my suggestion. I'm asking you to, instead of keeping a list of executors that have been removed here, just as the scheduler about whether that executor exist. The scheduler keeps the authoritative lists of executors. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] chandulal commented on a change in pull request #23990: [SPARK-27061][K8S] Expose Driver UI port on driver service to access …
chandulal commented on a change in pull request #23990: [SPARK-27061][K8S] Expose Driver UI port on driver service to access … URL: https://github.com/apache/spark/pull/23990#discussion_r263240701 ## File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala ## @@ -55,6 +56,9 @@ private[spark] class DriverServiceFeatureStep( private val driverBlockManagerPort = kubernetesConf.sparkConf.getInt( config.DRIVER_BLOCK_MANAGER_PORT.key, DEFAULT_BLOCKMANAGER_PORT) + val driverUIPort = SparkUI.getUIPort(kubernetesConf +.sparkConf) Review comment: Made these changes and pushed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] chandulal commented on a change in pull request #23990: [SPARK-27061][K8S] Expose Driver UI port on driver service to access …
chandulal commented on a change in pull request #23990: [SPARK-27061][K8S] Expose Driver UI port on driver service to access … URL: https://github.com/apache/spark/pull/23990#discussion_r263240684 ## File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala ## @@ -45,6 +45,7 @@ private[spark] object Constants { // Default and fixed ports val DEFAULT_DRIVER_PORT = 7078 val DEFAULT_BLOCKMANAGER_PORT = 7079 + val DEFAULT_UI_PORT = 4040 Review comment: Made these changes and pushed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #23943: [SPARK-27034][SQL] Nested schema pruning for ORC
viirya commented on a change in pull request #23943: [SPARK-27034][SQL] Nested schema pruning for ORC URL: https://github.com/apache/spark/pull/23943#discussion_r263240456 ## File path: sql/core/benchmarks/OrcNestedSchemaPruningBenchmark-results.txt ## @@ -6,35 +6,35 @@ Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.3 Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz Selection:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Top-level column113196 89 8.8 113.0 1.0X -Nested column 1316 1639 240 0.81315.5 0.1X +Top-level column116151 36 8.6 116.3 1.0X +Nested column 544604 31 1.8 544.5 0.2X Review comment: I think the read performance is more determined by persistent library than Spark side here. As Parquet, at Spark side we provide correctly pruned nested schema to ORC library. Pruning nested fields when reading data is done by ORC library. I'm not sure if we have much room to optimize at Spark side for this. Of course I'm open to any suggestion I'm missing right now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation
HyukjinKwon commented on a change in pull request #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation URL: https://github.com/apache/spark/pull/23946#discussion_r263240174 ## File path: python/pyspark/sql/window.py ## @@ -97,6 +97,33 @@ def rowsBetween(start, end): and ``Window.currentRow`` to specify special boundary values, rather than using integral values directly. +A row based boundary is based on the position of the row within the partition. +An offset indicates the number of rows above or below the current row, the frame for the +current row starts or ends. For instance, given a row based sliding frame with a lower bound +offset of -1 and a upper bound offset of +2. The frame for row with index 5 would range from +index 4 to index 6. + +>>> from pyspark.sql import Window +>>> from pyspark.sql import functions as func +>>> from pyspark.sql import SQLContext +>>> sc = SparkContext.getOrCreate() +>>> sqlContext = SQLContext(sc) +>>> tup = [(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")] +>>> df = sqlContext.createDataFrame(tup, ["id", "category"]) +>>> window = Window.partitionBy("category").orderBy("id").rowsBetween(Window.currentRow, 1) +>>> df.withColumn("sum", func.sum("id").over(window)).show() ++---++---+ +| id|category|sum| ++---++---+ +| 1| b| 3| +| 2| b| 5| +| 3| b| 3| +| 1| a| 2| +| 1| a| 3| +| 2| a| 2| ++---++---+ + Review comment: You can change the doctest running codes from: ```python import doctest SparkContext('local[4]', 'PythonTest') (failure_count, test_count) = doctest.testmod() ``` to: ```python import doctest import pyspark.sql.window SparkContext('local[4]', 'PythonTest') globs = pyspark.sql.window.__dict__.copy() (failure_count, test_count) = doctest.testmod( pyspark.sql.window, globs=globs, optionflags=doctest.NORMALIZE_WHITESPACE)) ``` so that: 1. it doesn't need to add `` 2. when the tests are skipped, it shows the correct fully qualified module names like `pyspark.sql.window...`, rather then `__main__. ...`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #23670: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable
SparkQA commented on issue #23670: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable URL: https://github.com/apache/spark/pull/23670#issuecomment-470386400 **[Test build #103120 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103120/testReport)** for PR 23670 at commit [`27b0ec4`](https://github.com/apache/spark/commit/27b0ec46ad7e0ffae096c3520359c86e56ca6ebc). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #23915: [SPARK-24252][SQL] Add v2 catalog plugin system
viirya commented on a change in pull request #23915: [SPARK-24252][SQL] Add v2 catalog plugin system URL: https://github.com/apache/spark/pull/23915#discussion_r263238897 ## File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala ## @@ -620,6 +622,12 @@ class SparkSession private( */ @transient lazy val catalog: Catalog = new CatalogImpl(self) + @transient private lazy val catalogs = new mutable.HashMap[String, CatalogPlugin]() + + private[sql] def catalog(name: String): CatalogPlugin = synchronized { +catalogs.getOrElseUpdate(name, Catalogs.load(name, sessionState.conf)) Review comment: Looks like we don't support unloading or reloading a catalog plugin? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on issue #23670: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable
viirya commented on issue #23670: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable URL: https://github.com/apache/spark/pull/23670#issuecomment-470385336 retest this please. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on issue #23982: [SQL][MINOR] Reconcile the join types between data frame and sql interface
maropu commented on issue #23982: [SQL][MINOR] Reconcile the join types between data frame and sql interface URL: https://github.com/apache/spark/pull/23982#issuecomment-470384537 WDYT? @dongjoon-hyun @gatorsmile @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] caneGuy commented on issue #23670: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable
caneGuy commented on issue #23670: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable URL: https://github.com/apache/spark/pull/23670#issuecomment-470384498 Sorry for bothering @viirya but the failed case i think is still not related with this pr. Any suggestion?Thanks `org.scalatest.exceptions.TestFailedException: scala.Predef.Set.apply[Int](0, 1, 2, 3).map[org.apache.spark.sql.Row, scala.collection.immutable.Set[org.apache.spark.sql.Row]](((x$3: Int) => org.apache.spark.sql.Row.apply(x$3)))(immutable.this.Set.canBuildFrom[org.apache.spark.sql.Row]).subsetOf(scala.Predef.refArrayOps[org.apache.spark.sql.Row](results).toSet[org.apache.spark.sql.Row]) was false Stacktrace sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: scala.Predef.Set.apply[Int](0, 1, 2, 3).map[org.apache.spark.sql.Row, scala.collection.immutable.Set[org.apache.spark.sql.Row]](((x$3: Int) => org.apache.spark.sql.Row.apply(x$3)))(immutable.this.Set.canBuildFrom[org.apache.spark.sql.Row]).subsetOf(scala.Predef.refArrayOps[org.apache.spark.sql.Row](results).toSet[org.apache.spark.sql.Row]) was false at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #23970: [SPARK-27054][BUILD][SQL] Remove the Calcite dependency
AmplabJenkins removed a comment on issue #23970: [SPARK-27054][BUILD][SQL] Remove the Calcite dependency URL: https://github.com/apache/spark/pull/23970#issuecomment-470384269 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/103112/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #23970: [SPARK-27054][BUILD][SQL] Remove the Calcite dependency
AmplabJenkins removed a comment on issue #23970: [SPARK-27054][BUILD][SQL] Remove the Calcite dependency URL: https://github.com/apache/spark/pull/23970#issuecomment-470384265 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #23970: [SPARK-27054][BUILD][SQL] Remove the Calcite dependency
AmplabJenkins commented on issue #23970: [SPARK-27054][BUILD][SQL] Remove the Calcite dependency URL: https://github.com/apache/spark/pull/23970#issuecomment-470384269 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/103112/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #23970: [SPARK-27054][BUILD][SQL] Remove the Calcite dependency
AmplabJenkins commented on issue #23970: [SPARK-27054][BUILD][SQL] Remove the Calcite dependency URL: https://github.com/apache/spark/pull/23970#issuecomment-470384265 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #23970: [SPARK-27054][BUILD][SQL] Remove the Calcite dependency
SparkQA removed a comment on issue #23970: [SPARK-27054][BUILD][SQL] Remove the Calcite dependency URL: https://github.com/apache/spark/pull/23970#issuecomment-470325496 **[Test build #103112 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103112/testReport)** for PR 23970 at commit [`66a1704`](https://github.com/apache/spark/commit/66a1704a6436c60a16f2d4daae27c333e73358e3). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #23970: [SPARK-27054][BUILD][SQL] Remove the Calcite dependency
SparkQA commented on issue #23970: [SPARK-27054][BUILD][SQL] Remove the Calcite dependency URL: https://github.com/apache/spark/pull/23970#issuecomment-470383916 **[Test build #103112 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103112/testReport)** for PR 23970 at commit [`66a1704`](https://github.com/apache/spark/commit/66a1704a6436c60a16f2d4daae27c333e73358e3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] caneGuy commented on issue #23826: [SPARK-26914][SQL] Fix scheduler pool may be unpredictable when we only want to use default pool and do not set spark.scheduler.pool for the session
caneGuy commented on issue #23826: [SPARK-26914][SQL] Fix scheduler pool may be unpredictable when we only want to use default pool and do not set spark.scheduler.pool for the session URL: https://github.com/apache/spark/pull/23826#issuecomment-470383325 No with two non-default pool, it can also has some case that we want to submit to 'A' but always be submitted to 'B' @srowen This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liupc commented on a change in pull request #23842: [SPARK-26927]Fix race condition may cause dynamic allocation not working
liupc commented on a change in pull request #23842: [SPARK-26927]Fix race condition may cause dynamic allocation not working URL: https://github.com/apache/spark/pull/23842#discussion_r263236018 ## File path: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ## @@ -725,10 +740,15 @@ private[spark] class ExecutorAllocationManager( if (stageIdToNumRunningTask.contains(stageId)) { stageIdToNumRunningTask(stageId) += 1 } -// This guards against the race condition in which the `SparkListenerTaskStart` -// event is posted before the `SparkListenerBlockManagerAdded` event, which is -// possible because these events are posted in different threads. (see SPARK-4951) -if (!allocationManager.executorIds.contains(executorId)) { +// This guards against the following race condition: +// 1. The `SparkListenerTaskStart` event is posted before the +// `SparkListenerExecutorAdded` event +// 2. The `SparkListenerExecutorRemoved` event is posted before the +// `SparkListenerTaskStart` event +// Above cases are possible because these events are posted in different threads. +// (see SPARK-4951 SPARK-26927) +if (!allocationManager.executorIds.contains(executorId) && + !allocationManager.removedExecutorIds.contains(executorId)) { Review comment: Or we change the `AsyncEventQueue` for executor management to a SyncQueue which directly calls listener's callbacks. e.g: ``` class SyncQueue extends SparkListenerBus { def post(event: SparkListenerEvent): Unit = { super.postToAll(event) } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24000: [SPARK-27079][MINOR][SQL] Fix typo & Remove useless imports & Add missing `override` annotation
SparkQA commented on issue #24000: [SPARK-27079][MINOR][SQL] Fix typo & Remove useless imports & Add missing `override` annotation URL: https://github.com/apache/spark/pull/24000#issuecomment-470381933 **[Test build #103119 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103119/testReport)** for PR 24000 at commit [`389456a`](https://github.com/apache/spark/commit/389456a026fb3e2f64e8aefc2111c2be2f2ff5d8). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24000: [SPARK-27079][MINOR][SQL] Fix typo & Remove useless imports & Add missing `override` annotation
AmplabJenkins removed a comment on issue #24000: [SPARK-27079][MINOR][SQL] Fix typo & Remove useless imports & Add missing `override` annotation URL: https://github.com/apache/spark/pull/24000#issuecomment-470381656 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sujith71955 commented on issue #23921: [SPARK-26560][SQL]:Repeat select on HiveUDF fails
sujith71955 commented on issue #23921: [SPARK-26560][SQL]:Repeat select on HiveUDF fails URL: https://github.com/apache/spark/pull/23921#issuecomment-470381787 cc @srowen @HyukjinKwon This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24000: [SPARK-27079][MINOR][SQL] Fix typo & Remove useless imports & Add missing `override` annotation
AmplabJenkins removed a comment on issue #24000: [SPARK-27079][MINOR][SQL] Fix typo & Remove useless imports & Add missing `override` annotation URL: https://github.com/apache/spark/pull/24000#issuecomment-470381658 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/8581/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24000: [SPARK-27079][MINOR][SQL] Fix typo & Remove useless imports & Add missing `override` annotation
AmplabJenkins commented on issue #24000: [SPARK-27079][MINOR][SQL] Fix typo & Remove useless imports & Add missing `override` annotation URL: https://github.com/apache/spark/pull/24000#issuecomment-470381658 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/8581/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24000: [SPARK-27079][MINOR][SQL] Fix typo & Remove useless imports & Add missing `override` annotation
AmplabJenkins commented on issue #24000: [SPARK-27079][MINOR][SQL] Fix typo & Remove useless imports & Add missing `override` annotation URL: https://github.com/apache/spark/pull/24000#issuecomment-470381656 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24000: [SPARK-27079][MINOR][SQL] Fix typo & Remove useless imports & Add missing `override` annotation
AmplabJenkins removed a comment on issue #24000: [SPARK-27079][MINOR][SQL] Fix typo & Remove useless imports & Add missing `override` annotation URL: https://github.com/apache/spark/pull/24000#issuecomment-470377580 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #24000: [SPARK-27079][MINOR][SQL] Fix typo & Remove useless imports & Add missing `override` annotation
HyukjinKwon commented on issue #24000: [SPARK-27079][MINOR][SQL] Fix typo & Remove useless imports & Add missing `override` annotation URL: https://github.com/apache/spark/pull/24000#issuecomment-470381465 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on issue #23982: [SQL][MINOR] Reconcile the join types between data frame and sql interface
dilipbiswal commented on issue #23982: [SQL][MINOR] Reconcile the join types between data frame and sql interface URL: https://github.com/apache/spark/pull/23982#issuecomment-470380969 @maropu OK.. I will do the 2nd option. Thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liupc commented on a change in pull request #23842: [SPARK-26927]Fix race condition may cause dynamic allocation not working
liupc commented on a change in pull request #23842: [SPARK-26927]Fix race condition may cause dynamic allocation not working URL: https://github.com/apache/spark/pull/23842#discussion_r263231449 ## File path: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ## @@ -725,10 +740,15 @@ private[spark] class ExecutorAllocationManager( if (stageIdToNumRunningTask.contains(stageId)) { stageIdToNumRunningTask(stageId) += 1 } -// This guards against the race condition in which the `SparkListenerTaskStart` -// event is posted before the `SparkListenerBlockManagerAdded` event, which is -// possible because these events are posted in different threads. (see SPARK-4951) -if (!allocationManager.executorIds.contains(executorId)) { +// This guards against the following race condition: +// 1. The `SparkListenerTaskStart` event is posted before the +// `SparkListenerExecutorAdded` event +// 2. The `SparkListenerExecutorRemoved` event is posted before the +// `SparkListenerTaskStart` event +// Above cases are possible because these events are posted in different threads. +// (see SPARK-4951 SPARK-26927) +if (!allocationManager.executorIds.contains(executorId) && + !allocationManager.removedExecutorIds.contains(executorId)) { Review comment: @vanzin > I don't believe the scheduler would post a task start if the executor was not tracked internally already. ~~It can happen, because now the `ListenerBus` works with several `AsyncEventQueue`, the `allocationManager.executorIds` differs from `CoarseGrainedSchedulerBackend.exectuorDataMap` in that `allocationManager.executorIds` is added when `SparkListenerExecutorAdded` polled and processed in `AsyncEventQueue`. So executor allocation is actually not kept sync with task scheduling for now.~~ I looked closer, seems it's because we now post a task start and then call the makeOffers immediately: ``` listenerBus.post( SparkListenerExecutorAdded(System.currentTimeMillis(), executorId, data)) makeOffers() ``` https://github.com/apache/spark/blob/a0e26cffc5c0163a780e86d37183b0cebbc7b24c/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L230 Maybe we should use a sync way to post this `SparkListenerExecutorAdded` events for `appManagement` queue? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on issue #23995: [MINOR][SQL][TEST] Include usage example for generating output for single test in SQLQueryTestSuite
dilipbiswal commented on issue #23995: [MINOR][SQL][TEST] Include usage example for generating output for single test in SQLQueryTestSuite URL: https://github.com/apache/spark/pull/23995#issuecomment-470380454 Thanks a lot @HyukjinKwon This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org