[GitHub] spark issue #23268: [Hive][Minor] Refactor on HiveShim and Add Unit Tests
Github user sadhen commented on the issue: https://github.com/apache/spark/pull/23268 OK, I will modify the PR several hours later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240110126 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) + .getOrElse(cols) + .mkString(",") --- End diff -- Thanks. It doesn't matter which company it is. All the PRs are equally and reasonably reviewed, and merged per the same criteria. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23270: [SPARK-26317][BUILD] Upgrade SBT to 0.13.18
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23270 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5911/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23269: [SPARK-26316] Currently the wrong implementation in the ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23269 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23270: [SPARK-26317][BUILD] Upgrade SBT to 0.13.18
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23270 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23269: [SPARK-26316] Currently the wrong implementation in the ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23269 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99896/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23269: [SPARK-26316] Currently the wrong implementation in the ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23269 **[Test build #99896 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99896/testReport)** for PR 23269 at commit [`03cfe2b`](https://github.com/apache/spark/commit/03cfe2b7506f5c5421aaf2858f3f31f2153db8fb). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23270: [SPARK-26317][BUILD] Upgrade SBT to 0.13.18
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23270 **[Test build #99897 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99897/testReport)** for PR 23270 at commit [`0885c94`](https://github.com/apache/spark/commit/0885c947d3b4561df2d39f1bc9a35a06d7f0ed0c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23270: [SPARK-26317][BUILD] Upgrade SBT to 0.13.18
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/23270 [SPARK-26317][BUILD] Upgrade SBT to 0.13.18 ## What changes were proposed in this pull request? SBT 0.13.14 ~ 1.1.1 has a bug on accessing `java.util.Base64.getDecoder` on JDK9+. It's fixed at 1.1.2 and backported to [0.13.18 (released on Nov 28th)](https://github.com/sbt/sbt/releases/tag/v0.13.18). This PR aims to update SBT. ## How was this patch tested? Pass the Jenkins with the building and existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-26317 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23270.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23270 commit 0885c947d3b4561df2d39f1bc9a35a06d7f0ed0c Author: Dongjoon Hyun Date: 2018-12-10T07:48:16Z [SPARK-26317][BUILD] Upgrade SBT to 0.13.18 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240107064 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) --- End diff -- Ah, it's `:+` not `+:`. Yea, it's confusing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user maomaoChibei commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240107003 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) + .getOrElse(cols) + .mkString(",") --- End diff -- thanks HyukjinKwon for the kind explaination. we are from an alibaba subsidiary internet company working on resolving real time data calculation based on Peta level data. Currently we have an real time engine which is built on this spark-sql, well, at lease from the sql engine point of view. The throughput is over an qps of 1k and with response latency less than 100ms. Its an none-stop online 24*7 platform serving over ten million active users. If this market scale and platform scale meets the criteria, let me know, we are searching for further cooporations. thanks again. maomao --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/23132 `spark.sql.legacy.decimalParsing.enabled` is still shown in the PR description and commit messages. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240105070 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) --- End diff -- ? output is different, no? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23124: [SPARK-25829][SQL] remove duplicated map keys wit...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/23124#discussion_r240104815 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -27,6 +27,8 @@ displayTitle: Spark SQL Upgrading Guide - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but users can still distinguish them via `Dataset.show`, `Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 internally, and users can't distinguish them any more. + - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be udefined. --- End diff -- A few typos. How about? ``` In Spark version 2.4 and earlier, users can create a map with duplicate keys via built-in functions like `CreateMap` and `StringToMap`. The behavior of map with duplicate keys is undefined. For example, the map lookup respects the duplicate key that appears first, `Dataset.collect` only keeps the duplicate key that appears last, and `MapKeys` returns duplicate keys. Since Spark 3.0, these built-in functions will remove duplicate map keys using the last-one-wins policy. Users may still read map values with duplicate keys from the data sources that do not enforce it (e.g. Parquet), but the behavior will be undefined. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23204: Revert "[SPARK-21052][SQL] Add hash map metrics t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23204#discussion_r240104812 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala --- @@ -213,10 +213,6 @@ trait HashJoin { s"BroadcastHashJoin should not take $x as the JoinType") } -// At the end of the task, we update the avg hash probe. -TaskContext.get().addTaskCompletionListener[Unit](_ => --- End diff -- in this file, the `join` method takes `avgHashProbe: SQLMetric`, we should remove it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23225: [SPARK-26287][CORE]Don't need to create an empty ...
Github user wangjiaochun commented on a diff in the pull request: https://github.com/apache/spark/pull/23225#discussion_r239990587 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java --- @@ -161,6 +161,10 @@ private void writeSortedFile(boolean isLastFile) { final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords = inMemSorter.getSortedIterator(); +// If there are no sorted records, so we don't need to create an empty spill file. +if (!sortedRecords.hasNext()) { + return; +} --- End diff -- Okay, I will make the changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23225: [SPARK-26287][CORE]Don't need to create an empty ...
Github user wangjiaochun commented on a diff in the pull request: https://github.com/apache/spark/pull/23225#discussion_r239990083 --- Diff: core/src/test/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriterSuite.java --- @@ -235,6 +235,7 @@ public void writeEmptyIterator() throws Exception { final Option mapStatus = writer.stop(true); assertTrue(mapStatus.isDefined()); assertTrue(mergedOutputFile.exists()); +assertEquals(0, spillFilesCreated.size()); --- End diff -- I think it's unnecessary to add a new test case, and it can delete line 239~242 of this test writeEmptyIterator because they're always right. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user sadhen commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240103941 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) --- End diff -- Just a idiomatic way to write Scala --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18784: [SPARK-21559][Mesos] remove mesos fine-grained mode
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18784 Looks inactive. @srowen and @felixcheung, do you know anyone who might be interested in this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23269: [SPARK-26316] partial revert 21052 because of the perfor...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23269 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5910/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23269: [SPARK-26316] partial revert 21052 because of the perfor...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23269 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23269: [SPARK-26316] partial revert 21052 because of the perfor...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/23269 @JkSelf . Since PR description already mentioned the background and the partital revert stuff, could you update the PR title to describe your solution more directly? `partial revert` doesn't give a clear idea. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240101927 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) + .getOrElse(cols) + .mkString(",") --- End diff -- Right, but I don't think it's more readable. It's non-critical path so performance concern is secondary anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23268: [Hive][Minor] Refactor on HiveShim and Add Unit Tests
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23268 Yup, the test looks okay in that way. Let's file a JIRA and only leave the test case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23269: [SPARK-26316] partial revert 21052 because of the perfor...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23269 **[Test build #99896 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99896/testReport)** for PR 23269 at commit [`03cfe2b`](https://github.com/apache/spark/commit/03cfe2b7506f5c5421aaf2858f3f31f2153db8fb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23269: [SPARK-26316] partial revert 21052 because of the perfor...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/23269 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23208: [SPARK-25530][SQL] data source v2 API refactor (b...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23208#discussion_r240101369 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,52 +17,49 @@ package org.apache.spark.sql.execution.datasources.v2 -import java.util.UUID - -import scala.collection.JavaConverters._ +import java.util.{Optional, UUID} import org.apache.spark.sql.{AnalysisException, SaveMode} import org.apache.spark.sql.catalyst.analysis.{MultiInstanceRelation, NamedRelation} import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression} import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} import org.apache.spark.sql.catalyst.util.truncatedString -import org.apache.spark.sql.sources.DataSourceRegister import org.apache.spark.sql.sources.v2._ import org.apache.spark.sql.sources.v2.reader._ import org.apache.spark.sql.sources.v2.writer.BatchWriteSupport import org.apache.spark.sql.types.StructType /** - * A logical plan representing a data source v2 scan. + * A logical plan representing a data source v2 table. * - * @param source An instance of a [[DataSourceV2]] implementation. - * @param options The options for this scan. Used to create fresh [[BatchWriteSupport]]. - * @param userSpecifiedSchema The user-specified schema for this scan. + * @param table The table that this relation represents. + * @param options The options for this table operation. It's used to create fresh [[ScanBuilder]] + *and [[BatchWriteSupport]]. */ case class DataSourceV2Relation( -// TODO: remove `source` when we finish API refactor for write. -source: TableProvider, -table: SupportsBatchRead, +table: Table, output: Seq[AttributeReference], -options: Map[String, String], -userSpecifiedSchema: Option[StructType] = None) +// TODO: use a simple case insensitive map instead. +options: DataSourceOptions) extends LeafNode with MultiInstanceRelation with NamedRelation { - import DataSourceV2Relation._ - override def name: String = table.name() override def simpleString: String = { s"RelationV2${truncatedString(output, "[", ", ", "]")} $name" } - def newWriteSupport(): BatchWriteSupport = source.createWriteSupport(options, schema) - - def newScanBuilder(): ScanBuilder = { -val dsOptions = new DataSourceOptions(options.asJava) -table.newScanBuilder(dsOptions) + def newWriteSupport(inputSchema: StructType, mode: SaveMode): Optional[BatchWriteSupport] = { --- End diff -- Nit: add comment for the method. Especially when it will return None. Although it is explained in `SupportsBatchWrite.createBatchWriteSupport` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"
Github user JkSelf commented on the issue: https://github.com/apache/spark/pull/23204 @cloud-fan the new ticket is in [here](https://github.com/apache/spark/pull/23269 ). I will close this ticket. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23268: [Hive][Minor] Refactor on HiveShim and Add Unit Tests
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/23268 The existing way is too JAVA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23204 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23204 **[Test build #99894 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99894/testReport)** for PR 23204 at commit [`9baf05e`](https://github.com/apache/spark/commit/9baf05e00f2839a1dfa5d3a23cf265e1fe21d2a6). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23204 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99894/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23268: [Hive][Minor] Refactor on HiveShim and Add Unit Tests
Github user sadhen commented on the issue: https://github.com/apache/spark/pull/23268 Nevermind, just do not like the coding style personally. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23269: partial revert 21052 because of the performance degradat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23269 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23269: partial revert 21052 because of the performance degradat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23269 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23269: partial revert 21052 because of the performance degradat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23269 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23269: partial revert 21052 because of the performance d...
GitHub user JkSelf opened a pull request: https://github.com/apache/spark/pull/23269 partial revert 21052 because of the performance degradation in TPC-DS ## What changes were proposed in this pull request? We tested TPC-DS in spark2.3 with and without [L486](https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486) and [L487](https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487) in following cluster configuration. And the result [tpc-ds result](https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0) has performance degradation. So we currently partial revert 21052. **Cluster info:**  | Master Node | Worker Nodes -- | -- | -- Node | 1x | 4x Processor | Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz | Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz Memory | 192 GB | 384 GB Storage Main | 8 x 960G SSD | 8 x 960G SSD Network | 10Gbe |  Role | CM Management NameNodeSecondary NameNodeResource ManagerHive Metastore Server | DataNodeNodeManager OS Version | CentOS 7.2 | CentOS 7.2 Hadoop | Apache Hadoop 2.7.5 | Apache Hadoop 2.7.5 Hive | Apache Hive 2.2.0 |  Spark | Apache Spark 2.1.0 & Apache Spark2.3.0 |  JDK version | 1.8.0_112 | 1.8.0_112 **Related parameters setting:** Component | Parameter | Value -- | -- | -- Yarn Resource Manager | yarn.scheduler.maximum-allocation-mb | 120GB  | yarn.scheduler.minimum-allocation-mb | 1GB  | yarn.scheduler.maximum-allocation-vcores | 121  | Yarn.resourcemanager.scheduler.class | Fair Scheduler Yarn Node Manager | yarn.nodemanager.resource.memory-mb | 120GB  | yarn.nodemanager.resource.cpu-vcores | 121 Spark | spark.executor.memory | 110GB  | spark.executor.cores | 50 ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/JkSelf/spark partial-revert-21052 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23269.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23269 commit 03cfe2b7506f5c5421aaf2858f3f31f2153db8fb Author: jiake Date: 2018-12-10T06:50:32Z partial revert 21052 because of the performance degradation in tpc-ds --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23255: [SPARK-26307] [SQL] Fix CTAS when INSERT a partit...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23255 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23255: [SPARK-26307] [SQL] Fix CTAS when INSERT a partitioned t...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23255 thanks, merging to master/2.4/2.3! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user sadhen commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240098806 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) + .getOrElse(cols) + .mkString(",") --- End diff -- For comprehension is just a syntax sugar and is not performant at all. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user sadhen commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240098620 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) + .getOrElse(cols) + .mkString(",") --- End diff -- As for performance: Current Version: ``` [info] HiveShimSuite: [info] - appendReadColumns (549 milliseconds) ``` Previous Version: ``` [info] HiveShimSuite: [info] - appendReadColumns (949 milliseconds) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23268: [Hive][Minor] Refactor on HiveShim and Add Unit Tests
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/23268 @sadhen What is the motivation of this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23268: [Hive][Minor] Refactor on HiveShim and Add Unit Tests
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23268 Let's close this one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240097931 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) --- End diff -- Also, looks the output would be different after this change. Looks it was `old + col` but the current changes looks doing `col + old` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicates and R...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23211 to make the PR smaller, can we add an individual rule `PushdownLeftSemiOrAntiJoin` first? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23225: [SPARK-26287][CORE]Don't need to create an empty ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23225 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicate...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23211#discussion_r240097479 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -984,6 +1002,28 @@ object PushDownPredicate extends Rule[LogicalPlan] with PredicateHelper { project.copy(child = Filter(replaceAlias(condition, aliasMap), grandChild)) +// Similar to the above Filter over Project +// LeftSemi/LeftAnti over Project +case join @ Join(p @ Project(pList, grandChild), rightOp, LeftSemiOrAnti(joinType), joinCond) --- End diff -- Shall we create a new rule `PushdownLeftSemaOrAntiJoin`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23268: [Hive][Minor] Refactor on HiveShim and Add Unit Tests
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23268 **[Test build #99895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99895/testReport)** for PR 23268 at commit [`b68e7b1`](https://github.com/apache/spark/commit/b68e7b1421f977c7256573564b12b4cc07d31f4a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicate...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23211#discussion_r240097255 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -649,13 +664,16 @@ object CollapseProject extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { case p1 @ Project(_, p2: Project) => - if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList)) { + if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) || +ScalarSubquery.hasScalarSubquery(p1.projectList) || +ScalarSubquery.hasScalarSubquery(p2.projectList)) { --- End diff -- why do we allow it before? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23225: [SPARK-26287][CORE]Don't need to create an empty spill f...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/23225 Thank you, @wangjiaochun . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23268: [Hive][Minor] Refactor on HiveShim and Add Unit Tests
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23268 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23268: [Hive][Minor] Refactor on HiveShim and Add Unit Tests
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23268 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240097105 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) + .getOrElse(cols) + .mkString(",") --- End diff -- I don't think this is more readable. Also, the previous code is more performant. I wouldn't change this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23255: [SPARK-26307] [SQL] Fix CTAS when INSERT a partitioned t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23255 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23255: [SPARK-26307] [SQL] Fix CTAS when INSERT a partitioned t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23255 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99891/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23255: [SPARK-26307] [SQL] Fix CTAS when INSERT a partitioned t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23255 **[Test build #99891 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99891/testReport)** for PR 23255 at commit [`68c076a`](https://github.com/apache/spark/commit/68c076a34a1406057effb6953e2cedfadabd1324). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23204 can we follow https://github.com/apache/spark/pull/23204#issuecomment-445510026 and create a new ticket? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
GitHub user sadhen opened a pull request: https://github.com/apache/spark/pull/23268 [Hive][Minor] Refactor on HiveShim and Add Unit Tests ## What changes were proposed in this pull request? Refactor on HiveShim, and add Unit Tests. ## How was this patch tested? ``` $ build/sbt > project hive > testOnly *HiveShimSuite ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/sadhen/spark refactor/hiveshim Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23268.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23268 commit 987fa18a3b50e117fc839201fcc29c166732486e Author: Darcy Shen Date: 2018-12-10T06:32:35Z Add unit tests for HiveShim appendReadColumns commit b68e7b1421f977c7256573564b12b4cc07d31f4a Author: Darcy Shen Date: 2018-12-10T06:38:19Z Refactor --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22290: [SPARK-25285][CORE] Add executor task metrics, successfu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22290 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"
Github user LuciferYang commented on the issue: https://github.com/apache/spark/pull/23204 ok~ already close #23214 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23214: [SPARK-26155] Optimizing the performance of LongT...
Github user LuciferYang closed the pull request at: https://github.com/apache/spark/pull/23214 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23214: [SPARK-26155] Optimizing the performance of LongToUnsafe...
Github user LuciferYang commented on the issue: https://github.com/apache/spark/pull/23214 As @cloud-fan said `the hash join metrics is wrongly implemented`, we will partial revert # SPARK-21052, no longer need this patch, close it ~ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23225: [SPARK-26287][CORE]Don't need to create an empty spill f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23225 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23225: [SPARK-26287][CORE]Don't need to create an empty spill f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23225 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99890/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"
Github user JkSelf commented on the issue: https://github.com/apache/spark/pull/23204 @cloud-fan @dongjoon-hyun update the patch, please help review if you have time. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23225: [SPARK-26287][CORE]Don't need to create an empty spill f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23225 **[Test build #99890 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99890/testReport)** for PR 23225 at commit [`bac1991`](https://github.com/apache/spark/commit/bac1991a5c047fd546785179d19144f9c076aa58). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23204 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5909/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23204 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23204 **[Test build #99894 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99894/testReport)** for PR 23204 at commit [`9baf05e`](https://github.com/apache/spark/commit/9baf05e00f2839a1dfa5d3a23cf265e1fe21d2a6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicate...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23211#discussion_r240092936 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -267,6 +267,17 @@ object ScalarSubquery { case _ => false }.isDefined } + + def hasScalarSubquery(e: Expression): Boolean = { +e.find { + case s: ScalarSubquery => true + case _ => false +}.isDefined + } + + def hasScalarSubquery(e: Seq[Expression]): Boolean = { +e.find(hasScalarSubquery(_)).isDefined --- End diff -- `e.exists(hasScalarSubquery)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22305: [SPARK-24561][SQL][Python] User-defined window ag...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22305#discussion_r240092271 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala --- @@ -144,24 +282,107 @@ case class WindowInPandasExec( queue.close() } - val inputProj = UnsafeProjection.create(allInputs, child.output) - val pythonInput = grouped.map { case (_, rows) => -rows.map { row => - queue.add(row.asInstanceOf[UnsafeRow]) - inputProj(row) + val stream = iter.map { row => +queue.add(row.asInstanceOf[UnsafeRow]) +row + } + + val pythonInput = new Iterator[Iterator[UnsafeRow]] { + +// Manage the stream and the grouping. +var nextRow: UnsafeRow = null +var nextGroup: UnsafeRow = null +var nextRowAvailable: Boolean = false +private[this] def fetchNextRow() { + nextRowAvailable = stream.hasNext + if (nextRowAvailable) { +nextRow = stream.next().asInstanceOf[UnsafeRow] +nextGroup = grouping(nextRow) + } else { +nextRow = null +nextGroup = null + } +} +fetchNextRow() + +// Manage the current partition. +val buffer: ExternalAppendOnlyUnsafeRowArray = + new ExternalAppendOnlyUnsafeRowArray(inMemoryThreshold, spillThreshold) +var bufferIterator: Iterator[UnsafeRow] = _ + +val indexRow = new SpecificInternalRow(Array.fill(numBoundIndices)(IntegerType)) + +val frames = factories.map(_(indexRow)) + +private[this] def fetchNextPartition() { + // Collect all the rows in the current partition. + // Before we start to fetch new input rows, make a copy of nextGroup. + val currentGroup = nextGroup.copy() + + // clear last partition + buffer.clear() + + while (nextRowAvailable && nextGroup == currentGroup) { --- End diff -- Thanks! Could you submit the PR to fix `WindowExec` please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21877: [SPARK-24923][SQL][WIP] Add unpartitioned CTAS and RTAS ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21877 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99893/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21877: [SPARK-24923][SQL][WIP] Add unpartitioned CTAS and RTAS ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21877 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21877: [SPARK-24923][SQL][WIP] Add unpartitioned CTAS and RTAS ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21877 **[Test build #99893 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99893/testReport)** for PR 21877 at commit [`e50d94b`](https://github.com/apache/spark/commit/e50d94bbc205369969e0c1707bda057ee99f0007). * This patch **fails to build**. * This patch **does not merge cleanly**. * This patch adds the following public classes _(experimental)_: * `case class CreateTableAsSelect(` * `case class ReplaceTableAsSelect(` * `case class TableV2Relation(` * `case class AppendDataExec(` * `case class CreateTableAsSelectExec(` * `case class ReplaceTableAsSelectExec(` * `case class WriteToDataSourceV2Exec(` * `abstract class V2TableWriteExec(` * ` implicit class CatalogHelper(catalog: CatalogProvider) ` * ` implicit class TableHelper(table: Table) ` * ` implicit class SourceHelper(source: DataSourceV2) ` * ` implicit class OptionsHelper(options: Map[String, String]) ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23248: [SPARK-26293][SQL] Cast exception when having python udf...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23248 If it's fine for 2.4, I think it's also fine for master as a temporary fix? We can create another ticket to clean up the subquery optimization hack. IIUC https://github.com/apache/spark/pull/23211 may help with it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23228: [MINOR][DOC] Update the condition description of seriali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23228 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5908/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21877: [SPARK-24923][SQL][WIP] Add unpartitioned CTAS and RTAS ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21877 **[Test build #99893 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99893/testReport)** for PR 21877 at commit [`e50d94b`](https://github.com/apache/spark/commit/e50d94bbc205369969e0c1707bda057ee99f0007). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23228: [MINOR][DOC] Update the condition description of seriali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23228 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23258#discussion_r240090371 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared } test("Sort metrics") { -// Assume the execution plan is -// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1)) -val ds = spark.range(10).sort('id) -testSparkPlanMetrics(ds.toDF(), 2, Map.empty) +// Assume the execution plan with node id is +// Sort(nodeId = 0) +// Exchange(nodeId = 1) +// LocalTableScan(nodeId = 2) +val df = Seq(1, 3, 2).toDF("id").sort('id) +testSparkPlanMetrics(df, 2, Map.empty) --- End diff -- can we just check something like `sortTime > 0`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23201#discussion_r240090192 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala --- @@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable { DecimalType(bigDecimal.precision, bigDecimal.scale) } decimalTry.getOrElse(StringType) - case VALUE_STRING => StringType + case VALUE_STRING => +val stringValue = parser.getText --- End diff -- the partition feature is shared between all the file-based sources, I think it's an overkill to make it differ with different data sources. The simplest solution to me is asking all text sources to follow the behavior of partition value type inference. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23228: [MINOR][DOC] Update the condition description of seriali...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23228 **[Test build #99892 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99892/testReport)** for PR 23228 at commit [`d5dadbf`](https://github.com/apache/spark/commit/d5dadbf30d5429c36ec3d5c2845a71c2717fd6f3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23265: [2.4][SPARK-26021][SQL][FOLLOWUP] only deal with ...
Github user cloud-fan closed the pull request at: https://github.com/apache/spark/pull/23265 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23228: [MINOR][DOC] Update the condition description of seriali...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23228 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23204 If we can quickly finish #23214 (within several days), let's go for it. But if we can't, I'd suggest we do the partial revert first to fix the perf regression, and add back the metrics later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23204: Revert "[SPARK-21052][SQL] Add hash map metrics to join"
Github user LuciferYang commented on the issue: https://github.com/apache/spark/pull/23204 @cloud-fan If we decide to partial revert SPARK-21052 and no need for #23214, I will close it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23255: [SPARK-26307] [SQL] Fix CTAS when INSERT a partitioned t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23255 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23255: [SPARK-26307] [SQL] Fix CTAS when INSERT a partitioned t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23255 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5907/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23255: [SPARK-26307] [SQL] Fix CTAS when INSERT a partitioned t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23255 **[Test build #99891 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99891/testReport)** for PR 23255 at commit [`68c076a`](https://github.com/apache/spark/commit/68c076a34a1406057effb6953e2cedfadabd1324). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23223: [SPARK-26269][YARN]Yarnallocator should have same blackl...
Github user Ngone51 commented on the issue: https://github.com/apache/spark/pull/23223 Hi @tgravescs , I tried it, but found it's difficult to produce KILLED_BY_RESOURCEMANAGER exit status. I followed [YARN-73](https://issues.apache.org/jira/browse/YARN-73) [YARN-495](https://issues.apache.org/jira/browse/YARN-495), but things didn't go as I expected. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23248: [SPARK-26293][SQL] Cast exception when having python udf...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/23248 LGTM to the surgical fix for backporting. We need to fix this rule with the other rules for avoiding making such a strong and hidden assumption. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23248: [SPARK-26293][SQL] Cast exception when having pyt...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/23248#discussion_r240079120 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala --- @@ -131,8 +131,20 @@ object ExtractPythonUDFs extends Rule[LogicalPlan] with PredicateHelper { expressions.flatMap(collectEvaluableUDFs) } - def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { -case plan: LogicalPlan => extract(plan) + def apply(plan: LogicalPlan): LogicalPlan = plan match { +// SPARK-26293: A subquery will be rewritten into join later, and will go through this rule +// eventually. Here we skip subquery, as Python UDF only needs to be extracted once. +case _: Subquery => plan --- End diff -- Basically, we want to ensure this rule is running once and only once. In the future, if we have another rule/function that calls Optimizer.this.execute(plan), this rule needs to be fixed again... We have a very strong hidden assumption in the implementation. This looks risky in the long term. The current fix is fine for backporting to 2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Github user fjh100456 commented on a diff in the pull request: https://github.com/apache/spark/pull/22707#discussion_r240077883 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala --- @@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter } } } + + test("SPARK-25717: Insert overwrite a recreated external and partitioned table " --- End diff -- Okï¼ thank you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23266: [SPARK-26313][SQL] move read related methods from...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23266#discussion_r240074711 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java --- @@ -20,14 +20,27 @@ import org.apache.spark.annotation.Evolving; import org.apache.spark.sql.sources.v2.reader.Scan; import org.apache.spark.sql.sources.v2.reader.ScanBuilder; +import org.apache.spark.sql.types.StructType; /** - * An empty mix-in interface for {@link Table}, to indicate this table supports batch scan. - * - * If a {@link Table} implements this interface, its {@link Table#newScanBuilder(DataSourceOptions)} - * must return a {@link ScanBuilder} that builds {@link Scan} with {@link Scan#toBatch()} - * implemented. - * + * A mix-in interface for {@link Table} to provide data reading ability of batch processing. */ @Evolving -public interface SupportsBatchRead extends Table { } +public interface SupportsBatchRead extends Table { + + /** + * Returns the schema of this table. + */ + StructType schema(); --- End diff -- To me, +1 for the current change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23266: [SPARK-26313][SQL] move read related methods from...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23266#discussion_r240073831 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java --- @@ -20,14 +20,27 @@ import org.apache.spark.annotation.Evolving; import org.apache.spark.sql.sources.v2.reader.Scan; import org.apache.spark.sql.sources.v2.reader.ScanBuilder; +import org.apache.spark.sql.types.StructType; /** - * An empty mix-in interface for {@link Table}, to indicate this table supports batch scan. - * - * If a {@link Table} implements this interface, its {@link Table#newScanBuilder(DataSourceOptions)} - * must return a {@link ScanBuilder} that builds {@link Scan} with {@link Scan#toBatch()} - * implemented. - * + * A mix-in interface for {@link Table} to provide data reading ability of batch processing. */ @Evolving -public interface SupportsBatchRead extends Table { } +public interface SupportsBatchRead extends Table { + + /** + * Returns the schema of this table. + */ + StructType schema(); + + /** + * Returns a {@link ScanBuilder} which can be used to build a {@link Scan} later. The built + * {@link Scan} must implement {@link Scan#toBatch()}. Spark will call this method for each data + * scanning query. + * + * The builder can take some query specific information to do operators pushdown, and keep these + * information in the created {@link Scan}. + * + */ + ScanBuilder newScanBuilder(DataSourceOptions options); --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22707#discussion_r240073151 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala --- @@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter } } } + + test("SPARK-25717: Insert overwrite a recreated external and partitioned table " --- End diff -- How about `SPARK-25717: Insert overwrites remove old partition dirs correctly`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/23072 It looks enough to me, @srowen . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16812 I still don't think need this since the workaround is easy. If other committers find it worth, I won't object. If there are no interests fro this PR afterwards, I would just close this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Github user fjh100456 commented on a diff in the pull request: https://github.com/apache/spark/pull/22707#discussion_r240070378 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -227,18 +227,22 @@ case class InsertIntoHiveTable( // Newer Hive largely improves insert overwrite performance. As Spark uses older Hive // version and we may not want to catch up new Hive version every time. We delete the // Hive partition first and then load data file into the Hive partition. - if (oldPart.nonEmpty && overwrite) { -oldPart.get.storage.locationUri.foreach { uri => - val partitionPath = new Path(uri) - val fs = partitionPath.getFileSystem(hadoopConf) - if (fs.exists(partitionPath)) { -if (!fs.delete(partitionPath, true)) { - throw new RuntimeException( -"Cannot remove partition directory '" + partitionPath.toString) -} -// Don't let Hive do overwrite operation since it is slower. -doHiveOverwrite = false + if (overwrite) { +val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_))) + .getOrElse { +ExternalCatalogUtils.generatePartitionPath( + partitionSpec, + partitionColumnNames, + new Path(table.location)) --- End diff -- Oops, seems it's a mistake. The `oldPart` is empty. Thank you very much, I'll change the code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23228: [MINOR][DOC] Update the condition description of seriali...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/23228 I have updated, thanks all. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23267: [SPARK-25401] [SQL] Reorder join predicates to match chi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23267 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org