[GitHub] [spark] EnricoMi commented on a diff in pull request #37304: [SPARK-39877][PySpark] Add unpivot to PySpark DataFrame API
EnricoMi commented on code in PR #37304: URL: https://github.com/apache/spark/pull/37304#discussion_r932044028 ## python/pyspark/sql/dataframe.py: ## @@ -2188,6 +2188,142 @@ def cube(self, *cols: "ColumnOrName") -> "GroupedData": # type: ignore[misc] return GroupedData(jgd, self) +def unpivot( +self, +ids: Optional[Union["ColumnOrName", List["ColumnOrName"], Tuple["ColumnOrName", ...]]] = None, +values: Optional[Union["ColumnOrName", List["ColumnOrName"], Tuple["ColumnOrName", ...]]] = None, +variableColumnName: Optional[str] = None, +valueColumnName: Optional[str] = None, +) -> "DataFrame": +""" +Unpivot a DataFrame from wide format to long format, optionally leaving +identifier columns set. This is the reverse to `groupBy(...).pivot(...).agg(...)`, +except for the aggregation, which cannot be reversed. + +This function is useful to massage a DataFrame into a format where some +columns are identifier columns ("ids"), while all other columns ("values") +are "unpivoted" to the rows, leaving just two non-id columns, named as given +by `variableColumnName` and `valueColumnName`. + +When no "id" columns are given, the unpivoted DataFrame consists of only the +"variable" and "value" columns. + +All "value" columns must share a least common data type. Unless they are the same data type, +all "value" columns are cast to the nearest common data type. For instance, types +`IntegerType` and `LongType` are cast to `LongType`, while `IntegerType` and `StringType` +do not have a common data type and `unpivot` fails. + +:func:`groupby` is an alias for :func:`groupBy`. + +.. versionadded:: 3.4.0 + +Parameters +-- +ids : str, Column, tuple, list, optional +Column(s) to use as identifiers. Can be a single column or column name, +or a list for multiple columns. +values : str, Column, tuple, list, optional +Column(s) to unpivot. Can be a single column or column name, or a list +for multiple columns. If not specified or empty, uses all columns that +are not set as `ids`. +variableColumnName : scalar, default 'variable' +Name of the variable column. If None it uses 'variable'. +valueColumnName : scalar, default 'value' +Name of the value column. If None it uses 'value'. + +Returns +--- +DataFrame +Unpivoted DataFrame. + +Examples + +>>> df = spark.createDataFrame( +... [(1, 11, 1.1), (2, 12, 1.2)], +... ["id", "int", "double"], +... ) +>>> df.show() ++---+---+--+ +| id|int|double| ++---+---+--+ +| 1| 11| 1.1| +| 2| 12| 1.2| ++---+---+--+ + +>>> df.unpivot("id").show() ++---++-+ +| id|variable|value| ++---++-+ +| 1| int| 11.0| +| 1| double| 1.1| +| 2| int| 12.0| +| 2| double| 1.2| ++---++-+ +""" +def to_jcols(cols) -> JavaObject: +l = cols +if cols is None: +l = [] +elif isinstance(cols, tuple): +l = list(cols) +elif not isinstance(cols, list): +l = [cols] +return self._jcols(*l) + +if variableColumnName is None: +variableColumnName = "variable" Review Comment: removed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #36996: [SPARK-34305][SQL] Unify v1 and v2 ALTER TABLE .. SET SERDE tests
panbingkun commented on code in PR #36996: URL: https://github.com/apache/spark/pull/36996#discussion_r932085322 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableSetSerdeSuite.scala: ## @@ -0,0 +1,203 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command.v1 + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.execution.command +import org.apache.spark.sql.internal.StaticSQLConf.CATALOG_IMPLEMENTATION + +/** + * This base suite contains unified tests for the `ALTER TABLE .. SET [SERDE|SERDEPROPERTIES]` + * command that check V1 table catalogs. The tests that cannot run for all V1 catalogs + * are located in more specific test suites: + * + * - V1 In-Memory catalog: `org.apache.spark.sql.execution.command.v1.AlterTableSetSerdeSuite` + * - V1 Hive External catalog: + * `org.apache.spark.sql.hive.execution.command.AlterTableSetSerdeSuite` + */ +trait AlterTableSetSerdeSuiteBase extends command.AlterTableSetSerdeSuiteBase { + + protected val isDatasourceTable = true + + private def isUsingHiveMetastore: Boolean = { +spark.sparkContext.conf.get(CATALOG_IMPLEMENTATION) == "hive" + } + + private def normalizeSerdeProp(props: Map[String, String]): Map[String, String] = { +props.filterNot(p => Seq("serialization.format", "path").contains(p._1)) + } + + private def maybeWrapException[T](expectException: Boolean)(body: => T): Unit = { +if (expectException) intercept[AnalysisException] { body } else body + } + + protected def testSetSerde(): Unit = { +withNamespaceAndTable("ns", "tbl") { t => + if (!isUsingHiveMetastore) { +assert(isDatasourceTable, "InMemoryCatalog only supports data source tables") + } + sql(s"CREATE TABLE $t (col1 int, col2 string, a int, b int) $defaultUsing " + +s"PARTITIONED BY (a, b)") + + val catalog = spark.sessionState.catalog + val tableIdent = TableIdentifier("tbl", Some("ns")) + def checkSerdeProps(expectedSerdeProps: Map[String, String]): Unit = { +val serdeProp = catalog.getTableMetadata(tableIdent).storage.properties +if (isUsingHiveMetastore) { Review Comment: OK,I will do it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] physinet opened a new pull request, #37329: [SPARK-39832][PYTHON] Support column arguments in regexp_replace
physinet opened a new pull request, #37329: URL: https://github.com/apache/spark/pull/37329 ### What changes were proposed in this pull request? Support either literal Python strings or Column objects for the pattern and replacement arguments for `regexp_replace`. ### Why are the changes needed? Allows using different replacements per row, as in [this example](https://stackoverflow.com/questions/64613761/in-pyspark-using-regexp-replace-how-to-replace-a-group-with-value-from-another,). ### Does this PR introduce _any_ user-facing change? Users can now use `regexp_replace` with columns for all three arguments. The first argument (string that regex should be applied to) can be either a Column object or the string name of the column. In summary, the following signatures are supported: ```python regexp_replace("str", "\d", "") regexp_replace(F.col("str"), "\d", "") regexp_replace("str", F.col("pattern"), F.col("replacement")) regexp_replace(F.col("str"), F.col("pattern"), F.col("replacement")) ``` ### How was this patch tested? Added unit tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r932183215 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -410,12 +413,24 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper { case s @ Sort(order, _, operation @ ScanOperation(project, filter, sHolder: ScanBuilderHolder)) // Without building the Scan, we do not know the resulting column names after aggregate // push-down, and thus can't push down Top-N which needs to know the ordering column names. -// TODO: we can support simple cases like GROUP BY columns directly and ORDER BY the same -// columns, which we know the resulting column names: the original table columns. -if sHolder.pushedAggregate.isEmpty && filter.isEmpty && +// In particular, we push down the simple cases like GROUP BY expressions directly and +// ORDER BY the same expressions, which we know the original table columns. +if filter.isEmpty && CollapseProject.canCollapseExpressions(order, project, alwaysInline = true) => val aliasMap = getAliasMap(project) - val newOrder = order.map(replaceAlias(_, aliasMap)).asInstanceOf[Seq[SortOrder]] + def findGroupExprForSortOrder(sortOrder: SortOrder): SortOrder = sortOrder match { Review Comment: I think we can just do something similar with `replaceAlias` ``` sortOrder transform { case a: Attribute => sHolder.pushedAggOutputMap.getOrElse(a, a) } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #36918: [SQL][SPARK-39528] Use V2 Filter in SupportsRuntimeFiltering
cloud-fan closed pull request #36918: [SQL][SPARK-39528] Use V2 Filter in SupportsRuntimeFiltering URL: https://github.com/apache/spark/pull/36918 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on pull request #37319: [SPARK-39887][SQL] `PullOutGroupingExpressions` should generate different alias names
peter-toth commented on PR #37319: URL: https://github.com/apache/spark/pull/37319#issuecomment-1197915164 So, I was thinking about adding ``` case _: Union => var first = true plan.mapChildren { child => if (first) { first = false removeRedundantAliases(child, excluded ++ child.outputSet) } else { removeRedundantAliases(child, excluded) } } ``` to `RemoveRedundantAliases` (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L561) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wayneguow commented on pull request #36775: [SPARK-39389]Filesystem closed should not be considered as corrupt files
wayneguow commented on PR #36775: URL: https://github.com/apache/spark/pull/36775#issuecomment-1198234993 IMO, it's better that users can configure what exceptions can ignore corrupt files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #37317: [SPARK-39894][SQL] Combine the similar binary comparison in boolean expression.
beliefer commented on PR #37317: URL: https://github.com/apache/spark/pull/37317#issuecomment-1198064228 ping @MaxGekk @gengliangwang @dongjoon-hyun cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r932182709 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -545,6 +560,9 @@ case class ScanBuilderHolder( var pushedPredicates: Seq[Predicate] = Seq.empty[Predicate] var pushedAggregate: Option[Aggregation] = None + + var pushedAggregateExpectedOutputMap: Map[AttributeReference, Expression] = Review Comment: ```suggestion var pushedAggOutputMap: Map[AttributeReference, Expression] = ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r932181859 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -545,6 +560,9 @@ case class ScanBuilderHolder( var pushedPredicates: Seq[Predicate] = Seq.empty[Predicate] var pushedAggregate: Option[Aggregation] = None + + var pushedAggregateExpectedOutputMap: Map[AttributeReference, Expression] = Review Comment: We can use `AttributeMap` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
cloud-fan commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r932186273 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -811,6 +800,244 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = sql( + """ +|SELECT dept AS my_dept, SUM(SALARY) FROM h2.test.employee +|GROUP BY dept +|ORDER BY my_dept +|LIMIT 1 +|""".stripMargin) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +val df4 = spark.read + .table("h2.test.employee") + .groupBy("DEPT", "IS_MANAGER").sum("SALARY") + .orderBy("DEPT", "IS_MANAGER") + .limit(1) +checkSortRemoved(df4) +checkLimitRemoved(df4) +checkPushedInfo(df4, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df4, Seq(Row(1, false, 9000.00))) + +val df5 = sql( + """ +|SELECT dept AS my_dept, is_manager AS my_manager, SUM(SALARY) FROM h2.test.employee +|GROUP BY dept, my_manager +|ORDER BY my_dept, my_manager +|LIMIT 1 +|""".stripMargin) +checkSortRemoved(df5) +checkLimitRemoved(df5) +checkPushedInfo(df5, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df5, Seq(Row(1, false, 9000.00))) + +val df6 = spark.read + .table("h2.test.employee") + .select($"SALARY", $"IS_MANAGER", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key", "IS_MANAGER").sum("SALARY") + .orderBy("key", "IS_MANAGER") + .limit(1) +checkSortRemoved(df6) +checkLimitRemoved(df6) +checkPushedInfo(df6, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END, " + +"IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df6, Seq(Row(0.00, false, 12000.00))) + +val df7 = sql( + """ +|SELECT dept, SUM(SALARY) FROM h2.test.employee +|GROUP BY dept +|ORDER BY SUM(SALARY) +|LIMIT 1 +|""".stripMargin) +checkSortRemoved(df7, false) +checkLimitRemoved(df7, false) +checkPushedInfo(df7, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []") +checkAnswer(df7, Seq(Row(6, 12000.00))) + +val df8 = sql( + """ +|SELECT dept, SUM(SALARY) AS total FROM h2.test.employee +|GROUP BY dept +|ORDER BY total +|LIMIT 1 +|""".stripMargin) +checkSortRemoved(df8, false) +checkLimitRemoved(df8, false) +checkPushedInfo(df8, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []") +checkAnswer(df8, Seq(Row(6, 12000.00))) + } + +
[GitHub] [spark] goutam-git commented on a diff in pull request #37065: [SPARK-38699][SQL] Use error classes in the execution errors of dictionary encoding
goutam-git commented on code in PR #37065: URL: https://github.com/apache/spark/pull/37065#discussion_r932196681 ## sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/compressionSchemes.scala: ## @@ -421,7 +421,7 @@ private[columnar] case object DictionaryEncoding extends CompressionScheme { override def compress(from: ByteBuffer, to: ByteBuffer): ByteBuffer = { if (overflow) { Review Comment: @MaxGekk should I remove the assert and use the new method name with error class instead of useDictionaryEncodingWhenDictionaryOverflowError() -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37319: [SPARK-39887][SQL] `PullOutGroupingExpressions` should generate different alias names
cloud-fan commented on PR #37319: URL: https://github.com/apache/spark/pull/37319#issuecomment-1198119033 `Union.output` is a long-standing issue (same for `Join.output`). It reuses the first child's output but apparently `Union` and its first child output different values. We have to carefully work around this issue in places like `FoldablePropagation`. Maybe we need to do the same in `RemoveRedundantAliases`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you opened a new pull request, #37330: [SPARK-39911][SQL] Optimize global Sort to RepartitionByExpression
ulysses-you opened a new pull request, #37330: URL: https://github.com/apache/spark/pull/37330 ### What changes were proposed in this pull request? Optimize Global sort to RepartitionByExpression, for example: ``` Sort local Sort local Sort global=> RepartitionByExpression ``` ### Why are the changes needed? If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort. ### Does this PR introduce _any_ user-facing change? no, only improve performance ### How was this patch tested? add test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37293: [SPARK-39872][SQL] Change to use `BytePackerForLong#unpack8Values` with Array input api in `VectorizedDeltaBinaryPackedReader`
LuciferYang commented on code in PR #37293: URL: https://github.com/apache/spark/pull/37293#discussion_r932335919 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java: ## @@ -300,7 +300,8 @@ private void unpackMiniBlock() throws IOException { bitWidths[currentMiniBlock]); for (int j = 0; j < miniBlockSizeInValues; j += 8) { Review Comment: Testing the performance of the following changes: ``` private void unpackMiniBlock() throws IOException { Arrays.fill(this.unpackedValuesBuffer, 0); BytePackerForLong packer = Packer.LITTLE_ENDIAN.newBytePackerForLong( bitWidths[currentMiniBlock]); int bitWidth = packer.getBitWidth(); int times = miniBlockSizeInValues / 8; int total = times * bitWidth; ByteBuffer buffer = in.slice(total); byte[] array = buffer.array(); int i = 0; int srcOffset = buffer.arrayOffset() + buffer.position(); int targetOffSet = 0; while (i < times) { packer.unpack8Values(array, srcOffset, unpackedValuesBuffer, targetOffSet); i++; srcOffset += bitWidth; targetOffSet += 8; } remainingInMiniBlock = miniBlockSizeInValues; currentMiniBlock++; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] senthh commented on pull request #35785: [SPARK-38213][STREAMING] Adding KafkaSink Metrics feature
senthh commented on PR #35785: URL: https://github.com/apache/spark/pull/35785#issuecomment-1198036728 @dongjoon-hyun @dgd-contributor @gaborgsomogyi @squito Could you be kind to review this PR, Please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #37331: [SPARK-39913][BUILD] Upgrade to Arrow 9.0.0
LuciferYang opened a new pull request, #37331: URL: https://github.com/apache/spark/pull/37331 ### What changes were proposed in this pull request? Testing with Arrow 9.0.0, will update here later ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #37162: [SPARK-38910][YARN] Clean spark staging before unregister
AngersZh commented on PR #37162: URL: https://github.com/apache/spark/pull/37162#issuecomment-1197950750 ping @dongjoon-hyun The latest GA failed caused by ``` * DONE (miniUI) ERROR: dependency ‘pkgdown’ is not available for package ‘devtools’ * removing ‘/usr/local/lib/R/site-library/devtools’ The downloaded source packages are in ‘/tmp/RtmpTvMfJ6/downloaded_packages’ Warning messages: 1: In install.packages(c("devtools"), repos = "https://cloud.r-project.org/;) : installation of package ‘systemfonts’ had non-zero exit status 2: In install.packages(c("devtools"), repos = "https://cloud.r-project.org/;) : installation of package ‘textshaping’ had non-zero exit status 3: In install.packages(c("devtools"), repos = "https://cloud.r-project.org/;) : installation of package ‘ragg’ had non-zero exit status 4: In install.packages(c("devtools"), repos = "https://cloud.r-project.org/;) : installation of package ‘pkgdown’ had non-zero exit status 5: In install.packages(c("devtools"), repos = "https://cloud.r-project.org/;) : installation of package ‘devtools’ had non-zero exit status Error in loadNamespace(x) : there is no package called ‘devtools’ Calls: loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart Execution halted Error: Process completed with exit code 1. ``` Any advise? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #37314: [SPARK-39891][BUILD] Bump h2 to 2.1.214
panbingkun commented on PR #37314: URL: https://github.com/apache/spark/pull/37314#issuecomment-1197977785 cc @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #36996: [SPARK-34305][SQL] Unify v1 and v2 ALTER TABLE .. SET SERDE tests
MaxGekk commented on code in PR #36996: URL: https://github.com/apache/spark/pull/36996#discussion_r932008623 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableSetSerdeSuite.scala: ## @@ -0,0 +1,203 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command.v1 + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.execution.command +import org.apache.spark.sql.internal.StaticSQLConf.CATALOG_IMPLEMENTATION + +/** + * This base suite contains unified tests for the `ALTER TABLE .. SET [SERDE|SERDEPROPERTIES]` + * command that check V1 table catalogs. The tests that cannot run for all V1 catalogs + * are located in more specific test suites: + * + * - V1 In-Memory catalog: `org.apache.spark.sql.execution.command.v1.AlterTableSetSerdeSuite` + * - V1 Hive External catalog: + * `org.apache.spark.sql.hive.execution.command.AlterTableSetSerdeSuite` + */ +trait AlterTableSetSerdeSuiteBase extends command.AlterTableSetSerdeSuiteBase { + + protected val isDatasourceTable = true + + private def isUsingHiveMetastore: Boolean = { +spark.sparkContext.conf.get(CATALOG_IMPLEMENTATION) == "hive" + } + + private def normalizeSerdeProp(props: Map[String, String]): Map[String, String] = { +props.filterNot(p => Seq("serialization.format", "path").contains(p._1)) + } + + private def maybeWrapException[T](expectException: Boolean)(body: => T): Unit = { +if (expectException) intercept[AnalysisException] { body } else body + } + + protected def testSetSerde(): Unit = { +withNamespaceAndTable("ns", "tbl") { t => + if (!isUsingHiveMetastore) { +assert(isDatasourceTable, "InMemoryCatalog only supports data source tables") + } + sql(s"CREATE TABLE $t (col1 int, col2 string, a int, b int) $defaultUsing " + +s"PARTITIONED BY (a, b)") + + val catalog = spark.sessionState.catalog + val tableIdent = TableIdentifier("tbl", Some("ns")) + def checkSerdeProps(expectedSerdeProps: Map[String, String]): Unit = { +val serdeProp = catalog.getTableMetadata(tableIdent).storage.properties +if (isUsingHiveMetastore) { Review Comment: Could you extract common code to functions in `AlterTableSetSerdeSuiteBase`, and create dedicated tests for Hive and In-Memory catalogs. Please, invoke the common code from catalog specific tests. This is common convention in the unified tests for v1 and v2 catalogs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #37275: [SPARK-39835][SQL][3.2] Fix EliminateSorts remove global sort below the local sort
ulysses-you commented on PR #37275: URL: https://github.com/apache/spark/pull/37275#issuecomment-1197941198 cc @cloud-fan ready for branch-3.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #37276: [SPARK-39835][SQL][3.1] Fix EliminateSorts remove global sort below the local sort
ulysses-you commented on PR #37276: URL: https://github.com/apache/spark/pull/37276#issuecomment-1197940919 cc @cloud-fan ready for branch-3.1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #36918: [SQL][SPARK-39528] Use V2 Filter in SupportsRuntimeFiltering
huaxingao commented on PR #36918: URL: https://github.com/apache/spark/pull/36918#issuecomment-1198240325 Thanks @cloud-fan @zinking -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37287: [WIP] code cleanup for CatalogImpl
cloud-fan commented on code in PR #37287: URL: https://github.com/apache/spark/pull/37287#discussion_r932317590 ## sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala: ## @@ -110,53 +108,44 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog { override def listTables(dbName: String): Dataset[Table] = { // `dbName` could be either a single database name (behavior in Spark 3.3 and prior) or // a qualified namespace with catalog name. We assume it's a single database name -// and check if we can find the dbName in sessionCatalog. If so we listTables under -// that database. Otherwise we try 3-part name parsing and locate the database. -if (sessionCatalog.databaseExists(dbName) || sessionCatalog.isGlobalTempViewDB(dbName)) { Review Comment: no need to check global temp view db. It doesn't belong to any catalog and v2 commands take care of it as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on pull request #37319: [SPARK-39887][SQL] `PullOutGroupingExpressions` should generate different alias names
peter-toth commented on PR #37319: URL: https://github.com/apache/spark/pull/37319#issuecomment-1198030620 I don't think that extra `Alias` does any harm in that test, just the expected needs to be amended. My proposal also fixes the issue of the following: ``` SELECT a, b AS a FROM ( SELECT a, a AS b FROM (SELECT a FROM VALUES (1) AS t(a)) UNION ALL SELECT a, b FROM (SELECT a, b FROM VALUES (1, 2) AS t(a, b)) ) ``` and the query returns the correct result: ``` +---+---+ | a| a| +---+---+ | 1| 1| | 1| 2| +---+---+ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false
cloud-fan commented on code in PR #37327: URL: https://github.com/apache/spark/pull/37327#discussion_r932204473 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -153,19 +153,24 @@ class CSVOptions( * Disabled by default for backwards compatibility and performance. When enabled, date entries in * timestamp columns will be cast to timestamp upon parsing. Not compatible with * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters + * + * The flag is only enabled if inferSchema is set to true. */ - val inferDate = { -val inferDateFlag = getBool("inferDate") -if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) { + val preferDate = { +val preferDateFlag = getBool("preferDate") +if (preferDateFlag && SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) { throw QueryExecutionErrors.inferDateWithLegacyTimeParserError() } -inferDateFlag +if (preferDateFlag && !inferSchemaFlag) { Review Comment: I'd prefer to fix doc. This renaming is kind of "redefine" this option, and it doesn't make sense to bind `preferDate` to ``inferSchema`. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -153,19 +153,24 @@ class CSVOptions( * Disabled by default for backwards compatibility and performance. When enabled, date entries in * timestamp columns will be cast to timestamp upon parsing. Not compatible with * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters + * + * The flag is only enabled if inferSchema is set to true. */ - val inferDate = { -val inferDateFlag = getBool("inferDate") -if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) { + val preferDate = { +val preferDateFlag = getBool("preferDate") +if (preferDateFlag && SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) { throw QueryExecutionErrors.inferDateWithLegacyTimeParserError() } -inferDateFlag +if (preferDateFlag && !inferSchemaFlag) { Review Comment: I'd prefer to fix doc. This renaming is kind of "redefine" this option, and it doesn't make sense to bind `preferDate` to `inferSchema`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ala commented on a diff in pull request #37228: [SPARK-37980][SQL] Extend METADATA column to support row indexes
ala commented on code in PR #37228: URL: https://github.com/apache/spark/pull/37228#discussion_r932280019 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala: ## @@ -223,8 +216,25 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging { }.toSeq }.getOrElse(Seq.empty) + val fileFormatReaderGeneratedMetadataColumns: Seq[Attribute] = +metadataColumns.map(_.name).flatMap { Review Comment: Added tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #36240: [SPARK-37787][CORE] fix bug, Long running Spark Job throw HDFS_DELEGATE_TOKEN not found in cache Exception
github-actions[bot] closed pull request #36240: [SPARK-37787][CORE] fix bug, Long running Spark Job throw HDFS_DELEGATE_TOKEN not found in cache Exception URL: https://github.com/apache/spark/pull/36240 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] RS131419 commented on a diff in pull request #37230: [SPARK-33326][SQL] Fix the problem of writing hive partition table without updating metadata information
RS131419 commented on code in PR #37230: URL: https://github.com/apache/spark/pull/37230#discussion_r932792260 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala: ## @@ -1611,4 +1611,26 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } } } + + test("SPARK-33326: partition metadata auto update for dynamic partitions") { +val table = "partition_metadata_dynamic_partition" +Seq("hive", "parquet").foreach { source => + withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> "true") { +withTable(table) { + sql(s"CREATE TABLE $table (id INT, sp INT, dp INT) USING $source PARTITIONED BY (sp, dp)") + sql(s"INSERT INTO $table PARTITION (sp=0, dp) VALUES (0, 0)") + sql(s"INSERT OVERWRITE TABLE $table PARTITION (sp=0, dp) SELECT id, id FROM range(5)") + + for (i <- 0 until 5) { +val partition = spark.sessionState.catalog + .getPartition(TableIdentifier(table), Map("sp" -> s"0", "dp" -> s"$i")) +val numFiles = partition.parameters("numFiles") +assert(numFiles.nonEmpty && numFiles.toInt > 0) +val totalSize = partition.parameters("totalSize") Review Comment: Thanks for the reply! This patch will provide the calculated rawDataSize to hive via the alterPartitions method, but ultimately it fails to take effect, which I think may be due to hive's own behavior. Again, I did the test via spark2.3 and the rawDataSize was not updated, so I didn't do the verification in the unit test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl
cloud-fan commented on PR #37287: URL: https://github.com/apache/spark/pull/37287#issuecomment-1198793708 > Is listTables() does not respect current catalog fixed in this PR? I think so, by always passing the fully qualified name to `getTable` in `listTables`. We can add tests later, to make this PR a pure refinement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37329: [SPARK-39832][PYTHON] Support column arguments in regexp_replace
HyukjinKwon commented on PR #37329: URL: https://github.com/apache/spark/pull/37329#issuecomment-1198809042 cc @zero323 FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37329: [SPARK-39832][PYTHON] Support column arguments in regexp_replace
HyukjinKwon commented on code in PR #37329: URL: https://github.com/apache/spark/pull/37329#discussion_r932809698 ## python/pyspark/sql/functions.py: ## @@ -3262,7 +3262,19 @@ def regexp_extract(str: "ColumnOrName", pattern: str, idx: int) -> Column: return _invoke_function("regexp_extract", _to_java_column(str), pattern, idx) -def regexp_replace(str: "ColumnOrName", pattern: str, replacement: str) -> Column: +@overload +def regexp_replace(string: "ColumnOrName", pattern: str, replacement: str) -> Column: +... + + +@overload +def regexp_replace(string: "ColumnOrName", pattern: Column, replacement: Column) -> Column: +... + + +def regexp_replace( +string: "ColumnOrName", pattern: Union[str, Column], replacement: Union[str, Column] Review Comment: Can we write up `Parameters` section in the docs? And I think you can only keep `pattern: Union[str, Column]` and remove the ones above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #37258: [DO-NOT-MERGE] trigger CI
Yikun commented on PR #37258: URL: https://github.com/apache/spark/pull/37258#issuecomment-1198812267 Sorry for late reply, I'm busy in some local meeting recent days. > In addition, can we get the content of dmesg? @LuciferYang We can add a separate step like: ``` - name: Print debug info if: failure() run: | # print demsg info ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37258: [DO-NOT-MERGE] trigger CI
LuciferYang commented on PR #37258: URL: https://github.com/apache/spark/pull/37258#issuecomment-1198825739 > Sorry for late reply, I'm busy in some local meeting recent days. > > > In addition, can we get the content of dmesg? > > @LuciferYang We can add a separate step like: > > ``` > - name: Print debug info > if: failure() > run: | > # print demsg info > ``` Thanks for your reply, I have tested, the account executing GA should not have permission to execute `demsg` and GA test should already stability now :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] deshanxiao opened a new pull request, #37336: [SPARK-39916][SQL][MLLIB][REFACTOR] Merge ml SchemaUtils to SQL
deshanxiao opened a new pull request, #37336: URL: https://github.com/apache/spark/pull/37336 ### What changes were proposed in this pull request? Today we have two SchemaUtils: SQL SchemaUtils and mllib SchemaUtils. This pr is try to remove SchemaUtils in mllib. ### Why are the changes needed? Two SchemaUtils are often confused for us. MLlib SchemaUtils add a TODO flag and now we can do it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? exist UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #37304: [SPARK-39877][PySpark] Add unpivot to PySpark DataFrame API
zhengruifeng commented on code in PR #37304: URL: https://github.com/apache/spark/pull/37304#discussion_r932846004 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2127,6 +2127,15 @@ class Dataset[T] private[sql]( valueColumnName: String): DataFrame = unpivot(ids, Array.empty, variableColumnName, valueColumnName) + /** + * Called from Python as Seq[Column] are easier to create via py4j than Array[Column]. + */ + private[sql] def _unpivot(ids: Seq[Column], Review Comment: can you rename the method? we do not use `_xxx` like function names in scala -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #37337: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow
MaxGekk commented on code in PR #37337: URL: https://github.com/apache/spark/pull/37337#discussion_r932884678 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalMathUtils.scala: ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.util + +import org.apache.spark.sql.errors.QueryExecutionErrors + +/** + * Helper functions for interval arithmetic operations with overflow. + */ +object IntervalMathUtils { + + def addExact(a: Int, b: Int): Int = withOverflow(Math.addExact(a, b), "try_add") + + def addExact(a: Long, b: Long): Long = withOverflow(Math.addExact(a, b), "try_add") + + def subtractExact(a: Int, b: Int): Int = withOverflow(Math.subtractExact(a, b), "try_subtract") + + def subtractExact(a: Long, b: Long): Long = withOverflow(Math.subtractExact(a, b), "try_subtract") + + def negateExact(a: Int): Int = withOverflow(Math.negateExact(a)) + + def negateExact(a: Long): Long = withOverflow(Math.negateExact(a)) + + private def withOverflow[A](f: => A, hint: String = ""): A = { +try { + f +} catch { + case e: ArithmeticException => +throw QueryExecutionErrors.intervalArithmeticOverflowError(e.getMessage, hint) Review Comment: Should we create sub-error classes per every op instead of using of `e.getMessage` from Java exception? cc @srielau @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #37268: [SPARK-39853][CORE] Support stage level task resource schedule for standalone cluster when dynamic allocation disabled
ivoson commented on code in PR #37268: URL: https://github.com/apache/spark/pull/37268#discussion_r928873929 ## core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala: ## @@ -388,14 +388,19 @@ private[spark] class TaskSchedulerImpl( val execId = shuffledOffers(i).executorId val host = shuffledOffers(i).host val taskSetRpID = taskSet.taskSet.resourceProfileId + val prof = sc.resourceProfileManager.resourceProfileFromId(taskSetRpID) + val targetExecutorRpID = if (prof.isForTaskOnly) { +ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID + } else { +taskSetRpID Review Comment: Thanks pointing this out. Will try to add a new `TaskResourceProfile` to process task only profiles to make the change more clear. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on a diff in pull request #37305: [SPARK-39881][PYTHON] Fix erroneous check for black and reenable black validation.
Yikun commented on code in PR #37305: URL: https://github.com/apache/spark/pull/37305#discussion_r932795358 ## dev/lint-python: ## @@ -210,7 +210,7 @@ function black_test { local BLACK_STATUS= # Skip check if black is not installed. -$BLACK_BUILD 2> /dev/null +$PYTHON_EXECUTABLE -c 'import black' &> /dev/null if [ $? -ne 0 ]; then echo "The $BLACK_BUILD command was not found. Skipping black checks for now." Review Comment: nit: we might also change this warning to something like change from `$BLACK_BUILD` to `$PYTHON_EXECUTABLE -c 'import black'` or just simple as: `The black is not installed. Skipping black checks for now.` ## dev/pyproject.toml: ## @@ -27,7 +27,7 @@ testpaths = [ [tool.black] # When changing the version, we have to update # GitHub workflow version and dev/reformat-python -required-version = "21.12b0" +required-version = "22.6.0" Review Comment: unrelated nit: would we also want to pin the version to requirement? https://github.com/apache/spark/blob/master/dev/requirements.txt#L47 When devs install the requirements, then they get the pyspasrk dev requrired version. we can also do it in followup. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
beliefer commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r932847111 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -811,6 +800,244 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel checkAnswer(df2, Seq(Row(2, "david", 1.00))) } + test("scan with aggregate push-down and top N push-down") { +val df1 = spark.read + .table("h2.test.employee") + .groupBy("DEPT").sum("SALARY") + .orderBy("DEPT") + .limit(1) +checkSortRemoved(df1) +checkLimitRemoved(df1) +checkPushedInfo(df1, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df1, Seq(Row(1, 19000.00))) + +val df2 = sql( + """ +|SELECT dept AS my_dept, SUM(SALARY) FROM h2.test.employee +|GROUP BY dept +|ORDER BY my_dept +|LIMIT 1 +|""".stripMargin) +checkSortRemoved(df2) +checkLimitRemoved(df2) +checkPushedInfo(df2, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1") +checkAnswer(df2, Seq(Row(1, 19000.00))) + +val df3 = spark.read + .table("h2.test.employee") + .select($"SALARY", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key").sum("SALARY") + .orderBy("key") + .limit(1) +checkSortRemoved(df3) +checkLimitRemoved(df3) +checkPushedInfo(df3, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST] LIMIT 1") +checkAnswer(df3, Seq(Row(0, 44000.00))) + +val df4 = spark.read + .table("h2.test.employee") + .groupBy("DEPT", "IS_MANAGER").sum("SALARY") + .orderBy("DEPT", "IS_MANAGER") + .limit(1) +checkSortRemoved(df4) +checkLimitRemoved(df4) +checkPushedInfo(df4, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df4, Seq(Row(1, false, 9000.00))) + +val df5 = sql( + """ +|SELECT dept AS my_dept, is_manager AS my_manager, SUM(SALARY) FROM h2.test.employee +|GROUP BY dept, my_manager +|ORDER BY my_dept, my_manager +|LIMIT 1 +|""".stripMargin) +checkSortRemoved(df5) +checkLimitRemoved(df5) +checkPushedInfo(df5, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT, IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df5, Seq(Row(1, false, 9000.00))) + +val df6 = spark.read + .table("h2.test.employee") + .select($"SALARY", $"IS_MANAGER", +when(($"SALARY" > 8000).and($"SALARY" < 1), $"salary").otherwise(0).as("key")) + .groupBy("key", "IS_MANAGER").sum("SALARY") + .orderBy("key", "IS_MANAGER") + .limit(1) +checkSortRemoved(df6) +checkLimitRemoved(df6) +checkPushedInfo(df6, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: " + +"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END, " + +"IS_MANAGER]", + "PushedFilters: []", + "PushedTopN: ORDER BY [" + +"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 0.00 END " + +"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1") +checkAnswer(df6, Seq(Row(0.00, false, 12000.00))) + +val df7 = sql( + """ +|SELECT dept, SUM(SALARY) FROM h2.test.employee +|GROUP BY dept +|ORDER BY SUM(SALARY) +|LIMIT 1 +|""".stripMargin) +checkSortRemoved(df7, false) +checkLimitRemoved(df7, false) +checkPushedInfo(df7, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []") +checkAnswer(df7, Seq(Row(6, 12000.00))) + +val df8 = sql( + """ +|SELECT dept, SUM(SALARY) AS total FROM h2.test.employee +|GROUP BY dept +|ORDER BY total +|LIMIT 1 +|""".stripMargin) +checkSortRemoved(df8, false) +checkLimitRemoved(df8, false) +checkPushedInfo(df8, + "PushedAggregates: [SUM(SALARY)]", + "PushedGroupByExpressions: [DEPT]", + "PushedFilters: []") +checkAnswer(df8, Seq(Row(6, 12000.00))) + } + +
[GitHub] [spark] zhengruifeng commented on a diff in pull request #37304: [SPARK-39877][PySpark] Add unpivot to PySpark DataFrame API
zhengruifeng commented on code in PR #37304: URL: https://github.com/apache/spark/pull/37304#discussion_r932851912 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2127,6 +2127,15 @@ class Dataset[T] private[sql]( valueColumnName: String): DataFrame = unpivot(ids, Array.empty, variableColumnName, valueColumnName) + /** + * Called from Python as Seq[Column] are easier to create via py4j than Array[Column]. + */ + private[sql] def _unpivot(ids: Seq[Column], Review Comment: > And there are other private Python methods in Dataset (mapInPandas) and RelationalGroupedDataset (flatMapGroupsInPandas). they are also used in test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang opened a new pull request, #37337: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow
gengliangwang opened a new pull request, #37337: URL: https://github.com/apache/spark/pull/37337 ### What changes were proposed in this pull request? Similar with https://github.com/apache/spark/pull/37313, currently, when arithmetic overflow errors happen under ANSI mode, the error messages are like ``` [ARITHMETIC_OVERFLOW] long overflow. Use 'try_multiply' to tolerate overflow and return NULL instead. If necessary set spark.sql.ansi.enabled to "false" ``` The "(except for ANSI interval type)" part is confusing. We should remove it for the numeric arithmetic operations and have a new error class for the interval division error: `INTERVAL_ARITHMETIC_OVERFLOW` ### Why are the changes needed? For better error messages ### Does this PR introduce _any_ user-facing change? Yes, Use different error classes for arithmetic overflows of numeric/interval.. After changes, the error messages are simpler and more clear. ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang opened a new pull request, #37338: [SPARK-39918][SQL][MINOR] Replace the wording "un-comparable" with "incomparable" in error message
gengliangwang opened a new pull request, #37338: URL: https://github.com/apache/spark/pull/37338 ### What changes were proposed in this pull request? Update the codegen error message for data type which can't be compared by replacing`un-comparable` with `incomparable` ### Why are the changes needed? Incomparable is the correct wording here ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #37338: [SPARK-39918][SQL][MINOR] Replace the wording "un-comparable" with "incomparable" in error message
gengliangwang commented on PR #37338: URL: https://github.com/apache/spark/pull/37338#issuecomment-1198884914 This is trivial. I found it when working on https://github.com/apache/spark/pull/37337 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false
Jonathancui123 commented on PR #37327: URL: https://github.com/apache/spark/pull/37327#issuecomment-1198894995 > Should we keep requirement that `inferDate = true` needs `inferSchema = true`? I think we should clarify semantics. @sadikovi I think we should keep the requirement and the new exception type in this PR. The exception will clarify that the primary purpose of the `inferDate` flag is for allowing dates in the inferred schema. The requirement that `inferDate = true` needs `inferSchema = true` makes sense because otherwise, `inferDate` is modifying parsing fallback behavior for no reason. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead
MaxGekk commented on code in PR #37322: URL: https://github.com/apache/spark/pull/37322#discussion_r932499581 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala: ## @@ -305,14 +305,17 @@ class DatasetUnpivotSuite extends QueryTest valueColumnName = "val" ) } -checkErrorClass( +checkError( exception = e, errorClass = "UNPIVOT_VALUE_DATA_TYPE_MISMATCH", - msg = "Unpivot value columns must share a least common type, some types do not: \\[" + -"\"STRING\" \\(`str1#\\d+`\\), " + -"\"INT\" \\(`int1#\\d+`, `int2#\\d+`, `int3#\\d+`, ...\\), " + -"\"BIGINT\" \\(`long1#\\d+L`, `long2#\\d+L`\\)\\];(\n.*)*", - matchMsg = true) + errorSubClass = None, + sqlState = None, Review Comment: Let me add the overloaded method: ```scala protected def checkError( exception: SparkThrowable, errorClass: String, parameters: Map[String, String], matchPVals: Boolean): Unit = ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37335: [SPARK-39895][PYTHON] Support multiple column drop
dongjoon-hyun commented on code in PR #37335: URL: https://github.com/apache/spark/pull/37335#discussion_r932701092 ## python/pyspark/sql/dataframe.py: ## @@ -3237,17 +3237,18 @@ def drop(self, *cols: "ColumnOrName") -> "DataFrame": # type: ignore[misc] """ if len(cols) == 1: col = cols[0] -if isinstance(col, str): -jdf = self._jdf.drop(col) -elif isinstance(col, Column): -jdf = self._jdf.drop(col._jc) -else: +if not isinstance(col, (str, Column)): raise TypeError("col should be a string or a Column") +jdf = self._jdf.drop(_to_java_column(col)) Review Comment: This code path change is irrelevant to this PR's goal, `Support multiple column drop`, isn't it? If this is a bug, it's worth to have another JIRA and make a PR, @santosh-d3vpl3x . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on pull request #37280: [SPARK-39862][SQL] Fix bugs in existence DEFAULT value lookups for V2 data sources
dtenedor commented on PR #37280: URL: https://github.com/apache/spark/pull/37280#issuecomment-1198675997 @gengliangwang Sure, this is done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl
amaliujia commented on PR #37287: URL: https://github.com/apache/spark/pull/37287#issuecomment-1198716223 Is `listTables()` does not respect current catalog fixed in this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37335: [SPARK-39895][PYTHON] Support multiple column drop
dongjoon-hyun commented on code in PR #37335: URL: https://github.com/apache/spark/pull/37335#discussion_r932774765 ## python/pyspark/sql/tests/test_dataframe.py: ## @@ -87,6 +87,21 @@ def test_help_command(self): pydoc.render_doc(df.foo) pydoc.render_doc(df.take(1)) +def test_drop(self): +df = self.spark.createDataFrame([("A", 50, "Y"), ("B", 60, "Y")], ["name", "age", "active"]) + +self.assertEqual(df.drop("active").columns, ["name", "age"]) + +self.assertEqual(df.drop("active", "nonexistent_column").columns, ["name", "age"]) + +self.assertEqual(df.drop("name", "age", "active").columns, []) Review Comment: Does this fail without your patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cfmcgrady commented on a diff in pull request #37334: [SPARK-39887][SQL] RemoveRedundantAliases should keep attributes of a Union's first child
cfmcgrady commented on code in PR #37334: URL: https://github.com/apache/spark/pull/37334#discussion_r932804420 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -559,6 +559,17 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] { }) Join(newLeft, newRight, joinType, newCondition, hint) + case _: Union => Review Comment: Shall we update the method comment as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on a diff in pull request #37305: [SPARK-39881][PYTHON] Fix erroneous check for black and reenable black validation.
Yikun commented on code in PR #37305: URL: https://github.com/apache/spark/pull/37305#discussion_r932810013 ## python/pyspark/ml/feature.py: ## @@ -968,7 +968,7 @@ class _CountVectorizerParams(JavaParams, HasInputCol, HasOutputCol): def __init__(self, *args: Any): super(_CountVectorizerParams, self).__init__(*args) -self._setDefault(minTF=1.0, minDF=1.0, maxDF=2 ** 63 - 1, vocabSize=1 << 18, binary=False) +self._setDefault(minTF=1.0, minDF=1.0, maxDF=2**63 - 1, vocabSize=1 << 18, binary=False) Review Comment: Looks like `**` is not covered by PEP8, and the main reason of balck change is [consider about readable](https://github.com/psf/black/pull/2095#issuecomment-871694472), so I personaly think black choice is right. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #37328: [SPARK-39907][PS] Implement axis and skipna of Series.argmin
Yikun commented on PR #37328: URL: https://github.com/apache/spark/pull/37328#issuecomment-1198820043 otherwise LGTM! Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37258: [DO-NOT-MERGE] trigger CI
HyukjinKwon commented on PR #37258: URL: https://github.com/apache/spark/pull/37258#issuecomment-1198843792 Let me close this one. I believe all are fixed now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37258: [DO-NOT-MERGE] trigger CI
HyukjinKwon closed pull request #37258: [DO-NOT-MERGE] trigger CI URL: https://github.com/apache/spark/pull/37258 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #37304: [SPARK-39877][PySpark] Add unpivot to PySpark DataFrame API
zhengruifeng commented on code in PR #37304: URL: https://github.com/apache/spark/pull/37304#discussion_r932840669 ## python/pyspark/context.py: ## @@ -309,10 +309,7 @@ def _do_init( if sys.version_info[:2] < (3, 8): with warnings.catch_warnings(): warnings.simplefilter("once") -warnings.warn( -"Python 3.7 support is deprecated in Spark 3.4.", -FutureWarning -) +warnings.warn("Python 3.7 support is deprecated in Spark 3.4.", FutureWarning) Review Comment: I encounter the same issue in other PR, maybe due to the `black` version. Let us revert this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead
MaxGekk commented on PR #37322: URL: https://github.com/apache/spark/pull/37322#issuecomment-1198870290 @anchovYu @cloud-fan @HyukjinKwon @gengliangwang Could you review this PR, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl
amaliujia commented on PR #37287: URL: https://github.com/apache/spark/pull/37287#issuecomment-1198905411 > > Is listTables() does not respect current catalog fixed in this PR? > > I think so, by always passing the fully qualified name to `getTable` in `listTables`. We can add tests later, to make this PR a pure refinement. thanks for the confirmation! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #37280: [SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups
gengliangwang closed pull request #37280: [SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups URL: https://github.com/apache/spark/pull/37280 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #37280: [SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups
gengliangwang commented on PR #37280: URL: https://github.com/apache/spark/pull/37280#issuecomment-1198710296 Thanks, merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #37332: [SPARK-39914][SQL] Add DS V2 Filter to V1 Filter conversion
huaxingao commented on PR #37332: URL: https://github.com/apache/spark/pull/37332#issuecomment-1198735772 The GA failure doesn't seem relevant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37335: [SPARK-39895][PYTHON] Support multiple column drop
HyukjinKwon commented on code in PR #37335: URL: https://github.com/apache/spark/pull/37335#discussion_r932808436 ## python/pyspark/sql/dataframe.py: ## @@ -3244,10 +3244,14 @@ def drop(self, *cols: "ColumnOrName") -> "DataFrame": # type: ignore[misc] else: raise TypeError("col should be a string or a Column") else: -for col in cols: -if not isinstance(col, str): -raise TypeError("each col in the param list should be a string") -jdf = self._jdf.drop(self._jseq(cols)) +if all(isinstance(col, str) for col in cols): +jdf = self._jdf.drop(self._jseq(cols)) +elif all(isinstance(col, Column) for col in cols): +jdf = self._jdf +for col in cols: +jdf = jdf.drop(col._jc) # type: ignore[union-attr] Review Comment: Can we avoid looping here? This is super expensive in Spark SQL optmizer. Ideally we should add the signature of `def drop(colNames: Column*` in Scala side first, and PySpark side directlly invokes it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37326: [SPARK-39906][INFRA] Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'
HyukjinKwon commented on PR #37326: URL: https://github.com/apache/spark/pull/37326#issuecomment-1198807963 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] deshanxiao commented on pull request #37336: [SPARK-39916][SQL][MLLIB][REFACTOR] Merge ml SchemaUtils to SQL
deshanxiao commented on PR #37336: URL: https://github.com/apache/spark/pull/37336#issuecomment-1198839786 CC @gengliangwang @dongjoon-hyun @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
beliefer commented on code in PR #37320: URL: https://github.com/apache/spark/pull/37320#discussion_r932843706 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -545,6 +560,9 @@ case class ScanBuilderHolder( var pushedPredicates: Seq[Predicate] = Seq.empty[Predicate] var pushedAggregate: Option[Aggregation] = None + + var pushedAggregateExpectedOutputMap: Map[AttributeReference, Expression] = Review Comment: Thank you for the reminder. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #37332: [SPARK-39914][SQL] Add DS V2 Filter to V1 Filter conversion
huaxingao commented on PR #37332: URL: https://github.com/apache/spark/pull/37332#issuecomment-1198736391 @cloud-fan Could you please take a look when you have time? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37326: [SPARK-39906][INFRA] Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'
HyukjinKwon closed pull request #37326: [SPARK-39906][INFRA] Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead' URL: https://github.com/apache/spark/pull/37326 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37328: [SPARK-39907][PS] Implement axis and skipna of Series.argmin
HyukjinKwon commented on PR #37328: URL: https://github.com/apache/spark/pull/37328#issuecomment-1198808336 cc @itholic @xinrong-meng @ueshin FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #37305: [SPARK-39881][PYTHON] Fix erroneous check for black and reenable black validation.
Yikun commented on PR #37305: URL: https://github.com/apache/spark/pull/37305#issuecomment-1198817543 and CI failed due to `[Run / Scala 2.13 build with SBT](https://github.com/grundprinzip/spark/runs/7546678501?check_suite_focus=true)` git clone networking issue, I think we can pass it by re-triggering. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false
sadikovi commented on PR #37327: URL: https://github.com/apache/spark/pull/37327#issuecomment-1198896103 Yes, that was my thinking too. Okay, I will make a few changes to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on a diff in pull request #37328: [SPARK-39907][PS] Implement axis and skipna of Series.argmin
Yikun commented on code in PR #37328: URL: https://github.com/apache/spark/pull/37328#discussion_r932814726 ## python/pyspark/pandas/series.py: ## @@ -6322,13 +6322,21 @@ def argmax(self, axis: Axis = None, skipna: bool = True) -> int: # If the maximum is achieved in multiple locations, the first row position is returned. return -1 if max_value[0] is None else max_value[1] -def argmin(self) -> int: +def argmin(self, axis: Axis = None, skipna: bool = True) -> int: """ Return int position of the smallest value in the Series. If the minimum is achieved in multiple locations, the first row position is returned. +Parameters +-- +axis : {{None}} +Dummy argument for consistency with Series. +skipna : bool, default True +Exclude NA/null values. If the entire Series is NA, the result Review Comment: > If the entire Series is NA, the result It seems a plus doc for pandas (1.4.3), do we want to add a test for this? [1] https://pandas.pydata.org/docs/reference/api/pandas.Series.argmin.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #36253: [SPARK-38932][SQL] Datasource v2 support report distinct keys
ulysses-you commented on PR #36253: URL: https://github.com/apache/spark/pull/36253#issuecomment-1198822779 cc @cloud-fan @huaxingao if you have time to take a look, thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on a diff in pull request #37290: [SPARK-37194][SQL] Avoid unnecessary sort in v1 write if it's not dynamic partition
c21 commented on code in PR #37290: URL: https://github.com/apache/spark/pull/37290#discussion_r932846383 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala: ## @@ -117,20 +117,26 @@ object V1WritesUtils { outputColumns: Seq[Attribute], partitionColumns: Seq[Attribute], bucketSpec: Option[BucketSpec], - options: Map[String, String]): Seq[SortOrder] = { + options: Map[String, String], + numStaticPartitions: Int = 0): Seq[SortOrder] = { +assert(partitionColumns.size >= numStaticPartitions) Review Comment: ditto. ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala: ## @@ -107,8 +108,10 @@ object FileFormatWriter extends Logging { partitionColumns: Seq[Attribute], bucketSpec: Option[BucketSpec], statsTrackers: Seq[WriteJobStatsTracker], - options: Map[String, String]) + options: Map[String, String], + numStaticPartitions: Int = 0) : Set[String] = { +assert(partitionColumns.size >= numStaticPartitions) Review Comment: nit: would `require()` be better? ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/V1WriteCommandSuite.scala: ## @@ -214,4 +216,34 @@ class V1WriteCommandSuite extends V1WriteCommandSuiteBase with SharedSparkSessio } } } + + test("SPARK-37194: Avoid unnecessary sort in v1 write if it's not dynamic partition") { +withPlannedWrite { enabled => + withTable("t") { +sql( + """ +|CREATE TABLE t(key INT, value STRING) USING PARQUET +|PARTITIONED BY (p1 INT, p2 STRING) +|""".stripMargin) + +// partition columns are static +executeAndCheckOrdering(hasLogicalSort = false, orderingMatched = true) { + sql( +""" + |INSERT INTO t PARTITION(p1=1, p2='a') + |SELECT key, value FROM testData + |""".stripMargin) +} + +// one static partition column and one dynamic partition column +executeAndCheckOrdering(hasLogicalSort = enabled, orderingMatched = enabled) { + sql( +""" + |INSERT INTO t PARTITION(p1=1, p2) + |SELECT key, value, value FROM testData + |""".stripMargin) +} + } +} Review Comment: would it be good to have one more unit test for no static columns? ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala: ## @@ -83,6 +83,7 @@ object FileFormatWriter extends Logging { */ private[sql] var outputOrderingMatched: Boolean = false + // scalastyle:off argcount Review Comment: nit: we can pass in a wrapper class `PartitionSpec(partitionColumns: Seq[Attribute], numStaticPartitions: Int)` to avoid this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false
sadikovi commented on PR #37327: URL: https://github.com/apache/spark/pull/37327#issuecomment-1198856750 Should we keep requirement that `inferDate = true` needs `inferSchema = true`? I think it is unclear right now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on a diff in pull request #37264: [SPARK-39849][SQL] Dataset.as(StructType) fills missing new columns with null value
c21 commented on code in PR #37264: URL: https://github.com/apache/spark/pull/37264#discussion_r932857471 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameAsSchemaSuite.scala: ## @@ -46,15 +46,11 @@ class DataFrameAsSchemaSuite extends QueryTest with SharedSparkSession { checkAnswer(df, Row("b")) } - test("negative: column not found") { Review Comment: @cloud-fan - sure, updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on pull request #37264: [SPARK-39849][SQL] Dataset.as(StructType) fills missing new columns with null value
c21 commented on PR #37264: URL: https://github.com/apache/spark/pull/37264#issuecomment-1198868034 The PR is ready for review again, thanks @cloud-fan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a diff in pull request #37290: [SPARK-37194][SQL] Avoid unnecessary sort in v1 write if it's not dynamic partition
viirya commented on code in PR #37290: URL: https://github.com/apache/spark/pull/37290#discussion_r932864145 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala: ## @@ -107,8 +108,10 @@ object FileFormatWriter extends Logging { partitionColumns: Seq[Attribute], bucketSpec: Option[BucketSpec], statsTrackers: Seq[WriteJobStatsTracker], - options: Map[String, String]) + options: Map[String, String], + numStaticPartitions: Int = 0) Review Comment: numStaticPartitionCols? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead
MaxGekk closed pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead URL: https://github.com/apache/spark/pull/37322 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead
MaxGekk commented on PR #37322: URL: https://github.com/apache/spark/pull/37322#issuecomment-1198880973 Merging to master. Thank you, @gengliangwang for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl
dongjoon-hyun commented on code in PR #37287: URL: https://github.com/apache/spark/pull/37287#discussion_r932517832 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala: ## @@ -33,36 +33,37 @@ import org.apache.spark.storage.StorageLevel abstract class Catalog { /** - * Returns the current default database in this session. + * Returns the current database/namespace in this session. Review Comment: If you don't mind, could you avoid to use `/` here? You can use `or` literally. Otherwise, `/` could be read as another multi-layer. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl
dongjoon-hyun commented on code in PR #37287: URL: https://github.com/apache/spark/pull/37287#discussion_r932517832 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala: ## @@ -33,36 +33,37 @@ import org.apache.spark.storage.StorageLevel abstract class Catalog { /** - * Returns the current default database in this session. + * Returns the current database/namespace in this session. Review Comment: If you don't mind, could you avoid to use `/` here? You can use `or` literally. Otherwise, `/` could be read as another multi-layer unlike `table/view` case. We are not confused at `table/view`, but this new sentence looks a little confusing to me at least :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #35391: [SPARK-38098][PYTHON] Add support for ArrayType of nested StructType to arrow-based conversion
ueshin commented on code in PR #35391: URL: https://github.com/apache/spark/pull/35391#discussion_r932566706 ## python/pyspark/sql/tests/test_dataframe.py: ## @@ -953,6 +953,30 @@ def test_to_pandas_from_mixed_dataframe(self): pdf_with_only_nulls = self.spark.sql(sql).filter("tinyint is null").toPandas() self.assertTrue(np.all(pdf_with_only_nulls.dtypes == pdf_with_some_nulls.dtypes)) +@unittest.skipIf( +not have_pandas or not have_pyarrow, +cast(str, pandas_requirement_message or pyarrow_requirement_message), +) +def test_to_pandas_for_array_of_struct(self): +# SPARK-38098: Support Array of Struct for Pandas UDFs and toPandas +import numpy as np +import pandas as pd + +df = self.spark.createDataFrame( +[[[("a", 2, 3.0), ("a", 2, 3.0)]], [[("b", 5, 6.0), ("b", 5, 6.0)]]], +"array_struct_col Array>", +) +is_arrow_enabled = [True, False] +for value in is_arrow_enabled: Review Comment: nit: ```py for is_arrow_enabled in [True, False]: ``` ## python/pyspark/sql/tests/test_pandas_udf_scalar.py: ## @@ -134,6 +134,30 @@ def test_pandas_udf_nested_arrays(self): result = df.select(tokenize("vals").alias("hi")) self.assertEqual([Row(hi=[["hi", "boo"]]), Row(hi=[["bye", "boo"]])], result.collect()) +def test_pandas_array_struct(self): +# SPARK-38098: Support Array of Struct for Pandas UDFs and toPandas +# import numpy as np + +@pandas_udf("Array>") +def return_cols(cols): +# self.assertEqual(type(cols), pd.Series) +# self.assertEqual(type(cols[0]), np.ndarray) +# self.assertEqual(type(cols[0][0]), dict) Review Comment: I guess we can't use `self` in the udf. Shall we follow the other tests to use builtin `assert` instead: https://github.com/apache/spark/blob/f8b3d5322e6cbce2e42a6940518686b7255e79cb/python/pyspark/sql/tests/test_pandas_udf_scalar.py#L1206-L1215 Also `import numpy as np` might need to be in the udf. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] santosh-d3vpl3x closed pull request #37335: SPARK-39895 pyspark support multiple column drop
santosh-d3vpl3x closed pull request #37335: SPARK-39895 pyspark support multiple column drop URL: https://github.com/apache/spark/pull/37335 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cabral1888 commented on a diff in pull request #37230: [SPARK-33326][SQL] Fix the problem of writing hive partition table without updating metadata information
cabral1888 commented on code in PR #37230: URL: https://github.com/apache/spark/pull/37230#discussion_r932418981 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala: ## @@ -1611,4 +1611,26 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } } } + + test("SPARK-33326: partition metadata auto update for dynamic partitions") { +val table = "partition_metadata_dynamic_partition" +Seq("hive", "parquet").foreach { source => + withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> "true") { +withTable(table) { + sql(s"CREATE TABLE $table (id INT, sp INT, dp INT) USING $source PARTITIONED BY (sp, dp)") + sql(s"INSERT INTO $table PARTITION (sp=0, dp) VALUES (0, 0)") + sql(s"INSERT OVERWRITE TABLE $table PARTITION (sp=0, dp) SELECT id, id FROM range(5)") + + for (i <- 0 until 5) { +val partition = spark.sessionState.catalog + .getPartition(TableIdentifier(table), Map("sp" -> s"0", "dp" -> s"$i")) +val numFiles = partition.parameters("numFiles") +assert(numFiles.nonEmpty && numFiles.toInt > 0) +val totalSize = partition.parameters("totalSize") Review Comment: I see that you included the verification only for `totalSize` and `numFiles`, but what about `rawDataSize`? Does it make sense to be verified? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead
MaxGekk commented on code in PR #37322: URL: https://github.com/apache/spark/pull/37322#discussion_r932495675 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala: ## @@ -305,14 +305,17 @@ class DatasetUnpivotSuite extends QueryTest valueColumnName = "val" ) } -checkErrorClass( +checkError( exception = e, errorClass = "UNPIVOT_VALUE_DATA_TYPE_MISMATCH", - msg = "Unpivot value columns must share a least common type, some types do not: \\[" + -"\"STRING\" \\(`str1#\\d+`\\), " + -"\"INT\" \\(`int1#\\d+`, `int2#\\d+`, `int3#\\d+`, ...\\), " + -"\"BIGINT\" \\(`long1#\\d+L`, `long2#\\d+L`\\)\\];(\n.*)*", - matchMsg = true) + errorSubClass = None, + sqlState = None, Review Comment: It has but if I remove settings of the parameters, I am getting the errors: ``` overloaded method value checkError with alternatives: (exception: org.apache.spark.SparkThrowable,errorClass: String,parameters: Map[String,String])Unit (exception: org.apache.spark.SparkThrowable,errorClass: String,sqlState: String,parameters: Map[String,String])Unit (exception: org.apache.spark.SparkThrowable,errorClass: String,errorSubClass: String,sqlState: String,parameters: Map[String,String])Unit (exception: org.apache.spark.SparkThrowable,errorClass: String,errorSubClass: Option[String],sqlState: Option[String],parameters: Map[String,String],matchPVals: Boolean)Unit cannot be applied to (exception: org.apache.spark.sql.AnalysisException, errorClass: String, parameters: scala.collection.immutable.Map[String,String], matchPVals: Boolean) checkError( ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false
Jonathancui123 commented on code in PR #37327: URL: https://github.com/apache/spark/pull/37327#discussion_r932486986 ## docs/sql-data-sources-csv.md: ## @@ -109,7 +109,7 @@ Data source options of CSV can be set via: read -inferDate +preferDate false Whether or not to infer columns that satisfy the dateFormat option as Date. Requires inferSchema to be true. When false, columns with dates will be inferred as String (or as Timestamp if it fits the timestampFormat). Review Comment: ```suggestion preferDate false Whether or not to infer columns that satisfy the dateFormat option as Date. Inference requires inferSchema to be true. When false, columns with dates will be inferred as String (or as Timestamp if it fits the timestampFormat). When true, the parser will attempt entries in timestamp columns as DateType before other fallbacks ``` What do you think of the following docs change to clarify the behavior? @cloud-fan @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false
Jonathancui123 commented on code in PR #37327: URL: https://github.com/apache/spark/pull/37327#discussion_r932486986 ## docs/sql-data-sources-csv.md: ## @@ -109,7 +109,7 @@ Data source options of CSV can be set via: read -inferDate +preferDate false Whether or not to infer columns that satisfy the dateFormat option as Date. Requires inferSchema to be true. When false, columns with dates will be inferred as String (or as Timestamp if it fits the timestampFormat). Review Comment: ```suggestion preferDate false Whether or not to infer columns that satisfy the dateFormat option as Date. Inference requires inferSchema to be true. When false, columns with dates will be inferred as String (or as Timestamp if it fits the timestampFormat). When true, the parser will attempt to parse entries in timestamp columns as DateType before other fallbacks or failing. ``` What do you think of the following docs change to clarify the behavior? @cloud-fan @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead
MaxGekk commented on code in PR #37322: URL: https://github.com/apache/spark/pull/37322#discussion_r932506853 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala: ## @@ -305,14 +305,17 @@ class DatasetUnpivotSuite extends QueryTest valueColumnName = "val" ) } -checkErrorClass( +checkError( exception = e, errorClass = "UNPIVOT_VALUE_DATA_TYPE_MISMATCH", - msg = "Unpivot value columns must share a least common type, some types do not: \\[" + -"\"STRING\" \\(`str1#\\d+`\\), " + -"\"INT\" \\(`int1#\\d+`, `int2#\\d+`, `int3#\\d+`, ...\\), " + -"\"BIGINT\" \\(`long1#\\d+L`, `long2#\\d+L`\\)\\];(\n.*)*", - matchMsg = true) + errorSubClass = None, + sqlState = None, Review Comment: but I think it is better to set default values in the most wide method. I will remove one overloaded method and add default values. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
otterc commented on PR #35906: URL: https://github.com/apache/spark/pull/35906#issuecomment-1198491146 > Should be easy to add. We can have a feature flag, and when initiate the RemoteBlockPushResolver, db can be set to null if this feature flag is turned off, and all the later DB operations won't be triggered. @mridulm @otterc What do you think? I don't see this as a feature. For original shuffle, we support work preserving restart. This completes the same support for push-based shuffle. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on a diff in pull request #36995: [SPARK-39607][SQL][DSV2] Distribution and ordering support V2 function in writing
sunchao commented on code in PR #36995: URL: https://github.com/apache/spark/pull/36995#discussion_r932515356 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala: ## @@ -17,22 +17,33 @@ package org.apache.spark.sql.execution.datasources.v2 -import org.apache.spark.sql.catalyst.expressions.Expression +import org.apache.spark.sql.catalyst.analysis.{AnsiTypeCoercion, TypeCoercion} +import org.apache.spark.sql.catalyst.expressions.{Expression, Literal, SortOrder, TransformExpression, V2ExpressionUtils} import org.apache.spark.sql.catalyst.expressions.V2ExpressionUtils._ import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, RebalancePartitions, RepartitionByExpression, Sort} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.connector.catalog.FunctionCatalog +import org.apache.spark.sql.connector.catalog.functions.ScalarFunction import org.apache.spark.sql.connector.distributions._ import org.apache.spark.sql.connector.write.{RequiresDistributionAndOrdering, Write} import org.apache.spark.sql.errors.QueryCompilationErrors object DistributionAndOrderingUtils { - def prepareQuery(write: Write, query: LogicalPlan): LogicalPlan = write match { + def prepareQuery( Review Comment: Hmm I wonder how does the write work with transforms such as bucket. For example, suppose the required distribution is `bucket(col, 100)`, Spark currently will compute the partition (bucket) ID by `murmur_hash(bucket(col, 100)) pmod 100`, so the value of `col` is essentially hashed twice. I'm not sure whether this breaks any assumption from the V2 data source side, or whether it has any effect in the hash key distributions. ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala: ## @@ -17,22 +17,33 @@ package org.apache.spark.sql.execution.datasources.v2 -import org.apache.spark.sql.catalyst.expressions.Expression +import org.apache.spark.sql.catalyst.analysis.{AnsiTypeCoercion, TypeCoercion} +import org.apache.spark.sql.catalyst.expressions.{Expression, Literal, SortOrder, TransformExpression, V2ExpressionUtils} import org.apache.spark.sql.catalyst.expressions.V2ExpressionUtils._ import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, RebalancePartitions, RepartitionByExpression, Sort} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.connector.catalog.FunctionCatalog +import org.apache.spark.sql.connector.catalog.functions.ScalarFunction import org.apache.spark.sql.connector.distributions._ import org.apache.spark.sql.connector.write.{RequiresDistributionAndOrdering, Write} import org.apache.spark.sql.errors.QueryCompilationErrors object DistributionAndOrderingUtils { - def prepareQuery(write: Write, query: LogicalPlan): LogicalPlan = write match { + def prepareQuery( + write: Write, + query: LogicalPlan, + funCatalogOpt: Option[FunctionCatalog]): LogicalPlan = write match { case write: RequiresDistributionAndOrdering => val numPartitions = write.requiredNumPartitions() val distribution = write.requiredDistribution match { -case d: OrderedDistribution => toCatalystOrdering(d.ordering(), query) -case d: ClusteredDistribution => d.clustering.map(e => toCatalyst(e, query)).toSeq +case d: OrderedDistribution => + toCatalystOrdering(d.ordering(), query, funCatalogOpt) +.map(ur => resolveTransformExpression(ur).asInstanceOf[SortOrder]) Review Comment: nit: why the variable is named `ur`? maybe change it to `e`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on pull request #37304: [SPARK-39877][PySpark] Add unpivot to PySpark DataFrame API
EnricoMi commented on PR #37304: URL: https://github.com/apache/spark/pull/37304#issuecomment-1198506801 > btw, you may also need to run `dev/reformat-python` Why do I have to reformat `python/pyspark/context.py`? That seems unrelated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] santosh-d3vpl3x closed pull request #37333: SPARK-39895 pyspark support multiple column drop
santosh-d3vpl3x closed pull request #37333: SPARK-39895 pyspark support multiple column drop URL: https://github.com/apache/spark/pull/37333 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false
sadikovi commented on code in PR #37327: URL: https://github.com/apache/spark/pull/37327#discussion_r932712356 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -153,19 +153,24 @@ class CSVOptions( * Disabled by default for backwards compatibility and performance. When enabled, date entries in * timestamp columns will be cast to timestamp upon parsing. Not compatible with * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters + * + * The flag is only enabled if inferSchema is set to true. */ - val inferDate = { -val inferDateFlag = getBool("inferDate") -if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) { + val preferDate = { +val preferDateFlag = getBool("preferDate") +if (preferDateFlag && SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) { throw QueryExecutionErrors.inferDateWithLegacyTimeParserError() } -inferDateFlag +if (preferDateFlag && !inferSchemaFlag) { Review Comment: Okay, I can do that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #37311: [SPARK-39865][SQL][3.3] Show proper error messages on the overflow errors of table insert
gengliangwang closed pull request #37311: [SPARK-39865][SQL][3.3] Show proper error messages on the overflow errors of table insert URL: https://github.com/apache/spark/pull/37311 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth opened a new pull request, #37334: [SPARK-39887][SQL] RemoveRedundantAliases should keep attributes of a Union's first child
peter-toth opened a new pull request, #37334: URL: https://github.com/apache/spark/pull/37334 ### What changes were proposed in this pull request? Keep the output attributes of a `Union` node's first child in the `RemoveRedundantAliases` rule to avoid correctness issues. ### Why are the changes needed? To fix the result of the following query: ``` SELECT a, b AS a FROM ( SELECT a, a AS b FROM (SELECT a FROM VALUES (1) AS t(a)) UNION ALL SELECT a, b FROM (SELECT a, b FROM VALUES (1, 2) AS t(a, b)) ) ``` Before this PR the query returns the incorrect result: ``` +---+---+ | a| a| +---+---+ | 1| 1| | 2| 2| +---+---+ ``` After this PR it returns the expected result: ``` +---+---+ | a| a| +---+---+ | 1| 1| | 1| 2| +---+---+ ``` ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new UTs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on pull request #37319: [SPARK-39887][SQL] `PullOutGroupingExpressions` should generate different alias names
peter-toth commented on PR #37319: URL: https://github.com/apache/spark/pull/37319#issuecomment-1198525757 I've opened a PR with my proposal here: https://github.com/apache/spark/pull/37334 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] santosh-d3vpl3x opened a new pull request, #37335: SPARK-39895 pyspark support multiple column drop
santosh-d3vpl3x opened a new pull request, #37335: URL: https://github.com/apache/spark/pull/37335 * SPARK-39895 pyspark support multiple column drop ### What changes were proposed in this pull request? Fixes issues related type confirmation in pyspark api ### Why are the changes needed? We expect that multiple columns can be handled by drop call on df because of its typing but that is not the case. ### Does this PR introduce _any_ user-facing change? Yes, fixes issues related type confirmation in pyspark api ### How was this patch tested? CI Pipeline on fork and CI here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #37280: [SPARK-39862][SQL] Fix bug in existence DEFAULT value lookups for V2 data sources
gengliangwang commented on PR #37280: URL: https://github.com/apache/spark/pull/37280#issuecomment-1198601705 @dtenedor could you also update the PR description about the ORC fix? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #37290: [SPARK-37194][SQL] Avoid unnecessary sort in v1 write if it's not dynamic partition
ulysses-you commented on PR #37290: URL: https://github.com/apache/spark/pull/37290#issuecomment-1197941930 cc @viirya @cloud-fan @c21 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cfmcgrady commented on pull request #37319: [SPARK-39887][SQL] `PullOutGroupingExpressions` should generate different alias names
cfmcgrady commented on PR #37319: URL: https://github.com/apache/spark/pull/37319#issuecomment-1197968855 hi, @peter-toth thank you for your feedback. While these changes of `RemoveRedundantAliases` solve this issue, they break the guarantee of `alias removal should not break after push project through union`. https://github.com/apache/spark/blob/0f9c1a2e848fbdfb17af0555d0c8be7d5a7191bb/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAliasAndProjectSuite.scala#L96-L104 ``` == FAIL: Plans do not match === Union false, false Union false, false !:- Project [a#0 AS a#0] :- LocalRelation , [a#0] !: +- LocalRelation , [a#0] +- LocalRelation , [b#0] !+- LocalRelation , [b#0] ScalaTestFailureLocation: org.apache.spark.sql.catalyst.plans.PlanTestBase at (PlanTest.scala:177) org.scalatest.exceptions.TestFailedException: == FAIL: Plans do not match === Union false, false Union false, false !:- Project [a#0 AS a#0] :- LocalRelation , [a#0] !: +- LocalRelation , [a#0] +- LocalRelation , [b#0] !+- LocalRelation , [b#0] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org