[GitHub] [spark] EnricoMi commented on a diff in pull request #37304: [SPARK-39877][PySpark] Add unpivot to PySpark DataFrame API

2022-07-28 Thread GitBox


EnricoMi commented on code in PR #37304:
URL: https://github.com/apache/spark/pull/37304#discussion_r932044028


##
python/pyspark/sql/dataframe.py:
##
@@ -2188,6 +2188,142 @@ def cube(self, *cols: "ColumnOrName") -> "GroupedData": 
 # type: ignore[misc]
 
 return GroupedData(jgd, self)
 
+def unpivot(
+self,
+ids: Optional[Union["ColumnOrName", List["ColumnOrName"], 
Tuple["ColumnOrName", ...]]] = None,
+values: Optional[Union["ColumnOrName", List["ColumnOrName"], 
Tuple["ColumnOrName", ...]]] = None,
+variableColumnName: Optional[str] = None,
+valueColumnName: Optional[str] = None,
+) -> "DataFrame":
+"""
+Unpivot a DataFrame from wide format to long format, optionally leaving
+identifier columns set. This is the reverse to 
`groupBy(...).pivot(...).agg(...)`,
+except for the aggregation, which cannot be reversed.
+
+This function is useful to massage a DataFrame into a format where some
+columns are identifier columns ("ids"), while all other columns 
("values")
+are "unpivoted" to the rows, leaving just two non-id columns, named as 
given
+by `variableColumnName` and `valueColumnName`.
+
+When no "id" columns are given, the unpivoted DataFrame consists of 
only the
+"variable" and "value" columns.
+
+All "value" columns must share a least common data type. Unless they 
are the same data type,
+all "value" columns are cast to the nearest common data type. For 
instance, types
+`IntegerType` and `LongType` are cast to `LongType`, while 
`IntegerType` and `StringType`
+do not have a common data type and `unpivot` fails.
+
+:func:`groupby` is an alias for :func:`groupBy`.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+ids : str, Column, tuple, list, optional
+Column(s) to use as identifiers. Can be a single column or column 
name,
+or a list for multiple columns.
+values : str, Column, tuple, list, optional
+Column(s) to unpivot. Can be a single column or column name, or a 
list
+for multiple columns. If not specified or empty, uses all columns 
that
+are not set as `ids`.
+variableColumnName : scalar, default 'variable'
+Name of the variable column. If None it uses 'variable'.
+valueColumnName : scalar, default 'value'
+Name of the value column. If None it uses 'value'.
+
+Returns
+---
+DataFrame
+Unpivoted DataFrame.
+
+Examples
+
+>>> df = spark.createDataFrame(
+... [(1, 11, 1.1), (2, 12, 1.2)],
+... ["id", "int", "double"],
+... )
+>>> df.show()
++---+---+--+
+| id|int|double|
++---+---+--+
+|  1| 11|   1.1|
+|  2| 12|   1.2|
++---+---+--+
+
+>>> df.unpivot("id").show()
++---++-+
+| id|variable|value|
++---++-+
+|  1| int| 11.0|
+|  1|  double|  1.1|
+|  2| int| 12.0|
+|  2|  double|  1.2|
++---++-+
+"""
+def to_jcols(cols) -> JavaObject:
+l = cols
+if cols is None:
+l = []
+elif isinstance(cols, tuple):
+l = list(cols)
+elif not isinstance(cols, list):
+l = [cols]
+return self._jcols(*l)
+
+if variableColumnName is None:
+variableColumnName = "variable"

Review Comment:
   removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #36996: [SPARK-34305][SQL] Unify v1 and v2 ALTER TABLE .. SET SERDE tests

2022-07-28 Thread GitBox


panbingkun commented on code in PR #36996:
URL: https://github.com/apache/spark/pull/36996#discussion_r932085322


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableSetSerdeSuite.scala:
##
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command.v1
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.execution.command
+import org.apache.spark.sql.internal.StaticSQLConf.CATALOG_IMPLEMENTATION
+
+/**
+ * This base suite contains unified tests for the `ALTER TABLE .. SET 
[SERDE|SERDEPROPERTIES]`
+ * command that check V1 table catalogs. The tests that cannot run for all V1 
catalogs
+ * are located in more specific test suites:
+ *
+ *   - V1 In-Memory catalog: 
`org.apache.spark.sql.execution.command.v1.AlterTableSetSerdeSuite`
+ *   - V1 Hive External catalog:
+ * `org.apache.spark.sql.hive.execution.command.AlterTableSetSerdeSuite`
+ */
+trait AlterTableSetSerdeSuiteBase extends command.AlterTableSetSerdeSuiteBase {
+
+  protected val isDatasourceTable = true
+
+  private def isUsingHiveMetastore: Boolean = {
+spark.sparkContext.conf.get(CATALOG_IMPLEMENTATION) == "hive"
+  }
+
+  private def normalizeSerdeProp(props: Map[String, String]): Map[String, 
String] = {
+props.filterNot(p => Seq("serialization.format", "path").contains(p._1))
+  }
+
+  private def maybeWrapException[T](expectException: Boolean)(body: => T): 
Unit = {
+if (expectException) intercept[AnalysisException] { body } else body
+  }
+
+  protected def testSetSerde(): Unit = {
+withNamespaceAndTable("ns", "tbl") { t =>
+  if (!isUsingHiveMetastore) {
+assert(isDatasourceTable, "InMemoryCatalog only supports data source 
tables")
+  }
+  sql(s"CREATE TABLE $t (col1 int, col2 string, a int, b int) 
$defaultUsing " +
+s"PARTITIONED BY (a, b)")
+
+  val catalog = spark.sessionState.catalog
+  val tableIdent = TableIdentifier("tbl", Some("ns"))
+  def checkSerdeProps(expectedSerdeProps: Map[String, String]): Unit = {
+val serdeProp = catalog.getTableMetadata(tableIdent).storage.properties
+if (isUsingHiveMetastore) {

Review Comment:
   OK,I will do it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] physinet opened a new pull request, #37329: [SPARK-39832][PYTHON] Support column arguments in regexp_replace

2022-07-28 Thread GitBox


physinet opened a new pull request, #37329:
URL: https://github.com/apache/spark/pull/37329

   ### What changes were proposed in this pull request?
   Support either literal Python strings or Column objects for the pattern and 
replacement arguments for `regexp_replace`. 
   
   ### Why are the changes needed?
   Allows using different replacements per row, as in [this 
example](https://stackoverflow.com/questions/64613761/in-pyspark-using-regexp-replace-how-to-replace-a-group-with-value-from-another,).
   
   
   ### Does this PR introduce _any_ user-facing change?
   Users can now use `regexp_replace` with columns for all three arguments. The 
first argument (string that regex should be applied to) can be either a Column 
object or the string name of the column. In summary, the following signatures 
are supported:
   ```python
   regexp_replace("str", "\d", "")
   regexp_replace(F.col("str"), "\d", "")
   regexp_replace("str", F.col("pattern"), F.col("replacement"))
   regexp_replace(F.col("str"), F.col("pattern"), F.col("replacement"))
   ```
   
   ### How was this patch tested?
   Added unit tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-07-28 Thread GitBox


cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r932183215


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala:
##
@@ -410,12 +413,24 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
 case s @ Sort(order, _, operation @ ScanOperation(project, filter, 
sHolder: ScanBuilderHolder))
 // Without building the Scan, we do not know the resulting column 
names after aggregate
 // push-down, and thus can't push down Top-N which needs to know the 
ordering column names.
-// TODO: we can support simple cases like GROUP BY columns directly 
and ORDER BY the same
-//   columns, which we know the resulting column names: the 
original table columns.
-if sHolder.pushedAggregate.isEmpty && filter.isEmpty &&
+// In particular, we push down the simple cases like GROUP BY 
expressions directly and
+// ORDER BY the same expressions, which we know the original table 
columns.
+if filter.isEmpty &&
   CollapseProject.canCollapseExpressions(order, project, alwaysInline 
= true) =>
   val aliasMap = getAliasMap(project)
-  val newOrder = order.map(replaceAlias(_, 
aliasMap)).asInstanceOf[Seq[SortOrder]]
+  def findGroupExprForSortOrder(sortOrder: SortOrder): SortOrder = 
sortOrder match {

Review Comment:
   I think we can just do something similar with `replaceAlias`
   ```
   sortOrder transform {
 case a: Attribute => sHolder.pushedAggOutputMap.getOrElse(a, a)
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #36918: [SQL][SPARK-39528] Use V2 Filter in SupportsRuntimeFiltering

2022-07-28 Thread GitBox


cloud-fan closed pull request #36918: [SQL][SPARK-39528] Use V2 Filter in 
SupportsRuntimeFiltering
URL: https://github.com/apache/spark/pull/36918


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on pull request #37319: [SPARK-39887][SQL] `PullOutGroupingExpressions` should generate different alias names

2022-07-28 Thread GitBox


peter-toth commented on PR #37319:
URL: https://github.com/apache/spark/pull/37319#issuecomment-1197915164

   So, I was thinking about adding
   ```
 case _: Union =>
   var first = true
   plan.mapChildren { child =>
 if (first) {
   first = false
   removeRedundantAliases(child, excluded ++ child.outputSet)
 } else {
   removeRedundantAliases(child, excluded)
 }
   }
   ```
   to `RemoveRedundantAliases` 
(https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L561)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wayneguow commented on pull request #36775: [SPARK-39389]Filesystem closed should not be considered as corrupt files

2022-07-28 Thread GitBox


wayneguow commented on PR #36775:
URL: https://github.com/apache/spark/pull/36775#issuecomment-1198234993

   IMO, it's better that users can configure what exceptions can ignore corrupt 
files.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #37317: [SPARK-39894][SQL] Combine the similar binary comparison in boolean expression.

2022-07-28 Thread GitBox


beliefer commented on PR #37317:
URL: https://github.com/apache/spark/pull/37317#issuecomment-1198064228

   ping @MaxGekk @gengliangwang @dongjoon-hyun cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-07-28 Thread GitBox


cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r932182709


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala:
##
@@ -545,6 +560,9 @@ case class ScanBuilderHolder(
   var pushedPredicates: Seq[Predicate] = Seq.empty[Predicate]
 
   var pushedAggregate: Option[Aggregation] = None
+
+  var pushedAggregateExpectedOutputMap: Map[AttributeReference, Expression] =

Review Comment:
   ```suggestion
 var pushedAggOutputMap: Map[AttributeReference, Expression] =
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-07-28 Thread GitBox


cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r932181859


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala:
##
@@ -545,6 +560,9 @@ case class ScanBuilderHolder(
   var pushedPredicates: Seq[Predicate] = Seq.empty[Predicate]
 
   var pushedAggregate: Option[Aggregation] = None
+
+  var pushedAggregateExpectedOutputMap: Map[AttributeReference, Expression] =

Review Comment:
   We can use `AttributeMap`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-07-28 Thread GitBox


cloud-fan commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r932186273


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -811,6 +800,244 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = sql(
+  """
+|SELECT dept AS my_dept, SUM(SALARY) FROM h2.test.employee
+|GROUP BY dept
+|ORDER BY my_dept
+|LIMIT 1
+|""".stripMargin)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT", "IS_MANAGER").sum("SALARY")
+  .orderBy("DEPT", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df4)
+checkLimitRemoved(df4)
+checkPushedInfo(df4,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df4, Seq(Row(1, false, 9000.00)))
+
+val df5 = sql(
+  """
+|SELECT dept AS my_dept, is_manager AS my_manager, SUM(SALARY) FROM 
h2.test.employee
+|GROUP BY dept, my_manager
+|ORDER BY my_dept, my_manager
+|LIMIT 1
+|""".stripMargin)
+checkSortRemoved(df5)
+checkLimitRemoved(df5)
+checkPushedInfo(df5,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df5, Seq(Row(1, false, 9000.00)))
+
+val df6 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY", $"IS_MANAGER",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key", "IS_MANAGER").sum("SALARY")
+  .orderBy("key", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df6)
+checkLimitRemoved(df6)
+checkPushedInfo(df6,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END, " +
+"IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df6, Seq(Row(0.00, false, 12000.00)))
+
+val df7 = sql(
+  """
+|SELECT dept, SUM(SALARY) FROM h2.test.employee
+|GROUP BY dept
+|ORDER BY SUM(SALARY)
+|LIMIT 1
+|""".stripMargin)
+checkSortRemoved(df7, false)
+checkLimitRemoved(df7, false)
+checkPushedInfo(df7,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []")
+checkAnswer(df7, Seq(Row(6, 12000.00)))
+
+val df8 = sql(
+  """
+|SELECT dept, SUM(SALARY) AS total FROM h2.test.employee
+|GROUP BY dept
+|ORDER BY total
+|LIMIT 1
+|""".stripMargin)
+checkSortRemoved(df8, false)
+checkLimitRemoved(df8, false)
+checkPushedInfo(df8,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []")
+checkAnswer(df8, Seq(Row(6, 12000.00)))
+  }
+
+  

[GitHub] [spark] goutam-git commented on a diff in pull request #37065: [SPARK-38699][SQL] Use error classes in the execution errors of dictionary encoding

2022-07-28 Thread GitBox


goutam-git commented on code in PR #37065:
URL: https://github.com/apache/spark/pull/37065#discussion_r932196681


##
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/compressionSchemes.scala:
##
@@ -421,7 +421,7 @@ private[columnar] case object DictionaryEncoding extends 
CompressionScheme {
 
 override def compress(from: ByteBuffer, to: ByteBuffer): ByteBuffer = {
   if (overflow) {

Review Comment:
   @MaxGekk  should I remove the assert and use the new method name with error 
class instead of useDictionaryEncodingWhenDictionaryOverflowError()



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37319: [SPARK-39887][SQL] `PullOutGroupingExpressions` should generate different alias names

2022-07-28 Thread GitBox


cloud-fan commented on PR #37319:
URL: https://github.com/apache/spark/pull/37319#issuecomment-1198119033

   `Union.output` is a long-standing issue (same for `Join.output`). It reuses 
the first child's output but apparently `Union` and its first child output 
different values. We have to carefully work around this issue in places like 
`FoldablePropagation`. Maybe we need to do the same in `RemoveRedundantAliases`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you opened a new pull request, #37330: [SPARK-39911][SQL] Optimize global Sort to RepartitionByExpression

2022-07-28 Thread GitBox


ulysses-you opened a new pull request, #37330:
URL: https://github.com/apache/spark/pull/37330

   
   
   ### What changes were proposed in this pull request?
   
   Optimize Global sort to RepartitionByExpression, for example:
   ```
   Sort local Sort local 
 Sort global=>  RepartitionByExpression
   ```
   
   ### Why are the changes needed?
   
   If a global sort below a local sort, the only meaningful thing is it's 
distribution. So this pr optimizes that global sort to RepartitionByExpression 
to save a local sort.
   
   ### Does this PR introduce _any_ user-facing change?
   
   no, only improve performance
   
   ### How was this patch tested?
   
   add test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #37293: [SPARK-39872][SQL] Change to use `BytePackerForLong#unpack8Values` with Array input api in `VectorizedDeltaBinaryPackedReader`

2022-07-28 Thread GitBox


LuciferYang commented on code in PR #37293:
URL: https://github.com/apache/spark/pull/37293#discussion_r932335919


##
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java:
##
@@ -300,7 +300,8 @@ private void unpackMiniBlock() throws IOException {
 bitWidths[currentMiniBlock]);
 for (int j = 0; j < miniBlockSizeInValues; j += 8) {

Review Comment:
   Testing the performance of the following changes:
   
   ```
 private void unpackMiniBlock() throws IOException {
   Arrays.fill(this.unpackedValuesBuffer, 0);
   BytePackerForLong packer = Packer.LITTLE_ENDIAN.newBytePackerForLong(
   bitWidths[currentMiniBlock]);
   int bitWidth = packer.getBitWidth();
   int times = miniBlockSizeInValues / 8;
   int total = times * bitWidth;
   ByteBuffer buffer = in.slice(total);
   byte[] array = buffer.array();
   int i = 0;
   int srcOffset = buffer.arrayOffset() + buffer.position();
   int targetOffSet = 0;
   while (i < times) {
 packer.unpack8Values(array, srcOffset, unpackedValuesBuffer, 
targetOffSet);
 i++;
 srcOffset += bitWidth;
 targetOffSet += 8;
   }
   remainingInMiniBlock = miniBlockSizeInValues;
   currentMiniBlock++;
 }
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] senthh commented on pull request #35785: [SPARK-38213][STREAMING] Adding KafkaSink Metrics feature

2022-07-28 Thread GitBox


senthh commented on PR #35785:
URL: https://github.com/apache/spark/pull/35785#issuecomment-1198036728

   @dongjoon-hyun @dgd-contributor @gaborgsomogyi @squito Could you be kind to 
review this PR, Please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #37331: [SPARK-39913][BUILD] Upgrade to Arrow 9.0.0

2022-07-28 Thread GitBox


LuciferYang opened a new pull request, #37331:
URL: https://github.com/apache/spark/pull/37331

   ### What changes were proposed in this pull request?
   Testing with Arrow 9.0.0, will update here later
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #37162: [SPARK-38910][YARN] Clean spark staging before unregister

2022-07-28 Thread GitBox


AngersZh commented on PR #37162:
URL: https://github.com/apache/spark/pull/37162#issuecomment-1197950750

   ping @dongjoon-hyun The latest GA failed caused by 
   ```
   * DONE (miniUI)
   ERROR: dependency ‘pkgdown’ is not available for package ‘devtools’
   * removing ‘/usr/local/lib/R/site-library/devtools’
   
   The downloaded source packages are in
‘/tmp/RtmpTvMfJ6/downloaded_packages’
   Warning messages:
   1: In install.packages(c("devtools"), repos = 
"https://cloud.r-project.org/;) :
 installation of package ‘systemfonts’ had non-zero exit status
   2: In install.packages(c("devtools"), repos = 
"https://cloud.r-project.org/;) :
 installation of package ‘textshaping’ had non-zero exit status
   3: In install.packages(c("devtools"), repos = 
"https://cloud.r-project.org/;) :
 installation of package ‘ragg’ had non-zero exit status
   4: In install.packages(c("devtools"), repos = 
"https://cloud.r-project.org/;) :
 installation of package ‘pkgdown’ had non-zero exit status
   5: In install.packages(c("devtools"), repos = 
"https://cloud.r-project.org/;) :
 installation of package ‘devtools’ had non-zero exit status
   Error in loadNamespace(x) : there is no package called ‘devtools’
   Calls: loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart
   Execution halted
   Error: Process completed with exit code 1.
   ```
   
   Any advise?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on pull request #37314: [SPARK-39891][BUILD] Bump h2 to 2.1.214

2022-07-28 Thread GitBox


panbingkun commented on PR #37314:
URL: https://github.com/apache/spark/pull/37314#issuecomment-1197977785

   cc @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #36996: [SPARK-34305][SQL] Unify v1 and v2 ALTER TABLE .. SET SERDE tests

2022-07-28 Thread GitBox


MaxGekk commented on code in PR #36996:
URL: https://github.com/apache/spark/pull/36996#discussion_r932008623


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableSetSerdeSuite.scala:
##
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command.v1
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.execution.command
+import org.apache.spark.sql.internal.StaticSQLConf.CATALOG_IMPLEMENTATION
+
+/**
+ * This base suite contains unified tests for the `ALTER TABLE .. SET 
[SERDE|SERDEPROPERTIES]`
+ * command that check V1 table catalogs. The tests that cannot run for all V1 
catalogs
+ * are located in more specific test suites:
+ *
+ *   - V1 In-Memory catalog: 
`org.apache.spark.sql.execution.command.v1.AlterTableSetSerdeSuite`
+ *   - V1 Hive External catalog:
+ * `org.apache.spark.sql.hive.execution.command.AlterTableSetSerdeSuite`
+ */
+trait AlterTableSetSerdeSuiteBase extends command.AlterTableSetSerdeSuiteBase {
+
+  protected val isDatasourceTable = true
+
+  private def isUsingHiveMetastore: Boolean = {
+spark.sparkContext.conf.get(CATALOG_IMPLEMENTATION) == "hive"
+  }
+
+  private def normalizeSerdeProp(props: Map[String, String]): Map[String, 
String] = {
+props.filterNot(p => Seq("serialization.format", "path").contains(p._1))
+  }
+
+  private def maybeWrapException[T](expectException: Boolean)(body: => T): 
Unit = {
+if (expectException) intercept[AnalysisException] { body } else body
+  }
+
+  protected def testSetSerde(): Unit = {
+withNamespaceAndTable("ns", "tbl") { t =>
+  if (!isUsingHiveMetastore) {
+assert(isDatasourceTable, "InMemoryCatalog only supports data source 
tables")
+  }
+  sql(s"CREATE TABLE $t (col1 int, col2 string, a int, b int) 
$defaultUsing " +
+s"PARTITIONED BY (a, b)")
+
+  val catalog = spark.sessionState.catalog
+  val tableIdent = TableIdentifier("tbl", Some("ns"))
+  def checkSerdeProps(expectedSerdeProps: Map[String, String]): Unit = {
+val serdeProp = catalog.getTableMetadata(tableIdent).storage.properties
+if (isUsingHiveMetastore) {

Review Comment:
   Could you extract common code to functions in `AlterTableSetSerdeSuiteBase`, 
and create dedicated tests for Hive and In-Memory catalogs. Please, invoke the 
common code from catalog specific tests. This is common convention in the 
unified tests for v1 and v2 catalogs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #37275: [SPARK-39835][SQL][3.2] Fix EliminateSorts remove global sort below the local sort

2022-07-28 Thread GitBox


ulysses-you commented on PR #37275:
URL: https://github.com/apache/spark/pull/37275#issuecomment-1197941198

   cc @cloud-fan ready for branch-3.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #37276: [SPARK-39835][SQL][3.1] Fix EliminateSorts remove global sort below the local sort

2022-07-28 Thread GitBox


ulysses-you commented on PR #37276:
URL: https://github.com/apache/spark/pull/37276#issuecomment-1197940919

   cc @cloud-fan ready for branch-3.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on pull request #36918: [SQL][SPARK-39528] Use V2 Filter in SupportsRuntimeFiltering

2022-07-28 Thread GitBox


huaxingao commented on PR #36918:
URL: https://github.com/apache/spark/pull/36918#issuecomment-1198240325

   Thanks @cloud-fan @zinking 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37287: [WIP] code cleanup for CatalogImpl

2022-07-28 Thread GitBox


cloud-fan commented on code in PR #37287:
URL: https://github.com/apache/spark/pull/37287#discussion_r932317590


##
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:
##
@@ -110,53 +108,44 @@ class CatalogImpl(sparkSession: SparkSession) extends 
Catalog {
   override def listTables(dbName: String): Dataset[Table] = {
 // `dbName` could be either a single database name (behavior in Spark 3.3 
and prior) or
 // a qualified namespace with catalog name. We assume it's a single 
database name
-// and check if we can find the dbName in sessionCatalog. If so we 
listTables under
-// that database. Otherwise we try 3-part name parsing and locate the 
database.
-if (sessionCatalog.databaseExists(dbName) || 
sessionCatalog.isGlobalTempViewDB(dbName)) {

Review Comment:
   no need to check global temp view db. It doesn't belong to any catalog and 
v2 commands take care of it as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on pull request #37319: [SPARK-39887][SQL] `PullOutGroupingExpressions` should generate different alias names

2022-07-28 Thread GitBox


peter-toth commented on PR #37319:
URL: https://github.com/apache/spark/pull/37319#issuecomment-1198030620

   I don't think that extra `Alias` does any harm in that test, just the 
expected needs to be amended.
   
   My proposal also fixes the issue of the following:
   
   ```
   SELECT a, b AS a FROM (
 SELECT a, a AS b FROM (SELECT a FROM VALUES (1) AS t(a))
 UNION ALL
 SELECT a, b FROM (SELECT a, b FROM VALUES (1, 2) AS t(a, b))
   )
   ```
   and the query returns the correct result:
   ```
   +---+---+
   |  a|  a|
   +---+---+
   |  1|  1|
   |  1|  2|
   +---+---+
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false

2022-07-28 Thread GitBox


cloud-fan commented on code in PR #37327:
URL: https://github.com/apache/spark/pull/37327#discussion_r932204473


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -153,19 +153,24 @@ class CSVOptions(
* Disabled by default for backwards compatibility and performance. When 
enabled, date entries in
* timestamp columns will be cast to timestamp upon parsing. Not compatible 
with
* legacyTimeParserPolicy == LEGACY since legacy date parser will accept 
extra trailing characters
+   *
+   * The flag is only enabled if inferSchema is set to true.
*/
-  val inferDate = {
-val inferDateFlag = getBool("inferDate")
-if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && 
inferDateFlag) {
+  val preferDate = {
+val preferDateFlag = getBool("preferDate")
+if (preferDateFlag && SQLConf.get.legacyTimeParserPolicy == 
LegacyBehaviorPolicy.LEGACY) {
   throw QueryExecutionErrors.inferDateWithLegacyTimeParserError()
 }
-inferDateFlag
+if (preferDateFlag && !inferSchemaFlag) {

Review Comment:
   I'd prefer to fix doc. This renaming is kind of "redefine" this option, and 
it doesn't make sense to bind `preferDate` to ``inferSchema`.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -153,19 +153,24 @@ class CSVOptions(
* Disabled by default for backwards compatibility and performance. When 
enabled, date entries in
* timestamp columns will be cast to timestamp upon parsing. Not compatible 
with
* legacyTimeParserPolicy == LEGACY since legacy date parser will accept 
extra trailing characters
+   *
+   * The flag is only enabled if inferSchema is set to true.
*/
-  val inferDate = {
-val inferDateFlag = getBool("inferDate")
-if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && 
inferDateFlag) {
+  val preferDate = {
+val preferDateFlag = getBool("preferDate")
+if (preferDateFlag && SQLConf.get.legacyTimeParserPolicy == 
LegacyBehaviorPolicy.LEGACY) {
   throw QueryExecutionErrors.inferDateWithLegacyTimeParserError()
 }
-inferDateFlag
+if (preferDateFlag && !inferSchemaFlag) {

Review Comment:
   I'd prefer to fix doc. This renaming is kind of "redefine" this option, and 
it doesn't make sense to bind `preferDate` to `inferSchema`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ala commented on a diff in pull request #37228: [SPARK-37980][SQL] Extend METADATA column to support row indexes

2022-07-28 Thread GitBox


ala commented on code in PR #37228:
URL: https://github.com/apache/spark/pull/37228#discussion_r932280019


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##
@@ -223,8 +216,25 @@ object FileSourceStrategy extends Strategy with 
PredicateHelper with Logging {
 }.toSeq
   }.getOrElse(Seq.empty)
 
+  val fileFormatReaderGeneratedMetadataColumns: Seq[Attribute] =
+metadataColumns.map(_.name).flatMap {

Review Comment:
   Added tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #36240: [SPARK-37787][CORE] fix bug, Long running Spark Job throw HDFS_DELEGATE_TOKEN not found in cache Exception

2022-07-28 Thread GitBox


github-actions[bot] closed pull request #36240: [SPARK-37787][CORE] fix bug, 
Long running Spark Job throw HDFS_DELEGATE_TOKEN not found in cache Exception
URL: https://github.com/apache/spark/pull/36240


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] RS131419 commented on a diff in pull request #37230: [SPARK-33326][SQL] Fix the problem of writing hive partition table without updating metadata information

2022-07-28 Thread GitBox


RS131419 commented on code in PR #37230:
URL: https://github.com/apache/spark/pull/37230#discussion_r932792260


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala:
##
@@ -1611,4 +1611,26 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   }
 }
   }
+
+  test("SPARK-33326: partition metadata auto update for dynamic partitions") {
+val table = "partition_metadata_dynamic_partition"
+Seq("hive", "parquet").foreach { source =>
+  withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> "true") {
+withTable(table) {
+  sql(s"CREATE TABLE $table (id INT, sp INT, dp INT) USING $source 
PARTITIONED BY (sp, dp)")
+  sql(s"INSERT INTO $table PARTITION (sp=0, dp) VALUES (0, 0)")
+  sql(s"INSERT OVERWRITE TABLE $table PARTITION (sp=0, dp) SELECT id, 
id FROM range(5)")
+
+  for (i <- 0 until 5) {
+val partition = spark.sessionState.catalog
+  .getPartition(TableIdentifier(table), Map("sp" -> s"0", "dp" -> 
s"$i"))
+val numFiles = partition.parameters("numFiles")
+assert(numFiles.nonEmpty && numFiles.toInt > 0)
+val totalSize = partition.parameters("totalSize")

Review Comment:
   Thanks for the reply! This patch will provide the calculated rawDataSize to 
hive via the alterPartitions method, but ultimately it fails to take effect, 
which I think may be due to hive's own behavior. Again, I did the test via 
spark2.3 and the rawDataSize was not updated, so I didn't do the verification 
in the unit test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-07-28 Thread GitBox


cloud-fan commented on PR #37287:
URL: https://github.com/apache/spark/pull/37287#issuecomment-1198793708

   > Is listTables() does not respect current catalog fixed in this PR?
   
   I think so, by always passing the fully qualified name to `getTable` in 
`listTables`. We can add tests later, to make this PR a pure refinement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37329: [SPARK-39832][PYTHON] Support column arguments in regexp_replace

2022-07-28 Thread GitBox


HyukjinKwon commented on PR #37329:
URL: https://github.com/apache/spark/pull/37329#issuecomment-1198809042

   cc @zero323 FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37329: [SPARK-39832][PYTHON] Support column arguments in regexp_replace

2022-07-28 Thread GitBox


HyukjinKwon commented on code in PR #37329:
URL: https://github.com/apache/spark/pull/37329#discussion_r932809698


##
python/pyspark/sql/functions.py:
##
@@ -3262,7 +3262,19 @@ def regexp_extract(str: "ColumnOrName", pattern: str, 
idx: int) -> Column:
 return _invoke_function("regexp_extract", _to_java_column(str), pattern, 
idx)
 
 
-def regexp_replace(str: "ColumnOrName", pattern: str, replacement: str) -> 
Column:
+@overload
+def regexp_replace(string: "ColumnOrName", pattern: str, replacement: str) -> 
Column:
+...
+
+
+@overload
+def regexp_replace(string: "ColumnOrName", pattern: Column, replacement: 
Column) -> Column:
+...
+
+
+def regexp_replace(
+string: "ColumnOrName", pattern: Union[str, Column], replacement: 
Union[str, Column]

Review Comment:
   Can we write up `Parameters` section in the docs? And I think you can only 
keep `pattern: Union[str, Column]` and remove the ones above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #37258: [DO-NOT-MERGE] trigger CI

2022-07-28 Thread GitBox


Yikun commented on PR #37258:
URL: https://github.com/apache/spark/pull/37258#issuecomment-1198812267

   Sorry for late reply, I'm busy in some local meeting recent days.
   
   > In addition, can we get the content of dmesg?
   
   @LuciferYang  We can add a separate step like:
   ```
   - name: Print debug info
 if: failure()
 run: |
   # print demsg info
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #37258: [DO-NOT-MERGE] trigger CI

2022-07-28 Thread GitBox


LuciferYang commented on PR #37258:
URL: https://github.com/apache/spark/pull/37258#issuecomment-1198825739

   > Sorry for late reply, I'm busy in some local meeting recent days.
   > 
   > > In addition, can we get the content of dmesg?
   > 
   > @LuciferYang We can add a separate step like:
   > 
   > ```
   > - name: Print debug info
   >   if: failure()
   >   run: |
   > # print demsg info
   > ```
   
   Thanks for your reply, I have tested,  the account executing GA should not 
have permission to execute `demsg` and  GA test should already stability now :)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] deshanxiao opened a new pull request, #37336: [SPARK-39916][SQL][MLLIB][REFACTOR] Merge ml SchemaUtils to SQL

2022-07-28 Thread GitBox


deshanxiao opened a new pull request, #37336:
URL: https://github.com/apache/spark/pull/37336

   ### What changes were proposed in this pull request?
   Today we have two SchemaUtils: SQL SchemaUtils and mllib SchemaUtils. This 
pr is try to remove SchemaUtils in mllib.
   
   ### Why are the changes needed?
   Two SchemaUtils are often confused for us. MLlib SchemaUtils add a TODO flag 
and now we can do it.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   exist UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #37304: [SPARK-39877][PySpark] Add unpivot to PySpark DataFrame API

2022-07-28 Thread GitBox


zhengruifeng commented on code in PR #37304:
URL: https://github.com/apache/spark/pull/37304#discussion_r932846004


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -2127,6 +2127,15 @@ class Dataset[T] private[sql](
   valueColumnName: String): DataFrame =
 unpivot(ids, Array.empty, variableColumnName, valueColumnName)
 
+  /**
+   * Called from Python as Seq[Column] are easier to create via py4j than 
Array[Column].
+   */
+  private[sql] def _unpivot(ids: Seq[Column],

Review Comment:
   can you rename the method? we do not use `_xxx` like function names in scala
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #37337: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow

2022-07-28 Thread GitBox


MaxGekk commented on code in PR #37337:
URL: https://github.com/apache/spark/pull/37337#discussion_r932884678


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalMathUtils.scala:
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.sql.errors.QueryExecutionErrors
+
+/**
+ * Helper functions for interval arithmetic operations with overflow.
+ */
+object IntervalMathUtils {
+
+  def addExact(a: Int, b: Int): Int = withOverflow(Math.addExact(a, b), 
"try_add")
+
+  def addExact(a: Long, b: Long): Long = withOverflow(Math.addExact(a, b), 
"try_add")
+
+  def subtractExact(a: Int, b: Int): Int = withOverflow(Math.subtractExact(a, 
b), "try_subtract")
+
+  def subtractExact(a: Long, b: Long): Long = 
withOverflow(Math.subtractExact(a, b), "try_subtract")
+
+  def negateExact(a: Int): Int = withOverflow(Math.negateExact(a))
+
+  def negateExact(a: Long): Long = withOverflow(Math.negateExact(a))
+
+  private def withOverflow[A](f: => A, hint: String = ""): A = {
+try {
+  f
+} catch {
+  case e: ArithmeticException =>
+throw 
QueryExecutionErrors.intervalArithmeticOverflowError(e.getMessage, hint)

Review Comment:
   Should we create sub-error classes per every op instead of using of 
`e.getMessage` from Java exception? cc @srielau @cloud-fan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #37268: [SPARK-39853][CORE] Support stage level task resource schedule for standalone cluster when dynamic allocation disabled

2022-07-28 Thread GitBox


ivoson commented on code in PR #37268:
URL: https://github.com/apache/spark/pull/37268#discussion_r928873929


##
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala:
##
@@ -388,14 +388,19 @@ private[spark] class TaskSchedulerImpl(
   val execId = shuffledOffers(i).executorId
   val host = shuffledOffers(i).host
   val taskSetRpID = taskSet.taskSet.resourceProfileId
+  val prof = sc.resourceProfileManager.resourceProfileFromId(taskSetRpID)
+  val targetExecutorRpID = if (prof.isForTaskOnly) {
+ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID
+  } else {
+taskSetRpID

Review Comment:
   Thanks pointing this out. Will try to add a new `TaskResourceProfile` to 
process task only profiles to make the change more clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on a diff in pull request #37305: [SPARK-39881][PYTHON] Fix erroneous check for black and reenable black validation.

2022-07-28 Thread GitBox


Yikun commented on code in PR #37305:
URL: https://github.com/apache/spark/pull/37305#discussion_r932795358


##
dev/lint-python:
##
@@ -210,7 +210,7 @@ function black_test {
 local BLACK_STATUS=
 
 # Skip check if black is not installed.
-$BLACK_BUILD 2> /dev/null
+$PYTHON_EXECUTABLE -c 'import black' &> /dev/null
 if [ $? -ne 0 ]; then
 echo "The $BLACK_BUILD command was not found. Skipping black checks 
for now."

Review Comment:
   nit: we might also change this warning to something like change from 
`$BLACK_BUILD` to `$PYTHON_EXECUTABLE -c 'import black'`
   
   or just simple as: `The black is not installed. Skipping black checks for 
now.`



##
dev/pyproject.toml:
##
@@ -27,7 +27,7 @@ testpaths = [
 [tool.black]
 # When changing the version, we have to update
 # GitHub workflow version and dev/reformat-python
-required-version = "21.12b0"
+required-version = "22.6.0"

Review Comment:
   unrelated nit: would we also want to pin the version to requirement?
   
   https://github.com/apache/spark/blob/master/dev/requirements.txt#L47
   
   When devs install the requirements, then they get the pyspasrk dev requrired 
version. we can also do it in followup.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-07-28 Thread GitBox


beliefer commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r932847111


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -811,6 +800,244 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df2, Seq(Row(2, "david", 1.00)))
   }
 
+  test("scan with aggregate push-down and top N push-down") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .orderBy("DEPT")
+  .limit(1)
+checkSortRemoved(df1)
+checkLimitRemoved(df1)
+checkPushedInfo(df1,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df1, Seq(Row(1, 19000.00)))
+
+val df2 = sql(
+  """
+|SELECT dept AS my_dept, SUM(SALARY) FROM h2.test.employee
+|GROUP BY dept
+|ORDER BY my_dept
+|LIMIT 1
+|""".stripMargin)
+checkSortRemoved(df2)
+checkLimitRemoved(df2)
+checkPushedInfo(df2,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df2, Seq(Row(1, 19000.00)))
+
+val df3 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key").sum("SALARY")
+  .orderBy("key")
+  .limit(1)
+checkSortRemoved(df3)
+checkLimitRemoved(df3)
+checkPushedInfo(df3,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df3, Seq(Row(0, 44000.00)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT", "IS_MANAGER").sum("SALARY")
+  .orderBy("DEPT", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df4)
+checkLimitRemoved(df4)
+checkPushedInfo(df4,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df4, Seq(Row(1, false, 9000.00)))
+
+val df5 = sql(
+  """
+|SELECT dept AS my_dept, is_manager AS my_manager, SUM(SALARY) FROM 
h2.test.employee
+|GROUP BY dept, my_manager
+|ORDER BY my_dept, my_manager
+|LIMIT 1
+|""".stripMargin)
+checkSortRemoved(df5)
+checkLimitRemoved(df5)
+checkPushedInfo(df5,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT, IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [DEPT ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] 
LIMIT 1")
+checkAnswer(df5, Seq(Row(1, false, 9000.00)))
+
+val df6 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY", $"IS_MANAGER",
+when(($"SALARY" > 8000).and($"SALARY" < 1), 
$"salary").otherwise(0).as("key"))
+  .groupBy("key", "IS_MANAGER").sum("SALARY")
+  .orderBy("key", "IS_MANAGER")
+  .limit(1)
+checkSortRemoved(df6)
+checkLimitRemoved(df6)
+checkPushedInfo(df6,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: " +
+"[CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY 
ELSE 0.00 END, " +
+"IS_MANAGER]",
+  "PushedFilters: []",
+  "PushedTopN: ORDER BY [" +
+"CASE WHEN (SALARY > 8000.00) AND (SALARY < 1.00) THEN SALARY ELSE 
0.00 END " +
+"ASC NULLS FIRST, IS_MANAGER ASC NULLS FIRST] LIMIT 1")
+checkAnswer(df6, Seq(Row(0.00, false, 12000.00)))
+
+val df7 = sql(
+  """
+|SELECT dept, SUM(SALARY) FROM h2.test.employee
+|GROUP BY dept
+|ORDER BY SUM(SALARY)
+|LIMIT 1
+|""".stripMargin)
+checkSortRemoved(df7, false)
+checkLimitRemoved(df7, false)
+checkPushedInfo(df7,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []")
+checkAnswer(df7, Seq(Row(6, 12000.00)))
+
+val df8 = sql(
+  """
+|SELECT dept, SUM(SALARY) AS total FROM h2.test.employee
+|GROUP BY dept
+|ORDER BY total
+|LIMIT 1
+|""".stripMargin)
+checkSortRemoved(df8, false)
+checkLimitRemoved(df8, false)
+checkPushedInfo(df8,
+  "PushedAggregates: [SUM(SALARY)]",
+  "PushedGroupByExpressions: [DEPT]",
+  "PushedFilters: []")
+checkAnswer(df8, Seq(Row(6, 12000.00)))
+  }
+
+  

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37304: [SPARK-39877][PySpark] Add unpivot to PySpark DataFrame API

2022-07-28 Thread GitBox


zhengruifeng commented on code in PR #37304:
URL: https://github.com/apache/spark/pull/37304#discussion_r932851912


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -2127,6 +2127,15 @@ class Dataset[T] private[sql](
   valueColumnName: String): DataFrame =
 unpivot(ids, Array.empty, variableColumnName, valueColumnName)
 
+  /**
+   * Called from Python as Seq[Column] are easier to create via py4j than 
Array[Column].
+   */
+  private[sql] def _unpivot(ids: Seq[Column],

Review Comment:
   > And there are other private Python methods in Dataset (mapInPandas) and 
RelationalGroupedDataset (flatMapGroupsInPandas).
   
   they are also used in test



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang opened a new pull request, #37337: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow

2022-07-28 Thread GitBox


gengliangwang opened a new pull request, #37337:
URL: https://github.com/apache/spark/pull/37337

   
   
   ### What changes were proposed in this pull request?
   
   Similar with https://github.com/apache/spark/pull/37313, currently, when  
arithmetic overflow errors happen under ANSI mode, the error messages are like
   ```
   [ARITHMETIC_OVERFLOW] long overflow. Use 'try_multiply' to tolerate overflow 
and return NULL instead. If necessary set spark.sql.ansi.enabled to "false" 
   ```
   
   
   The "(except for ANSI interval type)" part is confusing. We should remove it 
for the numeric arithmetic operations and have a new error class for the 
interval division error: `INTERVAL_ARITHMETIC_OVERFLOW`
   
   
   
   ### Why are the changes needed?
   
   For better error messages
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, Use different error classes for arithmetic overflows of 
numeric/interval.. After changes, the error messages are simpler and more clear.
   
   ### How was this patch tested?
   
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang opened a new pull request, #37338: [SPARK-39918][SQL][MINOR] Replace the wording "un-comparable" with "incomparable" in error message

2022-07-28 Thread GitBox


gengliangwang opened a new pull request, #37338:
URL: https://github.com/apache/spark/pull/37338

   
   
   ### What changes were proposed in this pull request?
   
   
   Update the codegen error message for data type which can't be compared by 
replacing`un-comparable` with `incomparable`
   
   ### Why are the changes needed?
   
   Incomparable is the correct wording here
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No
   ### How was this patch tested?
   
   Existing UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #37338: [SPARK-39918][SQL][MINOR] Replace the wording "un-comparable" with "incomparable" in error message

2022-07-28 Thread GitBox


gengliangwang commented on PR #37338:
URL: https://github.com/apache/spark/pull/37338#issuecomment-1198884914

   This is trivial. I found it when working on 
https://github.com/apache/spark/pull/37337


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false

2022-07-28 Thread GitBox


Jonathancui123 commented on PR #37327:
URL: https://github.com/apache/spark/pull/37327#issuecomment-1198894995

   > Should we keep requirement that `inferDate = true` needs `inferSchema = 
true`? I think we should clarify semantics.
   
   @sadikovi I think we should keep the requirement and the new exception type 
in this PR. The exception will clarify that the primary purpose of the 
`inferDate` flag is for allowing dates in the inferred schema. The requirement 
that `inferDate = true` needs `inferSchema = true` makes sense because 
otherwise, `inferDate` is modifying parsing fallback behavior for no reason.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead

2022-07-28 Thread GitBox


MaxGekk commented on code in PR #37322:
URL: https://github.com/apache/spark/pull/37322#discussion_r932499581


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala:
##
@@ -305,14 +305,17 @@ class DatasetUnpivotSuite extends QueryTest
   valueColumnName = "val"
 )
 }
-checkErrorClass(
+checkError(
   exception = e,
   errorClass = "UNPIVOT_VALUE_DATA_TYPE_MISMATCH",
-  msg = "Unpivot value columns must share a least common type, some types 
do not: \\[" +
-"\"STRING\" \\(`str1#\\d+`\\), " +
-"\"INT\" \\(`int1#\\d+`, `int2#\\d+`, `int3#\\d+`, ...\\), " +
-"\"BIGINT\" \\(`long1#\\d+L`, `long2#\\d+L`\\)\\];(\n.*)*",
-  matchMsg = true)
+  errorSubClass = None,
+  sqlState = None,

Review Comment:
   Let me add the overloaded method:
   ```scala
protected def checkError(
 exception: SparkThrowable,
 errorClass: String,
 parameters: Map[String, String],
 matchPVals: Boolean): Unit =
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37335: [SPARK-39895][PYTHON] Support multiple column drop

2022-07-28 Thread GitBox


dongjoon-hyun commented on code in PR #37335:
URL: https://github.com/apache/spark/pull/37335#discussion_r932701092


##
python/pyspark/sql/dataframe.py:
##
@@ -3237,17 +3237,18 @@ def drop(self, *cols: "ColumnOrName") -> "DataFrame":  
# type: ignore[misc]
 """
 if len(cols) == 1:
 col = cols[0]
-if isinstance(col, str):
-jdf = self._jdf.drop(col)
-elif isinstance(col, Column):
-jdf = self._jdf.drop(col._jc)
-else:
+if not isinstance(col, (str, Column)):
 raise TypeError("col should be a string or a Column")
+jdf = self._jdf.drop(_to_java_column(col))

Review Comment:
   This code path change is irrelevant to this PR's goal, `Support multiple 
column drop`, isn't it?
   If this is a bug, it's worth to have another JIRA and make a PR, 
@santosh-d3vpl3x .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on pull request #37280: [SPARK-39862][SQL] Fix bugs in existence DEFAULT value lookups for V2 data sources

2022-07-28 Thread GitBox


dtenedor commented on PR #37280:
URL: https://github.com/apache/spark/pull/37280#issuecomment-1198675997

   @gengliangwang Sure, this is done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-07-28 Thread GitBox


amaliujia commented on PR #37287:
URL: https://github.com/apache/spark/pull/37287#issuecomment-1198716223

   Is `listTables()` does not respect current catalog fixed in this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37335: [SPARK-39895][PYTHON] Support multiple column drop

2022-07-28 Thread GitBox


dongjoon-hyun commented on code in PR #37335:
URL: https://github.com/apache/spark/pull/37335#discussion_r932774765


##
python/pyspark/sql/tests/test_dataframe.py:
##
@@ -87,6 +87,21 @@ def test_help_command(self):
 pydoc.render_doc(df.foo)
 pydoc.render_doc(df.take(1))
 
+def test_drop(self):
+df = self.spark.createDataFrame([("A", 50, "Y"), ("B", 60, "Y")], 
["name", "age", "active"])
+
+self.assertEqual(df.drop("active").columns, ["name", "age"])
+
+self.assertEqual(df.drop("active", "nonexistent_column").columns, 
["name", "age"])
+
+self.assertEqual(df.drop("name", "age", "active").columns, [])

Review Comment:
   Does this fail without your patch?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cfmcgrady commented on a diff in pull request #37334: [SPARK-39887][SQL] RemoveRedundantAliases should keep attributes of a Union's first child

2022-07-28 Thread GitBox


cfmcgrady commented on code in PR #37334:
URL: https://github.com/apache/spark/pull/37334#discussion_r932804420


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##
@@ -559,6 +559,17 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
 })
 Join(newLeft, newRight, joinType, newCondition, hint)
 
+  case _: Union =>

Review Comment:
   Shall we update the method comment as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on a diff in pull request #37305: [SPARK-39881][PYTHON] Fix erroneous check for black and reenable black validation.

2022-07-28 Thread GitBox


Yikun commented on code in PR #37305:
URL: https://github.com/apache/spark/pull/37305#discussion_r932810013


##
python/pyspark/ml/feature.py:
##
@@ -968,7 +968,7 @@ class _CountVectorizerParams(JavaParams, HasInputCol, 
HasOutputCol):
 
 def __init__(self, *args: Any):
 super(_CountVectorizerParams, self).__init__(*args)
-self._setDefault(minTF=1.0, minDF=1.0, maxDF=2 ** 63 - 1, vocabSize=1 
<< 18, binary=False)
+self._setDefault(minTF=1.0, minDF=1.0, maxDF=2**63 - 1, vocabSize=1 << 
18, binary=False)

Review Comment:
   Looks like `**` is not covered by PEP8, and the main reason of balck change 
is [consider about 
readable](https://github.com/psf/black/pull/2095#issuecomment-871694472), so I 
personaly think black choice is right.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #37328: [SPARK-39907][PS] Implement axis and skipna of Series.argmin

2022-07-28 Thread GitBox


Yikun commented on PR #37328:
URL: https://github.com/apache/spark/pull/37328#issuecomment-1198820043

   otherwise LGTM! Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37258: [DO-NOT-MERGE] trigger CI

2022-07-28 Thread GitBox


HyukjinKwon commented on PR #37258:
URL: https://github.com/apache/spark/pull/37258#issuecomment-1198843792

   Let me close this one. I believe all are fixed now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #37258: [DO-NOT-MERGE] trigger CI

2022-07-28 Thread GitBox


HyukjinKwon closed pull request #37258: [DO-NOT-MERGE] trigger CI
URL: https://github.com/apache/spark/pull/37258


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #37304: [SPARK-39877][PySpark] Add unpivot to PySpark DataFrame API

2022-07-28 Thread GitBox


zhengruifeng commented on code in PR #37304:
URL: https://github.com/apache/spark/pull/37304#discussion_r932840669


##
python/pyspark/context.py:
##
@@ -309,10 +309,7 @@ def _do_init(
 if sys.version_info[:2] < (3, 8):
 with warnings.catch_warnings():
 warnings.simplefilter("once")
-warnings.warn(
-"Python 3.7 support is deprecated in Spark 3.4.",
-FutureWarning
-)
+warnings.warn("Python 3.7 support is deprecated in Spark 
3.4.", FutureWarning)

Review Comment:
   I encounter the same issue in other PR, maybe due to the `black` version. 
Let us revert this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead

2022-07-28 Thread GitBox


MaxGekk commented on PR #37322:
URL: https://github.com/apache/spark/pull/37322#issuecomment-1198870290

   @anchovYu @cloud-fan @HyukjinKwon @gengliangwang Could you review this PR, 
please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-07-28 Thread GitBox


amaliujia commented on PR #37287:
URL: https://github.com/apache/spark/pull/37287#issuecomment-1198905411

   > > Is listTables() does not respect current catalog fixed in this PR?
   > 
   > I think so, by always passing the fully qualified name to `getTable` in 
`listTables`. We can add tests later, to make this PR a pure refinement.
   
   thanks for the confirmation! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang closed pull request #37280: [SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups

2022-07-28 Thread GitBox


gengliangwang closed pull request #37280: [SPARK-39862][SQL] Fix two bugs in 
existence DEFAULT value lookups
URL: https://github.com/apache/spark/pull/37280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #37280: [SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups

2022-07-28 Thread GitBox


gengliangwang commented on PR #37280:
URL: https://github.com/apache/spark/pull/37280#issuecomment-1198710296

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on pull request #37332: [SPARK-39914][SQL] Add DS V2 Filter to V1 Filter conversion

2022-07-28 Thread GitBox


huaxingao commented on PR #37332:
URL: https://github.com/apache/spark/pull/37332#issuecomment-1198735772

   The GA failure doesn't seem relevant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37335: [SPARK-39895][PYTHON] Support multiple column drop

2022-07-28 Thread GitBox


HyukjinKwon commented on code in PR #37335:
URL: https://github.com/apache/spark/pull/37335#discussion_r932808436


##
python/pyspark/sql/dataframe.py:
##
@@ -3244,10 +3244,14 @@ def drop(self, *cols: "ColumnOrName") -> "DataFrame":  
# type: ignore[misc]
 else:
 raise TypeError("col should be a string or a Column")
 else:
-for col in cols:
-if not isinstance(col, str):
-raise TypeError("each col in the param list should be a 
string")
-jdf = self._jdf.drop(self._jseq(cols))
+if all(isinstance(col, str) for col in cols):
+jdf = self._jdf.drop(self._jseq(cols))
+elif all(isinstance(col, Column) for col in cols):
+jdf = self._jdf
+for col in cols:
+jdf = jdf.drop(col._jc)  # type: ignore[union-attr]

Review Comment:
   Can we avoid looping here? This is super expensive in Spark SQL optmizer. 
Ideally we should add the signature of `def drop(colNames: Column*` in Scala 
side first, and PySpark side directlly invokes it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37326: [SPARK-39906][INFRA] Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'

2022-07-28 Thread GitBox


HyukjinKwon commented on PR #37326:
URL: https://github.com/apache/spark/pull/37326#issuecomment-1198807963

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] deshanxiao commented on pull request #37336: [SPARK-39916][SQL][MLLIB][REFACTOR] Merge ml SchemaUtils to SQL

2022-07-28 Thread GitBox


deshanxiao commented on PR #37336:
URL: https://github.com/apache/spark/pull/37336#issuecomment-1198839786

   CC @gengliangwang  @dongjoon-hyun @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #37320: [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)

2022-07-28 Thread GitBox


beliefer commented on code in PR #37320:
URL: https://github.com/apache/spark/pull/37320#discussion_r932843706


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala:
##
@@ -545,6 +560,9 @@ case class ScanBuilderHolder(
   var pushedPredicates: Seq[Predicate] = Seq.empty[Predicate]
 
   var pushedAggregate: Option[Aggregation] = None
+
+  var pushedAggregateExpectedOutputMap: Map[AttributeReference, Expression] =

Review Comment:
   Thank you for the reminder.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on pull request #37332: [SPARK-39914][SQL] Add DS V2 Filter to V1 Filter conversion

2022-07-28 Thread GitBox


huaxingao commented on PR #37332:
URL: https://github.com/apache/spark/pull/37332#issuecomment-1198736391

   @cloud-fan Could you please take a look when you have time? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #37326: [SPARK-39906][INFRA] Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'

2022-07-28 Thread GitBox


HyukjinKwon closed pull request #37326: [SPARK-39906][INFRA] Eliminate build 
warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'
URL: https://github.com/apache/spark/pull/37326


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37328: [SPARK-39907][PS] Implement axis and skipna of Series.argmin

2022-07-28 Thread GitBox


HyukjinKwon commented on PR #37328:
URL: https://github.com/apache/spark/pull/37328#issuecomment-1198808336

   cc @itholic @xinrong-meng @ueshin FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #37305: [SPARK-39881][PYTHON] Fix erroneous check for black and reenable black validation.

2022-07-28 Thread GitBox


Yikun commented on PR #37305:
URL: https://github.com/apache/spark/pull/37305#issuecomment-1198817543

   and CI failed due to `[Run / Scala 2.13 build with 
SBT](https://github.com/grundprinzip/spark/runs/7546678501?check_suite_focus=true)`
 git clone networking issue, I think we can pass it by re-triggering.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false

2022-07-28 Thread GitBox


sadikovi commented on PR #37327:
URL: https://github.com/apache/spark/pull/37327#issuecomment-1198896103

   Yes, that was my thinking too. Okay, I will make a few changes to the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on a diff in pull request #37328: [SPARK-39907][PS] Implement axis and skipna of Series.argmin

2022-07-28 Thread GitBox


Yikun commented on code in PR #37328:
URL: https://github.com/apache/spark/pull/37328#discussion_r932814726


##
python/pyspark/pandas/series.py:
##
@@ -6322,13 +6322,21 @@ def argmax(self, axis: Axis = None, skipna: bool = 
True) -> int:
 # If the maximum is achieved in multiple locations, the first row 
position is returned.
 return -1 if max_value[0] is None else max_value[1]
 
-def argmin(self) -> int:
+def argmin(self, axis: Axis = None, skipna: bool = True) -> int:
 """
 Return int position of the smallest value in the Series.
 
 If the minimum is achieved in multiple locations,
 the first row position is returned.
 
+Parameters
+--
+axis : {{None}}
+Dummy argument for consistency with Series.
+skipna : bool, default True
+Exclude NA/null values. If the entire Series is NA, the result

Review Comment:
   > If the entire Series is NA, the result
   
   It seems a plus doc for pandas (1.4.3), do we want to add a test for this?
   
   [1] https://pandas.pydata.org/docs/reference/api/pandas.Series.argmin.html



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #36253: [SPARK-38932][SQL] Datasource v2 support report distinct keys

2022-07-28 Thread GitBox


ulysses-you commented on PR #36253:
URL: https://github.com/apache/spark/pull/36253#issuecomment-1198822779

   cc @cloud-fan @huaxingao if you have time to take a look, thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a diff in pull request #37290: [SPARK-37194][SQL] Avoid unnecessary sort in v1 write if it's not dynamic partition

2022-07-28 Thread GitBox


c21 commented on code in PR #37290:
URL: https://github.com/apache/spark/pull/37290#discussion_r932846383


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala:
##
@@ -117,20 +117,26 @@ object V1WritesUtils {
   outputColumns: Seq[Attribute],
   partitionColumns: Seq[Attribute],
   bucketSpec: Option[BucketSpec],
-  options: Map[String, String]): Seq[SortOrder] = {
+  options: Map[String, String],
+  numStaticPartitions: Int = 0): Seq[SortOrder] = {
+assert(partitionColumns.size >= numStaticPartitions)

Review Comment:
   ditto.



##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala:
##
@@ -107,8 +108,10 @@ object FileFormatWriter extends Logging {
   partitionColumns: Seq[Attribute],
   bucketSpec: Option[BucketSpec],
   statsTrackers: Seq[WriteJobStatsTracker],
-  options: Map[String, String])
+  options: Map[String, String],
+  numStaticPartitions: Int = 0)
 : Set[String] = {
+assert(partitionColumns.size >= numStaticPartitions)

Review Comment:
   nit: would `require()` be better?



##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/V1WriteCommandSuite.scala:
##
@@ -214,4 +216,34 @@ class V1WriteCommandSuite extends V1WriteCommandSuiteBase 
with SharedSparkSessio
   }
 }
   }
+
+  test("SPARK-37194: Avoid unnecessary sort in v1 write if it's not dynamic 
partition") {
+withPlannedWrite { enabled =>
+  withTable("t") {
+sql(
+  """
+|CREATE TABLE t(key INT, value STRING) USING PARQUET
+|PARTITIONED BY (p1 INT, p2 STRING)
+|""".stripMargin)
+
+// partition columns are static
+executeAndCheckOrdering(hasLogicalSort = false, orderingMatched = 
true) {
+  sql(
+"""
+  |INSERT INTO t PARTITION(p1=1, p2='a')
+  |SELECT key, value FROM testData
+  |""".stripMargin)
+}
+
+// one static partition column and one dynamic partition column
+executeAndCheckOrdering(hasLogicalSort = enabled, orderingMatched = 
enabled) {
+  sql(
+"""
+  |INSERT INTO t PARTITION(p1=1, p2)
+  |SELECT key, value, value FROM testData
+  |""".stripMargin)
+}
+  }
+}

Review Comment:
   would it be good to have one more unit test for no static columns? 



##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala:
##
@@ -83,6 +83,7 @@ object FileFormatWriter extends Logging {
*/
   private[sql] var outputOrderingMatched: Boolean = false
 
+  // scalastyle:off argcount

Review Comment:
   nit: we can pass in a wrapper class `PartitionSpec(partitionColumns: 
Seq[Attribute], numStaticPartitions: Int)` to avoid this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false

2022-07-28 Thread GitBox


sadikovi commented on PR #37327:
URL: https://github.com/apache/spark/pull/37327#issuecomment-1198856750

   Should we keep requirement that `inferDate = true` needs `inferSchema = 
true`? I think it is unclear right now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a diff in pull request #37264: [SPARK-39849][SQL] Dataset.as(StructType) fills missing new columns with null value

2022-07-28 Thread GitBox


c21 commented on code in PR #37264:
URL: https://github.com/apache/spark/pull/37264#discussion_r932857471


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAsSchemaSuite.scala:
##
@@ -46,15 +46,11 @@ class DataFrameAsSchemaSuite extends QueryTest with 
SharedSparkSession {
 checkAnswer(df, Row("b"))
   }
 
-  test("negative: column not found") {

Review Comment:
   @cloud-fan - sure, updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #37264: [SPARK-39849][SQL] Dataset.as(StructType) fills missing new columns with null value

2022-07-28 Thread GitBox


c21 commented on PR #37264:
URL: https://github.com/apache/spark/pull/37264#issuecomment-1198868034

   The PR is ready for review again, thanks @cloud-fan.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a diff in pull request #37290: [SPARK-37194][SQL] Avoid unnecessary sort in v1 write if it's not dynamic partition

2022-07-28 Thread GitBox


viirya commented on code in PR #37290:
URL: https://github.com/apache/spark/pull/37290#discussion_r932864145


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala:
##
@@ -107,8 +108,10 @@ object FileFormatWriter extends Logging {
   partitionColumns: Seq[Attribute],
   bucketSpec: Option[BucketSpec],
   statsTrackers: Seq[WriteJobStatsTracker],
-  options: Map[String, String])
+  options: Map[String, String],
+  numStaticPartitions: Int = 0)

Review Comment:
   numStaticPartitionCols?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead

2022-07-28 Thread GitBox


MaxGekk closed pull request #37322: [SPARK-39905][SQL][TESTS] Remove 
`checkErrorClass()` and use `checkError()` instead
URL: https://github.com/apache/spark/pull/37322


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead

2022-07-28 Thread GitBox


MaxGekk commented on PR #37322:
URL: https://github.com/apache/spark/pull/37322#issuecomment-1198880973

   Merging to master. Thank you, @gengliangwang for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-07-28 Thread GitBox


dongjoon-hyun commented on code in PR #37287:
URL: https://github.com/apache/spark/pull/37287#discussion_r932517832


##
sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala:
##
@@ -33,36 +33,37 @@ import org.apache.spark.storage.StorageLevel
 abstract class Catalog {
 
   /**
-   * Returns the current default database in this session.
+   * Returns the current database/namespace in this session.

Review Comment:
   If you don't mind, could you avoid to use `/` here? You can use `or` 
literally. Otherwise, `/` could be read as another multi-layer. :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37287: [SPARK-39912][SQL] Refine CatalogImpl

2022-07-28 Thread GitBox


dongjoon-hyun commented on code in PR #37287:
URL: https://github.com/apache/spark/pull/37287#discussion_r932517832


##
sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala:
##
@@ -33,36 +33,37 @@ import org.apache.spark.storage.StorageLevel
 abstract class Catalog {
 
   /**
-   * Returns the current default database in this session.
+   * Returns the current database/namespace in this session.

Review Comment:
   If you don't mind, could you avoid to use `/` here? You can use `or` 
literally. Otherwise, `/` could be read as another multi-layer unlike 
`table/view` case. We are not confused at `table/view`, but this new sentence 
looks a little confusing to me at least :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #35391: [SPARK-38098][PYTHON] Add support for ArrayType of nested StructType to arrow-based conversion

2022-07-28 Thread GitBox


ueshin commented on code in PR #35391:
URL: https://github.com/apache/spark/pull/35391#discussion_r932566706


##
python/pyspark/sql/tests/test_dataframe.py:
##
@@ -953,6 +953,30 @@ def test_to_pandas_from_mixed_dataframe(self):
 pdf_with_only_nulls = self.spark.sql(sql).filter("tinyint is 
null").toPandas()
 self.assertTrue(np.all(pdf_with_only_nulls.dtypes == 
pdf_with_some_nulls.dtypes))
 
+@unittest.skipIf(
+not have_pandas or not have_pyarrow,
+cast(str, pandas_requirement_message or pyarrow_requirement_message),
+)
+def test_to_pandas_for_array_of_struct(self):
+# SPARK-38098: Support Array of Struct for Pandas UDFs and toPandas
+import numpy as np
+import pandas as pd
+
+df = self.spark.createDataFrame(
+[[[("a", 2, 3.0), ("a", 2, 3.0)]], [[("b", 5, 6.0), ("b", 5, 
6.0)]]],
+"array_struct_col Array>",
+)
+is_arrow_enabled = [True, False]
+for value in is_arrow_enabled:

Review Comment:
   nit:
   
   ```py
   for is_arrow_enabled in [True, False]:
   ```



##
python/pyspark/sql/tests/test_pandas_udf_scalar.py:
##
@@ -134,6 +134,30 @@ def test_pandas_udf_nested_arrays(self):
 result = df.select(tokenize("vals").alias("hi"))
 self.assertEqual([Row(hi=[["hi", "boo"]]), Row(hi=[["bye", "boo"]])], 
result.collect())
 
+def test_pandas_array_struct(self):
+# SPARK-38098: Support Array of Struct for Pandas UDFs and toPandas
+# import numpy as np
+
+@pandas_udf("Array>")
+def return_cols(cols):
+# self.assertEqual(type(cols), pd.Series)
+# self.assertEqual(type(cols[0]), np.ndarray)
+# self.assertEqual(type(cols[0][0]), dict)

Review Comment:
   I guess we can't use `self` in the udf. Shall we follow the other tests to 
use builtin `assert` instead:
   
   
https://github.com/apache/spark/blob/f8b3d5322e6cbce2e42a6940518686b7255e79cb/python/pyspark/sql/tests/test_pandas_udf_scalar.py#L1206-L1215
   
   Also `import numpy as np` might need to be in the udf.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] santosh-d3vpl3x closed pull request #37335: SPARK-39895 pyspark support multiple column drop

2022-07-28 Thread GitBox


santosh-d3vpl3x closed pull request #37335: SPARK-39895 pyspark support 
multiple column drop
URL: https://github.com/apache/spark/pull/37335


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cabral1888 commented on a diff in pull request #37230: [SPARK-33326][SQL] Fix the problem of writing hive partition table without updating metadata information

2022-07-28 Thread GitBox


cabral1888 commented on code in PR #37230:
URL: https://github.com/apache/spark/pull/37230#discussion_r932418981


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala:
##
@@ -1611,4 +1611,26 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   }
 }
   }
+
+  test("SPARK-33326: partition metadata auto update for dynamic partitions") {
+val table = "partition_metadata_dynamic_partition"
+Seq("hive", "parquet").foreach { source =>
+  withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> "true") {
+withTable(table) {
+  sql(s"CREATE TABLE $table (id INT, sp INT, dp INT) USING $source 
PARTITIONED BY (sp, dp)")
+  sql(s"INSERT INTO $table PARTITION (sp=0, dp) VALUES (0, 0)")
+  sql(s"INSERT OVERWRITE TABLE $table PARTITION (sp=0, dp) SELECT id, 
id FROM range(5)")
+
+  for (i <- 0 until 5) {
+val partition = spark.sessionState.catalog
+  .getPartition(TableIdentifier(table), Map("sp" -> s"0", "dp" -> 
s"$i"))
+val numFiles = partition.parameters("numFiles")
+assert(numFiles.nonEmpty && numFiles.toInt > 0)
+val totalSize = partition.parameters("totalSize")

Review Comment:
   I see that you included the verification only for `totalSize` and 
`numFiles`, but what about `rawDataSize`? Does it make sense to be verified?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead

2022-07-28 Thread GitBox


MaxGekk commented on code in PR #37322:
URL: https://github.com/apache/spark/pull/37322#discussion_r932495675


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala:
##
@@ -305,14 +305,17 @@ class DatasetUnpivotSuite extends QueryTest
   valueColumnName = "val"
 )
 }
-checkErrorClass(
+checkError(
   exception = e,
   errorClass = "UNPIVOT_VALUE_DATA_TYPE_MISMATCH",
-  msg = "Unpivot value columns must share a least common type, some types 
do not: \\[" +
-"\"STRING\" \\(`str1#\\d+`\\), " +
-"\"INT\" \\(`int1#\\d+`, `int2#\\d+`, `int3#\\d+`, ...\\), " +
-"\"BIGINT\" \\(`long1#\\d+L`, `long2#\\d+L`\\)\\];(\n.*)*",
-  matchMsg = true)
+  errorSubClass = None,
+  sqlState = None,

Review Comment:
   It has but if I remove settings of the parameters, I am getting the errors:
   ```
   overloaded method value checkError with alternatives:
 (exception: org.apache.spark.SparkThrowable,errorClass: String,parameters: 
Map[String,String])Unit 
 (exception: org.apache.spark.SparkThrowable,errorClass: String,sqlState: 
String,parameters: Map[String,String])Unit 
 (exception: org.apache.spark.SparkThrowable,errorClass: 
String,errorSubClass: String,sqlState: String,parameters: 
Map[String,String])Unit 
 (exception: org.apache.spark.SparkThrowable,errorClass: 
String,errorSubClass: Option[String],sqlState: Option[String],parameters: 
Map[String,String],matchPVals: Boolean)Unit
cannot be applied to (exception: org.apache.spark.sql.AnalysisException, 
errorClass: String, parameters: scala.collection.immutable.Map[String,String], 
matchPVals: Boolean)
   checkError(
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false

2022-07-28 Thread GitBox


Jonathancui123 commented on code in PR #37327:
URL: https://github.com/apache/spark/pull/37327#discussion_r932486986


##
docs/sql-data-sources-csv.md:
##
@@ -109,7 +109,7 @@ Data source options of CSV can be set via:
 read
   
   
-inferDate 
+preferDate
 false
 Whether or not to infer columns that satisfy the 
dateFormat option as Date. Requires 
inferSchema to be true. When false, 
columns with dates will be inferred as String (or as 
Timestamp if it fits the timestampFormat).

Review Comment:
   ```suggestion
   preferDate
   false
   Whether or not to infer columns that satisfy the 
dateFormat option as Date. Inference requires 
inferSchema to be true. When false, 
columns with dates will be inferred as String (or as 
Timestamp if it fits the timestampFormat). When 
true, the parser will attempt entries in timestamp columns as 
DateType before other fallbacks
   ```
   
   What do you think of the following docs change to clarify the behavior? 
@cloud-fan @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false

2022-07-28 Thread GitBox


Jonathancui123 commented on code in PR #37327:
URL: https://github.com/apache/spark/pull/37327#discussion_r932486986


##
docs/sql-data-sources-csv.md:
##
@@ -109,7 +109,7 @@ Data source options of CSV can be set via:
 read
   
   
-inferDate 
+preferDate
 false
 Whether or not to infer columns that satisfy the 
dateFormat option as Date. Requires 
inferSchema to be true. When false, 
columns with dates will be inferred as String (or as 
Timestamp if it fits the timestampFormat).

Review Comment:
   ```suggestion
   preferDate
   false
   Whether or not to infer columns that satisfy the 
dateFormat option as Date. Inference requires 
inferSchema to be true. When false, 
columns with dates will be inferred as String (or as 
Timestamp if it fits the timestampFormat). When 
true, the parser will attempt to parse entries in timestamp 
columns as DateType before other fallbacks or failing.
   ```
   
   What do you think of the following docs change to clarify the behavior? 
@cloud-fan @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #37322: [SPARK-39905][SQL][TESTS] Remove `checkErrorClass()` and use `checkError()` instead

2022-07-28 Thread GitBox


MaxGekk commented on code in PR #37322:
URL: https://github.com/apache/spark/pull/37322#discussion_r932506853


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala:
##
@@ -305,14 +305,17 @@ class DatasetUnpivotSuite extends QueryTest
   valueColumnName = "val"
 )
 }
-checkErrorClass(
+checkError(
   exception = e,
   errorClass = "UNPIVOT_VALUE_DATA_TYPE_MISMATCH",
-  msg = "Unpivot value columns must share a least common type, some types 
do not: \\[" +
-"\"STRING\" \\(`str1#\\d+`\\), " +
-"\"INT\" \\(`int1#\\d+`, `int2#\\d+`, `int3#\\d+`, ...\\), " +
-"\"BIGINT\" \\(`long1#\\d+L`, `long2#\\d+L`\\)\\];(\n.*)*",
-  matchMsg = true)
+  errorSubClass = None,
+  sqlState = None,

Review Comment:
   but I think it is better to set default values in the most wide method. I 
will remove one overloaded method and add default values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-07-28 Thread GitBox


otterc commented on PR #35906:
URL: https://github.com/apache/spark/pull/35906#issuecomment-1198491146

   > Should be easy to add. We can have a feature flag, and when initiate the 
RemoteBlockPushResolver, db can be set to null if this feature flag is turned 
off, and all the later DB operations won't be triggered. @mridulm @otterc What 
do you think?
   
   I don't see this as a feature. For original shuffle, we support work 
preserving restart. This completes the same support for push-based shuffle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a diff in pull request #36995: [SPARK-39607][SQL][DSV2] Distribution and ordering support V2 function in writing

2022-07-28 Thread GitBox


sunchao commented on code in PR #36995:
URL: https://github.com/apache/spark/pull/36995#discussion_r932515356


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala:
##
@@ -17,22 +17,33 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import org.apache.spark.sql.catalyst.expressions.Expression
+import org.apache.spark.sql.catalyst.analysis.{AnsiTypeCoercion, TypeCoercion}
+import org.apache.spark.sql.catalyst.expressions.{Expression, Literal, 
SortOrder, TransformExpression, V2ExpressionUtils}
 import org.apache.spark.sql.catalyst.expressions.V2ExpressionUtils._
 import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, 
RebalancePartitions, RepartitionByExpression, Sort}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.connector.catalog.FunctionCatalog
+import org.apache.spark.sql.connector.catalog.functions.ScalarFunction
 import org.apache.spark.sql.connector.distributions._
 import org.apache.spark.sql.connector.write.{RequiresDistributionAndOrdering, 
Write}
 import org.apache.spark.sql.errors.QueryCompilationErrors
 
 object DistributionAndOrderingUtils {
 
-  def prepareQuery(write: Write, query: LogicalPlan): LogicalPlan = write 
match {
+  def prepareQuery(

Review Comment:
   Hmm I wonder how does the write work with transforms such as bucket. For 
example, suppose the required distribution is `bucket(col, 100)`, Spark 
currently will compute the partition (bucket) ID by `murmur_hash(bucket(col, 
100)) pmod 100`, so the value of `col` is essentially hashed twice. I'm not 
sure whether this breaks any assumption from the V2 data source side, or 
whether it has any effect in the hash key distributions.
   
   



##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala:
##
@@ -17,22 +17,33 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import org.apache.spark.sql.catalyst.expressions.Expression
+import org.apache.spark.sql.catalyst.analysis.{AnsiTypeCoercion, TypeCoercion}
+import org.apache.spark.sql.catalyst.expressions.{Expression, Literal, 
SortOrder, TransformExpression, V2ExpressionUtils}
 import org.apache.spark.sql.catalyst.expressions.V2ExpressionUtils._
 import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, 
RebalancePartitions, RepartitionByExpression, Sort}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.connector.catalog.FunctionCatalog
+import org.apache.spark.sql.connector.catalog.functions.ScalarFunction
 import org.apache.spark.sql.connector.distributions._
 import org.apache.spark.sql.connector.write.{RequiresDistributionAndOrdering, 
Write}
 import org.apache.spark.sql.errors.QueryCompilationErrors
 
 object DistributionAndOrderingUtils {
 
-  def prepareQuery(write: Write, query: LogicalPlan): LogicalPlan = write 
match {
+  def prepareQuery(
+  write: Write,
+  query: LogicalPlan,
+  funCatalogOpt: Option[FunctionCatalog]): LogicalPlan = write match {
 case write: RequiresDistributionAndOrdering =>
   val numPartitions = write.requiredNumPartitions()
 
   val distribution = write.requiredDistribution match {
-case d: OrderedDistribution => toCatalystOrdering(d.ordering(), query)
-case d: ClusteredDistribution => d.clustering.map(e => toCatalyst(e, 
query)).toSeq
+case d: OrderedDistribution =>
+  toCatalystOrdering(d.ordering(), query, funCatalogOpt)
+.map(ur => resolveTransformExpression(ur).asInstanceOf[SortOrder])

Review Comment:
   nit: why the variable is named `ur`? maybe change it to `e`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on pull request #37304: [SPARK-39877][PySpark] Add unpivot to PySpark DataFrame API

2022-07-28 Thread GitBox


EnricoMi commented on PR #37304:
URL: https://github.com/apache/spark/pull/37304#issuecomment-1198506801

   > btw, you may also need to run `dev/reformat-python`
   
   Why do I have to reformat `python/pyspark/context.py`? That seems unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] santosh-d3vpl3x closed pull request #37333: SPARK-39895 pyspark support multiple column drop

2022-07-28 Thread GitBox


santosh-d3vpl3x closed pull request #37333: SPARK-39895 pyspark support 
multiple column drop
URL: https://github.com/apache/spark/pull/37333


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #37327: [SPARK-39904][SQL] Rename inferDate to preferDate and add check for inferSchema = false

2022-07-28 Thread GitBox


sadikovi commented on code in PR #37327:
URL: https://github.com/apache/spark/pull/37327#discussion_r932712356


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -153,19 +153,24 @@ class CSVOptions(
* Disabled by default for backwards compatibility and performance. When 
enabled, date entries in
* timestamp columns will be cast to timestamp upon parsing. Not compatible 
with
* legacyTimeParserPolicy == LEGACY since legacy date parser will accept 
extra trailing characters
+   *
+   * The flag is only enabled if inferSchema is set to true.
*/
-  val inferDate = {
-val inferDateFlag = getBool("inferDate")
-if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && 
inferDateFlag) {
+  val preferDate = {
+val preferDateFlag = getBool("preferDate")
+if (preferDateFlag && SQLConf.get.legacyTimeParserPolicy == 
LegacyBehaviorPolicy.LEGACY) {
   throw QueryExecutionErrors.inferDateWithLegacyTimeParserError()
 }
-inferDateFlag
+if (preferDateFlag && !inferSchemaFlag) {

Review Comment:
   Okay, I can do that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang closed pull request #37311: [SPARK-39865][SQL][3.3] Show proper error messages on the overflow errors of table insert

2022-07-28 Thread GitBox


gengliangwang closed pull request #37311: [SPARK-39865][SQL][3.3] Show proper 
error messages on the overflow errors of table insert 
URL: https://github.com/apache/spark/pull/37311


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth opened a new pull request, #37334: [SPARK-39887][SQL] RemoveRedundantAliases should keep attributes of a Union's first child

2022-07-28 Thread GitBox


peter-toth opened a new pull request, #37334:
URL: https://github.com/apache/spark/pull/37334

   ### What changes were proposed in this pull request?
   Keep the output attributes of a `Union` node's first child in the 
`RemoveRedundantAliases` rule to avoid correctness issues.
   
   ### Why are the changes needed?
   To fix the result of the following query:
   ```
   SELECT a, b AS a FROM (
 SELECT a, a AS b FROM (SELECT a FROM VALUES (1) AS t(a))
 UNION ALL
 SELECT a, b FROM (SELECT a, b FROM VALUES (1, 2) AS t(a, b))
   )
   ```
   Before this PR the query returns the incorrect result: 
   ```
   +---+---+
   |  a|  a|
   +---+---+
   |  1|  1|
   |  2|  2|
   +---+---+
   ```
   After this PR it returns the expected result:
   ```
   +---+---+
   |  a|  a|
   +---+---+
   |  1|  1|
   |  1|  2|
   +---+---+
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, fixes a correctness issue.
   
   ### How was this patch tested?
   Added new UTs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on pull request #37319: [SPARK-39887][SQL] `PullOutGroupingExpressions` should generate different alias names

2022-07-28 Thread GitBox


peter-toth commented on PR #37319:
URL: https://github.com/apache/spark/pull/37319#issuecomment-1198525757

   I've opened a PR with my proposal here: 
https://github.com/apache/spark/pull/37334


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] santosh-d3vpl3x opened a new pull request, #37335: SPARK-39895 pyspark support multiple column drop

2022-07-28 Thread GitBox


santosh-d3vpl3x opened a new pull request, #37335:
URL: https://github.com/apache/spark/pull/37335

   * SPARK-39895 pyspark support multiple column drop
   
   ### What changes were proposed in this pull request?
   Fixes issues related type confirmation in pyspark api
   
   ### Why are the changes needed?
   We expect that multiple columns can be handled by drop call on df because of 
its typing but that is not the case.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, fixes issues related type confirmation in pyspark api
   
   ### How was this patch tested?
   CI Pipeline on fork and CI here
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #37280: [SPARK-39862][SQL] Fix bug in existence DEFAULT value lookups for V2 data sources

2022-07-28 Thread GitBox


gengliangwang commented on PR #37280:
URL: https://github.com/apache/spark/pull/37280#issuecomment-1198601705

   @dtenedor could you also update the PR description about the ORC fix?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #37290: [SPARK-37194][SQL] Avoid unnecessary sort in v1 write if it's not dynamic partition

2022-07-28 Thread GitBox


ulysses-you commented on PR #37290:
URL: https://github.com/apache/spark/pull/37290#issuecomment-1197941930

   cc @viirya @cloud-fan @c21 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cfmcgrady commented on pull request #37319: [SPARK-39887][SQL] `PullOutGroupingExpressions` should generate different alias names

2022-07-28 Thread GitBox


cfmcgrady commented on PR #37319:
URL: https://github.com/apache/spark/pull/37319#issuecomment-1197968855

   hi, @peter-toth thank you for your feedback.
   While these changes of `RemoveRedundantAliases` solve this issue, they break 
the guarantee of `alias removal should not break after push project through 
union`.
   
   
https://github.com/apache/spark/blob/0f9c1a2e848fbdfb17af0555d0c8be7d5a7191bb/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAliasAndProjectSuite.scala#L96-L104
   
   ```
   == FAIL: Plans do not match ===
Union false, false   Union false, false
   !:- Project [a#0 AS a#0]  :- LocalRelation , [a#0]
   !:  +- LocalRelation , [a#0]   +- LocalRelation , [b#0]
   !+- LocalRelation , [b#0]  

   ScalaTestFailureLocation: org.apache.spark.sql.catalyst.plans.PlanTestBase 
at (PlanTest.scala:177)
   org.scalatest.exceptions.TestFailedException: 
   == FAIL: Plans do not match ===
Union false, false   Union false, false
   !:- Project [a#0 AS a#0]  :- LocalRelation , [a#0]
   !:  +- LocalRelation , [a#0]   +- LocalRelation , [b#0]
   !+- LocalRelation , [b#0] 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >