[GitHub] [spark] pan3793 commented on a diff in pull request #36496: [SPARK-39104][SQL] InMemoryRelation#isCachedColumnBuffersLoaded should be thread-safe

2022-05-17 Thread GitBox
pan3793 commented on code in PR #36496: URL: https://github.com/apache/spark/pull/36496#discussion_r874423234 ## sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala: ## @@ -563,4 +564,51 @@ class InMemoryColumnarQuerySuite extends

[GitHub] [spark] sadikovi commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox
sadikovi commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874433212 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala: ## @@ -30,29 +30,16 @@ import

[GitHub] [spark] Yikun commented on a diff in pull request #36509: [SPARK-38961][PYTHON][DOCS] Enhance to automatically generate the the pandas API support list

2022-05-17 Thread GitBox
Yikun commented on code in PR #36509: URL: https://github.com/apache/spark/pull/36509#discussion_r874452951 ## python/pyspark/pandas/supported_api_gen.py: ## @@ -0,0 +1,377 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license

[GitHub] [spark] Yikun commented on a diff in pull request #36509: [SPARK-38961][PYTHON][DOCS] Enhance to automatically generate the the pandas API support list

2022-05-17 Thread GitBox
Yikun commented on code in PR #36509: URL: https://github.com/apache/spark/pull/36509#discussion_r874452951 ## python/pyspark/pandas/supported_api_gen.py: ## @@ -0,0 +1,377 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license

[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874425426 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala: ## @@ -30,29 +30,16 @@ import

[GitHub] [spark] Yikun commented on a diff in pull request #36509: [SPARK-38961][PYTHON][DOCS] Enhance to automatically generate the the pandas API support list

2022-05-17 Thread GitBox
Yikun commented on code in PR #36509: URL: https://github.com/apache/spark/pull/36509#discussion_r874452951 ## python/pyspark/pandas/supported_api_gen.py: ## @@ -0,0 +1,377 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license

[GitHub] [spark] cloud-fan commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox
cloud-fan commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874403362 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -178,7 +164,8 @@ class CSVInferSchema(val options: CSVOptions) extends

[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874474944 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala: ## @@ -456,4 +456,19 @@ class TimestampFormatterSuite extends

[GitHub] [spark] sadikovi commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox
sadikovi commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874424921 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -178,7 +164,8 @@ class CSVInferSchema(val options: CSVOptions) extends

[GitHub] [spark] LuciferYang commented on a diff in pull request #36496: [SPARK-39104][SQL] InMemoryRelation#isCachedColumnBuffersLoaded should be thread-safe

2022-05-17 Thread GitBox
LuciferYang commented on code in PR #36496: URL: https://github.com/apache/spark/pull/36496#discussion_r874402726 ## sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala: ## @@ -563,4 +564,51 @@ class InMemoryColumnarQuerySuite

[GitHub] [spark] Yikun commented on a diff in pull request #36509: [SPARK-38961][PYTHON][DOCS] Enhance to automatically generate the the pandas API support list

2022-05-17 Thread GitBox
Yikun commented on code in PR #36509: URL: https://github.com/apache/spark/pull/36509#discussion_r874452951 ## python/pyspark/pandas/supported_api_gen.py: ## @@ -0,0 +1,377 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license

[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874415392 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -178,7 +164,8 @@ class CSVInferSchema(val options: CSVOptions)

[GitHub] [spark] Yikun commented on a diff in pull request #36509: [SPARK-38961][PYTHON][DOCS] Enhance to automatically generate the the pandas API support list

2022-05-17 Thread GitBox
Yikun commented on code in PR #36509: URL: https://github.com/apache/spark/pull/36509#discussion_r874457721 ## python/pyspark/pandas/supported_api_gen.py: ## @@ -0,0 +1,377 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license

[GitHub] [spark] beobest2 commented on a diff in pull request #36509: [SPARK-38961][PYTHON][DOCS] Enhance to automatically generate the the pandas API support list

2022-05-17 Thread GitBox
beobest2 commented on code in PR #36509: URL: https://github.com/apache/spark/pull/36509#discussion_r874470001 ## python/pyspark/pandas/supported_api_gen.py: ## @@ -0,0 +1,377 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license

[GitHub] [spark] cloud-fan commented on a diff in pull request #36295: [SPARK-38978][SQL] Support push down OFFSET to JDBC data source V2

2022-05-17 Thread GitBox
cloud-fan commented on code in PR #36295: URL: https://github.com/apache/spark/pull/36295#discussion_r874584082 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownLimit.java: ## @@ -21,8 +21,8 @@ /** * A mix-in interface for {@link

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36576: [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession

2022-05-17 Thread GitBox
HyukjinKwon commented on code in PR #36576: URL: https://github.com/apache/spark/pull/36576#discussion_r874632251 ## sql/core/src/test/scala/org/apache/spark/sql/BloomFilterAggregateQuerySuite.scala: ## @@ -35,23 +34,26 @@ class BloomFilterAggregateQuerySuite extends QueryTest

[GitHub] [spark] jackylee-ch commented on pull request #36578: [SPARK-39207][SQL] Record the SQL text when executing a query using SparkSession.sql()

2022-05-17 Thread GitBox
jackylee-ch commented on PR #36578: URL: https://github.com/apache/spark/pull/36578#issuecomment-1128721543 Great job. BTW, is it posibble for user to define the Description? Sometimes, SQLText is actually too big to show in Description, then use defined text is very helpful to find the

[GitHub] [spark] AmplabJenkins commented on pull request #36561: [SPARK-37939][SQL] Use error classes in the parsing errors of properties

2022-05-17 Thread GitBox
AmplabJenkins commented on PR #36561: URL: https://github.com/apache/spark/pull/36561#issuecomment-1128547684 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] panbingkun commented on pull request #36540: [SPARK-38466][CORE] Use error classes in org.apache.spark.mapred

2022-05-17 Thread GitBox
panbingkun commented on PR #36540: URL: https://github.com/apache/spark/pull/36540#issuecomment-1128547367 @MaxGekk ping -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] cloud-fan commented on a diff in pull request #36295: [SPARK-38978][SQL] Support push down OFFSET to JDBC data source V2

2022-05-17 Thread GitBox
cloud-fan commented on code in PR #36295: URL: https://github.com/apache/spark/pull/36295#discussion_r874589600 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -419,6 +420,72 @@ object V2ScanRelationPushDown extends

[GitHub] [spark] HyukjinKwon opened a new pull request, #36576: [SPARK-32268][SQL][TESTS][FOLLOw-UP] Use function registry in the SparkSession

2022-05-17 Thread GitBox
HyukjinKwon opened a new pull request, #36576: URL: https://github.com/apache/spark/pull/36576 ### What changes were proposed in this pull request? This PR proposes: 1. Use the function registry in the Spark Session being used 2. Move function registration into `beforeAll`

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36576: [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession

2022-05-17 Thread GitBox
HyukjinKwon commented on code in PR #36576: URL: https://github.com/apache/spark/pull/36576#discussion_r874625022 ## sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala: ## @@ -147,6 +147,9 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36576: [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession

2022-05-17 Thread GitBox
HyukjinKwon commented on code in PR #36576: URL: https://github.com/apache/spark/pull/36576#discussion_r874632251 ## sql/core/src/test/scala/org/apache/spark/sql/BloomFilterAggregateQuerySuite.scala: ## @@ -35,23 +34,26 @@ class BloomFilterAggregateQuerySuite extends QueryTest

[GitHub] [spark] LuciferYang commented on a diff in pull request #36578: [SPARK-39207][SQL] Record the SQL text when executing a query using SparkSession.sql()

2022-05-17 Thread GitBox
LuciferYang commented on code in PR #36578: URL: https://github.com/apache/spark/pull/36578#discussion_r874671414 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -82,7 +82,7 @@ object SQLExecution { val redactedStr = Utils

[GitHub] [spark] LuciferYang commented on a diff in pull request #36578: [SPARK-39207][SQL] Record the SQL text when executing a query using SparkSession.sql()

2022-05-17 Thread GitBox
LuciferYang commented on code in PR #36578: URL: https://github.com/apache/spark/pull/36578#discussion_r874671414 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -82,7 +82,7 @@ object SQLExecution { val redactedStr = Utils

[GitHub] [spark] MaxGekk opened a new pull request, #36579: [WIP][SQL] Use double quotes for values of SQL configs/DS options in error messages

2022-05-17 Thread GitBox
MaxGekk opened a new pull request, #36579: URL: https://github.com/apache/spark/pull/36579 ### What changes were proposed in this pull request? Wrap values of SQL configs and datasource options in error messages by double quotes. Added the `toDSOption()` method to `QueryErrorsBase` to

[GitHub] [spark] beliefer commented on a diff in pull request #36295: [SPARK-38978][SQL] Support push down OFFSET to JDBC data source V2

2022-05-17 Thread GitBox
beliefer commented on code in PR #36295: URL: https://github.com/apache/spark/pull/36295#discussion_r874727085 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala: ## @@ -304,10 +307,11 @@ private[jdbc] class JDBCRDD( } val

[GitHub] [spark] cloud-fan commented on pull request #36572: [SPARK-36718][SQL][FOLLOWUP] Improve the extract-only check in CollapseProject

2022-05-17 Thread GitBox
cloud-fan commented on PR #36572: URL: https://github.com/apache/spark/pull/36572#issuecomment-1128539432 thanks for the review, merging to master/3.3! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] cloud-fan commented on a diff in pull request #36531: [SPARK-39171][SQL] Unify the Cast expression

2022-05-17 Thread GitBox
cloud-fan commented on code in PR #36531: URL: https://github.com/apache/spark/pull/36531#discussion_r874528129 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -2138,199 +2287,28 @@ case class Cast( final override def

[GitHub] [spark] cloud-fan commented on a diff in pull request #36295: [SPARK-38978][SQL] Support push down OFFSET to JDBC data source V2

2022-05-17 Thread GitBox
cloud-fan commented on code in PR #36295: URL: https://github.com/apache/spark/pull/36295#discussion_r874586232 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala: ## @@ -304,10 +307,11 @@ private[jdbc] class JDBCRDD( } val

[GitHub] [spark] beliefer commented on a diff in pull request #36295: [SPARK-38978][SQL] Support push down OFFSET to JDBC data source V2

2022-05-17 Thread GitBox
beliefer commented on code in PR #36295: URL: https://github.com/apache/spark/pull/36295#discussion_r874600195 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownLimit.java: ## @@ -21,8 +21,8 @@ /** * A mix-in interface for {@link

[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874492296 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala: ## @@ -456,4 +456,19 @@ class TimestampFormatterSuite extends

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36576: [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession

2022-05-17 Thread GitBox
HyukjinKwon commented on code in PR #36576: URL: https://github.com/apache/spark/pull/36576#discussion_r874632251 ## sql/core/src/test/scala/org/apache/spark/sql/BloomFilterAggregateQuerySuite.scala: ## @@ -35,23 +34,26 @@ class BloomFilterAggregateQuerySuite extends QueryTest

[GitHub] [spark] zero323 commented on pull request #36547: [SPARK-39197][PYTHON] Implement `skipna` parameter of `GroupBy.all`

2022-05-17 Thread GitBox
zero323 commented on PR #36547: URL: https://github.com/apache/spark/pull/36547#issuecomment-1128565616 By itself LGTM. To concur with others, I also don't see shading issue and if it there was one, we're not introducing a new method here and changing a name at this point would be a

[GitHub] [spark] cloud-fan commented on a diff in pull request #36576: [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession

2022-05-17 Thread GitBox
cloud-fan commented on code in PR #36576: URL: https://github.com/apache/spark/pull/36576#discussion_r874598262 ## sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala: ## @@ -147,6 +147,9 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with

[GitHub] [spark] panbingkun commented on pull request #36548: [SPARK-38470][CORE] Use error classes in org.apache.spark.partial

2022-05-17 Thread GitBox
panbingkun commented on PR #36548: URL: https://github.com/apache/spark/pull/36548#issuecomment-1128666500 @bozhang2820 @MaxGekk ping -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] gengliangwang opened a new pull request, #36577: [SPARK-39208][SQL] Fix query context bugs in decimal overflow under codegen mode

2022-05-17 Thread GitBox
gengliangwang opened a new pull request, #36577: URL: https://github.com/apache/spark/pull/36577 ### What changes were proposed in this pull request? 1. Fix logical bugs in adding query contexts as references under codegen mode.

[GitHub] [spark] linhongliu-db opened a new pull request, #36578: [SPARK-39207][SQL] Record the query text when executed with SQL API

2022-05-17 Thread GitBox
linhongliu-db opened a new pull request, #36578: URL: https://github.com/apache/spark/pull/36578 ### What changes were proposed in this pull request? Record the query text when executed with SQL API. ### Why are the changes needed? * When executing a query using

[GitHub] [spark] linhongliu-db commented on pull request #36578: [SPARK-39207][SQL] Record the query text when executed with SparkSession.sql()

2022-05-17 Thread GitBox
linhongliu-db commented on PR #36578: URL: https://github.com/apache/spark/pull/36578#issuecomment-1128693761 cc @cloud-fan, I think it should be useful to record the original SQL text of a query. -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] cloud-fan closed pull request #36572: [SPARK-36718][SQL][FOLLOWUP] Improve the extract-only check in CollapseProject

2022-05-17 Thread GitBox
cloud-fan closed pull request #36572: [SPARK-36718][SQL][FOLLOWUP] Improve the extract-only check in CollapseProject URL: https://github.com/apache/spark/pull/36572 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] cloud-fan commented on a diff in pull request #36531: [SPARK-39171][SQL] Unify the Cast expression

2022-05-17 Thread GitBox
cloud-fan commented on code in PR #36531: URL: https://github.com/apache/spark/pull/36531#discussion_r874526787 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -41,6 +40,113 @@ import org.apache.spark.unsafe.types.{CalendarInterval,

[GitHub] [spark] cloud-fan commented on a diff in pull request #36531: [SPARK-39171][SQL] Unify the Cast expression

2022-05-17 Thread GitBox
cloud-fan commented on code in PR #36531: URL: https://github.com/apache/spark/pull/36531#discussion_r874525801 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala: ## @@ -772,15 +772,19 @@ abstract class TypeCoercionBase { case e if

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36576: [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession

2022-05-17 Thread GitBox
HyukjinKwon commented on code in PR #36576: URL: https://github.com/apache/spark/pull/36576#discussion_r874624321 ## sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala: ## @@ -147,6 +147,9 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36576: [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession

2022-05-17 Thread GitBox
HyukjinKwon commented on code in PR #36576: URL: https://github.com/apache/spark/pull/36576#discussion_r874632251 ## sql/core/src/test/scala/org/apache/spark/sql/BloomFilterAggregateQuerySuite.scala: ## @@ -35,23 +34,26 @@ class BloomFilterAggregateQuerySuite extends QueryTest

[GitHub] [spark] cloud-fan commented on a diff in pull request #36295: [SPARK-38978][SQL] Support push down OFFSET to JDBC data source V2

2022-05-17 Thread GitBox
cloud-fan commented on code in PR #36295: URL: https://github.com/apache/spark/pull/36295#discussion_r874587041 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -44,6 +44,7 @@ object V2ScanRelationPushDown extends

[GitHub] [spark] cloud-fan commented on a diff in pull request #36295: [SPARK-38978][SQL] Support push down OFFSET to JDBC data source V2

2022-05-17 Thread GitBox
cloud-fan commented on code in PR #36295: URL: https://github.com/apache/spark/pull/36295#discussion_r874587459 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -44,6 +44,7 @@ object V2ScanRelationPushDown extends

[GitHub] [spark] physinet commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-17 Thread GitBox
physinet commented on code in PR #36545: URL: https://github.com/apache/spark/pull/36545#discussion_r874781396 ## python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst: ## @@ -0,0 +1,23 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more

[GitHub] [spark] gengliangwang commented on pull request #36582: [SPARK-39210][SQL] Provide query context of Decimal overflow in AVG when WSCG is off

2022-05-17 Thread GitBox
gengliangwang commented on PR #36582: URL: https://github.com/apache/spark/pull/36582#issuecomment-1129038094 This should be the last one of query context fix when WSCG is not available. -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] gengliangwang opened a new pull request, #36582: [SPARK-39210][SQL] Provide query context of Decimal overflow in AVG when WSCG is off

2022-05-17 Thread GitBox
gengliangwang opened a new pull request, #36582: URL: https://github.com/apache/spark/pull/36582 ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/36525, this PR provides runtime error query context for the Average expression

[GitHub] [spark] srowen commented on pull request #36496: [SPARK-39104][SQL] InMemoryRelation#isCachedColumnBuffersLoaded should be thread-safe

2022-05-17 Thread GitBox
srowen commented on PR #36496: URL: https://github.com/apache/spark/pull/36496#issuecomment-1128869583 Hm, try tests again? I'm having trouble seeing the error. I thought it might be MiMa, because you make a method private, but not sure that is it -- This is an automated message from the

[GitHub] [spark] panbingkun opened a new pull request, #36580: [SPARK-39167][SQL] Throw an exception w/ an error class for multiple rows from a subquery used as an expression

2022-05-17 Thread GitBox
panbingkun opened a new pull request, #36580: URL: https://github.com/apache/spark/pull/36580 ### What changes were proposed in this pull request? In the PR, I propose to use the MULTI_VALUE_SUBQUERY_ERROR error classes for multiple rows from a subquery used as an expression. ###

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36576: [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession

2022-05-17 Thread GitBox
HyukjinKwon commented on code in PR #36576: URL: https://github.com/apache/spark/pull/36576#discussion_r874632251 ## sql/core/src/test/scala/org/apache/spark/sql/BloomFilterAggregateQuerySuite.scala: ## @@ -35,23 +34,26 @@ class BloomFilterAggregateQuerySuite extends QueryTest

[GitHub] [spark] pan3793 commented on pull request #36496: [SPARK-39104][SQL] InMemoryRelation#isCachedColumnBuffersLoaded should be thread-safe

2022-05-17 Thread GitBox
pan3793 commented on PR #36496: URL: https://github.com/apache/spark/pull/36496#issuecomment-1128885052 Hmm, let me check the error message first -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] srowen commented on pull request #36567: [SPARK-39196][CORE][SQL][K8S] replace `getOrElse(null)` with `orNull`

2022-05-17 Thread GitBox
srowen commented on PR #36567: URL: https://github.com/apache/spark/pull/36567#issuecomment-112223 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] srowen commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox
srowen commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874840258 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala: ## @@ -52,6 +52,25 @@ sealed trait TimestampFormatter extends Serializable {

[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874864869 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala: ## @@ -52,6 +52,25 @@ sealed trait TimestampFormatter extends

[GitHub] [spark] pan3793 commented on pull request #36496: [SPARK-39104][SQL] InMemoryRelation#isCachedColumnBuffersLoaded should be thread-safe

2022-05-17 Thread GitBox
pan3793 commented on PR #36496: URL: https://github.com/apache/spark/pull/36496#issuecomment-1128945680 Two jobs failed, hive slow test failed because of OOM, another is pyspark(not familiar with python), re-triggered -- This is an automated message from the Apache Git Service. To

[GitHub] [spark] gengliangwang commented on pull request #36577: [SPARK-39208][SQL] Fix query context bugs in decimal overflow under codegen mode

2022-05-17 Thread GitBox
gengliangwang commented on PR #36577: URL: https://github.com/apache/spark/pull/36577#issuecomment-1128945945 Merging to master/3.3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] gengliangwang closed pull request #36577: [SPARK-39208][SQL] Fix query context bugs in decimal overflow under codegen mode

2022-05-17 Thread GitBox
gengliangwang closed pull request #36577: [SPARK-39208][SQL] Fix query context bugs in decimal overflow under codegen mode URL: https://github.com/apache/spark/pull/36577 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[GitHub] [spark] srowen commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-17 Thread GitBox
srowen commented on code in PR #36545: URL: https://github.com/apache/spark/pull/36545#discussion_r875033677 ## python/pyspark/sql/session.py: ## @@ -570,10 +570,20 @@ def _inferSchemaFromList( if not data: raise ValueError("can not infer schema from empty

[GitHub] [spark] neilagupta commented on pull request #36441: [SPARK-39091][SQL] Updating specific SQL Expression traits that don't compose when multiple are extended due to nodePatterns being final.

2022-05-17 Thread GitBox
neilagupta commented on PR #36441: URL: https://github.com/apache/spark/pull/36441#issuecomment-1129086391 @AmplabJenkins any chance I could get someone with write access to review this? -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [spark] HyukjinKwon commented on pull request #36576: [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession

2022-05-17 Thread GitBox
HyukjinKwon commented on PR #36576: URL: https://github.com/apache/spark/pull/36576#issuecomment-1128914759 Merged to master and branch-3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874872763 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala: ## @@ -52,6 +52,25 @@ sealed trait TimestampFormatter extends

[GitHub] [spark] Yikun opened a new pull request, #36581: [SPARK-39054][PYTHON][PS] Ensure infer schema accuracy in GroupBy.apply

2022-05-17 Thread GitBox
Yikun opened a new pull request, #36581: URL: https://github.com/apache/spark/pull/36581 ### What changes were proposed in this pull request? Ensure sampling rows >= 2 to make sure apply's infer schema is accurate. ### Why are the changes needed? GroupBy.apply infers schema

[GitHub] [spark] srowen closed pull request #36567: [SPARK-39196][CORE][SQL][K8S] replace `getOrElse(null)` with `orNull`

2022-05-17 Thread GitBox
srowen closed pull request #36567: [SPARK-39196][CORE][SQL][K8S] replace `getOrElse(null)` with `orNull` URL: https://github.com/apache/spark/pull/36567 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] srowen commented on pull request #36529: [SPARK-39102][CORE][SQL][DSTREAM] Add checkstyle rules to disabled use of Guava's `Files.createTempDir()`

2022-05-17 Thread GitBox
srowen commented on PR #36529: URL: https://github.com/apache/spark/pull/36529#issuecomment-1128890020 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] srowen closed pull request #36529: [SPARK-39102][CORE][SQL][DSTREAM] Add checkstyle rules to disabled use of Guava's `Files.createTempDir()`

2022-05-17 Thread GitBox
srowen closed pull request #36529: [SPARK-39102][CORE][SQL][DSTREAM] Add checkstyle rules to disabled use of Guava's `Files.createTempDir()` URL: https://github.com/apache/spark/pull/36529 -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] pan3793 commented on pull request #36496: [SPARK-39104][SQL] InMemoryRelation#isCachedColumnBuffersLoaded should be thread-safe

2022-05-17 Thread GitBox
pan3793 commented on PR #36496: URL: https://github.com/apache/spark/pull/36496#issuecomment-1129007630 All tests past now https://github.com/pan3793/spark/runs/6471801942?check_suite_focus=true -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] physinet commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-17 Thread GitBox
physinet commented on code in PR #36545: URL: https://github.com/apache/spark/pull/36545#discussion_r874983582 ## python/pyspark/sql/session.py: ## @@ -570,10 +570,20 @@ def _inferSchemaFromList( if not data: raise ValueError("can not infer schema from

[GitHub] [spark] srowen commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-17 Thread GitBox
srowen commented on code in PR #36545: URL: https://github.com/apache/spark/pull/36545#discussion_r874838643 ## python/pyspark/sql/session.py: ## @@ -570,10 +570,20 @@ def _inferSchemaFromList( if not data: raise ValueError("can not infer schema from empty

[GitHub] [spark] HyukjinKwon closed pull request #36576: [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession

2022-05-17 Thread GitBox
HyukjinKwon closed pull request #36576: [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession URL: https://github.com/apache/spark/pull/36576 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

2022-05-17 Thread GitBox
Eugene-Mark commented on PR #36499: URL: https://github.com/apache/spark/pull/36499#issuecomment-1129110003 @HyukjinKwon The [issue-38846 ](https://issues.apache.org/jira/browse/SPARK-38846) shows that the Number type of Teradata will lose its fractional part after loading to Spark. We

[GitHub] [spark] abellina commented on pull request #36505: [SPARK-39131][SQL] Rewrite exists as LeftSemi earlier to allow filters to be inferred

2022-05-17 Thread GitBox
abellina commented on PR #36505: URL: https://github.com/apache/spark/pull/36505#issuecomment-1129117318 Update on the SPARK-32290: SingleColumn Null Aware Anti Join Optimize failure: - The original test used a table in the subquery `testData2` which has no nulls, so I added

[GitHub] [spark] dtenedor commented on pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

2022-05-17 Thread GitBox
dtenedor commented on PR #36583: URL: https://github.com/apache/spark/pull/36583#issuecomment-1129125976 Note: this PR is based on https://github.com/apache/spark/pull/36501. The additional changes comprise about 15 lines of code, in this commit:

[GitHub] [spark] physinet commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-17 Thread GitBox
physinet commented on code in PR #36545: URL: https://github.com/apache/spark/pull/36545#discussion_r875099589 ## python/pyspark/sql/session.py: ## @@ -570,10 +570,20 @@ def _inferSchemaFromList( if not data: raise ValueError("can not infer schema from

[GitHub] [spark] MaxGekk commented on a diff in pull request #36553: [SPARK-39214][SQL] Improve errors related to CAST

2022-05-17 Thread GitBox
MaxGekk commented on code in PR #36553: URL: https://github.com/apache/spark/pull/36553#discussion_r875133200 ## core/src/main/resources/error/error-classes.json: ## @@ -22,8 +22,12 @@ "CANNOT_UP_CAST_DATATYPE" : { "message" : [ "Cannot up cast from to .\n" ] }, -

[GitHub] [spark] physinet commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-17 Thread GitBox
physinet commented on code in PR #36545: URL: https://github.com/apache/spark/pull/36545#discussion_r875142772 ## python/pyspark/sql/session.py: ## @@ -570,10 +570,20 @@ def _inferSchemaFromList( if not data: raise ValueError("can not infer schema from

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

2022-05-17 Thread GitBox
Eugene-Mark commented on PR #36499: URL: https://github.com/apache/spark/pull/36499#issuecomment-1129099918 @srowen I'm also not a Teradata guy, just invokes Teradata's API from Spark and found the issue. I didn't find the document explaining the issue at Teradata side. I tried to print

[GitHub] [spark] abellina commented on pull request #36505: [SPARK-39131][SQL] Rewrite exists as LeftSemi earlier to allow filters to be inferred

2022-05-17 Thread GitBox
abellina commented on PR #36505: URL: https://github.com/apache/spark/pull/36505#issuecomment-1129153161 > All other queries in the test are passing, except for the negative case for the multi-column support. It is commented out in my last patch (obviously that's not the solution)

[GitHub] [spark] srowen commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-17 Thread GitBox
srowen commented on code in PR #36545: URL: https://github.com/apache/spark/pull/36545#discussion_r875147010 ## python/pyspark/sql/session.py: ## @@ -570,10 +570,20 @@ def _inferSchemaFromList( if not data: raise ValueError("can not infer schema from empty

[GitHub] [spark] MaxGekk commented on a diff in pull request #36561: [SPARK-37939][SQL] Use error classes in the parsing errors of properties

2022-05-17 Thread GitBox
MaxGekk commented on code in PR #36561: URL: https://github.com/apache/spark/pull/36561#discussion_r875157818 ## sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala: ## @@ -642,4 +642,92 @@ class QueryParsingErrorsSuite extends QueryTest with

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36560: [SPARK-39192][PS][SQL] Make pandas-on-spark's kurt consistent with pandas

2022-05-17 Thread GitBox
xinrong-databricks commented on code in PR #36560: URL: https://github.com/apache/spark/pull/36560#discussion_r875138420 ## python/pyspark/pandas/tests/test_generic_functions.py: ## @@ -150,8 +150,8 @@ def test_stat_functions(self):

[GitHub] [spark] vli-databricks opened a new pull request, #36584: [SPARK-39213] Create ANY_VALUE aggregate function

2022-05-17 Thread GitBox
vli-databricks opened a new pull request, #36584: URL: https://github.com/apache/spark/pull/36584 ### What changes were proposed in this pull request? Adding implementation for ANY_VALUE aggregate function. During optimization stage it is rewritten to `First` aggregate

[GitHub] [spark] MaxGekk commented on pull request #36579: [SPARK-39212][SQL] Use double quotes for values of SQL configs/DS options in error messages

2022-05-17 Thread GitBox
MaxGekk commented on PR #36579: URL: https://github.com/apache/spark/pull/36579#issuecomment-1129142133 @srielau @panbingkun Could you take a look at the PR, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] srowen commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

2022-05-17 Thread GitBox
srowen commented on code in PR #36545: URL: https://github.com/apache/spark/pull/36545#discussion_r875104045 ## python/pyspark/sql/session.py: ## @@ -570,10 +570,20 @@ def _inferSchemaFromList( if not data: raise ValueError("can not infer schema from empty

[GitHub] [spark] MaxGekk commented on pull request #36553: [SPARK-39214][SQL] Improve errors related to CAST

2022-05-17 Thread GitBox
MaxGekk commented on PR #36553: URL: https://github.com/apache/spark/pull/36553#issuecomment-1129174496 cc @srielau -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] dtenedor commented on pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

2022-05-17 Thread GitBox
dtenedor commented on PR #36501: URL: https://github.com/apache/spark/pull/36501#issuecomment-1129093762 @HyukjinKwon I fixed the bad sync, this is ready to merge now at your convenience. -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [spark] srowen commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

2022-05-17 Thread GitBox
srowen commented on PR #36499: URL: https://github.com/apache/spark/pull/36499#issuecomment-1129130839 OK, I just wonder if this is specific to Teradata, or whether it can be changed elsewhere higher up in the abstraction layers. But you're saying the scale/precision info is lost in

[GitHub] [spark] dtenedor opened a new pull request, #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

2022-05-17 Thread GitBox
dtenedor opened a new pull request, #36583: URL: https://github.com/apache/spark/pull/36583 ### What changes were proposed in this pull request? Support JSON scans when the table schema has associated DEFAULT column values. Example: ``` create table t(i int) using

[GitHub] [spark] xinrong-databricks commented on pull request #36547: [SPARK-39197][PYTHON] Implement `skipna` parameter of `GroupBy.all`

2022-05-17 Thread GitBox
xinrong-databricks commented on PR #36547: URL: https://github.com/apache/spark/pull/36547#issuecomment-1129166446 Rebased master to retrigger irrelevant failed test. No new changes after review. -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] xinrong-databricks commented on pull request #36547: [SPARK-39197][PYTHON] Implement `skipna` parameter of `GroupBy.all`

2022-05-17 Thread GitBox
xinrong-databricks commented on PR #36547: URL: https://github.com/apache/spark/pull/36547#issuecomment-1129178343 > Boolean cast not only is not going to cover all types, but also yield different results in some cases Would you give an example in which case we may diverge from

[GitHub] [spark] zero323 commented on pull request #36547: [SPARK-39197][PYTHON] Implement `skipna` parameter of `GroupBy.all`

2022-05-17 Thread GitBox
zero323 commented on PR #36547: URL: https://github.com/apache/spark/pull/36547#issuecomment-1129228724 > Would you give an example in which case we may diverge from pandas? I Sure thing @xinrong-databricks. Sorry for being enigmatic before. So, very simple case would be something

[GitHub] [spark] srowen commented on pull request #36496: [SPARK-39104][SQL] InMemoryRelation#isCachedColumnBuffersLoaded should be thread-safe

2022-05-17 Thread GitBox
srowen commented on PR #36496: URL: https://github.com/apache/spark/pull/36496#issuecomment-1129416530 Merged to master/3.3/3.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] dtenedor commented on pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

2022-05-17 Thread GitBox
dtenedor commented on PR #36583: URL: https://github.com/apache/spark/pull/36583#issuecomment-1129351769 > Is this [[[SPARK-38067](https://issues.apache.org/jira/browse/SPARK-38067)][PYTHON] Preserve None values when saved to

[GitHub] [spark] HyukjinKwon commented on pull request #36581: [SPARK-39054][PYTHON][PS] Ensure infer schema accuracy in GroupBy.apply

2022-05-17 Thread GitBox
HyukjinKwon commented on PR #36581: URL: https://github.com/apache/spark/pull/36581#issuecomment-1129428282 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] amaliujia commented on a diff in pull request #36586: [DO NOT MERGE] test catalog API changes

2022-05-17 Thread GitBox
amaliujia commented on code in PR #36586: URL: https://github.com/apache/spark/pull/36586#discussion_r875377592 ## sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala: ## @@ -204,8 +211,12 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog {

[GitHub] [spark] cloud-fan commented on a diff in pull request #36586: [DO NOT MERGE] test catalog API changes

2022-05-17 Thread GitBox
cloud-fan commented on code in PR #36586: URL: https://github.com/apache/spark/pull/36586#discussion_r875383064 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala: ## @@ -1025,9 +1025,14 @@ abstract class CatalogTestUtils { def

[GitHub] [spark] cloud-fan commented on a diff in pull request #36586: [DO NOT MERGE] test catalog API changes

2022-05-17 Thread GitBox
cloud-fan commented on code in PR #36586: URL: https://github.com/apache/spark/pull/36586#discussion_r875383670 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExec.scala: ## @@ -38,12 +38,14 @@ case class ShowTablesExec( val rows = new

[GitHub] [spark] HyukjinKwon opened a new pull request, #36587: [SPARK-39215][PYTHON] Reduce Py4J calls in pyspark.sql.utils.is_timestamp_ntz_preferred

2022-05-17 Thread GitBox
HyukjinKwon opened a new pull request, #36587: URL: https://github.com/apache/spark/pull/36587 ### What changes were proposed in this pull request? This PR proposes to reduce the number of Py4J calls at `pyspark.sql.utils.is_timestamp_ntz_preferred` by having a single method to

[GitHub] [spark] gengliangwang commented on pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox
gengliangwang commented on PR #36562: URL: https://github.com/apache/spark/pull/36562#issuecomment-1129515141 Merging to master/3.3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

  1   2   >