[GitHub] [spark] pietro-cerutti commented on pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data

2022-01-17 Thread GitBox
pietro-cerutti commented on pull request #35229: URL: https://github.com/apache/spark/pull/35229#issuecomment-1015152944 We've been hit by this, the C++ `arrow::field` API won't limit you on the characters you put in a field name. You can then `arrow::Table::Make` a table using that field

[GitHub] [spark] cloud-fan commented on a change in pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data

2022-01-17 Thread GitBox
cloud-fan commented on a change in pull request #35229: URL: https://github.com/apache/spark/pull/35229#discussion_r786483098 ## File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala ## @@ -4243,6 +4243,14 @@ class SQLQuerySuite extends QueryTest with

[GitHub] [spark] cloud-fan commented on a change in pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data

2022-01-17 Thread GitBox
cloud-fan commented on a change in pull request #35229: URL: https://github.com/apache/spark/pull/35229#discussion_r786483098 ## File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala ## @@ -4243,6 +4243,14 @@ class SQLQuerySuite extends QueryTest with

[GitHub] [spark] cloud-fan commented on a change in pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data

2022-01-17 Thread GitBox
cloud-fan commented on a change in pull request #35229: URL: https://github.com/apache/spark/pull/35229#discussion_r786482308 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ## @@ -434,7 +434,7 @@ case class DataSource(

[GitHub] [spark] cloud-fan commented on a change in pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data

2022-01-17 Thread GitBox
cloud-fan commented on a change in pull request #35229: URL: https://github.com/apache/spark/pull/35229#discussion_r786481288 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ## @@ -434,7 +434,7 @@ case class DataSource(

[GitHub] [spark] AngersZhuuuu commented on pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data

2022-01-17 Thread GitBox
AngersZh commented on pull request #35229: URL: https://github.com/apache/spark/pull/35229#issuecomment-1015143539 > Also, I think we should at least know how the files can be generated before merging this. How were these files created if they did not use Parquet I/O library to write?

[GitHub] [spark] beliefer commented on a change in pull request #35060: [SPARK-28137][SQL] Data Type Formatting Functions: `to_number`

2022-01-17 Thread GitBox
beliefer commented on a change in pull request #35060: URL: https://github.com/apache/spark/pull/35060#discussion_r786478459 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/NumberConstants.scala ## @@ -0,0 +1,250 @@ +/* + * Licensed to the Apache

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data in Parquet

2022-01-17 Thread GitBox
AngersZh commented on a change in pull request #35229: URL: https://github.com/apache/spark/pull/35229#discussion_r786475590 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ## @@ -434,7 +434,7 @@ case class DataSource(

[GitHub] [spark] AngersZhuuuu commented on pull request #35237: [SPARK-37951][MLLIB] Refactor ImageFileFormatSuite

2022-01-17 Thread GitBox
AngersZh commented on pull request #35237: URL: https://github.com/apache/spark/pull/35237#issuecomment-1015138333 ping @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] viirya commented on a change in pull request #35214: [SPARK-37915][SQL] Combine unions if there is a project between them

2022-01-17 Thread GitBox
viirya commented on a change in pull request #35214: URL: https://github.com/apache/spark/pull/35214#discussion_r786468967 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -1322,6 +1322,12 @@ object CombineUnions extends

[GitHub] [spark] huaxingao commented on a change in pull request #35221: [SPARK-37923][SQL] Generate partition transforms for BucketSpec inside parser

2022-01-17 Thread GitBox
huaxingao commented on a change in pull request #35221: URL: https://github.com/apache/spark/pull/35221#discussion_r786468915 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala ## @@ -3468,7 +3469,8 @@ class AstBuilder extends

[GitHub] [spark] AngersZhuuuu opened a new pull request #35237: [SPARK-37951][MLLIB] Refactor ImageFileFormatSuite

2022-01-17 Thread GitBox
AngersZh opened a new pull request #35237: URL: https://github.com/apache/spark/pull/35237 ### What changes were proposed in this pull request? Move test file of ImageFileFormatSuite from `../data` to module resource folder and use standard API to get file path ### Why

[GitHub] [spark] viirya commented on a change in pull request #35214: [SPARK-37915][SQL] Combine unions if there is a project between them

2022-01-17 Thread GitBox
viirya commented on a change in pull request #35214: URL: https://github.com/apache/spark/pull/35214#discussion_r786465506 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -1322,6 +1322,12 @@ object CombineUnions extends

[GitHub] [spark] itholic commented on a change in pull request #34406: Minor fix to docs for read_csv

2022-01-17 Thread GitBox
itholic commented on a change in pull request #34406: URL: https://github.com/apache/spark/pull/34406#discussion_r786454523 ## File path: python/pyspark/pandas/namespace.py ## @@ -272,7 +272,7 @@ def read_csv( The character used to denote the start and end of a quoted

[GitHub] [spark] itholic commented on a change in pull request #34406: Minor fix to docs for read_csv

2022-01-17 Thread GitBox
itholic commented on a change in pull request #34406: URL: https://github.com/apache/spark/pull/34406#discussion_r786452886 ## File path: python/pyspark/pandas/namespace.py ## @@ -272,7 +272,7 @@ def read_csv( The character used to denote the start and end of a quoted

[GitHub] [spark] itholic commented on a change in pull request #34406: Minor fix to docs for read_csv

2022-01-17 Thread GitBox
itholic commented on a change in pull request #34406: URL: https://github.com/apache/spark/pull/34406#discussion_r786452886 ## File path: python/pyspark/pandas/namespace.py ## @@ -272,7 +272,7 @@ def read_csv( The character used to denote the start and end of a quoted

[GitHub] [spark] itholic commented on a change in pull request #34406: Minor fix to docs for read_csv

2022-01-17 Thread GitBox
itholic commented on a change in pull request #34406: URL: https://github.com/apache/spark/pull/34406#discussion_r786454523 ## File path: python/pyspark/pandas/namespace.py ## @@ -272,7 +272,7 @@ def read_csv( The character used to denote the start and end of a quoted

[GitHub] [spark] itholic commented on a change in pull request #34406: Minor fix to docs for read_csv

2022-01-17 Thread GitBox
itholic commented on a change in pull request #34406: URL: https://github.com/apache/spark/pull/34406#discussion_r786452886 ## File path: python/pyspark/pandas/namespace.py ## @@ -272,7 +272,7 @@ def read_csv( The character used to denote the start and end of a quoted

[GitHub] [spark] itholic commented on a change in pull request #34406: Minor fix to docs for read_csv

2022-01-17 Thread GitBox
itholic commented on a change in pull request #34406: URL: https://github.com/apache/spark/pull/34406#discussion_r786452886 ## File path: python/pyspark/pandas/namespace.py ## @@ -272,7 +272,7 @@ def read_csv( The character used to denote the start and end of a quoted

[GitHub] [spark] wangyum commented on a change in pull request #35214: [SPARK-37915][SQL] Combine unions if there is a project between them

2022-01-17 Thread GitBox
wangyum commented on a change in pull request #35214: URL: https://github.com/apache/spark/pull/35214#discussion_r786435360 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -1322,6 +1322,12 @@ object CombineUnions extends

[GitHub] [spark] HyukjinKwon edited a comment on pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data in Parquet

2022-01-17 Thread GitBox
HyukjinKwon edited a comment on pull request #35229: URL: https://github.com/apache/spark/pull/35229#issuecomment-1015113084 Hey, we should at least disallow `.`, and should have a proper error message for Parquet specifically per PARQUET-1809 because it doesn't work with reading.

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data in Parquet

2022-01-17 Thread GitBox
HyukjinKwon commented on a change in pull request #35229: URL: https://github.com/apache/spark/pull/35229#discussion_r786450390 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ## @@ -434,7 +434,7 @@ case class DataSource(

[GitHub] [spark] HyukjinKwon commented on pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data in Parquet

2022-01-17 Thread GitBox
HyukjinKwon commented on pull request #35229: URL: https://github.com/apache/spark/pull/35229#issuecomment-1015113084 Hey, we should at least disallow `.`, and should have a proper error message for Parquet specifically per PARQUET-1809 because it doesn't work with reading. Also, I

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data in Parquet

2022-01-17 Thread GitBox
HyukjinKwon commented on a change in pull request #35229: URL: https://github.com/apache/spark/pull/35229#discussion_r786449221 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ## @@ -434,7 +434,7 @@ case class DataSource(

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data in Parquet

2022-01-17 Thread GitBox
HyukjinKwon commented on a change in pull request #35229: URL: https://github.com/apache/spark/pull/35229#discussion_r786448499 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala ## @@ -81,12 +81,16 @@ object DataSourceUtils

[GitHub] [spark] beliefer commented on a change in pull request #35060: [SPARK-28137][SQL] Data Type Formatting Functions: `to_number`

2022-01-17 Thread GitBox
beliefer commented on a change in pull request #35060: URL: https://github.com/apache/spark/pull/35060#discussion_r786436785 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/NumberConstants.scala ## @@ -0,0 +1,244 @@ +/* + * Licensed to the Apache

[GitHub] [spark] wangyum commented on a change in pull request #35214: [SPARK-37915][SQL] Combine unions if there is a project between them

2022-01-17 Thread GitBox
wangyum commented on a change in pull request #35214: URL: https://github.com/apache/spark/pull/35214#discussion_r786435360 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -1322,6 +1322,12 @@ object CombineUnions extends

[GitHub] [spark] AngersZhuuuu commented on pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data in Parquet

2022-01-17 Thread GitBox
AngersZh commented on pull request #35229: URL: https://github.com/apache/spark/pull/35229#issuecomment-1015093681 > We should add a test for this. AFAIK Parquet field names can contain special chars (one of our customers hit this issue), regardless of what Parquet spec says. Can we

[GitHub] [spark] cloud-fan closed pull request #35147: [SPARK-37768][SQL][FOLLOWUP] Schema pruning for the metadata struct

2022-01-17 Thread GitBox
cloud-fan closed pull request #35147: URL: https://github.com/apache/spark/pull/35147 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [spark] cloud-fan commented on pull request #35147: [SPARK-37768][SQL][FOLLOWUP] Schema pruning for the metadata struct

2022-01-17 Thread GitBox
cloud-fan commented on pull request #35147: URL: https://github.com/apache/spark/pull/35147#issuecomment-1015090660 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] cloud-fan commented on pull request #35130: [SPARK-37839][SQL] DS V2 supports partial aggregate push-down `AVG`

2022-01-17 Thread GitBox
cloud-fan commented on pull request #35130: URL: https://github.com/apache/spark/pull/35130#issuecomment-1015090389 LGTM if tests pass -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] cloud-fan commented on a change in pull request #35220: [SPARK-37922][SQL] Combine to one cast if we can safely up-cast two casts

2022-01-17 Thread GitBox
cloud-fan commented on a change in pull request #35220: URL: https://github.com/apache/spark/pull/35220#discussion_r786430773 ## File path: sql/core/src/test/resources/sql-tests/results/typeCoercion/native/concat.sql.out ## @@ -40,16 +40,16 @@ FROM ( -- !query schema struct

[GitHub] [spark] cloud-fan closed pull request #35206: [SPARK-37906][SQL] spark-sql should not pass last comment to backend

2022-01-17 Thread GitBox
cloud-fan closed pull request #35206: URL: https://github.com/apache/spark/pull/35206 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [spark] cloud-fan commented on pull request #35206: [SPARK-37906][SQL] spark-sql should not pass last comment to backend

2022-01-17 Thread GitBox
cloud-fan commented on pull request #35206: URL: https://github.com/apache/spark/pull/35206#issuecomment-1015088791 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] wangyum commented on a change in pull request #35220: [SPARK-37922][SQL] Combine to one cast if we can safely up-cast two casts

2022-01-17 Thread GitBox
wangyum commented on a change in pull request #35220: URL: https://github.com/apache/spark/pull/35220#discussion_r786429857 ## File path: sql/core/src/test/resources/sql-tests/results/typeCoercion/native/concat.sql.out ## @@ -40,16 +40,16 @@ FROM ( -- !query schema struct

[GitHub] [spark] dchvn commented on a change in pull request #35236: [SPARK-37903][PYTHON][FOLLOW-UP] Raise TypeError with no return function

2022-01-17 Thread GitBox
dchvn commented on a change in pull request #35236: URL: https://github.com/apache/spark/pull/35236#discussion_r786426897 ## File path: python/pyspark/pandas/tests/test_typedef.py ## @@ -56,6 +56,23 @@ class TypeHintTests(unittest.TestCase): +def

[GitHub] [spark] cloud-fan commented on pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data in Parquet

2022-01-17 Thread GitBox
cloud-fan commented on pull request #35229: URL: https://github.com/apache/spark/pull/35229#issuecomment-1015085542 We should add a test for this. AFAIK Parquet field names can contain special chars (one of our customers hit this issue), regardless of what Parquet spec says. Can we use

[GitHub] [spark] cloud-fan commented on pull request #35234: [SPARK-32165][SQL] SessionState leaks SparkListener with multiple SparkSession

2022-01-17 Thread GitBox
cloud-fan commented on pull request #35234: URL: https://github.com/apache/spark/pull/35234#issuecomment-1015082966 @dnskr Can you describe the case that multiple `SharedState`s are instantiated? Ideally we should only have one `SharedState` instance per driver JVM. -- This is an

[GitHub] [spark] cloud-fan commented on a change in pull request #35220: [SPARK-37922][SQL] Combine to one cast if we can safely up-cast two casts

2022-01-17 Thread GitBox
cloud-fan commented on a change in pull request #35220: URL: https://github.com/apache/spark/pull/35220#discussion_r786423824 ## File path: sql/core/src/test/resources/sql-tests/results/typeCoercion/native/concat.sql.out ## @@ -40,16 +40,16 @@ FROM ( -- !query schema struct

[GitHub] [spark] cloud-fan closed pull request #35204: [SPARK-37878][SQL] Migrate SHOW CREATE TABLE to use v2 command by default

2022-01-17 Thread GitBox
cloud-fan closed pull request #35204: URL: https://github.com/apache/spark/pull/35204 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [spark] cloud-fan commented on pull request #35204: [SPARK-37878][SQL] Migrate SHOW CREATE TABLE to use v2 command by default

2022-01-17 Thread GitBox
cloud-fan commented on pull request #35204: URL: https://github.com/apache/spark/pull/35204#issuecomment-1015081296 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] wangyum commented on a change in pull request #35214: [SPARK-37915][SQL] Combine unions if there is a project between them

2022-01-17 Thread GitBox
wangyum commented on a change in pull request #35214: URL: https://github.com/apache/spark/pull/35214#discussion_r786418127 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -1322,6 +1322,12 @@ object CombineUnions extends

[GitHub] [spark] wangyum commented on a change in pull request #35214: [SPARK-37915][SQL] Combine unions if there is a project between them

2022-01-17 Thread GitBox
wangyum commented on a change in pull request #35214: URL: https://github.com/apache/spark/pull/35214#discussion_r786417364 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -762,22 +762,22 @@ object

[GitHub] [spark] cloud-fan commented on a change in pull request #35214: [SPARK-37915][SQL] Combine unions if there is a project between them

2022-01-17 Thread GitBox
cloud-fan commented on a change in pull request #35214: URL: https://github.com/apache/spark/pull/35214#discussion_r786417032 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -1322,6 +1322,12 @@ object CombineUnions

[GitHub] [spark] cloud-fan commented on a change in pull request #35214: [SPARK-37915][SQL] Push down deterministic projection through SQL UNION and combine them

2022-01-17 Thread GitBox
cloud-fan commented on a change in pull request #35214: URL: https://github.com/apache/spark/pull/35214#discussion_r786415710 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -762,22 +762,22 @@ object

[GitHub] [spark] HyukjinKwon commented on pull request #35150: [SPARK-37850][PYTHON][INFRA] Enable flake's E731 rule in PySpark

2022-01-17 Thread GitBox
HyukjinKwon commented on pull request #35150: URL: https://github.com/apache/spark/pull/35150#issuecomment-1015064709 Okay .. let me maybe go forward and merge in few more days if there are no more comments ... -- This is an automated message from the Apache Git Service. To respond to

[GitHub] [spark] itholic edited a comment on pull request #35236: [SPARK-37903][PYTHON][FOLLOW-UP] Raise TypeError with no return function

2022-01-17 Thread GitBox
itholic edited a comment on pull request #35236: URL: https://github.com/apache/spark/pull/35236#issuecomment-1015063480 Also could you add some more context to the PR description why this change is needed ?? At least It is good to provide a link which PR or comment you are

[GitHub] [spark] itholic commented on pull request #35236: [SPARK-37903][PYTHON][FOLLOW-UP] Raise TypeError with no return function

2022-01-17 Thread GitBox
itholic commented on pull request #35236: URL: https://github.com/apache/spark/pull/35236#issuecomment-1015063480 Also add some more context to the PR description why this change is needed ?? At least It is good to provide a link which PR or comment you are following up. -- This

[GitHub] [spark] itholic commented on a change in pull request #35236: [SPARK-37903][PYTHON][FOLLOW-UP] Raise TypeError with no return function

2022-01-17 Thread GitBox
itholic commented on a change in pull request #35236: URL: https://github.com/apache/spark/pull/35236#discussion_r786408066 ## File path: python/pyspark/pandas/tests/test_typedef.py ## @@ -56,6 +56,23 @@ class TypeHintTests(unittest.TestCase): +def

[GitHub] [spark] Peng-Lei commented on pull request #35204: [SPARK-37878][SQL] Migrate SHOW CREATE TABLE to use v2 command by default

2022-01-17 Thread GitBox
Peng-Lei commented on pull request #35204: URL: https://github.com/apache/spark/pull/35204#issuecomment-1015060395 > @Peng-Lei We should update `AstBuilder.cleanTableProperties`, to make `EXTERNAL` a truly reserved property. Let's open a new PR for it as it's a breaking change. The

[GitHub] [spark] dchvn commented on pull request #35236: [SPARK-37903][PYTHON][FOLLOW-UP] Raise TypeError with no return function

2022-01-17 Thread GitBox
dchvn commented on pull request #35236: URL: https://github.com/apache/spark/pull/35236#issuecomment-1015059516 CC @ueshin @HyukjinKwon FYI, Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] dchvn opened a new pull request #35236: [SPARK-37903][PYTHON][FOLLOW-UP] Raise TypeError with no return function

2022-01-17 Thread GitBox
dchvn opened a new pull request #35236: URL: https://github.com/apache/spark/pull/35236 ### What changes were proposed in this pull request? Raise TypeError with no return function ### Why are the changes needed? Raise TypeError with no return function ### Does this PR

[GitHub] [spark] itholic commented on pull request #35203: [SPARK-37886][PYTHON][TESTS] Use ComparisonTestBase as base class in OpsTestBase

2022-01-17 Thread GitBox
itholic commented on pull request #35203: URL: https://github.com/apache/spark/pull/35203#issuecomment-1015053637 LGTM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] ulysses-you opened a new pull request #35235: [SPARK-37949][SQL] Improve Rebalance statistics estimation

2022-01-17 Thread GitBox
ulysses-you opened a new pull request #35235: URL: https://github.com/apache/spark/pull/35235 ### What changes were proposed in this pull request? Match `RebalancePartitions` in `SizeInBytesOnlyStatsPlanVisitor` and `BasicStatsPlanVisitor`. ### Why are the changes

[GitHub] [spark] AngersZhuuuu edited a comment on pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data in Parquet

2022-01-17 Thread GitBox
AngersZh edited a comment on pull request #35229: URL: https://github.com/apache/spark/pull/35229#issuecomment-1015028246 > These special characters are disallowed in Parquet side if I remember correctly. Can we double check what special chars are disallowed in Parquet side, and keep

[GitHub] [spark] AngersZhuuuu commented on pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data in Parquet

2022-01-17 Thread GitBox
AngersZh commented on pull request #35229: URL: https://github.com/apache/spark/pull/35229#issuecomment-1015028246 > These special characters are disallowed in Parquet side if I remember correctly. Can we double check what special chars are disallowed in Parquet side, and keep the

[GitHub] [spark] viirya commented on a change in pull request #33559: [SPARK-34265][PYTHON][SQL] Instrument Pandas UDFs using SQL metrics

2022-01-17 Thread GitBox
viirya commented on a change in pull request #33559: URL: https://github.com/apache/spark/pull/33559#discussion_r786378908 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala ## @@ -96,6 +101,8 @@ class ArrowPythonRunner(

[GitHub] [spark] itholic commented on a change in pull request #33559: [SPARK-34265][PYTHON][SQL] Instrument Pandas UDFs using SQL metrics

2022-01-17 Thread GitBox
itholic commented on a change in pull request #33559: URL: https://github.com/apache/spark/pull/33559#discussion_r786378875 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala ## @@ -42,7 +43,10 @@ class ArrowPythonRunner(

[GitHub] [spark] viirya commented on a change in pull request #33559: [SPARK-34265][PYTHON][SQL] Instrument Pandas UDFs using SQL metrics

2022-01-17 Thread GitBox
viirya commented on a change in pull request #33559: URL: https://github.com/apache/spark/pull/33559#discussion_r786376142 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ## @@ -162,7 +162,10 @@ case class

[GitHub] [spark] viirya commented on a change in pull request #35214: [SPARK-37915][SQL] Push down deterministic projection through SQL UNION and combine them

2022-01-17 Thread GitBox
viirya commented on a change in pull request #35214: URL: https://github.com/apache/spark/pull/35214#discussion_r786373535 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -78,7 +78,6 @@ abstract class

[GitHub] [spark] HyukjinKwon commented on pull request #35200: [SPARK-37903][PYTHON] Replace string_typehints with get_type_hints

2022-01-17 Thread GitBox
HyukjinKwon commented on pull request #35200: URL: https://github.com/apache/spark/pull/35200#issuecomment-1015006891 oh, okay, `max` doesn't have a return type hint. we should raise an exception -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] HyukjinKwon commented on pull request #35200: [SPARK-37903][PYTHON] Replace string_typehints with get_type_hints

2022-01-17 Thread GitBox
HyukjinKwon commented on pull request #35200: URL: https://github.com/apache/spark/pull/35200#issuecomment-1015006329 I think it's correct to infer the type from `max`. does it cause any problem? -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] HyukjinKwon commented on pull request #35232: [SPARK-37947][SQL] Extract generator from GeneratorOuter expression contained by a Generate operator.

2022-01-17 Thread GitBox
HyukjinKwon commented on pull request #35232: URL: https://github.com/apache/spark/pull/35232#issuecomment-1015002241 cc @allisonwang-db FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] pan3793 edited a comment on pull request #34934: [SPARK-37675][CORE][SHUFFLE] Return PushMergedRemoteMetaFailedFetchResult if no available push-merged block

2022-01-17 Thread GitBox
pan3793 edited a comment on pull request #34934: URL: https://github.com/apache/spark/pull/34934#issuecomment-1015001194 > To clarify, these logs are with a version of spark/shuffle service without modifications ? Or were there any code changes made to them ? Thx. Oops, I forgot to

[GitHub] [spark] pan3793 commented on pull request #34934: [SPARK-37675][CORE][SHUFFLE] Return PushMergedRemoteMetaFailedFetchResult if no available push-merged block

2022-01-17 Thread GitBox
pan3793 commented on pull request #34934: URL: https://github.com/apache/spark/pull/34934#issuecomment-1015001194 > To clarify, these logs are with a version of spark/shuffle service without modifications ? Or were there any code changes made to them ? Thx. Oops, I forgot link the

[GitHub] [spark] beliefer commented on a change in pull request #35130: [SPARK-37839][SQL] DS V2 supports partial aggregate push-down `AVG`

2022-01-17 Thread GitBox
beliefer commented on a change in pull request #35130: URL: https://github.com/apache/spark/pull/35130#discussion_r786360517 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala ## @@ -88,25 +88,65 @@ object

[GitHub] [spark] wangyum commented on a change in pull request #35220: [SPARK-37922][SQL] Combine to one cast if we can safely up-cast two casts

2022-01-17 Thread GitBox
wangyum commented on a change in pull request #35220: URL: https://github.com/apache/spark/pull/35220#discussion_r786351644 ## File path: sql/core/src/test/resources/sql-tests/results/typeCoercion/native/concat.sql.out ## @@ -40,16 +40,16 @@ FROM ( -- !query schema struct

[GitHub] [spark] Stelyus commented on pull request #35233: [SPARK-37290][SQL] - Exponential planning time in case of non-deterministic function

2022-01-17 Thread GitBox
Stelyus commented on pull request #35233: URL: https://github.com/apache/spark/pull/35233#issuecomment-1014977463 Tested with: ``` val adselect_raw = spark.createDataFrame(Seq(("imp-1",1),("imp-2",2))) .cache() val adselect = adselect_raw.select(

[GitHub] [spark] HyukjinKwon edited a comment on pull request #35233: [SPARK-37290][SQL] - Exponential planning time in case of non-deterministic function

2022-01-17 Thread GitBox
HyukjinKwon edited a comment on pull request #35233: URL: https://github.com/apache/spark/pull/35233#issuecomment-1014973053 > How was this patch tested? Can we either add a unittest or describe how you tested? -- This is an automated message from the Apache Git Service. To

[GitHub] [spark] HyukjinKwon commented on pull request #35233: [SPARK-37290][SQL] - Exponential planning time in case of non-deterministic function

2022-01-17 Thread GitBox
HyukjinKwon commented on pull request #35233: URL: https://github.com/apache/spark/pull/35233#issuecomment-1014973053 > How was this patch tested? Can we either add a unittest or describe how you tested? -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] HyukjinKwon commented on pull request #33559: [SPARK-34265][PYTHON][SQL] Instrument Pandas UDFs using SQL metrics

2022-01-17 Thread GitBox
HyukjinKwon commented on pull request #33559: URL: https://github.com/apache/spark/pull/33559#issuecomment-1014970803 Looks fine from a cursory look .. but let me add some more Python and SQL people here - @cloud-fan, @maryannxue, @viirya @ueshin @BryanCutler FYI -- This is an automated

[GitHub] [spark] asfgit closed pull request #34982: [SPARK-37712][YARN] Spark request yarn cluster metrics slow cause delay

2022-01-17 Thread GitBox
asfgit closed pull request #34982: URL: https://github.com/apache/spark/pull/34982 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [spark] HyukjinKwon commented on a change in pull request #33559: [SPARK-34265][PYTHON][SQL] Instrument Pandas UDFs using SQL metrics

2022-01-17 Thread GitBox
HyukjinKwon commented on a change in pull request #33559: URL: https://github.com/apache/spark/pull/33559#discussion_r786342835 ## File path: python/pyspark/sql/tests/test_pandas_sqlmetrics.py ## @@ -0,0 +1,66 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one

[GitHub] [spark] HyukjinKwon commented on a change in pull request #33559: [SPARK-34265][PYTHON][SQL] Instrument Pandas UDFs using SQL metrics

2022-01-17 Thread GitBox
HyukjinKwon commented on a change in pull request #33559: URL: https://github.com/apache/spark/pull/33559#discussion_r786342806 ## File path: python/pyspark/sql/tests/test_pandas_sqlmetrics.py ## @@ -0,0 +1,66 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one

[GitHub] [spark] mridulm commented on pull request #34982: [SPARK-37712][YARN] Spark request yarn cluster metrics slow cause delay

2022-01-17 Thread GitBox
mridulm commented on pull request #34982: URL: https://github.com/apache/spark/pull/34982#issuecomment-1014969982 Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] mridulm commented on a change in pull request #34982: [SPARK-37712][YARN] Spark request yarn cluster metrics slow cause delay

2022-01-17 Thread GitBox
mridulm commented on a change in pull request #34982: URL: https://github.com/apache/spark/pull/34982#discussion_r786342436 ## File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ## @@ -183,8 +183,10 @@ private[spark] class Client(

[GitHub] [spark] HyukjinKwon commented on a change in pull request #33559: [SPARK-34265][PYTHON][SQL] Instrument Pandas UDFs using SQL metrics

2022-01-17 Thread GitBox
HyukjinKwon commented on a change in pull request #33559: URL: https://github.com/apache/spark/pull/33559#discussion_r786342279 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ## @@ -162,7 +162,10 @@ case class

[GitHub] [spark] mridulm commented on a change in pull request #35085: [SPARK-37618][CORE] Remove shuffle blocks using the shuffle service for released executors

2022-01-17 Thread GitBox
mridulm commented on a change in pull request #35085: URL: https://github.com/apache/spark/pull/35085#discussion_r786341998 ## File path: core/src/main/scala/org/apache/spark/util/Utils.scala ## @@ -2742,6 +2743,16 @@ private[spark] object Utils extends Logging { new

[GitHub] [spark] mridulm commented on a change in pull request #35085: [SPARK-37618][CORE] Remove shuffle blocks using the shuffle service for released executors

2022-01-17 Thread GitBox
mridulm commented on a change in pull request #35085: URL: https://github.com/apache/spark/pull/35085#discussion_r786341998 ## File path: core/src/main/scala/org/apache/spark/util/Utils.scala ## @@ -2742,6 +2743,16 @@ private[spark] object Utils extends Logging { new

[GitHub] [spark] dnskr commented on pull request #35224: [SPARK-32165][SQL] Ensure Spark only initiates SharedState once across SparkSessions

2022-01-17 Thread GitBox
dnskr commented on pull request #35224: URL: https://github.com/apache/spark/pull/35224#issuecomment-1014968891 I haven't seen https://github.com/apache/spark/commit/4d90c5dc0efcf77ef6735000ee7016428c57077b either. The change fixes `ExecutionListenerBus` memory leak but not the memory

[GitHub] [spark] mridulm commented on pull request #35180: [SPARK-37881][CORE] Cleanup ShuffleBlockResolver from polluted methods to create a developer API

2022-01-17 Thread GitBox
mridulm commented on pull request #35180: URL: https://github.com/apache/spark/pull/35180#issuecomment-1014968390 I am fine with either direction you want to take @attilapiros - either we can merge this PR or make it WIP and fix `MapStatus`/surrounding infra before circling back to this -

[GitHub] [spark] Yikun commented on a change in pull request #35215: [SPARK-37916][K8S] The ConfigMap is assigned to incorrect namespace

2022-01-17 Thread GitBox
Yikun commented on a change in pull request #35215: URL: https://github.com/apache/spark/pull/35215#discussion_r786341158 ## File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala ## @@ -76,13

[GitHub] [spark] mridulm edited a comment on pull request #35185: [SPARK-37831][CORE] add task partition id in TaskInfo and Task Metrics

2022-01-17 Thread GitBox
mridulm edited a comment on pull request #35185: URL: https://github.com/apache/spark/pull/35185#issuecomment-1014966382 > > Took an initial pass through the PR and added some comments - overall looks good. We would need to make sure that skew join and partition coalescing in SQL interact

[GitHub] [spark] github-actions[bot] closed pull request #33888: [SPARK-36634][SQL] Support access and read parquet file by column ordinal

2022-01-17 Thread GitBox
github-actions[bot] closed pull request #33888: URL: https://github.com/apache/spark/pull/33888 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [spark] HyukjinKwon commented on pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data

2022-01-17 Thread GitBox
HyukjinKwon commented on pull request #35229: URL: https://github.com/apache/spark/pull/35229#issuecomment-1014966688 See https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/MessageTypeParser.java#L48 as an example. Also dot is not

[GitHub] [spark] mridulm commented on pull request #35185: [SPARK-37831][CORE] add task partition id in TaskInfo and Task Metrics

2022-01-17 Thread GitBox
mridulm commented on pull request #35185: URL: https://github.com/apache/spark/pull/35185#issuecomment-1014966382 > > Took an initial pass through the PR and added some comments - overall looks good. We would need to make sure that skew join and partition coalescing in SQL interact well

[GitHub] [spark] mridulm commented on a change in pull request #35185: [SPARK-37831][CORE] add task partition id in TaskInfo and Task Metrics

2022-01-17 Thread GitBox
mridulm commented on a change in pull request #35185: URL: https://github.com/apache/spark/pull/35185#discussion_r786339891 ## File path: core/src/test/resources/HistoryServerExpectations/excludeOnFailure_for_stage_expectation.json ## @@ -631,6 +642,7 @@ "taskId" : 4,

[GitHub] [spark] mridulm commented on a change in pull request #35185: [SPARK-37831][CORE] add task partition id in TaskInfo and Task Metrics

2022-01-17 Thread GitBox
mridulm commented on a change in pull request #35185: URL: https://github.com/apache/spark/pull/35185#discussion_r786339392 ## File path: core/src/main/scala/org/apache/spark/status/storeTypes.scala ## @@ -286,6 +289,7 @@ private[spark] class TaskDataWrapper( taskId,

[GitHub] [spark] mridulm commented on a change in pull request #35185: [SPARK-37831][CORE] add task partition id in TaskInfo and Task Metrics

2022-01-17 Thread GitBox
mridulm commented on a change in pull request #35185: URL: https://github.com/apache/spark/pull/35185#discussion_r786339392 ## File path: core/src/main/scala/org/apache/spark/status/storeTypes.scala ## @@ -286,6 +289,7 @@ private[spark] class TaskDataWrapper( taskId,

[GitHub] [spark] mridulm commented on a change in pull request #35185: [SPARK-37831][CORE] add task partition id in TaskInfo and Task Metrics

2022-01-17 Thread GitBox
mridulm commented on a change in pull request #35185: URL: https://github.com/apache/spark/pull/35185#discussion_r786339392 ## File path: core/src/main/scala/org/apache/spark/status/storeTypes.scala ## @@ -286,6 +289,7 @@ private[spark] class TaskDataWrapper( taskId,

[GitHub] [spark] dnskr opened a new pull request #35234: [SPARK-32165][SQL] SessionState leaks SparkListener with multiple SparkSession

2022-01-17 Thread GitBox
dnskr opened a new pull request #35234: URL: https://github.com/apache/spark/pull/35234 ### What changes were proposed in this pull request? The memory leak of `ExecutionListenerBus` was fixed in https://github.com/apache/spark/commit/4d90c5dc0efcf77ef6735000ee7016428c57077b by

[GitHub] [spark] HyukjinKwon commented on pull request #35229: [SPARK-27442][SQL] Remove check field name when reading data

2022-01-17 Thread GitBox
HyukjinKwon commented on pull request #35229: URL: https://github.com/apache/spark/pull/35229#issuecomment-1014963246 These special characters are disallowed in Parquet side if I remember correctly. Can we double check what special chars are disallowed in Parquet side, and keep the check

[GitHub] [spark] mridulm commented on pull request #34934: [SPARK-37675][CORE][SHUFFLE] Return PushMergedRemoteMetaFailedFetchResult if no available push-merged block

2022-01-17 Thread GitBox
mridulm commented on pull request #34934: URL: https://github.com/apache/spark/pull/34934#issuecomment-1014963088 To clarify, these logs are with a version of spark/shuffle service without modifications ? Or were there any code changes made to them ? Thx. -- This is an automated

[GitHub] [spark] AmplabJenkins commented on pull request #35230: [SPARK-37934] [Build] Upgrade Jetty version to 9.4.44

2022-01-17 Thread GitBox
AmplabJenkins commented on pull request #35230: URL: https://github.com/apache/spark/pull/35230#issuecomment-1014961317 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] HyukjinKwon closed pull request #35228: [SPARK-37498][PYTHON] Add eventually for test_reuse_worker_of_parallelize_range

2022-01-17 Thread GitBox
HyukjinKwon closed pull request #35228: URL: https://github.com/apache/spark/pull/35228 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [spark] HyukjinKwon commented on pull request #35228: [SPARK-37498][PYTHON] Add eventually for test_reuse_worker_of_parallelize_range

2022-01-17 Thread GitBox
HyukjinKwon commented on pull request #35228: URL: https://github.com/apache/spark/pull/35228#issuecomment-1014958300 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] AmplabJenkins commented on pull request #35233: [SPARK-37290][SQL] - Exponential planning time in case of non-deterministic function

2022-01-17 Thread GitBox
AmplabJenkins commented on pull request #35233: URL: https://github.com/apache/spark/pull/35233#issuecomment-1014936236 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] Stelyus opened a new pull request #35233: Spark 37290

2022-01-17 Thread GitBox
Stelyus opened a new pull request #35233: URL: https://github.com/apache/spark/pull/35233 ### What changes were proposed in this pull request? When using non-deterministic function, the method getAllValidConstraints can throw an OOM ``` protected def

[GitHub] [spark] chia7712 commented on pull request #35215: [SPARK-37916][K8S] The ConfigMap is assigned to incorrect namespace

2022-01-17 Thread GitBox
chia7712 commented on pull request #35215: URL: https://github.com/apache/spark/pull/35215#issuecomment-1014889112 > If it's possible, would you mind also add a testcase to make sure the namespace is set correctly in executor/driver? @Yikun thanks for your comments. will copy that

[GitHub] [spark] bersprockets opened a new pull request #35232: [SPARK-37947][SQL] Extract generator from GeneratorOuter expression contained by a Generate operator.

2022-01-17 Thread GitBox
bersprockets opened a new pull request #35232: URL: https://github.com/apache/spark/pull/35232 ### What changes were proposed in this pull request? This PR updates the ExtractGenerator rule to extract a generator from a GeneratorOuter expression contained by a Generate operator.

  1   2   3   4   5   6   7   8   9   10   >