[GitHub] [spark] MaxGekk commented on a diff in pull request #38615: [SPARK-41109][SQL] Rename the error class _LEGACY_ERROR_TEMP_1216 to INVALID_LIKE_PATTERN

2022-11-11 Thread GitBox
MaxGekk commented on code in PR #38615: URL: https://github.com/apache/spark/pull/38615#discussion_r1020704358 ## core/src/main/resources/error/error-classes.json: ## @@ -630,6 +630,11 @@ "Input schema can only contain STRING as a key type for a MAP." ] }, +

[GitHub] [spark] AmplabJenkins commented on pull request #38601: [WIP][SPARK-41100][INFRA] Upgrade Ubuntu to latest

2022-11-11 Thread GitBox
AmplabJenkins commented on PR #38601: URL: https://github.com/apache/spark/pull/38601#issuecomment-1312357558 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] AmplabJenkins commented on pull request #38607: [SPARK-40938][CONNECT][PYTHON][FOLLOW-UP] Fix SubqueryAlias without the child plan when constructing Connect proto in the Python client

2022-11-11 Thread GitBox
AmplabJenkins commented on PR #38607: URL: https://github.com/apache/spark/pull/38607#issuecomment-1312357530 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] AmplabJenkins commented on pull request #38603: [SPARK-41101][PYTHON][PROTOBUF] Message classname support for PYSPARK-PROTOBUF

2022-11-11 Thread GitBox
AmplabJenkins commented on PR #38603: URL: https://github.com/apache/spark/pull/38603#issuecomment-1312357548 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] Dam1029 commented on pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s

2022-11-11 Thread GitBox
Dam1029 commented on PR #38518: URL: https://github.com/apache/spark/pull/38518#issuecomment-1312350979 @dongjoon-hyun @Ngone51 Could you help take a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-11 Thread GitBox
WeichenXu123 commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r1020654939 ## python/pyspark/ml/functions.py: ## @@ -106,6 +138,602 @@ def array_to_vector(col: Column) -> Column: return

[GitHub] [spark] viirya commented on a diff in pull request #38626: [SPARK-38959][SQL][FOLLOWUP] Do not optimize subqueries twice

2022-11-11 Thread GitBox
viirya commented on code in PR #38626: URL: https://github.com/apache/spark/pull/38626#discussion_r1020639613 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala: ## @@ -51,8 +51,10 @@ class SparkOptimizer( Batch("Optimize Metadata Only Query",

[GitHub] [spark] viirya commented on a diff in pull request #38626: [SPARK-38959][SQL][FOLLOWUP] Do not optimize subqueries twice

2022-11-11 Thread GitBox
viirya commented on code in PR #38626: URL: https://github.com/apache/spark/pull/38626#discussion_r1020639496 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala: ## @@ -51,8 +51,10 @@ class SparkOptimizer( Batch("Optimize Metadata Only Query",

[GitHub] [spark] viirya commented on a diff in pull request #38626: [SPARK-38959][SQL][FOLLOWUP] Do not optimize subqueries twice

2022-11-11 Thread GitBox
viirya commented on code in PR #38626: URL: https://github.com/apache/spark/pull/38626#discussion_r1020639233 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala: ## @@ -51,8 +51,10 @@ class SparkOptimizer( Batch("Optimize Metadata Only Query",

[GitHub] [spark] github-actions[bot] closed pull request #37365: [SPARK-39938][PYTHON][PS] Accept all inputs of prefix/suffix which implement __str__ in add_predix/add_suffix

2022-11-11 Thread GitBox
github-actions[bot] closed pull request #37365: [SPARK-39938][PYTHON][PS] Accept all inputs of prefix/suffix which implement __str__ in add_predix/add_suffix URL: https://github.com/apache/spark/pull/37365 -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] github-actions[bot] closed pull request #37355: [SPARK-39930][SQL] Introduce Cache Hints

2022-11-11 Thread GitBox
github-actions[bot] closed pull request #37355: [SPARK-39930][SQL] Introduce Cache Hints URL: https://github.com/apache/spark/pull/37355 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] github-actions[bot] commented on pull request #37346: [SPARK-37210][CORE][SQL] Allow forced use of staging directory

2022-11-11 Thread GitBox
github-actions[bot] commented on PR #37346: URL: https://github.com/apache/spark/pull/37346#issuecomment-1312282414 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] amaliujia commented on pull request #38632: [SPARK-41116][CONNECT] Input relation can be optional for Project in Connect proto

2022-11-11 Thread GitBox
amaliujia commented on PR #38632: URL: https://github.com/apache/spark/pull/38632#issuecomment-1312268172 R: @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] amaliujia opened a new pull request, #38632: [SPARK-41116][CONNECT] Input relation can be optional for Project in Connect proto

2022-11-11 Thread GitBox
amaliujia opened a new pull request, #38632: URL: https://github.com/apache/spark/pull/38632 ### What changes were proposed in this pull request? I was writing test cases to test expressions and realized that we can allow `Project` without input plan. For example, `SELECT 1`

[GitHub] [spark] xinrong-meng commented on a diff in pull request #38611: [SPARK-41107][PYTHON][INFRA][TEST] Install memory-profiler in the CI

2022-11-11 Thread GitBox
xinrong-meng commented on code in PR #38611: URL: https://github.com/apache/spark/pull/38611#discussion_r1020600278 ## dev/infra/Dockerfile: ## @@ -32,7 +32,7 @@ RUN $APT_INSTALL software-properties-common git libxml2-dev pkg-config curl wget RUN update-alternatives --set

[GitHub] [spark] AmplabJenkins commented on pull request #38611: [SPARK-41107][PYTHON][INFRA][TEST] Install memory-profiler in the CI

2022-11-11 Thread GitBox
AmplabJenkins commented on PR #38611: URL: https://github.com/apache/spark/pull/38611#issuecomment-1312240061 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] AmplabJenkins commented on pull request #38615: [SPARK-41109][SQL] Rename the error class _LEGACY_ERROR_TEMP_1216 to INVALID_LIKE_PATTERN

2022-11-11 Thread GitBox
AmplabJenkins commented on PR #38615: URL: https://github.com/apache/spark/pull/38615#issuecomment-1312240034 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

2022-11-11 Thread GitBox
grundprinzip commented on code in PR #38631: URL: https://github.com/apache/spark/pull/38631#discussion_r1020586736 ## python/pyspark/sql/connect/column.py: ## @@ -82,6 +82,74 @@ def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression": def __str__(self) ->

[GitHub] [spark] felipepessoto commented on pull request #37616: [SPARK-40178][PYTHON][SQL] Fix partitioning hint parameters in PySpark

2022-11-11 Thread GitBox
felipepessoto commented on PR #37616: URL: https://github.com/apache/spark/pull/37616#issuecomment-1312224727 For Scala is expected to need to call `.expr`, or we need to fix it as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] mridulm commented on pull request #38091: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-11 Thread GitBox
mridulm commented on PR #38091: URL: https://github.com/apache/spark/pull/38091#issuecomment-131768 Let us see if the recent fix addresses the issue - else we can take that route @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

2022-11-11 Thread GitBox
grundprinzip commented on code in PR #38631: URL: https://github.com/apache/spark/pull/38631#discussion_r1020577947 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -334,7 +334,11 @@ class SparkConnectPlanner(session:

[GitHub] [spark] ueshin commented on a diff in pull request #38611: [SPARK-41107][PYTHON][INFRA][TEST] Install memory-profiler in the CI

2022-11-11 Thread GitBox
ueshin commented on code in PR #38611: URL: https://github.com/apache/spark/pull/38611#discussion_r1020481025 ## dev/infra/Dockerfile: ## @@ -32,7 +32,7 @@ RUN $APT_INSTALL software-properties-common git libxml2-dev pkg-config curl wget RUN update-alternatives --set java

[GitHub] [spark] gengliangwang commented on pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI

2022-11-11 Thread GitBox
gengliangwang commented on PR #38567: URL: https://github.com/apache/spark/pull/38567#issuecomment-1312210749 @mridulm FYI I have sent the SPIP to the dev list. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38603: [SPARK-41101][PYTHON][PROTOBUF] Message classname support for PYSPARK-PROTOBUF

2022-11-11 Thread GitBox
SandishKumarHN commented on code in PR #38603: URL: https://github.com/apache/spark/pull/38603#discussion_r1020572339 ## python/pyspark/sql/protobuf/functions.py: ## @@ -48,8 +48,11 @@ def from_protobuf( -- data : :class:`~pyspark.sql.Column` or str

[GitHub] [spark] amaliujia commented on a diff in pull request #38621: [SPARK-41111][CONNECT][PYTHON] Implement `DataFrame.show`

2022-11-11 Thread GitBox
amaliujia commented on code in PR #38621: URL: https://github.com/apache/spark/pull/38621#discussion_r1020483723 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -253,6 +254,23 @@ message Repartition { bool shuffle = 3; } +// Compose the string

[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

2022-11-11 Thread GitBox
grundprinzip commented on code in PR #38631: URL: https://github.com/apache/spark/pull/38631#discussion_r1020565080 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -334,7 +334,11 @@ class SparkConnectPlanner(session:

[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

2022-11-11 Thread GitBox
grundprinzip commented on code in PR #38631: URL: https://github.com/apache/spark/pull/38631#discussion_r1020564580 ## python/pyspark/sql/connect/column.py: ## @@ -82,6 +82,74 @@ def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression": def __str__(self) ->

[GitHub] [spark] amaliujia commented on a diff in pull request #38607: [SPARK-40938][CONNECT][PYTHON][FOLLOW-UP] Fix SubqueryAlias without the child plan when constructing Connect proto in the Python

2022-11-11 Thread GitBox
amaliujia commented on code in PR #38607: URL: https://github.com/apache/spark/pull/38607#discussion_r1020564169 ## python/pyspark/sql/connect/plan.py: ## @@ -712,6 +712,8 @@ def __init__(self, child: Optional["LogicalPlan"], alias: str) -> None: def plan(self, session:

[GitHub] [spark] amaliujia commented on pull request #38630: [SPARK-41115][CONNECT] Add ClientType Enum to proto to indicate which client sends a request

2022-11-11 Thread GitBox
amaliujia commented on PR #38630: URL: https://github.com/apache/spark/pull/38630#issuecomment-1312198788 Yeah let me take a look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] amaliujia commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

2022-11-11 Thread GitBox
amaliujia commented on code in PR #38631: URL: https://github.com/apache/spark/pull/38631#discussion_r1020563219 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -334,7 +334,11 @@ class SparkConnectPlanner(session:

[GitHub] [spark] amaliujia commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

2022-11-11 Thread GitBox
amaliujia commented on code in PR #38631: URL: https://github.com/apache/spark/pull/38631#discussion_r1020563219 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -334,7 +334,11 @@ class SparkConnectPlanner(session:

[GitHub] [spark] grundprinzip commented on a diff in pull request #38607: [SPARK-40938][CONNECT][PYTHON][FOLLOW-UP] Fix SubqueryAlias without the child plan when constructing Connect proto in the Pyth

2022-11-11 Thread GitBox
grundprinzip commented on code in PR #38607: URL: https://github.com/apache/spark/pull/38607#discussion_r1020562864 ## python/pyspark/sql/connect/plan.py: ## @@ -712,6 +712,8 @@ def __init__(self, child: Optional["LogicalPlan"], alias: str) -> None: def plan(self,

[GitHub] [spark] amaliujia commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

2022-11-11 Thread GitBox
amaliujia commented on code in PR #38631: URL: https://github.com/apache/spark/pull/38631#discussion_r1020561328 ## python/pyspark/sql/connect/column.py: ## @@ -82,6 +82,74 @@ def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression": def __str__(self) ->

[GitHub] [spark] amaliujia commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

2022-11-11 Thread GitBox
amaliujia commented on code in PR #38631: URL: https://github.com/apache/spark/pull/38631#discussion_r1020561852 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -334,7 +334,11 @@ class SparkConnectPlanner(session:

[GitHub] [spark] grundprinzip commented on pull request #38630: [SPARK-41115][CONNECT] Add ClientType Enum to proto to indicate which client sends a request

2022-11-11 Thread GitBox
grundprinzip commented on PR #38630: URL: https://github.com/apache/spark/pull/38630#issuecomment-1312195716 All this in mind it still makes more sense to keep this as an optional string field instead of an enum. -- This is an automated message from the Apache Git Service. To respond to

[GitHub] [spark] grundprinzip commented on pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

2022-11-11 Thread GitBox
grundprinzip commented on PR #38631: URL: https://github.com/apache/spark/pull/38631#issuecomment-1312192934 @amaliujia @cloud-fan @HyukjinKwon @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] amaliujia commented on pull request #38630: [SPARK-41115][CONNECT] Add ClientType Enum to proto to indicate which client sends a request

2022-11-11 Thread GitBox
amaliujia commented on PR #38630: URL: https://github.com/apache/spark/pull/38630#issuecomment-1312190938 @grundprinzip I actually want to log to different entries for the usage from different clients. The perfect logging to me is x jobs submitted through proto (which

[GitHub] [spark] grundprinzip opened a new pull request, #38631: [SPARK-40809] [CONNECT] [FOLLOW]

2022-11-11 Thread GitBox
grundprinzip opened a new pull request, #38631: URL: https://github.com/apache/spark/pull/38631 ### What changes were proposed in this pull request? This extends the implementation of column aliases in Spark Connect with supporting lists of column names and providing the appropriate

[GitHub] [spark] grundprinzip commented on a diff in pull request #38630: [SPARK-41115][CONNECT] Add ClientType Enum to proto to indicate which client sends a request

2022-11-11 Thread GitBox
grundprinzip commented on code in PR #38630: URL: https://github.com/apache/spark/pull/38630#discussion_r1020553812 ## connector/connect/src/main/protobuf/spark/connect/base.proto: ## @@ -48,6 +54,9 @@ message Request { // The logical plan to be executed / analyzed. Plan

[GitHub] [spark] grundprinzip commented on a diff in pull request #38630: [SPARK-41115][CONNECT] Add ClientType Enum to proto to indicate which client sends a request

2022-11-11 Thread GitBox
grundprinzip commented on code in PR #38630: URL: https://github.com/apache/spark/pull/38630#discussion_r1020553414 ## connector/connect/src/main/protobuf/spark/connect/base.proto: ## @@ -48,6 +54,9 @@ message Request { // The logical plan to be executed / analyzed. Plan

[GitHub] [spark] amaliujia commented on pull request #38630: [SPARK-41115][CONNECT] Add ClientType Enum to proto to indicate which client sends a request

2022-11-11 Thread GitBox
amaliujia commented on PR #38630: URL: https://github.com/apache/spark/pull/38630#issuecomment-1312181336 @grundprinzip @HyukjinKwon cc @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] amaliujia opened a new pull request, #38630: [SPARK-41115][CONNECT] Add ClientType Enum to proto to indicate which client sends a request

2022-11-11 Thread GitBox
amaliujia opened a new pull request, #38630: URL: https://github.com/apache/spark/pull/38630 ### What changes were proposed in this pull request? This PRs introduces a ENUM into Connect proto that can be included into Request to indicate the client type. ### Why are

[GitHub] [spark] MaxGekk commented on a diff in pull request #38623: [WIP][SPARK-41072][SQL][SS] Add the error class `STREAM_FAILED`

2022-11-11 Thread GitBox
MaxGekk commented on code in PR #38623: URL: https://github.com/apache/spark/pull/38623#discussion_r1020544244 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala: ## @@ -321,7 +321,10 @@ abstract class StreamExecution( // to `new

[GitHub] [spark] MaxGekk opened a new pull request, #38629: [WIP][SS] Add the error class `STREAM_FAILED` to `StreamingQueryException`

2022-11-11 Thread GitBox
MaxGekk opened a new pull request, #38629: URL: https://github.com/apache/spark/pull/38629 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How

[GitHub] [spark] amaliujia commented on a diff in pull request #38621: [SPARK-41111][CONNECT][PYTHON] Implement `DataFrame.show`

2022-11-11 Thread GitBox
amaliujia commented on code in PR #38621: URL: https://github.com/apache/spark/pull/38621#discussion_r1020488570 ## python/pyspark/sql/tests/connect/test_connect_basic.py: ## @@ -217,6 +217,12 @@ def test_empty_dataset(self): def test_session(self):

[GitHub] [spark] amaliujia commented on a diff in pull request #38595: [SPARK-41090][SQL] Fix view not found issue for `db_name.view_name`

2022-11-11 Thread GitBox
amaliujia commented on code in PR #38595: URL: https://github.com/apache/spark/pull/38595#discussion_r1020520053 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3804,6 +3804,13 @@ class Dataset[T] private[sql]( } catch { case _: ParseException

[GitHub] [spark] amaliujia commented on a diff in pull request #38595: [SPARK-41090][SQL] Fix view not found issue for `db_name.view_name`

2022-11-11 Thread GitBox
amaliujia commented on code in PR #38595: URL: https://github.com/apache/spark/pull/38595#discussion_r1020516505 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3804,6 +3804,13 @@ class Dataset[T] private[sql]( } catch { case _: ParseException

[GitHub] [spark] zsxwing commented on a diff in pull request #38623: [WIP][SPARK-41072][SQL][SS] Add the error class `STREAM_FAILED`

2022-11-11 Thread GitBox
zsxwing commented on code in PR #38623: URL: https://github.com/apache/spark/pull/38623#discussion_r1020491006 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala: ## @@ -321,7 +321,10 @@ abstract class StreamExecution( // to `new

[GitHub] [spark] amaliujia commented on a diff in pull request #38621: [SPARK-41111][CONNECT][PYTHON] Implement `DataFrame.show`

2022-11-11 Thread GitBox
amaliujia commented on code in PR #38621: URL: https://github.com/apache/spark/pull/38621#discussion_r1020488570 ## python/pyspark/sql/tests/connect/test_connect_basic.py: ## @@ -217,6 +217,12 @@ def test_empty_dataset(self): def test_session(self):

[GitHub] [spark] amaliujia commented on a diff in pull request #38621: [SPARK-41111][CONNECT][PYTHON] Implement `DataFrame.show`

2022-11-11 Thread GitBox
amaliujia commented on code in PR #38621: URL: https://github.com/apache/spark/pull/38621#discussion_r1020484857 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -253,6 +254,23 @@ message Repartition { bool shuffle = 3; } +// Compose the string

[GitHub] [spark] amaliujia commented on a diff in pull request #38621: [SPARK-41111][CONNECT][PYTHON] Implement `DataFrame.show`

2022-11-11 Thread GitBox
amaliujia commented on code in PR #38621: URL: https://github.com/apache/spark/pull/38621#discussion_r1020483723 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -253,6 +254,23 @@ message Repartition { bool shuffle = 3; } +// Compose the string

[GitHub] [spark] ueshin commented on a diff in pull request #38611: [SPARK-41107][PYTHON][INFRA][TEST] Install memory-profiler in the CI

2022-11-11 Thread GitBox
ueshin commented on code in PR #38611: URL: https://github.com/apache/spark/pull/38611#discussion_r1020481025 ## dev/infra/Dockerfile: ## @@ -32,7 +32,7 @@ RUN $APT_INSTALL software-properties-common git libxml2-dev pkg-config curl wget RUN update-alternatives --set java

[GitHub] [spark] kazuyukitanimura commented on pull request #38628: [SPARK-41096][SQL] Support reading parquet FIXED_LEN_BYTE_ARRAY type

2022-11-11 Thread GitBox
kazuyukitanimura commented on PR #38628: URL: https://github.com/apache/spark/pull/38628#issuecomment-1312069028 cc @huaxingao @sunchao @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] kazuyukitanimura opened a new pull request, #38628: [SPARK-41096][SQL] Support reading parquet FIXED_LEN_BYTE_ARRAY type

2022-11-11 Thread GitBox
kazuyukitanimura opened a new pull request, #38628: URL: https://github.com/apache/spark/pull/38628 ### What changes were proposed in this pull request? Parquet supports FIXED_LEN_BYTE_ARRAY (FLBA) data type. However, Spark Parquet reader currently cannot handle FLBA. This PR proposes

[GitHub] [spark] carlfu-db commented on a diff in pull request #38404: [SPARK-40956] SQL Equivalent for Dataframe overwrite command

2022-11-11 Thread GitBox
carlfu-db commented on code in PR #38404: URL: https://github.com/apache/spark/pull/38404#discussion_r1013470376 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -319,7 +319,7 @@ query insertInto : INSERT OVERWRITE TABLE?

[GitHub] [spark] grundprinzip commented on pull request #38627: [SPARK-40875] [CONNECT] [FOLLOW] Retain Group expressions in aggregate.

2022-11-11 Thread GitBox
grundprinzip commented on PR #38627: URL: https://github.com/apache/spark/pull/38627#issuecomment-1311969221 R: @cloud-fan @amaliujia @zhengruifeng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] grundprinzip opened a new pull request, #38627: [SPARK-40875] [CONNECT] [FOLLOW] Retain Group expressions in aggregate.

2022-11-11 Thread GitBox
grundprinzip opened a new pull request, #38627: URL: https://github.com/apache/spark/pull/38627 ### What changes were proposed in this pull request? This is a follow-up improving the behavior and compatibility for aggregate relations using Spark Connect. Previously, Spark Connect

[GitHub] [spark] 19Serhii99 commented on pull request #38574: [SPARK-41060][K8S] Fix generating driver and executor Config Maps

2022-11-11 Thread GitBox
19Serhii99 commented on PR #38574: URL: https://github.com/apache/spark/pull/38574#issuecomment-1311956108 Need help with fixing the integration tests. I did not expect they would fail as I had replaced the constants with the methods and that's it. -- This is an automated message from

[GitHub] [spark] srielau commented on a diff in pull request #38615: [SPARK-41109][SQL] Rename the error class _LEGACY_ERROR_TEMP_1216 to INVALID_LIKE_PATTERN

2022-11-11 Thread GitBox
srielau commented on code in PR #38615: URL: https://github.com/apache/spark/pull/38615#discussion_r1020406807 ## core/src/main/resources/error/error-classes.json: ## @@ -630,6 +630,11 @@ "Input schema can only contain STRING as a key type for a MAP." ] }, +

[GitHub] [spark] AmplabJenkins commented on pull request #38622: [SPARK-39601][YARN] AllocationFailure should not be treated as exitCausedByApp when driver is shutting down

2022-11-11 Thread GitBox
AmplabJenkins commented on PR #38622: URL: https://github.com/apache/spark/pull/38622#issuecomment-1311933990 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] cloud-fan commented on pull request #38626: [SPARK-38959][SQL][FOLLOWUP] Do not optimize subqueries twice

2022-11-11 Thread GitBox
cloud-fan commented on PR #38626: URL: https://github.com/apache/spark/pull/38626#issuecomment-1311906245 cc @aokolnychyi @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] cloud-fan opened a new pull request, #38626: [SPARK-38959][SQL][FOLLOWUP] Do not optimize subqueries twice

2022-11-11 Thread GitBox
cloud-fan opened a new pull request, #38626: URL: https://github.com/apache/spark/pull/38626 ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/38557 . We found that some optimizer rules can't be applied twice

[GitHub] [spark] WeichenXu123 commented on pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-11 Thread GitBox
WeichenXu123 commented on PR #37734: URL: https://github.com/apache/spark/pull/37734#issuecomment-1311877149 @mengxr Could you make a final pass ? The PR is LGTM once all my comments addressed. -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-11 Thread GitBox
WeichenXu123 commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r1020345415 ## python/pyspark/ml/functions.py: ## @@ -106,6 +117,601 @@ def array_to_vector(col: Column) -> Column: return

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-11 Thread GitBox
WeichenXu123 commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r1020344325 ## python/pyspark/ml/functions.py: ## @@ -106,6 +117,601 @@ def array_to_vector(col: Column) -> Column: return

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-11 Thread GitBox
WeichenXu123 commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r1020344325 ## python/pyspark/ml/functions.py: ## @@ -106,6 +117,601 @@ def array_to_vector(col: Column) -> Column: return

[GitHub] [spark] deepyaman commented on pull request #38625: [PYTHON][PS] Fix the `.groupby()` method docstring

2022-11-11 Thread GitBox
deepyaman commented on PR #38625: URL: https://github.com/apache/spark/pull/38625#issuecomment-1311866247 Not sure why the Build job is failing; Actions are enabled on my fork, and I did run the following, as specified in the failed job, just in case: ```bash git fetch upstream

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-11 Thread GitBox
WeichenXu123 commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r1020333239 ## python/pyspark/ml/functions.py: ## @@ -106,6 +117,601 @@ def array_to_vector(col: Column) -> Column: return

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-11 Thread GitBox
WeichenXu123 commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r1020332942 ## python/pyspark/ml/functions.py: ## @@ -106,6 +117,601 @@ def array_to_vector(col: Column) -> Column: return

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-11 Thread GitBox
WeichenXu123 commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r1020331415 ## python/pyspark/ml/functions.py: ## @@ -106,6 +117,601 @@ def array_to_vector(col: Column) -> Column: return

[GitHub] [spark] srielau commented on a diff in pull request #38531: [SPARK-40755][SQL] Migrate type check failures of number formatting onto error classes

2022-11-11 Thread GitBox
srielau commented on code in PR #38531: URL: https://github.com/apache/spark/pull/38531#discussion_r1020324789 ## core/src/main/resources/error/error-classes.json: ## @@ -290,6 +290,46 @@ "Null typed values cannot be used as arguments of ." ] }, +

[GitHub] [spark] deepyaman opened a new pull request, #38625: Update generic.py

2022-11-11 Thread GitBox
deepyaman opened a new pull request, #38625: URL: https://github.com/apache/spark/pull/38625 ### What changes were proposed in this pull request? Update the docstring for the `.groupby()` method. ### Why are the changes needed? The `.groupby()` method accept

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-11-11 Thread GitBox
WeichenXu123 commented on code in PR #37734: URL: https://github.com/apache/spark/pull/37734#discussion_r1020309114 ## python/pyspark/ml/model_cache.py: ## @@ -0,0 +1,46 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license

[GitHub] [spark] AmplabJenkins commented on pull request #38624: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

2022-11-11 Thread GitBox
AmplabJenkins commented on PR #38624: URL: https://github.com/apache/spark/pull/38624#issuecomment-1311784451 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] srowen closed pull request #38596: [SPARK-41093][BUILD] Remove netty-tcnative-classes from Spark dependencyList

2022-11-11 Thread GitBox
srowen closed pull request #38596: [SPARK-41093][BUILD] Remove netty-tcnative-classes from Spark dependencyList URL: https://github.com/apache/spark/pull/38596 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] srowen commented on pull request #38596: [SPARK-41093][BUILD] Remove netty-tcnative-classes from Spark dependencyList

2022-11-11 Thread GitBox
srowen commented on PR #38596: URL: https://github.com/apache/spark/pull/38596#issuecomment-1311748449 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] mridulm commented on pull request #38617: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case

2022-11-11 Thread GitBox
mridulm commented on PR #38617: URL: https://github.com/apache/spark/pull/38617#issuecomment-1311717433 Thanks @HyukjinKwon , @LuciferYang ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] wankunde commented on a diff in pull request #38495: [SPARK-35531][SQL] Update hive table stats without unnecessary convert

2022-11-11 Thread GitBox
wankunde commented on code in PR #38495: URL: https://github.com/apache/spark/pull/38495#discussion_r1020228031 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/AlterTableDropPartitionSuite.scala: ## @@ -39,11 +39,11 @@ class AlterTableDropPartitionSuite

[GitHub] [spark] MaxGekk commented on pull request #38531: [SPARK-40755][SQL] Migrate type check failures of number formatting onto error classes

2022-11-11 Thread GitBox
MaxGekk commented on PR #38531: URL: https://github.com/apache/spark/pull/38531#issuecomment-1311696603 also cc @dtenedor @cloud-fan @srielau @itholic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] MaxGekk commented on a diff in pull request #38531: [SPARK-40755][SQL] Migrate type check failures of number formatting onto error classes

2022-11-11 Thread GitBox
MaxGekk commented on code in PR #38531: URL: https://github.com/apache/spark/pull/38531#discussion_r1020213977 ## core/src/main/resources/error/error-classes.json: ## @@ -290,6 +290,46 @@ "Null typed values cannot be used as arguments of ." ] }, +

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38621: [SPARK-41111][CONNECT][PYTHON] Implement `DataFrame.show`

2022-11-11 Thread GitBox
HyukjinKwon commented on code in PR #38621: URL: https://github.com/apache/spark/pull/38621#discussion_r1020209283 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -253,6 +254,23 @@ message Repartition { bool shuffle = 3; } +// Compose the

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38621: [SPARK-41111][CONNECT][PYTHON] Implement `DataFrame.show`

2022-11-11 Thread GitBox
HyukjinKwon commented on code in PR #38621: URL: https://github.com/apache/spark/pull/38621#discussion_r1020208301 ## python/pyspark/sql/connect/dataframe.py: ## @@ -388,8 +388,55 @@ def sample( session=self._session, ) -def show(self, n: int,

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38621: [SPARK-41111][CONNECT][PYTHON] Implement `DataFrame.show`

2022-11-11 Thread GitBox
HyukjinKwon commented on code in PR #38621: URL: https://github.com/apache/spark/pull/38621#discussion_r1020207990 ## python/pyspark/sql/connect/dataframe.py: ## @@ -388,8 +388,55 @@ def sample( session=self._session, ) -def show(self, n: int,

[GitHub] [spark] MaxGekk closed pull request #38507: [SPARK-40372][SQL] Migrate failures of array type checks onto error classes

2022-11-11 Thread GitBox
MaxGekk closed pull request #38507: [SPARK-40372][SQL] Migrate failures of array type checks onto error classes URL: https://github.com/apache/spark/pull/38507 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] wangyum commented on a diff in pull request #38495: [SPARK-35531][SQL] Update hive table stats without unnecessary convert

2022-11-11 Thread GitBox
wangyum commented on code in PR #38495: URL: https://github.com/apache/spark/pull/38495#discussion_r1020196122 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/AlterTableDropPartitionSuite.scala: ## @@ -39,11 +39,11 @@ class AlterTableDropPartitionSuite

[GitHub] [spark] MaxGekk commented on pull request #38507: [SPARK-40372][SQL] Migrate failures of array type checks onto error classes

2022-11-11 Thread GitBox
MaxGekk commented on PR #38507: URL: https://github.com/apache/spark/pull/38507#issuecomment-1311659255 +1, LGTM. Merging to master. All GAs passed on the previous commit, and the last one is just a rebase. Thank you, @LuciferYang. -- This is an automated message from the Apache Git

[GitHub] [spark] EnricoMi opened a new pull request, #38624: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

2022-11-11 Thread GitBox
EnricoMi opened a new pull request, #38624: URL: https://github.com/apache/spark/pull/38624 ### What changes were proposed in this pull request? Add `applyInArrow` method to PySpark `groupBy` and `groupBy.cogroup` to allow for user functions that work on Arrow. Similar to existing

[GitHub] [spark] HyukjinKwon closed pull request #38614: [SPARK-41005][CONNECT][FOLLOWUP] Collect should use `submitJob` instead of `runJob`

2022-11-11 Thread GitBox
HyukjinKwon closed pull request #38614: [SPARK-41005][CONNECT][FOLLOWUP] Collect should use `submitJob` instead of `runJob` URL: https://github.com/apache/spark/pull/38614 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[GitHub] [spark] HyukjinKwon commented on pull request #38614: [SPARK-41005][CONNECT][FOLLOWUP] Collect should use `submitJob` instead of `runJob`

2022-11-11 Thread GitBox
HyukjinKwon commented on PR #38614: URL: https://github.com/apache/spark/pull/38614#issuecomment-1311628662 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HyukjinKwon commented on pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-11 Thread GitBox
HyukjinKwon commented on PR #38613: URL: https://github.com/apache/spark/pull/38613#issuecomment-1311621739 Actually let's just go with https://github.com/apache/spark/pull/38614 approach which is simpler. This approach can't easily dedup the codes anyway because of ordering anyway. --

[GitHub] [spark] HyukjinKwon closed pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-11 Thread GitBox
HyukjinKwon closed pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect URL: https://github.com/apache/spark/pull/38613 -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] hvanhovell commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-11 Thread GitBox
hvanhovell commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1020162907 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -144,36 +144,10 @@ class

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-11 Thread GitBox
HyukjinKwon commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1020130165 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -144,36 +144,10 @@ class

[GitHub] [spark] hvanhovell commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-11 Thread GitBox
hvanhovell commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1020133986 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -144,36 +144,10 @@ class

[GitHub] [spark] MaxGekk commented on a diff in pull request #38615: [SPARK-41109][SQL] Rename the error class _LEGACY_ERROR_TEMP_1216 to INVALID_LIKE_PATTERN

2022-11-11 Thread GitBox
MaxGekk commented on code in PR #38615: URL: https://github.com/apache/spark/pull/38615#discussion_r1020153905 ## core/src/main/resources/error/error-classes.json: ## @@ -630,6 +630,11 @@ "Input schema can only contain STRING as a key type for a MAP." ] }, +

[GitHub] [spark] hvanhovell commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-11 Thread GitBox
hvanhovell commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1020133986 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -144,36 +144,10 @@ class

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-11 Thread GitBox
HyukjinKwon commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1020130165 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -144,36 +144,10 @@ class

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-11 Thread GitBox
HyukjinKwon commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1020127951 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -56,7 +56,7 @@ class

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-11 Thread GitBox
HyukjinKwon commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1020127788 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -56,7 +56,7 @@ class

[GitHub] [spark] hvanhovell commented on a diff in pull request #38613: [SPARK-41005][CONNECT][PYTHON][FOLLOW-UP] Fetch/send partitions in parallel for Arrow based collect

2022-11-11 Thread GitBox
hvanhovell commented on code in PR #38613: URL: https://github.com/apache/spark/pull/38613#discussion_r1020126889 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala: ## @@ -184,9 +158,30 @@ class

  1   2   >