[jira] [Resolved] (SPARK-43979) CollectedMetrics should be treated as the same one for self-join

2023-06-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43979.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41475
[https://github.com/apache/spark/pull/41475]

> CollectedMetrics should be treated as the same one for self-join
> 
>
> Key: SPARK-43979
> URL: https://issues.apache.org/jira/browse/SPARK-43979
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43717) Scala Client Dataset#reduce failed to handle null partitions for scala primitive types

2023-06-06 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-43717.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41264
[https://github.com/apache/spark/pull/41264]

> Scala Client Dataset#reduce failed to handle null partitions for scala 
> primitive types
> --
>
> Key: SPARK-43717
> URL: https://issues.apache.org/jira/browse/SPARK-43717
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Zhen Li
>Assignee: Zhen Li
>Priority: Major
> Fix For: 3.5.0
>
>
> Scala client failed with NPE when running:
> assert(spark.range(0, 5, 1, 10).as[Long].reduce(_ + _) == 10)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43717) Scala Client Dataset#reduce failed to handle null partitions for scala primitive types

2023-06-06 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-43717:


Assignee: Zhen Li

> Scala Client Dataset#reduce failed to handle null partitions for scala 
> primitive types
> --
>
> Key: SPARK-43717
> URL: https://issues.apache.org/jira/browse/SPARK-43717
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Zhen Li
>Assignee: Zhen Li
>Priority: Major
>
> Scala client failed with NPE when running:
> assert(spark.range(0, 5, 1, 10).as[Long].reduce(_ + _) == 10)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43989) Add maven testing GA task for connect server module

2023-06-06 Thread Yang Jie (Jira)
Yang Jie created SPARK-43989:


 Summary: Add maven testing GA task for connect server module
 Key: SPARK-43989
 URL: https://issues.apache.org/jira/browse/SPARK-43989
 Project: Spark
  Issue Type: Improvement
  Components: Connect, Project Infra
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43988) Add maven testing GA task for connect client module

2023-06-06 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43988:
-
Summary: Add maven testing GA task for connect client module  (was: Add 
independent maven testing GA task for connect client module)

> Add maven testing GA task for connect client module
> ---
>
> Key: SPARK-43988
> URL: https://issues.apache.org/jira/browse/SPARK-43988
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Project Infra
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43988) Add independent maven testing GA task for connect client module

2023-06-06 Thread Yang Jie (Jira)
Yang Jie created SPARK-43988:


 Summary: Add independent maven testing GA task for connect client 
module
 Key: SPARK-43988
 URL: https://issues.apache.org/jira/browse/SPARK-43988
 Project: Spark
  Issue Type: Improvement
  Components: Connect, Project Infra
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43987) Separate finalizeShuffleMerge Processing to Dedicated Thread Pools

2023-06-06 Thread SHU WANG (Jira)
SHU WANG created SPARK-43987:


 Summary: Separate finalizeShuffleMerge Processing to Dedicated 
Thread Pools
 Key: SPARK-43987
 URL: https://issues.apache.org/jira/browse/SPARK-43987
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.4.0, 3.2.0
Reporter: SHU WANG


In our production environment, _finalizeShuffleMerge_ processing took longer 
time (p90 is around 20s) than other PRC requests. This is due to 
_finalizeShuffleMerge_ invoking IO operations like truncate and file 
open/close.  

More importantly, processing this _finalizeShuffleMerge_ can block other 
critical lightweight messages like authentications, which can cause 
authentication timeout as well as fetch failures. Those timeout and fetch 
failures affect the stability of the Spark job executions. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43669) Fix BinaryOps.lt to work with Spark Connect Column.

2023-06-06 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-43669.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Fixed in [https://github.com/apache/spark/pull/41305]

> Fix BinaryOps.lt to work with Spark Connect Column.
> ---
>
> Key: SPARK-43669
> URL: https://issues.apache.org/jira/browse/SPARK-43669
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Fix BinaryOps.lt to work with Spark Connect Column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43668) Fix BinaryOps.le to work with Spark Connect Column.

2023-06-06 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-43668.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Fixed in [https://github.com/apache/spark/pull/41305]

> Fix BinaryOps.le to work with Spark Connect Column.
> ---
>
> Key: SPARK-43668
> URL: https://issues.apache.org/jira/browse/SPARK-43668
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Fix BinaryOps.le to work with Spark Connect Column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43672) Enable CategoricalOps.gt to work with Spark Connect.

2023-06-06 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-43672.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

> Enable CategoricalOps.gt to work with Spark Connect.
> 
>
> Key: SPARK-43672
> URL: https://issues.apache.org/jira/browse/SPARK-43672
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Enable CategoricalOps.gt to work with Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43673) Enable CategoricalOps.le to work with Spark Connect.

2023-06-06 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-43673:

Fix Version/s: 3.5.0

> Enable CategoricalOps.le to work with Spark Connect.
> 
>
> Key: SPARK-43673
> URL: https://issues.apache.org/jira/browse/SPARK-43673
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Enable CategoricalOps.le to work with Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43674) Enable CategoricalOps.lt to work with Spark Connect.

2023-06-06 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-43674:

Fix Version/s: 3.5.0

> Enable CategoricalOps.lt to work with Spark Connect.
> 
>
> Key: SPARK-43674
> URL: https://issues.apache.org/jira/browse/SPARK-43674
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Enable CategoricalOps.lt to work with Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-43672) Enable CategoricalOps.gt to work with Spark Connect.

2023-06-06 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee reopened SPARK-43672:
-

> Enable CategoricalOps.gt to work with Spark Connect.
> 
>
> Key: SPARK-43672
> URL: https://issues.apache.org/jira/browse/SPARK-43672
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Enable CategoricalOps.gt to work with Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43667) Fix BinaryOps.gt to work with Spark Connect Column.

2023-06-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43667.
--
Fix Version/s: 3.5.0
 Assignee: Haejoon Lee
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/41305

> Fix BinaryOps.gt to work with Spark Connect Column.
> ---
>
> Key: SPARK-43667
> URL: https://issues.apache.org/jira/browse/SPARK-43667
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Fix BinaryOps.gt to work with Spark Connect Column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43674) Enable CategoricalOps.lt to work with Spark Connect.

2023-06-06 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-43674.
-
Resolution: Fixed

This is resolved from https://github.com/apache/spark/pull/41310.

> Enable CategoricalOps.lt to work with Spark Connect.
> 
>
> Key: SPARK-43674
> URL: https://issues.apache.org/jira/browse/SPARK-43674
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Enable CategoricalOps.lt to work with Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43673) Enable CategoricalOps.le to work with Spark Connect.

2023-06-06 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-43673.
-
Resolution: Fixed

This is resolved from https://github.com/apache/spark/pull/41310.

> Enable CategoricalOps.le to work with Spark Connect.
> 
>
> Key: SPARK-43673
> URL: https://issues.apache.org/jira/browse/SPARK-43673
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Enable CategoricalOps.le to work with Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43672) Enable CategoricalOps.gt to work with Spark Connect.

2023-06-06 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-43672.
-
Resolution: Fixed

This is resolved from https://github.com/apache/spark/pull/41310.

> Enable CategoricalOps.gt to work with Spark Connect.
> 
>
> Key: SPARK-43672
> URL: https://issues.apache.org/jira/browse/SPARK-43672
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Enable CategoricalOps.gt to work with Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43985) Spark protobuf enums.as.ints raises exception on repeated enum types

2023-06-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43985.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41481
[https://github.com/apache/spark/pull/41481]

> Spark protobuf enums.as.ints raises exception on repeated enum types
> 
>
> Key: SPARK-43985
> URL: https://issues.apache.org/jira/browse/SPARK-43985
> Project: Spark
>  Issue Type: Bug
>  Components: Protobuf
>Affects Versions: 3.4.0
>Reporter: Parth Upadhyay
>Assignee: Parth Upadhyay
>Priority: Major
> Fix For: 3.5.0
>
>
> For repeated enum types, the `enums.as.ints` being enabled currently raises 
> an exception when trying to deserialize repeated enum fields. We should fix 
> this behavior so that repeated enum fields work correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43901) Avro to Support custom decimal type backed by Long

2023-06-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43901.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41409
[https://github.com/apache/spark/pull/41409]

> Avro to Support custom decimal type backed by Long
> --
>
> Key: SPARK-43901
> URL: https://issues.apache.org/jira/browse/SPARK-43901
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Major
> Fix For: 3.5.0
>
>
> Right now, Avro only allows Decimal logical type in fixed and array types. 
> However, there is a requirement from users to represent decimal in long type. 
> It is to support represent currency (for money). The request is to support a 
> customized decimal type backed by long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43901) Avro to Support custom decimal type backed by Long

2023-06-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43901:
-

Assignee: Siying Dong

> Avro to Support custom decimal type backed by Long
> --
>
> Key: SPARK-43901
> URL: https://issues.apache.org/jira/browse/SPARK-43901
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Major
>
> Right now, Avro only allows Decimal logical type in fixed and array types. 
> However, there is a requirement from users to represent decimal in long type. 
> It is to support represent currency (for money). The request is to support a 
> customized decimal type backed by long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42750) Support INSERT INTO by name

2023-06-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-42750.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40908
[https://github.com/apache/spark/pull/40908]

> Support INSERT INTO by name
> ---
>
> Key: SPARK-42750
> URL: https://issues.apache.org/jira/browse/SPARK-42750
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jose Torres
>Assignee: Jia Fan
>Priority: Major
> Fix For: 3.5.0
>
>
> In some use cases, users have incoming dataframes with fixed column names 
> which might differ from the canonical order. Currently there's no way to 
> handle this easily through the INSERT INTO API - the user has to make sure 
> the columns are in the right order as they would when inserting a tuple. We 
> should add an optional BY NAME clause, such that:
> INSERT INTO tgt BY NAME 
> takes each column of  and inserts it into the column in `tgt` which 
> has the same name according to the configured `resolver` logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43615) Enable DataFrameSlowParityTests.test_eval

2023-06-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43615:


Assignee: Haejoon Lee

> Enable DataFrameSlowParityTests.test_eval
> -
>
> Key: SPARK-43615
> URL: https://issues.apache.org/jira/browse/SPARK-43615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Repro:
> {code:java}
> pdf = pd.DataFrame({"A": range(1, 6), "B": range(10, 0, -2)})
> psdf = ps.from_pandas(pdf)
> pdf.eval("B = A + B // (100 + 200) * (500 - B) - 10.5")
> psdf.eval("B = A + B // (100 + 200) * (500 - B) - 10.5") {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43615) Enable DataFrameSlowParityTests.test_eval

2023-06-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43615.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41471
[https://github.com/apache/spark/pull/41471]

> Enable DataFrameSlowParityTests.test_eval
> -
>
> Key: SPARK-43615
> URL: https://issues.apache.org/jira/browse/SPARK-43615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Repro:
> {code:java}
> pdf = pd.DataFrame({"A": range(1, 6), "B": range(10, 0, -2)})
> psdf = ps.from_pandas(pdf)
> pdf.eval("B = A + B // (100 + 200) * (500 - B) - 10.5")
> psdf.eval("B = A + B // (100 + 200) * (500 - B) - 10.5") {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43930) Add unix_* functions to Scala and Python

2023-06-06 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43930:
-

Assignee: BingKun Pan

> Add unix_* functions to Scala and Python
> 
>
> Key: SPARK-43930
> URL: https://issues.apache.org/jira/browse/SPARK-43930
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: BingKun Pan
>Priority: Major
>
> Add following functions:
> * unix_date
> * unix_micros
> * unix_millis
> * unix_seconds
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43930) Add unix_* functions to Scala and Python

2023-06-06 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43930.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41463
[https://github.com/apache/spark/pull/41463]

> Add unix_* functions to Scala and Python
> 
>
> Key: SPARK-43930
> URL: https://issues.apache.org/jira/browse/SPARK-43930
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: BingKun Pan
>Priority: Major
> Fix For: 3.5.0
>
>
> Add following functions:
> * unix_date
> * unix_micros
> * unix_millis
> * unix_seconds
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43356) Migrate deprecated createOrReplace to serverSideApply

2023-06-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43356:
-

Assignee: Cheng Pan

> Migrate deprecated createOrReplace to serverSideApply
> -
>
> Key: SPARK-43356
> URL: https://issues.apache.org/jira/browse/SPARK-43356
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>
>  
>  
> {{public interface CreateOrReplaceable extends Replaceable {}}
> {{  /**}}
> {{   * Creates a provided resource in a Kubernetes Cluster. If creation}}
> {{   * fails with a HTTP_CONFLICT, it tries to replace resource.}}
> {{   *}}
> {{   * @return created item returned in kubernetes api response}}
> {{   *}}
> {{   * @deprecated please use \{@link ServerSideApplicable#serverSideApply()} 
> or attempt a create and edit/patch operation.}}
> {{   */}}
> {{  @Deprecated}}
> {{  T createOrReplace();}}
>  
> {{  /**}}
> {{   * Creates an item}}
> {{   *}}
> {{   * @return the item from the api server}}
> {{   */}}
> {{  T create();}}
> {{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43356) Migrate deprecated createOrReplace to serverSideApply

2023-06-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43356.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41136
[https://github.com/apache/spark/pull/41136]

> Migrate deprecated createOrReplace to serverSideApply
> -
>
> Key: SPARK-43356
> URL: https://issues.apache.org/jira/browse/SPARK-43356
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>
>  
>  
> {{public interface CreateOrReplaceable extends Replaceable {}}
> {{  /**}}
> {{   * Creates a provided resource in a Kubernetes Cluster. If creation}}
> {{   * fails with a HTTP_CONFLICT, it tries to replace resource.}}
> {{   *}}
> {{   * @return created item returned in kubernetes api response}}
> {{   *}}
> {{   * @deprecated please use \{@link ServerSideApplicable#serverSideApply()} 
> or attempt a create and edit/patch operation.}}
> {{   */}}
> {{  @Deprecated}}
> {{  T createOrReplace();}}
>  
> {{  /**}}
> {{   * Creates an item}}
> {{   *}}
> {{   * @return the item from the api server}}
> {{   */}}
> {{  T create();}}
> {{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43906) Implement the file support in SparkSession.addArtifacts

2023-06-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43906.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41415
[https://github.com/apache/spark/pull/41415]

> Implement the file support in SparkSession.addArtifacts
> ---
>
> Key: SPARK-43906
> URL: https://issues.apache.org/jira/browse/SPARK-43906
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.5.0
>
>
> Related to SPARK-42748, SPARK-43747 and SPARK-43612. We should also make 
> SparkSession.addArtifacts work with regular files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43906) Implement the file support in SparkSession.addArtifacts

2023-06-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43906:


Assignee: Hyukjin Kwon

> Implement the file support in SparkSession.addArtifacts
> ---
>
> Key: SPARK-43906
> URL: https://issues.apache.org/jira/browse/SPARK-43906
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Related to SPARK-42748, SPARK-43747 and SPARK-43612. We should also make 
> SparkSession.addArtifacts work with regular files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43970) Hide unsupported dataframe methods from auto-completion

2023-06-06 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43970:
-

Assignee: Ruifeng Zheng

> Hide unsupported dataframe methods from auto-completion
> ---
>
> Key: SPARK-43970
> URL: https://issues.apache.org/jira/browse/SPARK-43970
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43970) Hide unsupported dataframe methods from auto-completion

2023-06-06 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43970.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41462
[https://github.com/apache/spark/pull/41462]

> Hide unsupported dataframe methods from auto-completion
> ---
>
> Key: SPARK-43970
> URL: https://issues.apache.org/jira/browse/SPARK-43970
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43986) Add error classes for HyperLogLog functions

2023-06-06 Thread Daniel (Jira)
Daniel created SPARK-43986:
--

 Summary: Add error classes for HyperLogLog functions
 Key: SPARK-43986
 URL: https://issues.apache.org/jira/browse/SPARK-43986
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Daniel






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43893) StructType input/output support in Arrow-optimized Python UDF

2023-06-06 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reassigned SPARK-43893:


Assignee: Xinrong Meng

> StructType input/output support in Arrow-optimized Python UDF
> -
>
> Key: SPARK-43893
> URL: https://issues.apache.org/jira/browse/SPARK-43893
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43893) StructType input/output support in Arrow-optimized Python UDF

2023-06-06 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-43893.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41321
[https://github.com/apache/spark/pull/41321]

> StructType input/output support in Arrow-optimized Python UDF
> -
>
> Key: SPARK-43893
> URL: https://issues.apache.org/jira/browse/SPARK-43893
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2023-06-06 Thread Zach Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729825#comment-17729825
 ] 

Zach Liu edited comment on SPARK-36277 at 6/6/23 6:55 PM:
--

I see the same behavior on Spark 3.3.1. I have to create this "checkpoint":
{code:java}
spark.conf.set(
"spark.sql.optimizer.excludedRules",
    "org.apache.spark.sql.catalyst.optimizer.ColumnPruning",
)
true_count = df.count()
spark.conf.set("spark.sql.optimizer.excludedRules", "null")
all_count = df.count()
malformed_count = all_count - true_count
if malformed_count > 0:
    raise ValueError("Self-defined schema is not compatible with the data") 
{code}
[~fchen] I don't know if disabling `ColumnPruning` has other implications, so I 
just re-enable it.


was (Author: zach liu):
I see the same behavior on Spark 3.3.1. I have to create this "checkpoint":
{code:java}
spark.sql("set 
spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ColumnPruning")
true_count = df.count()
spark.sql("set spark.sql.optimizer.excludedRules=null")
all_count = df.count()
malformed_count = all_count - true_count
if malformed_count > 0:
    raise ValueError("Self-defined schema is not compatible with the data") 
{code}
[~fchen] I don't know if disabling `ColumnPruning` has other implications, so I 
just re-enable it.

> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG, Inputfile.PNG, sample.csv
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv "without schema" and "with schema using 
> mode 'dropmalformed' options  in two different dataframes . While displaying 
> the "with schema using mode 'dropmalformed'" dataframe , the display looks 
> good ,it is not showing the malformed records .But when we apply count api on 
> the dataframe it gives the record count of actual file. I am expecting it 
> should give me valid record count .
> here is the code used:-
> {code}
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
> StructField("salary", IntegerType(), True) \
>   ])
> with_schema_df = 
> spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
> print("The dataframe with schema")
> with_schema_df.show()
> print("The dataframe without schema")
> without_schema_df.show()
> cnt_with_schema=with_schema_df.count()
> print("The  records count from with schema df :"+str(cnt_with_schema))
> cnt_without_schema=without_schema_df.count()
> print("The  records count from without schema df: "+str(cnt_without_schema))
> {code}
> here is the outputs screen shot 111.PNG is the outputs of the code and 
> inputfile.csv is the input to the code
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2023-06-06 Thread Zach Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729825#comment-17729825
 ] 

Zach Liu edited comment on SPARK-36277 at 6/6/23 6:44 PM:
--

I see the same behavior on Spark 3.3.1. I have to create this "checkpoint":
{code:java}
spark.sql("set 
spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ColumnPruning")
true_count = df.count()
spark.sql("set spark.sql.optimizer.excludedRules=null")
all_count = df.count()
malformed_count = all_count - true_count
if malformed_count > 0:
    raise ValueError("Self-defined schema is not compatible with the data") 
{code}
[~fchen] I don't know if disabling `ColumnPruning` has other implications, so I 
just re-enable it.


was (Author: zach liu):
I see the same behavior on Spark 3.3.1. I have to create this "checkpoint":
{code:java}
spark.sql("set 
spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ColumnPruning")
true_count = df.count()
spark.sql("set spark.sql.optimizer.excludedRules=null")
all_count = df.count()
malformed_count = all_count - true_count
if malformed_count > 0:
    raise ValueError("Self-defined schema is not compatible with the data") 
{code}
I don't know if disabling `ColumnPruning` has other implications, so I just 
re-enable it.

> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG, Inputfile.PNG, sample.csv
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv "without schema" and "with schema using 
> mode 'dropmalformed' options  in two different dataframes . While displaying 
> the "with schema using mode 'dropmalformed'" dataframe , the display looks 
> good ,it is not showing the malformed records .But when we apply count api on 
> the dataframe it gives the record count of actual file. I am expecting it 
> should give me valid record count .
> here is the code used:-
> {code}
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
> StructField("salary", IntegerType(), True) \
>   ])
> with_schema_df = 
> spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
> print("The dataframe with schema")
> with_schema_df.show()
> print("The dataframe without schema")
> without_schema_df.show()
> cnt_with_schema=with_schema_df.count()
> print("The  records count from with schema df :"+str(cnt_with_schema))
> cnt_without_schema=without_schema_df.count()
> print("The  records count from without schema df: "+str(cnt_without_schema))
> {code}
> here is the outputs screen shot 111.PNG is the outputs of the code and 
> inputfile.csv is the input to the code
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2023-06-06 Thread Zach Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729825#comment-17729825
 ] 

Zach Liu edited comment on SPARK-36277 at 6/6/23 6:41 PM:
--

I see the same behavior on Spark 3.3.1. I have to create this "checkpoint":
{code:java}
spark.sql("set 
spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ColumnPruning")
true_count = df.count()
spark.sql("set spark.sql.optimizer.excludedRules=null")
all_count = df.count()
malformed_count = all_count - true_count
if malformed_count > 0:
    raise ValueError("Self-defined schema is not compatible with the data") 
{code}
I don't know if disabling `ColumnPruning` has other implications, so I just 
re-enable it.


was (Author: zach liu):
I see the same behavior on Spark 3.3.1. I have to create this "checkpoint":

 

```

spark.sql("set 
spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ColumnPruning")

true_count = df.count()

spark.sql("set spark.sql.optimizer.excludedRules=null")

all_count = df.count()

malformed_count = all_count - true_count

if malformed_count > 0:

    raise ValueError("Self-defined schema is not compatible with the data")

```

I don't know if disabling `ColumnPruning` has other implications, so I just 
re-enable it.

> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG, Inputfile.PNG, sample.csv
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv "without schema" and "with schema using 
> mode 'dropmalformed' options  in two different dataframes . While displaying 
> the "with schema using mode 'dropmalformed'" dataframe , the display looks 
> good ,it is not showing the malformed records .But when we apply count api on 
> the dataframe it gives the record count of actual file. I am expecting it 
> should give me valid record count .
> here is the code used:-
> {code}
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
> StructField("salary", IntegerType(), True) \
>   ])
> with_schema_df = 
> spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
> print("The dataframe with schema")
> with_schema_df.show()
> print("The dataframe without schema")
> without_schema_df.show()
> cnt_with_schema=with_schema_df.count()
> print("The  records count from with schema df :"+str(cnt_with_schema))
> cnt_without_schema=without_schema_df.count()
> print("The  records count from without schema df: "+str(cnt_without_schema))
> {code}
> here is the outputs screen shot 111.PNG is the outputs of the code and 
> inputfile.csv is the input to the code
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2023-06-06 Thread Zach Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729825#comment-17729825
 ] 

Zach Liu commented on SPARK-36277:
--

I see the same behavior on Spark 3.3.1. I have to create this "checkpoint":

 

```

spark.sql("set 
spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ColumnPruning")

true_count = df.count()

spark.sql("set spark.sql.optimizer.excludedRules=null")

all_count = df.count()

malformed_count = all_count - true_count

if malformed_count > 0:

    raise ValueError("Self-defined schema is not compatible with the data")

```

I don't know if disabling `ColumnPruning` has other implications, so I just 
re-enable it.

> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG, Inputfile.PNG, sample.csv
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv "without schema" and "with schema using 
> mode 'dropmalformed' options  in two different dataframes . While displaying 
> the "with schema using mode 'dropmalformed'" dataframe , the display looks 
> good ,it is not showing the malformed records .But when we apply count api on 
> the dataframe it gives the record count of actual file. I am expecting it 
> should give me valid record count .
> here is the code used:-
> {code}
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
> StructField("salary", IntegerType(), True) \
>   ])
> with_schema_df = 
> spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
> print("The dataframe with schema")
> with_schema_df.show()
> print("The dataframe without schema")
> without_schema_df.show()
> cnt_with_schema=with_schema_df.count()
> print("The  records count from with schema df :"+str(cnt_with_schema))
> cnt_without_schema=without_schema_df.count()
> print("The  records count from without schema df: "+str(cnt_without_schema))
> {code}
> here is the outputs screen shot 111.PNG is the outputs of the code and 
> inputfile.csv is the input to the code
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43959) Make RowLevelOperationSuiteBase and AlignAssignmentsSuite abstract

2023-06-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43959:
-

Assignee: Anton Okolnychyi

> Make RowLevelOperationSuiteBase and AlignAssignmentsSuite abstract
> --
>
> Key: SPARK-43959
> URL: https://issues.apache.org/jira/browse/SPARK-43959
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>
> Make RowLevelOperationSuiteBase and AlignAssignmentsSuite abstract.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43959) Make RowLevelOperationSuiteBase and AlignAssignmentsSuite abstract

2023-06-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43959.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41449
[https://github.com/apache/spark/pull/41449]

> Make RowLevelOperationSuiteBase and AlignAssignmentsSuite abstract
> --
>
> Key: SPARK-43959
> URL: https://issues.apache.org/jira/browse/SPARK-43959
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.5.0
>
>
> Make RowLevelOperationSuiteBase and AlignAssignmentsSuite abstract.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43976) Handle the case where modifiedConfigs doesn't exist in event logs

2023-06-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43976:
-

Assignee: Dongjoon Hyun

> Handle the case where modifiedConfigs doesn't exist in event logs
> -
>
> Key: SPARK-43976
> URL: https://issues.apache.org/jira/browse/SPARK-43976
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43976) Handle the case where modifiedConfigs doesn't exist in event logs

2023-06-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43976.
---
Fix Version/s: 3.3.3
   3.5.0
   3.4.1
   Resolution: Fixed

Issue resolved by pull request 41472
[https://github.com/apache/spark/pull/41472]

> Handle the case where modifiedConfigs doesn't exist in event logs
> -
>
> Key: SPARK-43976
> URL: https://issues.apache.org/jira/browse/SPARK-43976
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.3, 3.5.0, 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43919) Extract JSON functionality out of Row

2023-06-06 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-43919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-43919.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Extract JSON functionality out of  Row
> --
>
> Key: SPARK-43919
> URL: https://issues.apache.org/jira/browse/SPARK-43919
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43984) Change to use foreach when map doesn't produce results

2023-06-06 Thread Yang Jie (Jira)
Yang Jie created SPARK-43984:


 Summary: Change to use foreach when map doesn't produce results
 Key: SPARK-43984
 URL: https://issues.apache.org/jira/browse/SPARK-43984
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Yang Jie


Seq(1, 2).map(println) -> Seq(1, 2).foreach(println)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43980) Add support for EXCEPT in select clause, similar to what databricks provides

2023-06-06 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729773#comment-17729773
 ] 

Yuming Wang commented on SPARK-43980:
-

Spark SQL current supports regex column specification, similar to EXCEPT:
https://github.com/apache/spark/blob/2cbfc975ba937a4eb761de7a6473b7747941f386/sql/core/src/test/resources/sql-tests/inputs/query_regex_column.sql#L19-L33

> Add support for EXCEPT in select clause, similar to what databricks provides
> 
>
> Key: SPARK-43980
> URL: https://issues.apache.org/jira/browse/SPARK-43980
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yash Kothari
>Priority: Major
>
> I'm looking for a way to incorporate the {{select * except(col1, ...)}} 
> clause provided by Databricks into my workflow. I don't use Databricks and 
> would like to introduce this {{select except}} clause either as a 
> spark-package or by contributing a change to Spark.
> However, I'm unsure about how to begin this process and would appreciate any 
> guidance from the community.
> [https://docs.databricks.com/sql/language-manual/sql-ref-syntax-qry-select.html#examples]
>  
> Thank you for your assistance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43977) bad case of connect-jvm-client-mima-check

2023-06-06 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-43977:


Assignee: Yang Jie

> bad case of connect-jvm-client-mima-check
> -
>
> Key: SPARK-43977
> URL: https://issues.apache.org/jira/browse/SPARK-43977
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> run 
> ```
> build/sbt "protobuf/clean"
> dev/connect-jvm-client-mima-check
> ```
> {code:java}
> Using SPARK_LOCAL_IP=localhost
> Using SPARK_LOCAL_IP=localhost
> Do connect-client-jvm module mima check ...
> Failed to find the jar: spark-protobuf-assembly(.*).jar or 
> spark-protobuf(.*)3.5.0-SNAPSHOT.jar inside folder: 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/connector/protobuf/target. 
> This file can be generated by similar to the following command: build/sbt 
> package|assembly
> finish connect-client-jvm module mima check ...
> connect-client-jvm module mima check passed.
>  {code}
> The check result is wrong,  there are both error messages and checks 
> successful
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43977) bad case of connect-jvm-client-mima-check

2023-06-06 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-43977.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41473
[https://github.com/apache/spark/pull/41473]

> bad case of connect-jvm-client-mima-check
> -
>
> Key: SPARK-43977
> URL: https://issues.apache.org/jira/browse/SPARK-43977
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> run 
> ```
> build/sbt "protobuf/clean"
> dev/connect-jvm-client-mima-check
> ```
> {code:java}
> Using SPARK_LOCAL_IP=localhost
> Using SPARK_LOCAL_IP=localhost
> Do connect-client-jvm module mima check ...
> Failed to find the jar: spark-protobuf-assembly(.*).jar or 
> spark-protobuf(.*)3.5.0-SNAPSHOT.jar inside folder: 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/connector/protobuf/target. 
> This file can be generated by similar to the following command: build/sbt 
> package|assembly
> finish connect-client-jvm module mima check ...
> connect-client-jvm module mima check passed.
>  {code}
> The check result is wrong,  there are both error messages and checks 
> successful
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43939) Add try_* functions to Scala and Python

2023-06-06 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729752#comment-17729752
 ] 

BingKun Pan commented on SPARK-43939:
-

I work on it.

> Add try_* functions to Scala and Python
> ---
>
> Key: SPARK-43939
> URL: https://issues.apache.org/jira/browse/SPARK-43939
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Add following functions:
> * try_add
> * try_avg
> * try_divide
> * try_element_at
> * try_multiply
> * try_subtract
> * try_sum
> * try_to_binary
> * try_to_number
> * try_to_timestamp
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43097) Implement pyspark ML logistic regression estimator on top of torch distributor

2023-06-06 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu resolved SPARK-43097.

Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41383
[https://github.com/apache/spark/pull/41383]

> Implement pyspark ML logistic regression estimator on top of torch distributor
> --
>
> Key: SPARK-43097
> URL: https://issues.apache.org/jira/browse/SPARK-43097
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43510) Spark application hangs when YarnAllocator adds running executors after processing completed containers

2023-06-06 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-43510.
---
Fix Version/s: 3.4.1
   3.5.0
 Assignee: Manu Zhang
   Resolution: Fixed

> Spark application hangs when YarnAllocator adds running executors after 
> processing completed containers
> ---
>
> Key: SPARK-43510
> URL: https://issues.apache.org/jira/browse/SPARK-43510
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.4.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Major
> Fix For: 3.4.1, 3.5.0
>
>
> I see application hangs when containers are preempted immediately after 
> allocation as follows.
> {code:java}
> 23/05/14 09:11:33 INFO YarnAllocator: Launching container 
> container_e3812_1684033797982_57865_01_000382 on host 
> hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with 
> ID 277 for ResourceProfile Id 0 
> 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
> container_e3812_1684033797982_57865_01_000382
> 23/05/14 09:11:33 INFO YarnAllocator: Completed container 
> container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
> -102)
> 23/05/14 09:11:33 INFO YarnAllocator: Container 
> container_e3812_1684033797982_57865_01_000382 was preempted.{code}
> Note the warning log where YarnAllocator cannot find executorId for the 
> container when processing completed containers. The only plausible cause is 
> YarnAllocator added the running executor after processing completed 
> containers. The former happens in a separate thread after executor launch.
> YarnAllocator believes there are still running executors, although they are 
> already lost due to preemption. Hence, the application hangs without any 
> running executors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43907) Add SQL functions into Scala, Python and R API

2023-06-06 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729735#comment-17729735
 ] 

Yang Jie commented on SPARK-43907:
--

[~ivoson] feel free to pick up some ones which you interested

> Add SQL functions into Scala, Python and R API
> --
>
> Key: SPARK-43907
> URL: https://issues.apache.org/jira/browse/SPARK-43907
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SparkR, SQL
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See the discussion in dev mailing list 
> (https://lists.apache.org/thread/0tdcfyzxzcv8w46qbgwys2rormhdgyqg).
> This is an umbrella JIRA to implement all SQL functions in Scala, Python and R



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43982) Implement pipeline estimator

2023-06-06 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-43982:
--

Assignee: Weichen Xu

> Implement pipeline estimator
> 
>
> Key: SPARK-43982
> URL: https://issues.apache.org/jira/browse/SPARK-43982
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43982) Implement pipeline estimator

2023-06-06 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-43982:
--

 Summary: Implement pipeline estimator
 Key: SPARK-43982
 URL: https://issues.apache.org/jira/browse/SPARK-43982
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, ML, PySpark
Affects Versions: 3.5.0
Reporter: Weichen Xu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43983) Implement cross validator estimator

2023-06-06 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-43983:
--

Assignee: Weichen Xu

> Implement cross validator estimator
> ---
>
> Key: SPARK-43983
> URL: https://issues.apache.org/jira/browse/SPARK-43983
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43983) Implement cross validator estimator

2023-06-06 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-43983:
--

 Summary: Implement cross validator estimator
 Key: SPARK-43983
 URL: https://issues.apache.org/jira/browse/SPARK-43983
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, ML, PySpark
Affects Versions: 3.5.0
Reporter: Weichen Xu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43981) Basic saving / loading implementation

2023-06-06 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-43981:
---
Description: 
Support saving/loading  for estimator / transformer / evaluator / model.

We have some design goals:
 * The model format is decoupled from spark, i.e. we can run model inference 
without spark service.
 * We can save model to either local file system or cloud storage file system.

> Basic saving / loading implementation
> -
>
> Key: SPARK-43981
> URL: https://issues.apache.org/jira/browse/SPARK-43981
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> Support saving/loading  for estimator / transformer / evaluator / model.
> We have some design goals:
>  * The model format is decoupled from spark, i.e. we can run model inference 
> without spark service.
>  * We can save model to either local file system or cloud storage file system.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43981) Basic saving / loading implementation

2023-06-06 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-43981:
--

 Summary: Basic saving / loading implementation
 Key: SPARK-43981
 URL: https://issues.apache.org/jira/browse/SPARK-43981
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Weichen Xu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43981) Basic saving / loading implementation

2023-06-06 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-43981:
--

Assignee: Weichen Xu

> Basic saving / loading implementation
> -
>
> Key: SPARK-43981
> URL: https://issues.apache.org/jira/browse/SPARK-43981
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43981) Basic saving / loading implementation

2023-06-06 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-43981:
---
Component/s: Connect
 ML

> Basic saving / loading implementation
> -
>
> Key: SPARK-43981
> URL: https://issues.apache.org/jira/browse/SPARK-43981
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43980) Add support for EXCEPT in select clause, similar to what databricks provides

2023-06-06 Thread Yash Kothari (Jira)
Yash Kothari created SPARK-43980:


 Summary: Add support for EXCEPT in select clause, similar to what 
databricks provides
 Key: SPARK-43980
 URL: https://issues.apache.org/jira/browse/SPARK-43980
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yash Kothari


I'm looking for a way to incorporate the {{select * except(col1, ...)}} clause 
provided by Databricks into my workflow. I don't use Databricks and would like 
to introduce this {{select except}} clause either as a spark-package or by 
contributing a change to Spark.

However, I'm unsure about how to begin this process and would appreciate any 
guidance from the community.

[https://docs.databricks.com/sql/language-manual/sql-ref-syntax-qry-select.html#examples]

 

Thank you for your assistance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43914) Assign names to the error class _LEGACY_ERROR_TEMP_[2433-2437]

2023-06-06 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-43914:
---
Summary: Assign names to the error class _LEGACY_ERROR_TEMP_[2433-2437]  
(was: Assign a name to the error class _LEGACY_ERROR_TEMP_2427)

> Assign names to the error class _LEGACY_ERROR_TEMP_[2433-2437]
> --
>
> Key: SPARK-43914
> URL: https://issues.apache.org/jira/browse/SPARK-43914
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43715) Add spark DataFrame binary file format writer

2023-06-06 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu resolved SPARK-43715.

Resolution: Won't Do

> Add spark DataFrame binary file format writer
> -
>
> Key: SPARK-43715
> URL: https://issues.apache.org/jira/browse/SPARK-43715
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> In new distributed spark ML module (designed to support spark connect and 
> support local inference)
> We need to save ML model to hadoop file system using custom binary file 
> format, the reason is:
>  * We often submit a spark application to spark cluster for running the 
> training model job, we need to save trained model to hadoop file system 
> before the spark application completes.
>  * But we want to support local model inference, that means if we save the 
> model by current spark DataFrame writer (e.g. parquet format), when loading 
> model we have to rely on the spark service. But we hope we can load model 
> without spark service. So we want the model being saved as the original 
> binary format that our ML code can handle.
> We already have reader API of "binaryFile" format, we need to add a writer 
> API:
> {*}Writer API{*}:
> Supposing we have a dataframe with schema:
> [file_path: String, content: binary],
> we can save the dataframe to a hadoop path, each row we will save it as a 
> file under the hadoop path, the saved file path is \{hadoop 
> path}/\{file_path}, "file_path" can be a multiple part path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43913) Assign names to the error class _LEGACY_ERROR_TEMP_[2426-2432]

2023-06-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43913.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41424
[https://github.com/apache/spark/pull/41424]

> Assign names to the error class _LEGACY_ERROR_TEMP_[2426-2432]
> --
>
> Key: SPARK-43913
> URL: https://issues.apache.org/jira/browse/SPARK-43913
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43913) Assign names to the error class _LEGACY_ERROR_TEMP_[2426-2432]

2023-06-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43913:


Assignee: jiaan.geng

> Assign names to the error class _LEGACY_ERROR_TEMP_[2426-2432]
> --
>
> Key: SPARK-43913
> URL: https://issues.apache.org/jira/browse/SPARK-43913
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43962) Improve error messages: CANNOT_DECODE_URL, CANNOT_MERGE_INCOMPATIBLE_DATA_TYPE, CANNOT_PARSE_DECIMAL, CANNOT_READ_FILE_FOOTER, CANNOT_RECOGNIZE_HIVE_TYPE.

2023-06-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43962.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41455
[https://github.com/apache/spark/pull/41455]

> Improve error messages: CANNOT_DECODE_URL, 
> CANNOT_MERGE_INCOMPATIBLE_DATA_TYPE, CANNOT_PARSE_DECIMAL, 
> CANNOT_READ_FILE_FOOTER, CANNOT_RECOGNIZE_HIVE_TYPE.
> --
>
> Key: SPARK-43962
> URL: https://issues.apache.org/jira/browse/SPARK-43962
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Improve error message for usability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43962) Improve error messages: CANNOT_DECODE_URL, CANNOT_MERGE_INCOMPATIBLE_DATA_TYPE, CANNOT_PARSE_DECIMAL, CANNOT_READ_FILE_FOOTER, CANNOT_RECOGNIZE_HIVE_TYPE.

2023-06-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43962:


Assignee: Haejoon Lee

> Improve error messages: CANNOT_DECODE_URL, 
> CANNOT_MERGE_INCOMPATIBLE_DATA_TYPE, CANNOT_PARSE_DECIMAL, 
> CANNOT_READ_FILE_FOOTER, CANNOT_RECOGNIZE_HIVE_TYPE.
> --
>
> Key: SPARK-43962
> URL: https://issues.apache.org/jira/browse/SPARK-43962
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Improve error message for usability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43948) Assign names to the error class _LEGACY_ERROR_TEMP_[0050|0057|0058|0059]

2023-06-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43948:


Assignee: BingKun Pan

> Assign names to the error class _LEGACY_ERROR_TEMP_[0050|0057|0058|0059]
> 
>
> Key: SPARK-43948
> URL: https://issues.apache.org/jira/browse/SPARK-43948
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>
> _LEGACY_ERROR_TEMP_0050 => LOCAL_MUST_WITH_SCHEMA_FILE
> _LEGACY_ERROR_TEMP_0057 => UNSUPPORTED_DEFAULT_VALUE.WITHOUT_SUGGESTION
> _LEGACY_ERROR_TEMP_0058 => UNSUPPORTED_DEFAULT_VALUE.WITH_SUGGESTION
> _LEGACY_ERROR_TEMP_0059 => REF_DEFAULT_VALUE_IS_NOT_ALLOWED_IN_PARTITION



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43948) Assign names to the error class _LEGACY_ERROR_TEMP_[0050|0057|0058|0059]

2023-06-06 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43948.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41451
[https://github.com/apache/spark/pull/41451]

> Assign names to the error class _LEGACY_ERROR_TEMP_[0050|0057|0058|0059]
> 
>
> Key: SPARK-43948
> URL: https://issues.apache.org/jira/browse/SPARK-43948
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>
> _LEGACY_ERROR_TEMP_0050 => LOCAL_MUST_WITH_SCHEMA_FILE
> _LEGACY_ERROR_TEMP_0057 => UNSUPPORTED_DEFAULT_VALUE.WITHOUT_SUGGESTION
> _LEGACY_ERROR_TEMP_0058 => UNSUPPORTED_DEFAULT_VALUE.WITH_SUGGESTION
> _LEGACY_ERROR_TEMP_0059 => REF_DEFAULT_VALUE_IS_NOT_ALLOWED_IN_PARTITION



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43378) SerializerHelper.deserializeFromChunkedBuffer leaks deserialization streams

2023-06-06 Thread Emil Ejbyfeldt (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emil Ejbyfeldt updated SPARK-43378:
---
Summary: SerializerHelper.deserializeFromChunkedBuffer leaks 
deserialization streams  (was: SerializerHelper.deserializeFromChunkedBuffer)

> SerializerHelper.deserializeFromChunkedBuffer leaks deserialization streams
> ---
>
> Key: SPARK-43378
> URL: https://issues.apache.org/jira/browse/SPARK-43378
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.4.1, 3.5.0
>Reporter: Emil Ejbyfeldt
>Assignee: Emil Ejbyfeldt
>Priority: Major
> Fix For: 3.4.1, 3.5.0
>
>
> The method SerializerHelper.deserializeFromChunkedBuffer leaks serializations 
> stream. This can lead to huge performance regressions when using kryo 
> serializer as the spark application can become bottlenecked on the driver 
> creating expensive kryo objects that are then leaked as part of the 
> deserialization stream



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org