date:20221016

[jira] [Commented] (SPARK-40803) LZ4CompressionCodec looks up configuration on each stream creation

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618682#comment-17618682
 ] 

Apache Spark commented on SPARK-40803:
--

User 'eejbyfeldt' has created a pull request for this issue:
https://github.com/apache/spark/pull/38282

> LZ4CompressionCodec looks up configuration on each stream creation
> --
>
> Key: SPARK-40803
> URL: https://issues.apache.org/jira/browse/SPARK-40803
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Emil Ejbyfeldt
>Priority: Major
>
> This look up in SparkConf is quite expensive and shows up in profiling for 
> cases where lots of smaller streams are created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40803) LZ4CompressionCodec looks up configuration on each stream creation

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40803:


Assignee: Apache Spark

> LZ4CompressionCodec looks up configuration on each stream creation
> --
>
> Key: SPARK-40803
> URL: https://issues.apache.org/jira/browse/SPARK-40803
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Emil Ejbyfeldt
>Assignee: Apache Spark
>Priority: Major
>
> This look up in SparkConf is quite expensive and shows up in profiling for 
> cases where lots of smaller streams are created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40803) LZ4CompressionCodec looks up configuration on each stream creation

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618681#comment-17618681
 ] 

Apache Spark commented on SPARK-40803:
--

User 'eejbyfeldt' has created a pull request for this issue:
https://github.com/apache/spark/pull/38282

> LZ4CompressionCodec looks up configuration on each stream creation
> --
>
> Key: SPARK-40803
> URL: https://issues.apache.org/jira/browse/SPARK-40803
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Emil Ejbyfeldt
>Priority: Major
>
> This look up in SparkConf is quite expensive and shows up in profiling for 
> cases where lots of smaller streams are created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40803) LZ4CompressionCodec looks up configuration on each stream creation

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40803:


Assignee: (was: Apache Spark)

> LZ4CompressionCodec looks up configuration on each stream creation
> --
>
> Key: SPARK-40803
> URL: https://issues.apache.org/jira/browse/SPARK-40803
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Emil Ejbyfeldt
>Priority: Major
>
> This look up in SparkConf is quite expensive and shows up in profiling for 
> cases where lots of smaller streams are created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40796) Check the generated python protos in GitHub Actions

2022-10-16 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40796:
-

Assignee: Ruifeng Zheng

> Check the generated python protos in GitHub Actions
> ---
>
> Key: SPARK-40796
> URL: https://issues.apache.org/jira/browse/SPARK-40796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40796) Check the generated python protos in GitHub Actions

2022-10-16 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-40796.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38253
[https://github.com/apache/spark/pull/38253]

> Check the generated python protos in GitHub Actions
> ---
>
> Key: SPARK-40796
> URL: https://issues.apache.org/jira/browse/SPARK-40796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40737) Add basic support for DataFrameWriter

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618659#comment-17618659
 ] 

Apache Spark commented on SPARK-40737:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38281

> Add basic support for DataFrameWriter
> -
>
> Key: SPARK-40737
> URL: https://issues.apache.org/jira/browse/SPARK-40737
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
> Fix For: 3.4.0
>
>
> A key element of using Spark Connect is going to be to be able to write data 
> from a logical plan. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40790) Check error classes in DDL parsing tests

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40790:


Assignee: (was: Apache Spark)

> Check error classes in DDL parsing tests
> 
>
> Key: SPARK-40790
> URL: https://issues.apache.org/jira/browse/SPARK-40790
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in DDL command tests by using checkError(). For instance
>  - AlterNamespaceSetPropertiesParserSuite
>  - AlterTableDropPartitionParserSuite
>  - AlterTableRenameParserSuite
>  - AlterTableRecoverPartitionsParserSuite
>  - DescribeTableParserSuite
>  - TruncateTableParserSuite
>  - AlterTableSetSerdeParserSuite
>  - ShowPartitionsParserSuite
> [https://github.com/apache/spark/blob/414771d4e8b52d0a76a7729d005794dc04f1e075/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterNamespaceSetPropertiesParserSuite.scala#L43]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40790) Check error classes in DDL parsing tests

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40790:


Assignee: Apache Spark

> Check error classes in DDL parsing tests
> 
>
> Key: SPARK-40790
> URL: https://issues.apache.org/jira/browse/SPARK-40790
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in DDL command tests by using checkError(). For instance
>  - AlterNamespaceSetPropertiesParserSuite
>  - AlterTableDropPartitionParserSuite
>  - AlterTableRenameParserSuite
>  - AlterTableRecoverPartitionsParserSuite
>  - DescribeTableParserSuite
>  - TruncateTableParserSuite
>  - AlterTableSetSerdeParserSuite
>  - ShowPartitionsParserSuite
> [https://github.com/apache/spark/blob/414771d4e8b52d0a76a7729d005794dc04f1e075/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterNamespaceSetPropertiesParserSuite.scala#L43]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40790) Check error classes in DDL parsing tests

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618657#comment-17618657
 ] 

Apache Spark commented on SPARK-40790:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38280

> Check error classes in DDL parsing tests
> 
>
> Key: SPARK-40790
> URL: https://issues.apache.org/jira/browse/SPARK-40790
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in DDL command tests by using checkError(). For instance
>  - AlterNamespaceSetPropertiesParserSuite
>  - AlterTableDropPartitionParserSuite
>  - AlterTableRenameParserSuite
>  - AlterTableRecoverPartitionsParserSuite
>  - DescribeTableParserSuite
>  - TruncateTableParserSuite
>  - AlterTableSetSerdeParserSuite
>  - ShowPartitionsParserSuite
> [https://github.com/apache/spark/blob/414771d4e8b52d0a76a7729d005794dc04f1e075/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterNamespaceSetPropertiesParserSuite.scala#L43]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40790) Check error classes in DDL parsing tests

2022-10-16 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40790:

Description: 
Check error classes in DDL command tests by using checkError(). For instance
 - AlterNamespaceSetPropertiesParserSuite
 - AlterTableDropPartitionParserSuite
 - AlterTableRenameParserSuite
 - AlterTableRecoverPartitionsParserSuite
 - DescribeTableParserSuite
 - TruncateTableParserSuite
 - AlterTableSetSerdeParserSuite
 - ShowPartitionsParserSuite

[https://github.com/apache/spark/blob/414771d4e8b52d0a76a7729d005794dc04f1e075/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterNamespaceSetPropertiesParserSuite.scala#L43]

  was:
Check error classes in DDL command tests by using checkError(). For instance
 - AlterNamespaceSetPropertiesParserSuite
 - AlterTableDropPartitionParserSuite
 - AlterTableRenameParserSuite
 - CreateNamespaceParserSuite
 - AlterTableRecoverPartitionsParserSuite
 - DescribeTableParserSuite
 - TruncateTableParserSuite
 - AlterTableSetSerdeParserSuite
 - ShowPartitionsParserSuite

[https://github.com/apache/spark/blob/414771d4e8b52d0a76a7729d005794dc04f1e075/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterNamespaceSetPropertiesParserSuite.scala#L43]


> Check error classes in DDL parsing tests
> 
>
> Key: SPARK-40790
> URL: https://issues.apache.org/jira/browse/SPARK-40790
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in DDL command tests by using checkError(). For instance
>  - AlterNamespaceSetPropertiesParserSuite
>  - AlterTableDropPartitionParserSuite
>  - AlterTableRenameParserSuite
>  - AlterTableRecoverPartitionsParserSuite
>  - DescribeTableParserSuite
>  - TruncateTableParserSuite
>  - AlterTableSetSerdeParserSuite
>  - ShowPartitionsParserSuite
> [https://github.com/apache/spark/blob/414771d4e8b52d0a76a7729d005794dc04f1e075/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterNamespaceSetPropertiesParserSuite.scala#L43]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40816) Python: rename LogicalPlan.collect to LogicalPlan.to_proto

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40816:


Assignee: Apache Spark

> Python: rename LogicalPlan.collect to LogicalPlan.to_proto
> --
>
> Key: SPARK-40816
> URL: https://issues.apache.org/jira/browse/SPARK-40816
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40816) Python: rename LogicalPlan.collect to LogicalPlan.to_proto

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618421#comment-17618421
 ] 

Apache Spark commented on SPARK-40816:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38279

> Python: rename LogicalPlan.collect to LogicalPlan.to_proto
> --
>
> Key: SPARK-40816
> URL: https://issues.apache.org/jira/browse/SPARK-40816
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40816) Python: rename LogicalPlan.collect to LogicalPlan.to_proto

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40816:


Assignee: (was: Apache Spark)

> Python: rename LogicalPlan.collect to LogicalPlan.to_proto
> --
>
> Key: SPARK-40816
> URL: https://issues.apache.org/jira/browse/SPARK-40816
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40816) Python: rename LogicalPlan.collect to LogicalPlan.to_proto

2022-10-16 Thread Rui Wang (Jira)

Rui Wang created SPARK-40816:


 Summary: Python: rename LogicalPlan.collect to LogicalPlan.to_proto
 Key: SPARK-40816
 URL: https://issues.apache.org/jira/browse/SPARK-40816
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40780) Add WHERE to Connect proto and DSL

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618410#comment-17618410
 ] 

Apache Spark commented on SPARK-40780:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38278

> Add WHERE to Connect proto and DSL
> --
>
> Key: SPARK-40780
> URL: https://issues.apache.org/jira/browse/SPARK-40780
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40780) Add WHERE to Connect proto and DSL

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618411#comment-17618411
 ] 

Apache Spark commented on SPARK-40780:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38278

> Add WHERE to Connect proto and DSL
> --
>
> Key: SPARK-40780
> URL: https://issues.apache.org/jira/browse/SPARK-40780
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40809) Add as(alias: String) to connect DSL

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618408#comment-17618408
 ] 

Apache Spark commented on SPARK-40809:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38278

> Add as(alias: String) to connect DSL
> 
>
> Key: SPARK-40809
> URL: https://issues.apache.org/jira/browse/SPARK-40809
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40815) SymlinkTextInputFormat returns incorrect result due to enabled spark.hadoopRDD.ignoreEmptySplits

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618397#comment-17618397
 ] 

Apache Spark commented on SPARK-40815:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/38277

> SymlinkTextInputFormat returns incorrect result due to enabled 
> spark.hadoopRDD.ignoreEmptySplits
> 
>
> Key: SPARK-40815
> URL: https://issues.apache.org/jira/browse/SPARK-40815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Ivan Sadikov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40815) SymlinkTextInputFormat returns incorrect result due to enabled spark.hadoopRDD.ignoreEmptySplits

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40815:


Assignee: Apache Spark

> SymlinkTextInputFormat returns incorrect result due to enabled 
> spark.hadoopRDD.ignoreEmptySplits
> 
>
> Key: SPARK-40815
> URL: https://issues.apache.org/jira/browse/SPARK-40815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40815) SymlinkTextInputFormat returns incorrect result due to enabled spark.hadoopRDD.ignoreEmptySplits

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618398#comment-17618398
 ] 

Apache Spark commented on SPARK-40815:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/38277

> SymlinkTextInputFormat returns incorrect result due to enabled 
> spark.hadoopRDD.ignoreEmptySplits
> 
>
> Key: SPARK-40815
> URL: https://issues.apache.org/jira/browse/SPARK-40815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Ivan Sadikov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40815) SymlinkTextInputFormat returns incorrect result due to enabled spark.hadoopRDD.ignoreEmptySplits

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40815:


Assignee: (was: Apache Spark)

> SymlinkTextInputFormat returns incorrect result due to enabled 
> spark.hadoopRDD.ignoreEmptySplits
> 
>
> Key: SPARK-40815
> URL: https://issues.apache.org/jira/browse/SPARK-40815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Ivan Sadikov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40815) SymlinkTextInputFormat returns incorrect result due to enabled spark.hadoopRDD.ignoreEmptySplits

2022-10-16 Thread Ivan Sadikov (Jira)

Ivan Sadikov created SPARK-40815:


 Summary: SymlinkTextInputFormat returns incorrect result due to 
enabled spark.hadoopRDD.ignoreEmptySplits
 Key: SPARK-40815
 URL: https://issues.apache.org/jira/browse/SPARK-40815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.2, 3.3.0, 3.4.0
Reporter: Ivan Sadikov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40802) Enhance JDBC Connector to use PreparedStatement.getMetaData() to resolve schema instead of PreparedStatement.executeQuery()

2022-10-16 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618389#comment-17618389
 ] 

Hyukjin Kwon commented on SPARK-40802:
--

I guess the problem is that {{getMetaData}} doesn't gurantee to work in all 
cases or all DBMSes. We could probably introduce a dialect to optimize this 
further.

> Enhance JDBC Connector to use PreparedStatement.getMetaData() to resolve 
> schema instead of PreparedStatement.executeQuery()
> ---
>
> Key: SPARK-40802
> URL: https://issues.apache.org/jira/browse/SPARK-40802
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mingli Rui
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, Spark JDBC Connector uses *PreparedStatement.executeQuery()* to 
> resolve the JDBCRelation's schema. The schema query is like *s"SELECT * FROM 
> $table_or_query WHERE 1=0".*
> But it is not necessary to execute the query. It's enough to *prepare* the 
> query. With preparing the statement, the query is parsed and compiled, but is 
> not executed. It will be more efficient.
> So, it's better to use PreparedStatement.getMetaData() to resolve schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40802) Enhance JDBC Connector to use PreparedStatement.getMetaData() to resolve schema instead of PreparedStatement.executeQuery()

2022-10-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40802:
-
Summary: Enhance JDBC Connector to use PreparedStatement.getMetaData() to 
resolve schema instead of PreparedStatement.executeQuery()  (was: [SQL] Enhance 
JDBC Connector to use PreparedStatement.getMetaData() to resolve schema instead 
of PreparedStatement.executeQuery())

> Enhance JDBC Connector to use PreparedStatement.getMetaData() to resolve 
> schema instead of PreparedStatement.executeQuery()
> ---
>
> Key: SPARK-40802
> URL: https://issues.apache.org/jira/browse/SPARK-40802
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mingli Rui
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, Spark JDBC Connector uses *PreparedStatement.executeQuery()* to 
> resolve the JDBCRelation's schema. The schema query is like *s"SELECT * FROM 
> $table_or_query WHERE 1=0".*
> But it is not necessary to execute the query. It's enough to *prepare* the 
> query. With preparing the statement, the query is parsed and compiled, but is 
> not executed. It will be more efficient.
> So, it's better to use PreparedStatement.getMetaData() to resolve schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

2022-10-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40808:
-
Component/s: SQL
 (was: Spark Core)

> Infer schema for CSV files - wrong behavior using header + merge schema
> ---
>
> Key: SPARK-40808
> URL: https://issues.apache.org/jira/browse/SPARK-40808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: ohad
>Priority: Major
>  Labels: CSVReader, csv, csvparser
>
> Hello. 
> I am writing unit-tests to some functionality in my application that reading 
> data from CSV files using Spark.
> I am reading the data using:
> {code:java}
> header=True
> mergeSchema=True
> inferSchema=True{code}
> When I am reading this single file:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22{code}
> I am getting this schema:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string{code}
> When I am duplicating this file, I am getting the same schema.
> The strange part is when I am adding new int column, it looks like spark is 
> getting confused and think that the column that already identified as int are 
> now string:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2
> {code}
> result:
> {code:java}
> int_col=string
> string_col=string
> decimal_col=string
> date_col=string
> int2_col=int{code}
> When I am reading only the second file, it looks fine:
> {code:java}
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2{code}
> result:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string
> int2_col=int{code}
> For conclusion, it looks like there is a bug mixing the two features: header 
> recognition and merge schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

2022-10-16 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618388#comment-17618388
 ] 

Hyukjin Kwon commented on SPARK-40808:
--

Yeah reproducer would be helpful to assess this ticket further.

> Infer schema for CSV files - wrong behavior using header + merge schema
> ---
>
> Key: SPARK-40808
> URL: https://issues.apache.org/jira/browse/SPARK-40808
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: ohad
>Priority: Major
>  Labels: CSVReader, csv, csvparser
>
> Hello. 
> I am writing unit-tests to some functionality in my application that reading 
> data from CSV files using Spark.
> I am reading the data using:
> {code:java}
> header=True
> mergeSchema=True
> inferSchema=True{code}
> When I am reading this single file:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22{code}
> I am getting this schema:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string{code}
> When I am duplicating this file, I am getting the same schema.
> The strange part is when I am adding new int column, it looks like spark is 
> getting confused and think that the column that already identified as int are 
> now string:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2
> {code}
> result:
> {code:java}
> int_col=string
> string_col=string
> decimal_col=string
> date_col=string
> int2_col=int{code}
> When I am reading only the second file, it looks fine:
> {code:java}
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2{code}
> result:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string
> int2_col=int{code}
> For conclusion, it looks like there is a bug mixing the two features: header 
> recognition and merge schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40814) Exception in thread "main" java.lang.NoClassDefFoundError: io/fabric8/kubernetes/client/KubernetesClient

2022-10-16 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618386#comment-17618386
 ] 

Hyukjin Kwon commented on SPARK-40814:
--

Spark 2.4.x is EOL. Mind trying if the same issue persists in Spark 3+?

> Exception in thread "main" java.lang.NoClassDefFoundError: 
> io/fabric8/kubernetes/client/KubernetesClient
> 
>
> Key: SPARK-40814
> URL: https://issues.apache.org/jira/browse/SPARK-40814
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Submit
>Affects Versions: 2.4.0
> Environment: k8s version: v1.18.9
> spark version: v2.4.0
> kubernetes-client:v6.1.1
>Reporter: jiangjian
>Priority: Major
> Attachments: Dockerfile, spark-error.log
>
>
> After I change the user in the Spark image, the running program reports an 
> error. What is the problem
> ++ id -u
> + myuid=2023
> ++ id -g
> + mygid=2023
> + set +e
> ++ getent passwd 2023
> + uidentry=zndw:x:2023:2023::/home/zndw:/bin/sh
> + set -e
> + '[' -z zndw:x:2023:2023::/home/zndw:/bin/sh ']'
> + SPARK_K8S_CMD=driver
> + case "$SPARK_K8S_CMD" in
> + shift 1
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -n '' ']'
> + PYSPARK_ARGS=
> + '[' -n '' ']'
> + R_ARGS=
> + '[' -n '' ']'
> + '[' '' == 2 ']'
> + '[' '' == 3 ']'
> + case "$SPARK_K8S_CMD" in
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.1.1.11 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class 
> com.frontier.pueedas.computer.batchTool.etl.EtlScheduler 
> 'http://26.47.128.120:18000/spark/spark/raw/master/computer-batch-etl-hadoop-basic.jar?inline=false'
>  configMode=HDFS metaMode=HDFS platformConfigMode=NACOS storeConfigMode=NACOS 
> startDate=2022-08-02 endDate=2022-08-03 
> _file=/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml
>  runMode=TEST
> 2022-10-14 06:52:21 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2022-10-14 06:52:29 INFO  SparkContext:54 - Running Spark version 2.4.0
> 2022-10-14 06:52:29 INFO  SparkContext:54 - Submitted application: 
> [TEST]ETL[2022-08-02 00:00:00,2022-08-03 
> 00:00:00]\{/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml}
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls to: 
> zndw,root
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls to: 
> zndw,root
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls groups to: 
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls groups 
> to: 
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - SecurityManager: 
> authentication disabled; ui acls disabled; users  with view permissions: 
> Set(zndw, root); groups with view permissions: Set(); users  with modify 
> permissions: Set(zndw, root); groups with modify permissions: Set()
> 2022-10-14 06:52:29 INFO  Utils:54 - Successfully started service 
> 'sparkDriver' on port 7078.
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering MapOutputTracker
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering BlockManagerMaster
> 2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - 
> BlockManagerMasterEndpoint up
> 2022-10-14 06:52:29 INFO  DiskBlockManager:54 - Created local directory at 
> /var/data/spark-9a270950-7527-4d08-a7bd-d6c1062e8522/blockmgr-79ab0f0d-6f9e-401e-aa90-91baa00a3ff3
> 2022-10-14 06:52:29 INFO  MemoryStore:54 - MemoryStore started with capacity 
> 912.3 MB
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
> 2022-10-14 06:52:30 INFO  log:192 - Logging initialized @9926ms
> 2022-10-14 06:52:30 INFO  Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: 
> unknown, git hash: unknown
> 2022-10-14 06:52:30 INFO  Server:419 - Started @10035ms
> 2022-10-14 06:52:30 INFO  AbstractConnector:278 - Started 
> ServerConnector@66f0548d\{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
> 2022-10-14 06:52:30 INFO  Utils:54 - Successfully started service 'SparkUI' 
> on port 4040.
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@59ed3e6c\{/jobs,null,AVAILABLE,@Spark}
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@70c53dbe\{/jobs/j

[jira] [Updated] (SPARK-40790) Check error classes in DDL parsing tests

2022-10-16 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40790:

Description: 
Check error classes in DDL command tests by using checkError(). For instance
 - AlterNamespaceSetPropertiesParserSuite
 - AlterTableDropPartitionParserSuite
 - AlterTableRenameParserSuite
 - CreateNamespaceParserSuite
 - AlterTableRecoverPartitionsParserSuite
 - DescribeTableParserSuite
 - TruncateTableParserSuite
 - AlterTableSetSerdeParserSuite
 - ShowPartitionsParserSuite

[https://github.com/apache/spark/blob/414771d4e8b52d0a76a7729d005794dc04f1e075/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterNamespaceSetPropertiesParserSuite.scala#L43]

  was:
Check error classes in DDL command tests by using checkError(). For instance
- AlterNamespaceSetPropertiesParserSuite
- AlterTableDropPartitionParserSuite

https://github.com/apache/spark/blob/414771d4e8b52d0a76a7729d005794dc04f1e075/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterNamespaceSetPropertiesParserSuite.scala#L43




> Check error classes in DDL parsing tests
> 
>
> Key: SPARK-40790
> URL: https://issues.apache.org/jira/browse/SPARK-40790
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in DDL command tests by using checkError(). For instance
>  - AlterNamespaceSetPropertiesParserSuite
>  - AlterTableDropPartitionParserSuite
>  - AlterTableRenameParserSuite
>  - CreateNamespaceParserSuite
>  - AlterTableRecoverPartitionsParserSuite
>  - DescribeTableParserSuite
>  - TruncateTableParserSuite
>  - AlterTableSetSerdeParserSuite
>  - ShowPartitionsParserSuite
> [https://github.com/apache/spark/blob/414771d4e8b52d0a76a7729d005794dc04f1e075/sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterNamespaceSetPropertiesParserSuite.scala#L43]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40814) Exception in thread "main" java.lang.NoClassDefFoundError: io/fabric8/kubernetes/client/KubernetesClient

2022-10-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40814:
-
Priority: Major  (was: Blocker)

> Exception in thread "main" java.lang.NoClassDefFoundError: 
> io/fabric8/kubernetes/client/KubernetesClient
> 
>
> Key: SPARK-40814
> URL: https://issues.apache.org/jira/browse/SPARK-40814
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Submit
>Affects Versions: 2.4.0
> Environment: k8s version: v1.18.9
> spark version: v2.4.0
> kubernetes-client:v6.1.1
>Reporter: jiangjian
>Priority: Major
> Attachments: Dockerfile, spark-error.log
>
>
> After I change the user in the Spark image, the running program reports an 
> error. What is the problem
> ++ id -u
> + myuid=2023
> ++ id -g
> + mygid=2023
> + set +e
> ++ getent passwd 2023
> + uidentry=zndw:x:2023:2023::/home/zndw:/bin/sh
> + set -e
> + '[' -z zndw:x:2023:2023::/home/zndw:/bin/sh ']'
> + SPARK_K8S_CMD=driver
> + case "$SPARK_K8S_CMD" in
> + shift 1
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -n '' ']'
> + PYSPARK_ARGS=
> + '[' -n '' ']'
> + R_ARGS=
> + '[' -n '' ']'
> + '[' '' == 2 ']'
> + '[' '' == 3 ']'
> + case "$SPARK_K8S_CMD" in
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.1.1.11 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class 
> com.frontier.pueedas.computer.batchTool.etl.EtlScheduler 
> 'http://26.47.128.120:18000/spark/spark/raw/master/computer-batch-etl-hadoop-basic.jar?inline=false'
>  configMode=HDFS metaMode=HDFS platformConfigMode=NACOS storeConfigMode=NACOS 
> startDate=2022-08-02 endDate=2022-08-03 
> _file=/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml
>  runMode=TEST
> 2022-10-14 06:52:21 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2022-10-14 06:52:29 INFO  SparkContext:54 - Running Spark version 2.4.0
> 2022-10-14 06:52:29 INFO  SparkContext:54 - Submitted application: 
> [TEST]ETL[2022-08-02 00:00:00,2022-08-03 
> 00:00:00]\{/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml}
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls to: 
> zndw,root
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls to: 
> zndw,root
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls groups to: 
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls groups 
> to: 
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - SecurityManager: 
> authentication disabled; ui acls disabled; users  with view permissions: 
> Set(zndw, root); groups with view permissions: Set(); users  with modify 
> permissions: Set(zndw, root); groups with modify permissions: Set()
> 2022-10-14 06:52:29 INFO  Utils:54 - Successfully started service 
> 'sparkDriver' on port 7078.
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering MapOutputTracker
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering BlockManagerMaster
> 2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - 
> BlockManagerMasterEndpoint up
> 2022-10-14 06:52:29 INFO  DiskBlockManager:54 - Created local directory at 
> /var/data/spark-9a270950-7527-4d08-a7bd-d6c1062e8522/blockmgr-79ab0f0d-6f9e-401e-aa90-91baa00a3ff3
> 2022-10-14 06:52:29 INFO  MemoryStore:54 - MemoryStore started with capacity 
> 912.3 MB
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
> 2022-10-14 06:52:30 INFO  log:192 - Logging initialized @9926ms
> 2022-10-14 06:52:30 INFO  Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: 
> unknown, git hash: unknown
> 2022-10-14 06:52:30 INFO  Server:419 - Started @10035ms
> 2022-10-14 06:52:30 INFO  AbstractConnector:278 - Started 
> ServerConnector@66f0548d\{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
> 2022-10-14 06:52:30 INFO  Utils:54 - Successfully started service 'SparkUI' 
> on port 4040.
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@59ed3e6c\{/jobs,null,AVAILABLE,@Spark}
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@70c53dbe\{/jobs/json,null,AVAILABLE,@Spark}
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s

[jira] [Updated] (SPARK-40814) Exception in thread "main" java.lang.NoClassDefFoundError: io/fabric8/kubernetes/client/KubernetesClient

2022-10-16 Thread jiangjian (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiangjian updated SPARK-40814:
--
Priority: Blocker  (was: Major)

> Exception in thread "main" java.lang.NoClassDefFoundError: 
> io/fabric8/kubernetes/client/KubernetesClient
> 
>
> Key: SPARK-40814
> URL: https://issues.apache.org/jira/browse/SPARK-40814
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Submit
>Affects Versions: 2.4.0
> Environment: k8s version: v1.18.9
> spark version: v2.4.0
> kubernetes-client:v6.1.1
>Reporter: jiangjian
>Priority: Blocker
> Attachments: Dockerfile, spark-error.log
>
>
> After I change the user in the Spark image, the running program reports an 
> error. What is the problem
> ++ id -u
> + myuid=2023
> ++ id -g
> + mygid=2023
> + set +e
> ++ getent passwd 2023
> + uidentry=zndw:x:2023:2023::/home/zndw:/bin/sh
> + set -e
> + '[' -z zndw:x:2023:2023::/home/zndw:/bin/sh ']'
> + SPARK_K8S_CMD=driver
> + case "$SPARK_K8S_CMD" in
> + shift 1
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -n '' ']'
> + PYSPARK_ARGS=
> + '[' -n '' ']'
> + R_ARGS=
> + '[' -n '' ']'
> + '[' '' == 2 ']'
> + '[' '' == 3 ']'
> + case "$SPARK_K8S_CMD" in
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.1.1.11 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class 
> com.frontier.pueedas.computer.batchTool.etl.EtlScheduler 
> 'http://26.47.128.120:18000/spark/spark/raw/master/computer-batch-etl-hadoop-basic.jar?inline=false'
>  configMode=HDFS metaMode=HDFS platformConfigMode=NACOS storeConfigMode=NACOS 
> startDate=2022-08-02 endDate=2022-08-03 
> _file=/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml
>  runMode=TEST
> 2022-10-14 06:52:21 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2022-10-14 06:52:29 INFO  SparkContext:54 - Running Spark version 2.4.0
> 2022-10-14 06:52:29 INFO  SparkContext:54 - Submitted application: 
> [TEST]ETL[2022-08-02 00:00:00,2022-08-03 
> 00:00:00]\{/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml}
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls to: 
> zndw,root
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls to: 
> zndw,root
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls groups to: 
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls groups 
> to: 
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - SecurityManager: 
> authentication disabled; ui acls disabled; users  with view permissions: 
> Set(zndw, root); groups with view permissions: Set(); users  with modify 
> permissions: Set(zndw, root); groups with modify permissions: Set()
> 2022-10-14 06:52:29 INFO  Utils:54 - Successfully started service 
> 'sparkDriver' on port 7078.
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering MapOutputTracker
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering BlockManagerMaster
> 2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - 
> BlockManagerMasterEndpoint up
> 2022-10-14 06:52:29 INFO  DiskBlockManager:54 - Created local directory at 
> /var/data/spark-9a270950-7527-4d08-a7bd-d6c1062e8522/blockmgr-79ab0f0d-6f9e-401e-aa90-91baa00a3ff3
> 2022-10-14 06:52:29 INFO  MemoryStore:54 - MemoryStore started with capacity 
> 912.3 MB
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
> 2022-10-14 06:52:30 INFO  log:192 - Logging initialized @9926ms
> 2022-10-14 06:52:30 INFO  Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: 
> unknown, git hash: unknown
> 2022-10-14 06:52:30 INFO  Server:419 - Started @10035ms
> 2022-10-14 06:52:30 INFO  AbstractConnector:278 - Started 
> ServerConnector@66f0548d\{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
> 2022-10-14 06:52:30 INFO  Utils:54 - Successfully started service 'SparkUI' 
> on port 4040.
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@59ed3e6c\{/jobs,null,AVAILABLE,@Spark}
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@70c53dbe\{/jobs/json,null,AVAILABLE,@Spark}
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.Ser

[jira] [Updated] (SPARK-40814) Exception in thread "main" java.lang.NoClassDefFoundError: io/fabric8/kubernetes/client/KubernetesClient

2022-10-16 Thread jiangjian (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiangjian updated SPARK-40814:
--
Component/s: Spark Submit

> Exception in thread "main" java.lang.NoClassDefFoundError: 
> io/fabric8/kubernetes/client/KubernetesClient
> 
>
> Key: SPARK-40814
> URL: https://issues.apache.org/jira/browse/SPARK-40814
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Submit
>Affects Versions: 2.4.0
> Environment: k8s version: v1.18.9
> spark version: v2.4.0
> kubernetes-client:v6.1.1
>Reporter: jiangjian
>Priority: Major
> Attachments: Dockerfile, spark-error.log
>
>
> After I change the user in the Spark image, the running program reports an 
> error. What is the problem
> ++ id -u
> + myuid=2023
> ++ id -g
> + mygid=2023
> + set +e
> ++ getent passwd 2023
> + uidentry=zndw:x:2023:2023::/home/zndw:/bin/sh
> + set -e
> + '[' -z zndw:x:2023:2023::/home/zndw:/bin/sh ']'
> + SPARK_K8S_CMD=driver
> + case "$SPARK_K8S_CMD" in
> + shift 1
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -n '' ']'
> + PYSPARK_ARGS=
> + '[' -n '' ']'
> + R_ARGS=
> + '[' -n '' ']'
> + '[' '' == 2 ']'
> + '[' '' == 3 ']'
> + case "$SPARK_K8S_CMD" in
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.1.1.11 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class 
> com.frontier.pueedas.computer.batchTool.etl.EtlScheduler 
> 'http://26.47.128.120:18000/spark/spark/raw/master/computer-batch-etl-hadoop-basic.jar?inline=false'
>  configMode=HDFS metaMode=HDFS platformConfigMode=NACOS storeConfigMode=NACOS 
> startDate=2022-08-02 endDate=2022-08-03 
> _file=/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml
>  runMode=TEST
> 2022-10-14 06:52:21 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2022-10-14 06:52:29 INFO  SparkContext:54 - Running Spark version 2.4.0
> 2022-10-14 06:52:29 INFO  SparkContext:54 - Submitted application: 
> [TEST]ETL[2022-08-02 00:00:00,2022-08-03 
> 00:00:00]\{/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml}
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls to: 
> zndw,root
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls to: 
> zndw,root
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls groups to: 
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls groups 
> to: 
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - SecurityManager: 
> authentication disabled; ui acls disabled; users  with view permissions: 
> Set(zndw, root); groups with view permissions: Set(); users  with modify 
> permissions: Set(zndw, root); groups with modify permissions: Set()
> 2022-10-14 06:52:29 INFO  Utils:54 - Successfully started service 
> 'sparkDriver' on port 7078.
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering MapOutputTracker
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering BlockManagerMaster
> 2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - 
> BlockManagerMasterEndpoint up
> 2022-10-14 06:52:29 INFO  DiskBlockManager:54 - Created local directory at 
> /var/data/spark-9a270950-7527-4d08-a7bd-d6c1062e8522/blockmgr-79ab0f0d-6f9e-401e-aa90-91baa00a3ff3
> 2022-10-14 06:52:29 INFO  MemoryStore:54 - MemoryStore started with capacity 
> 912.3 MB
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
> 2022-10-14 06:52:30 INFO  log:192 - Logging initialized @9926ms
> 2022-10-14 06:52:30 INFO  Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: 
> unknown, git hash: unknown
> 2022-10-14 06:52:30 INFO  Server:419 - Started @10035ms
> 2022-10-14 06:52:30 INFO  AbstractConnector:278 - Started 
> ServerConnector@66f0548d\{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
> 2022-10-14 06:52:30 INFO  Utils:54 - Successfully started service 'SparkUI' 
> on port 4040.
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@59ed3e6c\{/jobs,null,AVAILABLE,@Spark}
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@70c53dbe\{/jobs/json,null,AVAILABLE,@Spark}
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletCont

[jira] [Updated] (SPARK-40814) Exception in thread "main" java.lang.NoClassDefFoundError: io/fabric8/kubernetes/client/KubernetesClient

2022-10-16 Thread jiangjian (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiangjian updated SPARK-40814:
--
Attachment: Dockerfile

> Exception in thread "main" java.lang.NoClassDefFoundError: 
> io/fabric8/kubernetes/client/KubernetesClient
> 
>
> Key: SPARK-40814
> URL: https://issues.apache.org/jira/browse/SPARK-40814
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.4.0
> Environment: k8s version: v1.18.9
> spark version: v2.4.0
> kubernetes-client:v6.1.1
>Reporter: jiangjian
>Priority: Major
> Attachments: Dockerfile, spark-error.log
>
>
> After I change the user in the Spark image, the running program reports an 
> error. What is the problem
> ++ id -u
> + myuid=2023
> ++ id -g
> + mygid=2023
> + set +e
> ++ getent passwd 2023
> + uidentry=zndw:x:2023:2023::/home/zndw:/bin/sh
> + set -e
> + '[' -z zndw:x:2023:2023::/home/zndw:/bin/sh ']'
> + SPARK_K8S_CMD=driver
> + case "$SPARK_K8S_CMD" in
> + shift 1
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -n '' ']'
> + PYSPARK_ARGS=
> + '[' -n '' ']'
> + R_ARGS=
> + '[' -n '' ']'
> + '[' '' == 2 ']'
> + '[' '' == 3 ']'
> + case "$SPARK_K8S_CMD" in
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.1.1.11 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class 
> com.frontier.pueedas.computer.batchTool.etl.EtlScheduler 
> 'http://26.47.128.120:18000/spark/spark/raw/master/computer-batch-etl-hadoop-basic.jar?inline=false'
>  configMode=HDFS metaMode=HDFS platformConfigMode=NACOS storeConfigMode=NACOS 
> startDate=2022-08-02 endDate=2022-08-03 
> _file=/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml
>  runMode=TEST
> 2022-10-14 06:52:21 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2022-10-14 06:52:29 INFO  SparkContext:54 - Running Spark version 2.4.0
> 2022-10-14 06:52:29 INFO  SparkContext:54 - Submitted application: 
> [TEST]ETL[2022-08-02 00:00:00,2022-08-03 
> 00:00:00]\{/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml}
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls to: 
> zndw,root
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls to: 
> zndw,root
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls groups to: 
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls groups 
> to: 
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - SecurityManager: 
> authentication disabled; ui acls disabled; users  with view permissions: 
> Set(zndw, root); groups with view permissions: Set(); users  with modify 
> permissions: Set(zndw, root); groups with modify permissions: Set()
> 2022-10-14 06:52:29 INFO  Utils:54 - Successfully started service 
> 'sparkDriver' on port 7078.
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering MapOutputTracker
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering BlockManagerMaster
> 2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - 
> BlockManagerMasterEndpoint up
> 2022-10-14 06:52:29 INFO  DiskBlockManager:54 - Created local directory at 
> /var/data/spark-9a270950-7527-4d08-a7bd-d6c1062e8522/blockmgr-79ab0f0d-6f9e-401e-aa90-91baa00a3ff3
> 2022-10-14 06:52:29 INFO  MemoryStore:54 - MemoryStore started with capacity 
> 912.3 MB
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
> 2022-10-14 06:52:30 INFO  log:192 - Logging initialized @9926ms
> 2022-10-14 06:52:30 INFO  Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: 
> unknown, git hash: unknown
> 2022-10-14 06:52:30 INFO  Server:419 - Started @10035ms
> 2022-10-14 06:52:30 INFO  AbstractConnector:278 - Started 
> ServerConnector@66f0548d\{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
> 2022-10-14 06:52:30 INFO  Utils:54 - Successfully started service 'SparkUI' 
> on port 4040.
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@59ed3e6c\{/jobs,null,AVAILABLE,@Spark}
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@70c53dbe\{/jobs/json,null,AVAILABLE,@Spark}
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@1894e4

[jira] [Updated] (SPARK-40814) Exception in thread "main" java.lang.NoClassDefFoundError: io/fabric8/kubernetes/client/KubernetesClient

2022-10-16 Thread jiangjian (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiangjian updated SPARK-40814:
--
Environment: 
k8s version: v1.18.9

spark version: v2.4.0

kubernetes-client:v6.1.1

  was:
k8s version: v1.18.9

spark version: v2.4.0


> Exception in thread "main" java.lang.NoClassDefFoundError: 
> io/fabric8/kubernetes/client/KubernetesClient
> 
>
> Key: SPARK-40814
> URL: https://issues.apache.org/jira/browse/SPARK-40814
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.4.0
> Environment: k8s version: v1.18.9
> spark version: v2.4.0
> kubernetes-client:v6.1.1
>Reporter: jiangjian
>Priority: Major
> Attachments: spark-error.log
>
>
> After I change the user in the Spark image, the running program reports an 
> error. What is the problem
> ++ id -u
> + myuid=2023
> ++ id -g
> + mygid=2023
> + set +e
> ++ getent passwd 2023
> + uidentry=zndw:x:2023:2023::/home/zndw:/bin/sh
> + set -e
> + '[' -z zndw:x:2023:2023::/home/zndw:/bin/sh ']'
> + SPARK_K8S_CMD=driver
> + case "$SPARK_K8S_CMD" in
> + shift 1
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -n '' ']'
> + PYSPARK_ARGS=
> + '[' -n '' ']'
> + R_ARGS=
> + '[' -n '' ']'
> + '[' '' == 2 ']'
> + '[' '' == 3 ']'
> + case "$SPARK_K8S_CMD" in
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.1.1.11 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class 
> com.frontier.pueedas.computer.batchTool.etl.EtlScheduler 
> 'http://26.47.128.120:18000/spark/spark/raw/master/computer-batch-etl-hadoop-basic.jar?inline=false'
>  configMode=HDFS metaMode=HDFS platformConfigMode=NACOS storeConfigMode=NACOS 
> startDate=2022-08-02 endDate=2022-08-03 
> _file=/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml
>  runMode=TEST
> 2022-10-14 06:52:21 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2022-10-14 06:52:29 INFO  SparkContext:54 - Running Spark version 2.4.0
> 2022-10-14 06:52:29 INFO  SparkContext:54 - Submitted application: 
> [TEST]ETL[2022-08-02 00:00:00,2022-08-03 
> 00:00:00]\{/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml}
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls to: 
> zndw,root
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls to: 
> zndw,root
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls groups to: 
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls groups 
> to: 
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - SecurityManager: 
> authentication disabled; ui acls disabled; users  with view permissions: 
> Set(zndw, root); groups with view permissions: Set(); users  with modify 
> permissions: Set(zndw, root); groups with modify permissions: Set()
> 2022-10-14 06:52:29 INFO  Utils:54 - Successfully started service 
> 'sparkDriver' on port 7078.
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering MapOutputTracker
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering BlockManagerMaster
> 2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - 
> BlockManagerMasterEndpoint up
> 2022-10-14 06:52:29 INFO  DiskBlockManager:54 - Created local directory at 
> /var/data/spark-9a270950-7527-4d08-a7bd-d6c1062e8522/blockmgr-79ab0f0d-6f9e-401e-aa90-91baa00a3ff3
> 2022-10-14 06:52:29 INFO  MemoryStore:54 - MemoryStore started with capacity 
> 912.3 MB
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
> 2022-10-14 06:52:30 INFO  log:192 - Logging initialized @9926ms
> 2022-10-14 06:52:30 INFO  Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: 
> unknown, git hash: unknown
> 2022-10-14 06:52:30 INFO  Server:419 - Started @10035ms
> 2022-10-14 06:52:30 INFO  AbstractConnector:278 - Started 
> ServerConnector@66f0548d\{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
> 2022-10-14 06:52:30 INFO  Utils:54 - Successfully started service 'SparkUI' 
> on port 4040.
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@59ed3e6c\{/jobs,null,AVAILABLE,@Spark}
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@70c53dbe\{/jobs/json,null,AVAILABLE,@S

[jira] [Updated] (SPARK-40814) Exception in thread "main" java.lang.NoClassDefFoundError: io/fabric8/kubernetes/client/KubernetesClient

2022-10-16 Thread jiangjian (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiangjian updated SPARK-40814:
--
Attachment: spark-error.log

> Exception in thread "main" java.lang.NoClassDefFoundError: 
> io/fabric8/kubernetes/client/KubernetesClient
> 
>
> Key: SPARK-40814
> URL: https://issues.apache.org/jira/browse/SPARK-40814
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.4.0
> Environment: k8s version: v1.18.9
> spark version: v2.4.0
>Reporter: jiangjian
>Priority: Major
> Attachments: spark-error.log
>
>
> After I change the user in the Spark image, the running program reports an 
> error. What is the problem
> ++ id -u
> + myuid=2023
> ++ id -g
> + mygid=2023
> + set +e
> ++ getent passwd 2023
> + uidentry=zndw:x:2023:2023::/home/zndw:/bin/sh
> + set -e
> + '[' -z zndw:x:2023:2023::/home/zndw:/bin/sh ']'
> + SPARK_K8S_CMD=driver
> + case "$SPARK_K8S_CMD" in
> + shift 1
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -n '' ']'
> + PYSPARK_ARGS=
> + '[' -n '' ']'
> + R_ARGS=
> + '[' -n '' ']'
> + '[' '' == 2 ']'
> + '[' '' == 3 ']'
> + case "$SPARK_K8S_CMD" in
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.1.1.11 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class 
> com.frontier.pueedas.computer.batchTool.etl.EtlScheduler 
> 'http://26.47.128.120:18000/spark/spark/raw/master/computer-batch-etl-hadoop-basic.jar?inline=false'
>  configMode=HDFS metaMode=HDFS platformConfigMode=NACOS storeConfigMode=NACOS 
> startDate=2022-08-02 endDate=2022-08-03 
> _file=/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml
>  runMode=TEST
> 2022-10-14 06:52:21 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2022-10-14 06:52:29 INFO  SparkContext:54 - Running Spark version 2.4.0
> 2022-10-14 06:52:29 INFO  SparkContext:54 - Submitted application: 
> [TEST]ETL[2022-08-02 00:00:00,2022-08-03 
> 00:00:00]\{/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml}
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls to: 
> zndw,root
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls to: 
> zndw,root
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls groups to: 
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls groups 
> to: 
> 2022-10-14 06:52:29 INFO  SecurityManager:54 - SecurityManager: 
> authentication disabled; ui acls disabled; users  with view permissions: 
> Set(zndw, root); groups with view permissions: Set(); users  with modify 
> permissions: Set(zndw, root); groups with modify permissions: Set()
> 2022-10-14 06:52:29 INFO  Utils:54 - Successfully started service 
> 'sparkDriver' on port 7078.
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering MapOutputTracker
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering BlockManagerMaster
> 2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - 
> BlockManagerMasterEndpoint up
> 2022-10-14 06:52:29 INFO  DiskBlockManager:54 - Created local directory at 
> /var/data/spark-9a270950-7527-4d08-a7bd-d6c1062e8522/blockmgr-79ab0f0d-6f9e-401e-aa90-91baa00a3ff3
> 2022-10-14 06:52:29 INFO  MemoryStore:54 - MemoryStore started with capacity 
> 912.3 MB
> 2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
> 2022-10-14 06:52:30 INFO  log:192 - Logging initialized @9926ms
> 2022-10-14 06:52:30 INFO  Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: 
> unknown, git hash: unknown
> 2022-10-14 06:52:30 INFO  Server:419 - Started @10035ms
> 2022-10-14 06:52:30 INFO  AbstractConnector:278 - Started 
> ServerConnector@66f0548d\{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
> 2022-10-14 06:52:30 INFO  Utils:54 - Successfully started service 'SparkUI' 
> on port 4040.
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@59ed3e6c\{/jobs,null,AVAILABLE,@Spark}
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@70c53dbe\{/jobs/json,null,AVAILABLE,@Spark}
> 2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@1894e40d\{/jobs/job,null,AVAILABLE,@Spar

[jira] [Created] (SPARK-40814) Exception in thread "main" java.lang.NoClassDefFoundError: io/fabric8/kubernetes/client/KubernetesClient

2022-10-16 Thread jiangjian (Jira)

jiangjian created SPARK-40814:
-

 Summary: Exception in thread "main" 
java.lang.NoClassDefFoundError: io/fabric8/kubernetes/client/KubernetesClient
 Key: SPARK-40814
 URL: https://issues.apache.org/jira/browse/SPARK-40814
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.4.0
 Environment: k8s version: v1.18.9

spark version: v2.4.0
Reporter: jiangjian


After I change the user in the Spark image, the running program reports an 
error. What is the problem

++ id -u
+ myuid=2023
++ id -g
+ mygid=2023
+ set +e
++ getent passwd 2023
+ uidentry=zndw:x:2023:2023::/home/zndw:/bin/sh
+ set -e
+ '[' -z zndw:x:2023:2023::/home/zndw:/bin/sh ']'
+ SPARK_K8S_CMD=driver
+ case "$SPARK_K8S_CMD" in
+ shift 1
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ PYSPARK_ARGS=
+ '[' -n '' ']'
+ R_ARGS=
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf 
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=10.1.1.11 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class 
com.frontier.pueedas.computer.batchTool.etl.EtlScheduler 
'http://26.47.128.120:18000/spark/spark/raw/master/computer-batch-etl-hadoop-basic.jar?inline=false'
 configMode=HDFS metaMode=HDFS platformConfigMode=NACOS storeConfigMode=NACOS 
startDate=2022-08-02 endDate=2022-08-03 
_file=/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml
 runMode=TEST
2022-10-14 06:52:21 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
2022-10-14 06:52:29 INFO  SparkContext:54 - Running Spark version 2.4.0
2022-10-14 06:52:29 INFO  SparkContext:54 - Submitted application: 
[TEST]ETL[2022-08-02 00:00:00,2022-08-03 
00:00:00]\{/user/config/YC2/TEST/config/computer/business/opc/EMpHpReadCurveHourData.xml}
2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls to: zndw,root
2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls to: 
zndw,root
2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing view acls groups to: 
2022-10-14 06:52:29 INFO  SecurityManager:54 - Changing modify acls groups to: 
2022-10-14 06:52:29 INFO  SecurityManager:54 - SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(zndw, root); 
groups with view permissions: Set(); users  with modify permissions: Set(zndw, 
root); groups with modify permissions: Set()
2022-10-14 06:52:29 INFO  Utils:54 - Successfully started service 'sparkDriver' 
on port 7078.
2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering MapOutputTracker
2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering BlockManagerMaster
2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - Using 
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2022-10-14 06:52:29 INFO  BlockManagerMasterEndpoint:54 - 
BlockManagerMasterEndpoint up
2022-10-14 06:52:29 INFO  DiskBlockManager:54 - Created local directory at 
/var/data/spark-9a270950-7527-4d08-a7bd-d6c1062e8522/blockmgr-79ab0f0d-6f9e-401e-aa90-91baa00a3ff3
2022-10-14 06:52:29 INFO  MemoryStore:54 - MemoryStore started with capacity 
912.3 MB
2022-10-14 06:52:29 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
2022-10-14 06:52:30 INFO  log:192 - Logging initialized @9926ms
2022-10-14 06:52:30 INFO  Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: 
unknown, git hash: unknown
2022-10-14 06:52:30 INFO  Server:419 - Started @10035ms
2022-10-14 06:52:30 INFO  AbstractConnector:278 - Started 
ServerConnector@66f0548d\{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2022-10-14 06:52:30 INFO  Utils:54 - Successfully started service 'SparkUI' on 
port 4040.
2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@59ed3e6c\{/jobs,null,AVAILABLE,@Spark}
2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@70c53dbe\{/jobs/json,null,AVAILABLE,@Spark}
2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@1894e40d\{/jobs/job,null,AVAILABLE,@Spark}
2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@7342e05d\{/jobs/job/json,null,AVAILABLE,@Spark}
2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@2a331b46\{/stages,null,AVAILABLE,@Spark}
2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@15383681\{/stages/json,null,AVAILABLE,@Spark}
2022-10-14 06:52:30 INFO  ContextHandler:781 - Started 
o.s.j.s.ServletContextHandler@743e66f7\{/stages/stag

[jira] [Commented] (SPARK-40812) Add Deduplicate to Connect proto

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618344#comment-17618344
 ] 

Apache Spark commented on SPARK-40812:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38276

> Add Deduplicate to Connect proto
> 
>
> Key: SPARK-40812
> URL: https://issues.apache.org/jira/browse/SPARK-40812
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40812) Add Deduplicate to Connect proto

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40812:


Assignee: Apache Spark

> Add Deduplicate to Connect proto
> 
>
> Key: SPARK-40812
> URL: https://issues.apache.org/jira/browse/SPARK-40812
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40812) Add Deduplicate to Connect proto

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618343#comment-17618343
 ] 

Apache Spark commented on SPARK-40812:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38276

> Add Deduplicate to Connect proto
> 
>
> Key: SPARK-40812
> URL: https://issues.apache.org/jira/browse/SPARK-40812
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40812) Add Deduplicate to Connect proto

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40812:


Assignee: (was: Apache Spark)

> Add Deduplicate to Connect proto
> 
>
> Key: SPARK-40812
> URL: https://issues.apache.org/jira/browse/SPARK-40812
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40813) Add limit and offset to Connect DSL

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40813:


Assignee: (was: Apache Spark)

> Add limit and offset to Connect DSL
> ---
>
> Key: SPARK-40813
> URL: https://issues.apache.org/jira/browse/SPARK-40813
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40813) Add limit and offset to Connect DSL

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618342#comment-17618342
 ] 

Apache Spark commented on SPARK-40813:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38275

> Add limit and offset to Connect DSL
> ---
>
> Key: SPARK-40813
> URL: https://issues.apache.org/jira/browse/SPARK-40813
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40813) Add limit and offset to Connect DSL

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40813:


Assignee: Apache Spark

> Add limit and offset to Connect DSL
> ---
>
> Key: SPARK-40813
> URL: https://issues.apache.org/jira/browse/SPARK-40813
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40813) Add limit and offset to Connect DSL

2022-10-16 Thread Rui Wang (Jira)

Rui Wang created SPARK-40813:


 Summary: Add limit and offset to Connect DSL
 Key: SPARK-40813
 URL: https://issues.apache.org/jira/browse/SPARK-40813
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40812) Add Deduplicate to Connect proto

2022-10-16 Thread Rui Wang (Jira)

Rui Wang created SPARK-40812:


 Summary: Add Deduplicate to Connect proto
 Key: SPARK-40812
 URL: https://issues.apache.org/jira/browse/SPARK-40812
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40811) Use checkError() to intercept ParseException

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40811:


Assignee: (was: Apache Spark)

> Use checkError() to intercept ParseException
> 
>
> Key: SPARK-40811
> URL: https://issues.apache.org/jira/browse/SPARK-40811
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Port the following test suites onto checkError():
> - SQLViewSuite
> - JDBCTableCatalogSuite



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40811) Use checkError() to intercept ParseException

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618300#comment-17618300
 ] 

Apache Spark commented on SPARK-40811:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38267

> Use checkError() to intercept ParseException
> 
>
> Key: SPARK-40811
> URL: https://issues.apache.org/jira/browse/SPARK-40811
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Port the following test suites onto checkError():
> - SQLViewSuite
> - JDBCTableCatalogSuite



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40811) Use checkError() to intercept ParseException

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40811:


Assignee: Apache Spark

> Use checkError() to intercept ParseException
> 
>
> Key: SPARK-40811
> URL: https://issues.apache.org/jira/browse/SPARK-40811
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Port the following test suites onto checkError():
> - SQLViewSuite
> - JDBCTableCatalogSuite



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40811) Use checkError() to intercept ParseException

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618299#comment-17618299
 ] 

Apache Spark commented on SPARK-40811:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38267

> Use checkError() to intercept ParseException
> 
>
> Key: SPARK-40811
> URL: https://issues.apache.org/jira/browse/SPARK-40811
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Port the following test suites onto checkError():
> - SQLViewSuite
> - JDBCTableCatalogSuite



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40811) Use checkError() to intercept ParseException

2022-10-16 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40811:
-
Description: 
Port the following test suites onto checkError():
- SQLViewSuite
- JDBCTableCatalogSuite

  was:
Port the following test suites onto checkError():
- SQLViewSuite


> Use checkError() to intercept ParseException
> 
>
> Key: SPARK-40811
> URL: https://issues.apache.org/jira/browse/SPARK-40811
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Port the following test suites onto checkError():
> - SQLViewSuite
> - JDBCTableCatalogSuite



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40811) Use checkError() to intercept ParseException

2022-10-16 Thread Max Gekk (Jira)

Max Gekk created SPARK-40811:


 Summary: Use checkError() to intercept ParseException
 Key: SPARK-40811
 URL: https://issues.apache.org/jira/browse/SPARK-40811
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk


Port the following test suites onto checkError():
- SQLViewSuite



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40786) Check error classes in PlanParserSuite

2022-10-16 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40786.
--
Resolution: Fixed

Issue resolved by pull request 38271
[https://github.com/apache/spark/pull/38271]

> Check error classes in PlanParserSuite
> --
>
> Key: SPARK-40786
> URL: https://issues.apache.org/jira/browse/SPARK-40786
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: BingKun Pan
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in PlanParserSuite by using checkError(). For instance, 
> replace
> {code:scala}
> intercept("EXPLAIN logical SELECT 1", "Unsupported SQL statement")
> {code}
> by
> {code:scala}
> checkError(
>   exception = parseException("EXPLAIN logical SELECT 1"),
>   errorClass = "...",
>   parameters = Map.empty,
>   context = ...)
> {code}
> at 
> https://github.com/apache/spark/blob/35d00df9bba7238ad4f40617fae4d04ddbfd/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala#L225



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40786) Check error classes in PlanParserSuite

2022-10-16 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40786:


Assignee: BingKun Pan

> Check error classes in PlanParserSuite
> --
>
> Key: SPARK-40786
> URL: https://issues.apache.org/jira/browse/SPARK-40786
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: BingKun Pan
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in PlanParserSuite by using checkError(). For instance, 
> replace
> {code:scala}
> intercept("EXPLAIN logical SELECT 1", "Unsupported SQL statement")
> {code}
> by
> {code:scala}
> checkError(
>   exception = parseException("EXPLAIN logical SELECT 1"),
>   errorClass = "...",
>   parameters = Map.empty,
>   context = ...)
> {code}
> at 
> https://github.com/apache/spark/blob/35d00df9bba7238ad4f40617fae4d04ddbfd/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala#L225



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40728) Upgrade ASM to 9.4

2022-10-16 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40728:
-
Priority: Minor  (was: Major)

> Upgrade ASM to 9.4
> --
>
> Key: SPARK-40728
> URL: https://issues.apache.org/jira/browse/SPARK-40728
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40728) Upgrade ASM to 9.4

2022-10-16 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40728.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38189
[https://github.com/apache/spark/pull/38189]

> Upgrade ASM to 9.4
> --
>
> Key: SPARK-40728
> URL: https://issues.apache.org/jira/browse/SPARK-40728
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40728) Upgrade ASM to 9.4

2022-10-16 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40728:


Assignee: Yang Jie

> Upgrade ASM to 9.4
> --
>
> Key: SPARK-40728
> URL: https://issues.apache.org/jira/browse/SPARK-40728
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

2022-10-16 Thread zzzzming95 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618277#comment-17618277
 ] 

ming95 commented on SPARK-40808:


[~ohadm] 

Can you provide code to reproduce this issue.

> Infer schema for CSV files - wrong behavior using header + merge schema
> ---
>
> Key: SPARK-40808
> URL: https://issues.apache.org/jira/browse/SPARK-40808
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: ohad
>Priority: Major
>  Labels: CSVReader, csv, csvparser
>
> Hello. 
> I am writing unit-tests to some functionality in my application that reading 
> data from CSV files using Spark.
> I am reading the data using:
> {code:java}
> header=True
> mergeSchema=True
> inferSchema=True{code}
> When I am reading this single file:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22{code}
> I am getting this schema:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string{code}
> When I am duplicating this file, I am getting the same schema.
> The strange part is when I am adding new int column, it looks like spark is 
> getting confused and think that the column that already identified as int are 
> now string:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2
> {code}
> result:
> {code:java}
> int_col=string
> string_col=string
> decimal_col=string
> date_col=string
> int2_col=int{code}
> When I am reading only the second file, it looks fine:
> {code:java}
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2{code}
> result:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string
> int2_col=int{code}
> For conclusion, it looks like there is a bug mixing the two features: header 
> recognition and merge schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40809) Add as(alias: String) to connect DSL

2022-10-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40809.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38272
[https://github.com/apache/spark/pull/38272]

> Add as(alias: String) to connect DSL
> 
>
> Key: SPARK-40809
> URL: https://issues.apache.org/jira/browse/SPARK-40809
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40809) Add as(alias: String) to connect DSL

2022-10-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40809:


Assignee: Rui Wang

> Add as(alias: String) to connect DSL
> 
>
> Key: SPARK-40809
> URL: https://issues.apache.org/jira/browse/SPARK-40809
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40588) Sorting issue with AQE turned on

2022-10-16 Thread zzzzming95 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618265#comment-17618265
 ] 

ming95 commented on SPARK-40588:


After my test, I think this is not a problem of AQE, because the reproduced 
code I used, and after setting spark.sql.adaptive.enabled to false, the sort 
still does not take effect.

 

!image-2022-10-16-22-05-47-159.png!

It can be reproduced by modifying a few parameters and running in spark local:

```
val partitions = 200
val minRand = 100
val maxRand = 300
```

The real problem seems to be in the sort + partitionBy operation.

> Sorting issue with AQE turned on  
> --
>
> Key: SPARK-40588
> URL: https://issues.apache.org/jira/browse/SPARK-40588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.3
> Environment: Spark v3.1.3
> Scala v2.12.13
>Reporter: Swetha Baskaran
>Priority: Major
> Attachments: image-2022-10-16-22-05-47-159.png
>
>
> We are attempting to partition data by a few columns, sort by a particular 
> _sortCol_ and write out one file per partition. 
> {code:java}
> df
>     .repartition(col("day"), col("month"), col("year"))
>     .withColumn("partitionId",spark_partition_id)
>     .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId)
>     .sortWithinPartitions("year", "month", "day", "sortCol")
>     .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId)
>     .write
>     .partitionBy("year", "month", "day")
>     .parquet(path){code}
> When inspecting the results, we observe one file per partition, however we 
> see an _alternating_ pattern of unsorted rows in some files.
> {code:java}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code}
> Here is a 
> [gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to 
> reproduce the issue. 
> Turning off AQE with spark.conf.set("spark.sql.adaptive.enabled", false) 
> fixes the issue.
> I'm working on identifying why AQE affects the sort order. Any leads or 
> thoughts would be appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40588) Sorting issue with AQE turned on

2022-10-16 Thread zzzzming95 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ming95 updated SPARK-40588:
---
Attachment: image-2022-10-16-22-05-47-159.png

> Sorting issue with AQE turned on  
> --
>
> Key: SPARK-40588
> URL: https://issues.apache.org/jira/browse/SPARK-40588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.3
> Environment: Spark v3.1.3
> Scala v2.12.13
>Reporter: Swetha Baskaran
>Priority: Major
> Attachments: image-2022-10-16-22-05-47-159.png
>
>
> We are attempting to partition data by a few columns, sort by a particular 
> _sortCol_ and write out one file per partition. 
> {code:java}
> df
>     .repartition(col("day"), col("month"), col("year"))
>     .withColumn("partitionId",spark_partition_id)
>     .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId)
>     .sortWithinPartitions("year", "month", "day", "sortCol")
>     .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId)
>     .write
>     .partitionBy("year", "month", "day")
>     .parquet(path){code}
> When inspecting the results, we observe one file per partition, however we 
> see an _alternating_ pattern of unsorted rows in some files.
> {code:java}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code}
> Here is a 
> [gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to 
> reproduce the issue. 
> Turning off AQE with spark.conf.set("spark.sql.adaptive.enabled", false) 
> fixes the issue.
> I'm working on identifying why AQE affects the sort order. Any leads or 
> thoughts would be appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40810) Use SparkIllegalArgumentException instead of IllegalArgumentException in CreateDatabaseCommand & AlterDatabaseSetLocationCommand

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618263#comment-17618263
 ] 

Apache Spark commented on SPARK-40810:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38274

> Use SparkIllegalArgumentException instead of IllegalArgumentException in 
> CreateDatabaseCommand & AlterDatabaseSetLocationCommand
> 
>
> Key: SPARK-40810
> URL: https://issues.apache.org/jira/browse/SPARK-40810
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40810) Use SparkIllegalArgumentException instead of IllegalArgumentException in CreateDatabaseCommand & AlterDatabaseSetLocationCommand

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40810:


Assignee: Apache Spark

> Use SparkIllegalArgumentException instead of IllegalArgumentException in 
> CreateDatabaseCommand & AlterDatabaseSetLocationCommand
> 
>
> Key: SPARK-40810
> URL: https://issues.apache.org/jira/browse/SPARK-40810
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40810) Use SparkIllegalArgumentException instead of IllegalArgumentException in CreateDatabaseCommand & AlterDatabaseSetLocationCommand

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618262#comment-17618262
 ] 

Apache Spark commented on SPARK-40810:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38274

> Use SparkIllegalArgumentException instead of IllegalArgumentException in 
> CreateDatabaseCommand & AlterDatabaseSetLocationCommand
> 
>
> Key: SPARK-40810
> URL: https://issues.apache.org/jira/browse/SPARK-40810
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40810) Use SparkIllegalArgumentException instead of IllegalArgumentException in CreateDatabaseCommand & AlterDatabaseSetLocationCommand

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40810:


Assignee: (was: Apache Spark)

> Use SparkIllegalArgumentException instead of IllegalArgumentException in 
> CreateDatabaseCommand & AlterDatabaseSetLocationCommand
> 
>
> Key: SPARK-40810
> URL: https://issues.apache.org/jira/browse/SPARK-40810
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40810) Use SparkIllegalArgumentException instead of IllegalArgumentException in CreateDatabaseCommand & AlterDatabaseSetLocationCommand

2022-10-16 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-40810:
---

 Summary: Use SparkIllegalArgumentException instead of 
IllegalArgumentException in CreateDatabaseCommand & 
AlterDatabaseSetLocationCommand
 Key: SPARK-40810
 URL: https://issues.apache.org/jira/browse/SPARK-40810
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37945) Use error classes in the execution errors of arithmetic ops

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37945:


Assignee: Apache Spark

> Use error classes in the execution errors of arithmetic ops
> ---
>
> Key: SPARK-37945
> URL: https://issues.apache.org/jira/browse/SPARK-37945
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * overflowInSumOfDecimalError
> * overflowInIntegralDivideError
> * arithmeticOverflowError
> * unaryMinusCauseOverflowError
> * binaryArithmeticCauseOverflowError
> * unscaledValueTooLargeForPrecisionError
> * decimalPrecisionExceedsMaxPrecisionError
> * outOfDecimalTypeRangeError
> * integerOverflowError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37945) Use error classes in the execution errors of arithmetic ops

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618253#comment-17618253
 ] 

Apache Spark commented on SPARK-37945:
--

User 'khalidmammadov' has created a pull request for this issue:
https://github.com/apache/spark/pull/38273

> Use error classes in the execution errors of arithmetic ops
> ---
>
> Key: SPARK-37945
> URL: https://issues.apache.org/jira/browse/SPARK-37945
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * overflowInSumOfDecimalError
> * overflowInIntegralDivideError
> * arithmeticOverflowError
> * unaryMinusCauseOverflowError
> * binaryArithmeticCauseOverflowError
> * unscaledValueTooLargeForPrecisionError
> * decimalPrecisionExceedsMaxPrecisionError
> * outOfDecimalTypeRangeError
> * integerOverflowError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37945) Use error classes in the execution errors of arithmetic ops

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37945:


Assignee: (was: Apache Spark)

> Use error classes in the execution errors of arithmetic ops
> ---
>
> Key: SPARK-37945
> URL: https://issues.apache.org/jira/browse/SPARK-37945
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * overflowInSumOfDecimalError
> * overflowInIntegralDivideError
> * arithmeticOverflowError
> * unaryMinusCauseOverflowError
> * binaryArithmeticCauseOverflowError
> * unscaledValueTooLargeForPrecisionError
> * decimalPrecisionExceedsMaxPrecisionError
> * outOfDecimalTypeRangeError
> * integerOverflowError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40809) Add as(alias: String) to connect DSL

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618239#comment-17618239
 ] 

Apache Spark commented on SPARK-40809:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38272

> Add as(alias: String) to connect DSL
> 
>
> Key: SPARK-40809
> URL: https://issues.apache.org/jira/browse/SPARK-40809
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40809) Add as(alias: String) to connect DSL

2022-10-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618238#comment-17618238
 ] 

Apache Spark commented on SPARK-40809:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38272

> Add as(alias: String) to connect DSL
> 
>
> Key: SPARK-40809
> URL: https://issues.apache.org/jira/browse/SPARK-40809
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40809) Add as(alias: String) to connect DSL

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40809:


Assignee: Apache Spark

> Add as(alias: String) to connect DSL
> 
>
> Key: SPARK-40809
> URL: https://issues.apache.org/jira/browse/SPARK-40809
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40809) Add as(alias: String) to connect DSL

2022-10-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40809:


Assignee: (was: Apache Spark)

> Add as(alias: String) to connect DSL
> 
>
> Key: SPARK-40809
> URL: https://issues.apache.org/jira/browse/SPARK-40809
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40809) Add as(alias: String) to connect DSL

2022-10-16 Thread Rui Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-40809:
-
Summary: Add as(alias: String) to connect DSL  (was: Add as(alias) to 
connect DSL)

> Add as(alias: String) to connect DSL
> 
>
> Key: SPARK-40809
> URL: https://issues.apache.org/jira/browse/SPARK-40809
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40809) Add as(alias) to connect DSL

2022-10-16 Thread Rui Wang (Jira)

Rui Wang created SPARK-40809:


 Summary: Add as(alias) to connect DSL
 Key: SPARK-40809
 URL: https://issues.apache.org/jira/browse/SPARK-40809
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39951) Support columnar batches with nested fields in Parquet V2

2022-10-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-39951:
---

Assignee: Adam Binford  (was: Apache Spark)

> Support columnar batches with nested fields in Parquet V2
> -
>
> Key: SPARK-39951
> URL: https://issues.apache.org/jira/browse/SPARK-39951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
> Follow up to https://issues.apache.org/jira/browse/SPARK-34863 to updated 
> `supportsColumnarReads` to account for nested fields



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40563) Error at where clause, when sql case executes by else branch

2022-10-16 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618234#comment-17618234
 ] 

Yuming Wang commented on SPARK-40563:
-

[~Zing] Does branch-3.3 also fixed this issue?

> Error at where clause, when sql case executes by else branch
> 
>
> Key: SPARK-40563
> URL: https://issues.apache.org/jira/browse/SPARK-40563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Vadim
>Priority: Major
> Fix For: 3.3.1
>
> Attachments: java-code-example.txt, sql.txt, stack-trace.txt
>
>
> Hello!
> The Spark SQL phase optimization failed with an internal error. Please, fill 
> a bug report in, and provide the full stack trace.
>  - Spark verison 3.3.0
>  - Scala version 2.12
>  - DatasourceV2
>  - Postgres
>  - Postrgres JDBC Driver: 42+
>  - Java8
> Case:
> select
>     case
>         when (t_name = 'foo') then 'foo'
>         else 'default'
>     end as case_when
> from
>     t
> where
>     case
>         when (t_name = 'foo') then 'foo'
>         else 'default'
>     end *= 'foo';  -> works as expected*
> *--*
> select
>     case
>         when (t_name = 'foo') then 'foo'
>         else 'default'
>     end as case_when
> from
>     t
> where
>     case
>         when (t_name = 'foo') then 'foo'
>         else 'default'
>     end *= 'default'; -> query throw ex;*
> In where clause when we try find rows by else branch, spark thrown exception:
> The Spark SQL phase optimization failed with an internal error. Please, fill 
> a bug report in, and provide the full stack trace.
> Caused by: java.lang.AssertionError: assertion failed
>     at scala.Predef$.assert(Predef.scala:208)
>  
> org.apache.spark.sql.execution.datasources.v2.PushablePredicate.$anonfun$unapply$1(DataSourceV2Strategy.scala:589)
> At debugger def unapply in PushablePredicate.class
> when sql case return 'foo' -> function unapply accept: (t_name = 'foo'), as 
> instance of Predicate
> when sql case return 'default' -> function unapply accept: COALESCE(t_name = 
> 'foo', FALSE) as instance of GeneralScalarExpression and assertation failed 
> with error
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39200) Stream is corrupted Exception while fetching the blocks from fallback storage system

2022-10-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-39200:

Fix Version/s: 3.3.1
   (was: 3.3.2)

> Stream is corrupted Exception while fetching the blocks from fallback storage 
> system
> 
>
> Key: SPARK-39200
> URL: https://issues.apache.org/jira/browse/SPARK-39200
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Rajendra Gujja
>Assignee: Frank Yin
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>
> When executor decommissioning and fallback storage is enabled - the shuffle 
> reads are failing with `FetchFailedException: Stream is corrupted` 
> ref: https://issues.apache.org/jira/browse/SPARK-18105 (search for 
> decommission)
>  
> This is happening when the shuffle block is bigger than `inputstream.read` 
> can read in one attempt. The code path is not reading the block fully 
> (`readFully`) and the partial read is causing the exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40535) NPE from observe of collect_list

2022-10-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40535:

Fix Version/s: 3.3.1
   (was: 3.3.2)

> NPE from observe of collect_list
> 
>
> Key: SPARK-40535
> URL: https://issues.apache.org/jira/browse/SPARK-40535
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
> The code below reproduces the issue:
> {code:scala}
> import org.apache.spark.sql.functions._
> val df = spark.range(1,10,1,11)
> df.observe("collectedList", collect_list("id")).collect()
> {code}
> instead of
> {code}
> Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
> {code}
> it fails with the NPE:
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:641)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:602)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:624)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40547) Fix dead links in sparkr-vignettes.Rmd

2022-10-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40547:

Fix Version/s: 3.3.1
   (was: 3.3.2)

> Fix dead links in sparkr-vignettes.Rmd
> --
>
> Key: SPARK-40547
> URL: https://issues.apache.org/jira/browse/SPARK-40547
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40322) Fix all dead links

2022-10-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40322:

Fix Version/s: 3.4.0
   3.3.1
   (was: 3.3.2)

> Fix all dead links
> --
>
> Key: SPARK-40322
> URL: https://issues.apache.org/jira/browse/SPARK-40322
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
>  
> [https://www.deadlinkchecker.com/website-dead-link-checker.asp]
>  
>  
> ||Status||URL||Source link text||
> |-1 Not found: The server name or address could not be 
> resolved|[http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark]|[Using
>  Parquet and Scrooge with Spark|https://spark.apache.org/documentation.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://blinkdb.org/]|[BlinkDB|https://spark.apache.org/third-party-projects.html]|
> |404 Not 
> Found|[https://github.com/AyasdiOpenSource/df]|[DF|https://spark.apache.org/third-party-projects.html]|
> |-1 Timeout|[https://atp.io/]|[atp|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://www.sehir.edu.tr/en/]|[Istanbul Sehir 
> University|https://spark.apache.org/powered-by.html]|
> |404 Not Found|[http://nsn.com/]|[Nokia Solutions and 
> Networks|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://www.nubetech.co/]|[Nube 
> Technologies|https://spark.apache.org/powered-by.html]|
> |-1 Timeout|[http://ooyala.com/]|[Ooyala, 
> Inc.|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://engineering.ooyala.com/blog/fast-spark-queries-memory-datasets]|[Spark
>  for Fast Queries|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://www.sisa.samsung.com/]|[Samsung Research 
> America|https://spark.apache.org/powered-by.html]|
> |-1 
> Timeout|[https://checker.apache.org/projs/spark.html]|[https://checker.apache.org/projs/spark.html|https://spark.apache.org/release-process.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|[AMP 
> Camp 2 [302 from 
> http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|https://spark.apache.org/documentation.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/agenda-2012/]|[AMP Camp 1 [302 
> from 
> http://ampcamp.berkeley.edu/agenda-2012/]|https://spark.apache.org/documentation.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/4/]|[AMP Camp 4 [302 from 
> http://ampcamp.berkeley.edu/4/]|https://spark.apache.org/documentation.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/3/]|[AMP Camp 3 [302 from 
> http://ampcamp.berkeley.edu/3/]|https://spark.apache.org/documentation.html]|
> |-500 Internal Server 
> Error-|-[https://www.packtpub.com/product/spark-cookbook/9781783987061]-|-[Spark
>  Cookbook [301 from 
> https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook]|https://spark.apache.org/documentation.html]-|
> |-500 Internal Server 
> Error-|-[https://www.packtpub.com/product/apache-spark-graph-processing/9781784391805]-|-[Apache
>  Spark Graph Processing [301 from 
> https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]|https://spark.apache.org/documentation.html]-|
> |500 Internal Server 
> Error|[https://prevalentdesignevents.com/sparksummit/eu17/]|[register|https://spark.apache.org/news/]|
> |500 Internal Server 
> Error|[https://prevalentdesignevents.com/sparksummit/ss17/?_ga=1.211902866.780052874.1433437196]|[register|https://spark.apache.org/news/]|
> |500 Internal Server 
> Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/registration.aspx?source=header]|[register|https://spark.apache.org/news/]|
> |500 Internal Server 
> Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/speaker/]|[Spark
>  Summit Europe|https://spark.apache.org/news/]|
> |-1 
> Timeout|[http://strataconf.com/strata2013]|[Strata|https://spark.apache.org/news/]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://blog.quantifind.com/posts/spark-unit-test/]|[Unit testing 
> with Spark|https://spark.apache.org/news/]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://blog.quantifind.com/posts/logging-post/]|[Configuring 
> Spark's logs|https://spark.apache.org/news/]|
> |-1 
> Timeout|[http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html]|[Spark|https://spark.apache.org/news/]|
> |-1 
> Timeout|[http://strata.oreilly.com/2012/11/shark-real-time-queries-and-a

[jira] [Updated] (SPARK-40562) Add spark.sql.legacy.groupingIdWithAppendedUserGroupBy

2022-10-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40562:

Fix Version/s: 3.3.1
   (was: 3.3.2)

> Add spark.sql.legacy.groupingIdWithAppendedUserGroupBy
> --
>
> Key: SPARK-40562
> URL: https://issues.apache.org/jira/browse/SPARK-40562
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>
> {code:java}
> scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS 
> t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
> +++
> |count(1)|grouping__id|
> +++
> |       1|           2|
> |       1|           2|
> +++
> scala> sql("set spark.sql.legacy.groupingIdWithAppendedUserGroupBy=true")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]scala> 
> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS 
> t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
> +++
> |count(1)|grouping__id|
> +++
> |       1|           1|
> |       1|           1|
> +++ {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38717) Handle Hive's bucket spec case preserving behaviour

2022-10-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-38717:

Fix Version/s: 3.3.1
   (was: 3.3.2)

> Handle Hive's bucket spec case preserving behaviour
> ---
>
> Key: SPARK-38717
> URL: https://issues.apache.org/jira/browse/SPARK-38717
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
> {code}
> CREATE TABLE t(
>  c STRING,
>  B_C STRING
> )
> PARTITIONED BY (p_c STRING)
> CLUSTERED BY (B_C) INTO 4 BUCKETS
> STORED AS PARQUET
> {code}
> then
> {code}
> SELECT * FROM t
> {code}
> fails with:
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns 
> B_C is not part of the table columns ([FieldSchema(name:c, type:string, 
> comment:null), FieldSchema(name:b_c, type:string, comment:null)]
>   at 
> org.apache.hadoop.hive.ql.metadata.Table.setBucketCols(Table.java:552)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:1098)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:764)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:274)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:763)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitionsByFilter$1(HiveExternalCatalog.scala:1287)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:101)
>   ... 110 more
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40583) Documentation error in "Integration with Cloud Infrastructures"

2022-10-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40583:

Fix Version/s: 3.3.1
   (was: 3.3.2)

> Documentation error in "Integration with Cloud Infrastructures"
> ---
>
> Key: SPARK-40583
> URL: https://issues.apache.org/jira/browse/SPARK-40583
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Daniel Ranchal
>Assignee: Daniel Ranchal
>Priority: Minor
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>
> The artifactId that implements the integration with several cloud 
> infrastructures is wrong. Instead of "hadoop-cloud-\{SCALA_VERSION}", it 
> should say "spark-hadoop-cloud-\{SCALA_VERSION}".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner

2022-10-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40636:

Fix Version/s: 3.3.1
   (was: 3.3.2)

> Fix wrong remained shuffles log in BlockManagerDecommissioner
> -
>
> Key: SPARK-40636
> URL: https://issues.apache.org/jira/browse/SPARK-40636
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>
>  BlockManagerDecommissioner should log correct remained shuffles.
> {code:java}
> 4 of 24 local shuffles are added. In total, 24 shuffles are remained.
> 2022-09-30 17:42:15.035 PDT
> 0 of 24 local shuffles are added. In total, 24 shuffles are remained.
> 2022-09-30 17:42:45.069 PDT
> 0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40648) Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module

2022-10-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40648:

Fix Version/s: 3.3.1
   (was: 3.3.2)

>   Add `@ExtendedLevelDBTest` to the leveldb relevant  tests in the yarn module
> --
>
> Key: SPARK-40648
> URL: https://issues.apache.org/jira/browse/SPARK-40648
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests, YARN
>Affects Versions: 3.2.2, 3.4.0, 3.3.1
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
> SPARK-40490 make  the test case related to `YarnShuffleIntegrationSuite` 
> starts to verify the registeredExecFile reload test scenario again，so we need 
> to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the 
> `MacOs/Apple Silicon` can skip relevant tests through 
> `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40574) Add PURGE to DROP TABLE doc

2022-10-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40574:

Fix Version/s: 3.3.1
   (was: 3.3.2)

> Add PURGE to DROP TABLE doc
> ---
>
> Key: SPARK-40574
> URL: https://issues.apache.org/jira/browse/SPARK-40574
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40612) On Kubernetes for long running app Spark using an invalid principal to renew the delegation token

2022-10-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40612:

Fix Version/s: 3.3.1
   (was: 3.3.2)

> On Kubernetes for long running app Spark using an invalid principal to renew 
> the delegation token
> -
>
> Key: SPARK-40612
> URL: https://issues.apache.org/jira/browse/SPARK-40612
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>
> When the delegation token fetched at the first time the principal is the 
> current user but the subsequent token renewals are using a MapReduce/Yarn 
> specific principal even on Kubernetes. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39725) Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622

2022-10-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-39725:

Fix Version/s: 3.3.1
   (was: 3.3.2)

> Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622
> 
>
> Key: SPARK-39725
> URL: https://issues.apache.org/jira/browse/SPARK-39725
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: jetty-io-spark.png
>
>
> [Release note |https://github.com/eclipse/jetty.project/releases] 
> [CVE-2022-2047|https://nvd.nist.gov/vuln/detail/CVE-2022-2047]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40682) Set spark.driver.maxResultSize to 3g in SqlBasedBenchmark

2022-10-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40682:

Fix Version/s: 3.3.1
   (was: 3.3.2)

> Set spark.driver.maxResultSize to 3g in SqlBasedBenchmark
> -
>
> Key: SPARK-40682
> URL: https://issues.apache.org/jira/browse/SPARK-40682
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

2022-10-16 Thread ohad (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ohad updated SPARK-40808:
-
Description: 
Hello. 
I am writing unit-tests to some functionality in my application that reading 
data from CSV files using Spark.

I am reading the data using:


{code:java}
header=True
mergeSchema=True
inferSchema=True{code}
When I am reading this single file:
{code:java}
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22{code}
I am getting this schema:
{code:java}
int_col=int
string_col=string
decimal_col=double
date_col=string{code}




When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is 
getting confused and think that the column that already identified as int are 
now string:


{code:java}
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
{code}
result:
{code:java}
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int{code}




When I am reading only the second file, it looks fine:
{code:java}
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2{code}
result:
{code:java}
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int{code}
For conclusion, it looks like there is a bug mixing the two features: header 
recognition and merge schema.

  was:
Hello. 
I am writing unit-tests to some functionality in my application that reading 
data from CSV files using Spark.

I am reading the data using:
```
header=True
mergeSchema=True
inferSchema=True
```

When I am reading this single file:
```
Fi
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
```

I am getting this schema:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
```

When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is 
getting confused and think that the column that already identified as int are 
now string:
```
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22

File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int
```

When I am reading only the second file, it looks fine:
```
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int
```

For conclusion, it looks like there is a bug mixing the two features: header 
recognition and merge schema.


> Infer schema for CSV files - wrong behavior using header + merge schema
> ---
>
> Key: SPARK-40808
> URL: https://issues.apache.org/jira/browse/SPARK-40808
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: ohad
>Priority: Major
>  Labels: CSVReader, csv, csvparser
>
> Hello. 
> I am writing unit-tests to some functionality in my application that reading 
> data from CSV files using Spark.
> I am reading the data using:
> {code:java}
> header=True
> mergeSchema=True
> inferSchema=True{code}
> When I am reading this single file:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22{code}
> I am getting this schema:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string{code}
> When I am duplicating this file, I am getting the same schema.
> The strange part is when I am adding new int column, it looks like spark is 
> getting confused and think that the column that already identified as int are 
> now string:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"

[jira] [Updated] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

2022-10-16 Thread ohad (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ohad updated SPARK-40808:
-
Description: 
Hello. 
I am writing unit-tests to some functionality in my application that reading 
data from CSV files using Spark.

I am reading the data using:
```
header=True
mergeSchema=True
inferSchema=True
```

When I am reading this single file:
```
Fi
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
```

I am getting this schema:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
```

When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is 
getting confused and think that the column that already identified as int are 
now string:
```
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22

File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int
```

When I am reading only the second file, it looks fine:
```
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int
```

For conclusion, it looks like there is a bug mixing the two features: header 
recognition and merge schema.

  was:
Hello. 
I am writing some unit-tests to some functionality in my application that 
reading data from CSV files using Spark.

I am reading the data using:
```
header=True
mergeSchema=True
inferSchema=True
```

When I am reading this single file:
```
Fi
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
```

I am getting this schema:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
```

When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is 
getting confused and think that the column that already identified as int are 
now string:
```
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22

File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int
```

When I am reading only the second file, it looks fine:
```
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int
```

For conclusion, it looks like there is a bug mixing the two features: header 
recognition and merge schema.


> Infer schema for CSV files - wrong behavior using header + merge schema
> ---
>
> Key: SPARK-40808
> URL: https://issues.apache.org/jira/browse/SPARK-40808
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: ohad
>Priority: Major
>  Labels: CSVReader, csv, csvparser
>
> Hello. 
> I am writing unit-tests to some functionality in my application that reading 
> data from CSV files using Spark.
> I am reading the data using:
> ```
> header=True
> mergeSchema=True
> inferSchema=True
> ```
> When I am reading this single file:
> ```
> Fi
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22
> ```
> I am getting this schema:
> ```
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string
> ```
> When I am duplicating this file, I am getting the same schema.
> The strange part is when I am adding new int column, it looks like spark is 
> getting confused and think that the column that already identified as int are 
> now string:
> ```
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,

[jira] [Created] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

2022-10-16 Thread ohad (Jira)

ohad created SPARK-40808:


 Summary: Infer schema for CSV files - wrong behavior using header 
+ merge schema
 Key: SPARK-40808
 URL: https://issues.apache.org/jira/browse/SPARK-40808
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.2
Reporter: ohad


Hello. 
I am writing some unit-tests to some functionality in my application that 
reading data from CSV files using Spark.

I am reading the data using:
```
header=True
mergeSchema=True
inferSchema=True
```

When I am reading this single file:
```
Fi
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
```

I am getting this schema:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
```

When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is 
getting confused and think that the column that already identified as int are 
now string:
```
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22

File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int
```

When I am reading only the second file, it looks fine:
```
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int
```

For conclusion, it looks like there is a bug mixing the two features: header 
recognition and merge schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

93 matches

Mail list logo