[jira] [Updated] (SPARK-42445) Fix SparkR install.spark function

2023-02-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42445:
--
Affects Version/s: 3.4.0

> Fix SparkR install.spark function
> -
>
> Key: SPARK-42445
> URL: https://issues.apache.org/jira/browse/SPARK-42445
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.3.0, 3.3.1, 3.3.2, 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> $ R
> R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid"
> Copyright (C) 2022 The R Foundation for Statistical Computing
> Platform: aarch64-apple-darwin20 (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> > library(SparkR)
> Attaching package: ‘SparkR’
> The following objects are masked from ‘package:stats’:
> cov, filter, lag, na.omit, predict, sd, var, window
> The following objects are masked from ‘package:base’:
> as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
> rank, rbind, sample, startsWith, subset, summary, transform, union
> > install.spark()
> Spark not found in the cache directory. Installation will start.
> MirrorUrl not provided.
> Looking for preferred site from apache website...
> Preferred mirror site found: https://dlcdn.apache.org/spark
> Downloading spark-3.3.2 for Hadoop 2.7 from:
> - https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz
> trying URL 
> 'https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz'
> simpleWarning in download.file(remotePath, localPath): downloaded length 0 != 
> reported length 196
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42446) Updating PySpark documentation to enhance usability

2023-02-14 Thread Allan Folting (Jira)
Allan Folting created SPARK-42446:
-

 Summary: Updating PySpark documentation to enhance usability
 Key: SPARK-42446
 URL: https://issues.apache.org/jira/browse/SPARK-42446
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Allan Folting


Updates to the PySpark documentation web site:
 * Fixing typo on the Getting Started page (Version => Versions)
 * Capitalizing "In/Out" in the DataFrame Quick Start notebook
 * Adding "(Legacy)" to the Spark Streaming heading on the Spark Streaming page
 * Reorganizing the User Guide page to list PySpark guides first + minor 
language updates



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42445) Fix SparkR install.spark function

2023-02-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42445:
--
Description: 
{code}
$ R

R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: aarch64-apple-darwin20 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(SparkR)

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

> install.spark()
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://dlcdn.apache.org/spark
Downloading spark-3.3.2 for Hadoop 2.7 from:
- https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz
trying URL 
'https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz'
simpleWarning in download.file(remotePath, localPath): downloaded length 0 != 
reported length 196
{code}

> Fix SparkR install.spark function
> -
>
> Key: SPARK-42445
> URL: https://issues.apache.org/jira/browse/SPARK-42445
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> $ R
> R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid"
> Copyright (C) 2022 The R Foundation for Statistical Computing
> Platform: aarch64-apple-darwin20 (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> > library(SparkR)
> Attaching package: ‘SparkR’
> The following objects are masked from ‘package:stats’:
> cov, filter, lag, na.omit, predict, sd, var, window
> The following objects are masked from ‘package:base’:
> as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
> rank, rbind, sample, startsWith, subset, summary, transform, union
> > install.spark()
> Spark not found in the cache directory. Installation will start.
> MirrorUrl not provided.
> Looking for preferred site from apache website...
> Preferred mirror site found: https://dlcdn.apache.org/spark
> Downloading spark-3.3.2 for Hadoop 2.7 from:
> - https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz
> trying URL 
> 'https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz'
> simpleWarning in download.file(remotePath, localPath): downloaded length 0 != 
> reported length 196
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42427) Conv should return an error if the internal conversion overflows

2023-02-14 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-42427.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40001
[https://github.com/apache/spark/pull/40001]

> Conv should return an error if the internal conversion overflows
> 
>
> Key: SPARK-42427
> URL: https://issues.apache.org/jira/browse/SPARK-42427
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42445) Fix SparkR install.spark function

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688837#comment-17688837
 ] 

Apache Spark commented on SPARK-42445:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40031

> Fix SparkR install.spark function
> -
>
> Key: SPARK-42445
> URL: https://issues.apache.org/jira/browse/SPARK-42445
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42445) Fix SparkR install.spark function

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42445:


Assignee: (was: Apache Spark)

> Fix SparkR install.spark function
> -
>
> Key: SPARK-42445
> URL: https://issues.apache.org/jira/browse/SPARK-42445
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42445) Fix SparkR install.spark function

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42445:


Assignee: Apache Spark

> Fix SparkR install.spark function
> -
>
> Key: SPARK-42445
> URL: https://issues.apache.org/jira/browse/SPARK-42445
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42445) Fix SparkR install.spark function

2023-02-14 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-42445:
-

 Summary: Fix SparkR install.spark function
 Key: SPARK-42445
 URL: https://issues.apache.org/jira/browse/SPARK-42445
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 3.3.1, 3.3.0, 3.3.2
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42002) Implement DataFrameWriterV2 (ReadwriterV2Tests)

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688832#comment-17688832
 ] 

Apache Spark commented on SPARK-42002:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40030

> Implement DataFrameWriterV2 (ReadwriterV2Tests)
> ---
>
> Key: SPARK-42002
> URL: https://issues.apache.org/jira/browse/SPARK-42002
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code}
> pyspark/sql/tests/test_readwriter.py:182 (ReadwriterV2ParityTests.test_api)
> self = 
>  testMethod=test_api>
> def test_api(self):
> df = self.df
> >   writer = df.writeTo("testcat.t")
> ../test_readwriter.py:185: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> self = DataFrame[key: bigint, value: string], args = ('testcat.t',), kwargs = 
> {}
> def writeTo(self, *args: Any, **kwargs: Any) -> None:
> >   raise NotImplementedError("writeTo() is not implemented.")
> E   NotImplementedError: writeTo() is not implemented.
> ../../connect/dataframe.py:1529: NotImplementedError
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42002) Implement DataFrameWriterV2 (ReadwriterV2Tests)

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688833#comment-17688833
 ] 

Apache Spark commented on SPARK-42002:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40030

> Implement DataFrameWriterV2 (ReadwriterV2Tests)
> ---
>
> Key: SPARK-42002
> URL: https://issues.apache.org/jira/browse/SPARK-42002
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code}
> pyspark/sql/tests/test_readwriter.py:182 (ReadwriterV2ParityTests.test_api)
> self = 
>  testMethod=test_api>
> def test_api(self):
> df = self.df
> >   writer = df.writeTo("testcat.t")
> ../test_readwriter.py:185: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> self = DataFrame[key: bigint, value: string], args = ('testcat.t',), kwargs = 
> {}
> def writeTo(self, *args: Any, **kwargs: Any) -> None:
> >   raise NotImplementedError("writeTo() is not implemented.")
> E   NotImplementedError: writeTo() is not implemented.
> ../../connect/dataframe.py:1529: NotImplementedError
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42431) Union avoid calling `output` before analysis

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688813#comment-17688813
 ] 

Apache Spark commented on SPARK-42431:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40029

> Union avoid calling `output` before analysis
> 
>
> Key: SPARK-42431
> URL: https://issues.apache.org/jira/browse/SPARK-42431
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38324) The second range is not [0, 59] in the day time ANSI interval

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38324:


Assignee: (was: Apache Spark)

> The second range is not [0, 59] in the day time ANSI interval
> -
>
> Key: SPARK-38324
> URL: https://issues.apache.org/jira/browse/SPARK-38324
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.0
> Environment: Spark 3.3.0 snapshot
>Reporter: chong
>Priority: Major
>
> [https://spark.apache.org/docs/latest/sql-ref-datatypes.html]
>  * SECOND, seconds within minutes and possibly fractions of a second 
> [0..59.99]}}{}}}
> {{Doc shows SECOND is seconds within minutes, it's range should be [0, 59]}}
>  
> But testing shows 99 second is valid:
> {{>>> spark.sql("select INTERVAL '10 01:01:99' DAY TO SECOND")}}
> {{{}DataFrame[INTERVAL '10 01:02:39' DAY TO SECOND: interval day to 
> second]{}}}}}{}}}
>  
> Meanwhile, minute range check is ok, see below:
> >>> spark.sql("select INTERVAL '10 01:60:01' DAY TO SECOND")
> requirement failed: {color:#de350b}*minute 60 outside range [0, 
> 59]*{color}(line 1, pos 16)
> == SQL ==
> select INTERVAL '10 01:60:01' DAY TO SECOND
> ^^^
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38324) The second range is not [0, 59] in the day time ANSI interval

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38324:


Assignee: Apache Spark

> The second range is not [0, 59] in the day time ANSI interval
> -
>
> Key: SPARK-38324
> URL: https://issues.apache.org/jira/browse/SPARK-38324
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.0
> Environment: Spark 3.3.0 snapshot
>Reporter: chong
>Assignee: Apache Spark
>Priority: Major
>
> [https://spark.apache.org/docs/latest/sql-ref-datatypes.html]
>  * SECOND, seconds within minutes and possibly fractions of a second 
> [0..59.99]}}{}}}
> {{Doc shows SECOND is seconds within minutes, it's range should be [0, 59]}}
>  
> But testing shows 99 second is valid:
> {{>>> spark.sql("select INTERVAL '10 01:01:99' DAY TO SECOND")}}
> {{{}DataFrame[INTERVAL '10 01:02:39' DAY TO SECOND: interval day to 
> second]{}}}}}{}}}
>  
> Meanwhile, minute range check is ok, see below:
> >>> spark.sql("select INTERVAL '10 01:60:01' DAY TO SECOND")
> requirement failed: {color:#de350b}*minute 60 outside range [0, 
> 59]*{color}(line 1, pos 16)
> == SQL ==
> select INTERVAL '10 01:60:01' DAY TO SECOND
> ^^^
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38324) The second range is not [0, 59] in the day time ANSI interval

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688807#comment-17688807
 ] 

Apache Spark commented on SPARK-38324:
--

User 'haoyanzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40028

> The second range is not [0, 59] in the day time ANSI interval
> -
>
> Key: SPARK-38324
> URL: https://issues.apache.org/jira/browse/SPARK-38324
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.0
> Environment: Spark 3.3.0 snapshot
>Reporter: chong
>Priority: Major
>
> [https://spark.apache.org/docs/latest/sql-ref-datatypes.html]
>  * SECOND, seconds within minutes and possibly fractions of a second 
> [0..59.99]}}{}}}
> {{Doc shows SECOND is seconds within minutes, it's range should be [0, 59]}}
>  
> But testing shows 99 second is valid:
> {{>>> spark.sql("select INTERVAL '10 01:01:99' DAY TO SECOND")}}
> {{{}DataFrame[INTERVAL '10 01:02:39' DAY TO SECOND: interval day to 
> second]{}}}}}{}}}
>  
> Meanwhile, minute range check is ok, see below:
> >>> spark.sql("select INTERVAL '10 01:60:01' DAY TO SECOND")
> requirement failed: {color:#de350b}*minute 60 outside range [0, 
> 59]*{color}(line 1, pos 16)
> == SQL ==
> select INTERVAL '10 01:60:01' DAY TO SECOND
> ^^^
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38324) The second range is not [0, 59] in the day time ANSI interval

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688806#comment-17688806
 ] 

Apache Spark commented on SPARK-38324:
--

User 'haoyanzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40028

> The second range is not [0, 59] in the day time ANSI interval
> -
>
> Key: SPARK-38324
> URL: https://issues.apache.org/jira/browse/SPARK-38324
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.0
> Environment: Spark 3.3.0 snapshot
>Reporter: chong
>Priority: Major
>
> [https://spark.apache.org/docs/latest/sql-ref-datatypes.html]
>  * SECOND, seconds within minutes and possibly fractions of a second 
> [0..59.99]}}{}}}
> {{Doc shows SECOND is seconds within minutes, it's range should be [0, 59]}}
>  
> But testing shows 99 second is valid:
> {{>>> spark.sql("select INTERVAL '10 01:01:99' DAY TO SECOND")}}
> {{{}DataFrame[INTERVAL '10 01:02:39' DAY TO SECOND: interval day to 
> second]{}}}}}{}}}
>  
> Meanwhile, minute range check is ok, see below:
> >>> spark.sql("select INTERVAL '10 01:60:01' DAY TO SECOND")
> requirement failed: {color:#de350b}*minute 60 outside range [0, 
> 59]*{color}(line 1, pos 16)
> == SQL ==
> select INTERVAL '10 01:60:01' DAY TO SECOND
> ^^^
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42441) Scala Client - Implement Column API

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688805#comment-17688805
 ] 

Apache Spark commented on SPARK-42441:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/40027

> Scala Client - Implement Column API
> ---
>
> Key: SPARK-42441
> URL: https://issues.apache.org/jira/browse/SPARK-42441
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42441) Scala Client - Implement Column API

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688804#comment-17688804
 ] 

Apache Spark commented on SPARK-42441:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/40027

> Scala Client - Implement Column API
> ---
>
> Key: SPARK-42441
> URL: https://issues.apache.org/jira/browse/SPARK-42441
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42441) Scala Client - Implement Column API

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42441:


Assignee: Apache Spark

> Scala Client - Implement Column API
> ---
>
> Key: SPARK-42441
> URL: https://issues.apache.org/jira/browse/SPARK-42441
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42441) Scala Client - Implement Column API

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42441:


Assignee: (was: Apache Spark)

> Scala Client - Implement Column API
> ---
>
> Key: SPARK-42441
> URL: https://issues.apache.org/jira/browse/SPARK-42441
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42444) DataFrame.drop should handle multi columns properly

2023-02-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-42444:
--
Description: 
{code:java}
from pyspark.sql import Row
df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", 
"name"])
df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, 
name="Bob")])
df1.join(df2, df1.name == df2.name, 'inner').drop('name', 'age').show()
{code}

This works in 3.3

{code:java}
+--+
|height|
+--+
|85|
|80|
+--+
{code}

but fails in 3.4


{code:java}
---
AnalysisException Traceback (most recent call last)
Cell In[1], line 4
  2 df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], 
["age", "name"])
  3 df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, 
name="Bob")])
> 4 df1.join(df2, df1.name == df2.name, 'inner').drop('name', 'age').show()

File ~/Dev/spark/python/pyspark/sql/dataframe.py:4913, in DataFrame.drop(self, 
*cols)
   4911 jcols = [_to_java_column(c) for c in cols]
   4912 first_column, *remaining_columns = jcols
-> 4913 jdf = self._jdf.drop(first_column, self._jseq(remaining_columns))
   4915 return DataFrame(jdf, self.sparkSession)

File ~/Dev/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in 
JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317 self.command_header +\
   1318 args_command +\
   1319 proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323 answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326 if hasattr(temp_arg, "_detach"):

File ~/Dev/spark/python/pyspark/errors/exceptions/captured.py:159, in 
capture_sql_exception..deco(*a, **kw)
155 converted = convert_exception(e.java_exception)
156 if not isinstance(converted, UnknownException):
157 # Hide where the exception came from that shows a non-Pythonic
158 # JVM exception message.
--> 159 raise converted from None
160 else:
161 raise

AnalysisException: [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could 
be: [`name`, `name`].

{code}



  was:

{code:java}
from pyspark.sql import Row
df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", 
"name"])
df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, 
name="Bob")])
df1.join(df2, df1.name == df2.name, 'inner').drop('name', 'age').show()
{code}

This works in 3.3.0

{code:java}
+--+
|height|
+--+
|85|
|80|
+--+
{code}

but fails in 3.4


{code:java}
---
AnalysisException Traceback (most recent call last)
Cell In[1], line 4
  2 df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], 
["age", "name"])
  3 df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, 
name="Bob")])
> 4 df1.join(df2, df1.name == df2.name, 'inner').drop('name', 'age').show()

File ~/Dev/spark/python/pyspark/sql/dataframe.py:4913, in DataFrame.drop(self, 
*cols)
   4911 jcols = [_to_java_column(c) for c in cols]
   4912 first_column, *remaining_columns = jcols
-> 4913 jdf = self._jdf.drop(first_column, self._jseq(remaining_columns))
   4915 return DataFrame(jdf, self.sparkSession)

File ~/Dev/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in 
JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317 self.command_header +\
   1318 args_command +\
   1319 proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323 answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326 if hasattr(temp_arg, "_detach"):

File ~/Dev/spark/python/pyspark/errors/exceptions/captured.py:159, in 
capture_sql_exception..deco(*a, **kw)
155 converted = convert_exception(e.java_exception)
156 if not isinstance(converted, UnknownException):
157 # Hide where the exception came from that shows a non-Pythonic
158 # JVM exception message.
--> 159 raise converted from None
160 else:
161 raise

AnalysisException: [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could 
be: [`name`, `name`].

{code}




> DataFrame.drop should handle multi columns properly
> ---
>
> Key: SPARK-42444
> URL: https://issues.apache.org/jira/browse/SPARK-42444
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>

[jira] [Created] (SPARK-42444) DataFrame.drop should handle multi columns properly

2023-02-14 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-42444:
-

 Summary: DataFrame.drop should handle multi columns properly
 Key: SPARK-42444
 URL: https://issues.apache.org/jira/browse/SPARK-42444
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng



{code:java}
from pyspark.sql import Row
df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", 
"name"])
df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, 
name="Bob")])
df1.join(df2, df1.name == df2.name, 'inner').drop('name', 'age').show()
{code}

This works in 3.3.0

{code:java}
+--+
|height|
+--+
|85|
|80|
+--+
{code}

but fails in 3.4


{code:java}
---
AnalysisException Traceback (most recent call last)
Cell In[1], line 4
  2 df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], 
["age", "name"])
  3 df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, 
name="Bob")])
> 4 df1.join(df2, df1.name == df2.name, 'inner').drop('name', 'age').show()

File ~/Dev/spark/python/pyspark/sql/dataframe.py:4913, in DataFrame.drop(self, 
*cols)
   4911 jcols = [_to_java_column(c) for c in cols]
   4912 first_column, *remaining_columns = jcols
-> 4913 jdf = self._jdf.drop(first_column, self._jseq(remaining_columns))
   4915 return DataFrame(jdf, self.sparkSession)

File ~/Dev/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in 
JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317 self.command_header +\
   1318 args_command +\
   1319 proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323 answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326 if hasattr(temp_arg, "_detach"):

File ~/Dev/spark/python/pyspark/errors/exceptions/captured.py:159, in 
capture_sql_exception..deco(*a, **kw)
155 converted = convert_exception(e.java_exception)
156 if not isinstance(converted, UnknownException):
157 # Hide where the exception came from that shows a non-Pythonic
158 # JVM exception message.
--> 159 raise converted from None
160 else:
161 raise

AnalysisException: [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could 
be: [`name`, `name`].

{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42401) Incorrect results or NPE when inserting null value into array using array_insert/array_append

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688786#comment-17688786
 ] 

Apache Spark commented on SPARK-42401:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/40026

> Incorrect results or NPE when inserting null value into array using 
> array_insert/array_append
> -
>
> Key: SPARK-42401
> URL: https://issues.apache.org/jira/browse/SPARK-42401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
> Fix For: 3.4.0
>
>
> Example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array(1, 2, 3, 4), 5, 5),
> (array(1, 2, 3, 4), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> This produces an incorrect result:
> {noformat}
> [1,2,3,4,5]
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> A more succint example:
> {noformat}
> select array_insert(array(1, 2, 3, 4), 5, cast(null as int));
> {noformat}
> This also produces an incorrect result:
> {noformat}
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> Another example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array('1', '2', '3', '4'), 5, '5'),
> (array('1', '2', '3', '4'), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> The above query throws a {{NullPointerException}}:
> {noformat}
> 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, 
> col2, col3) from v1]
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44)
> {noformat}
> {{array_append}} has the same issue:
> {noformat}
> spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int));
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> Time taken: 3.679 seconds, Fetched 1 row(s)
> spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as 
> string));
> 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42401) Incorrect results or NPE when inserting null value into array using array_insert/array_append

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688788#comment-17688788
 ] 

Apache Spark commented on SPARK-42401:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/40026

> Incorrect results or NPE when inserting null value into array using 
> array_insert/array_append
> -
>
> Key: SPARK-42401
> URL: https://issues.apache.org/jira/browse/SPARK-42401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
> Fix For: 3.4.0
>
>
> Example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array(1, 2, 3, 4), 5, 5),
> (array(1, 2, 3, 4), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> This produces an incorrect result:
> {noformat}
> [1,2,3,4,5]
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> A more succint example:
> {noformat}
> select array_insert(array(1, 2, 3, 4), 5, cast(null as int));
> {noformat}
> This also produces an incorrect result:
> {noformat}
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> Another example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array('1', '2', '3', '4'), 5, '5'),
> (array('1', '2', '3', '4'), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> The above query throws a {{NullPointerException}}:
> {noformat}
> 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, 
> col2, col3) from v1]
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44)
> {noformat}
> {{array_append}} has the same issue:
> {noformat}
> spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int));
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> Time taken: 3.679 seconds, Fetched 1 row(s)
> spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as 
> string));
> 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42426) insertInto fails when the column names are different from the table columns

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688765#comment-17688765
 ] 

Apache Spark commented on SPARK-42426:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/40024

> insertInto fails when the column names are different from the table columns
> ---
>
> Key: SPARK-42426
> URL: https://issues.apache.org/jira/browse/SPARK-42426
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> {noformat}
> File "/.../python/pyspark/sql/connect/readwriter.py", line 518, in 
> pyspark.sql.connect.readwriter.DataFrameWriter.insertInto
> Failed example:
> df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA")
> Exception raised:
> Traceback (most recent call last):
>   File "/.../lib/python3.9/doctest.py", line 1334, in __run
> exec(compile(example.source, filename, "single",
>   File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[3]>", line 1, in 
> 
> df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA")
>   File "/.../python/pyspark/sql/connect/readwriter.py", line 477, in 
> insertInto
> self.saveAsTable(tableName)
>   File "/.../python/pyspark/sql/connect/readwriter.py", line 495, in 
> saveAsTable
> 
> self._spark.client.execute_command(self._write.command(self._spark.client))
>   File "/.../python/pyspark/sql/connect/client.py", line 553, in 
> execute_command
> self._execute(req)
>   File "/.../python/pyspark/sql/connect/client.py", line 648, in _execute
> self._handle_error(rpc_error)
>   File "/.../python/pyspark/sql/connect/client.py", line 718, in 
> _handle_error
> raise convert_exception(info, status.message) from None
> pyspark.errors.exceptions.connect.AnalysisException: Cannot resolve 'age' 
> given input columns: [col1, col2].
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42426) insertInto fails when the column names are different from the table columns

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688763#comment-17688763
 ] 

Apache Spark commented on SPARK-42426:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/40024

> insertInto fails when the column names are different from the table columns
> ---
>
> Key: SPARK-42426
> URL: https://issues.apache.org/jira/browse/SPARK-42426
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> {noformat}
> File "/.../python/pyspark/sql/connect/readwriter.py", line 518, in 
> pyspark.sql.connect.readwriter.DataFrameWriter.insertInto
> Failed example:
> df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA")
> Exception raised:
> Traceback (most recent call last):
>   File "/.../lib/python3.9/doctest.py", line 1334, in __run
> exec(compile(example.source, filename, "single",
>   File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[3]>", line 1, in 
> 
> df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA")
>   File "/.../python/pyspark/sql/connect/readwriter.py", line 477, in 
> insertInto
> self.saveAsTable(tableName)
>   File "/.../python/pyspark/sql/connect/readwriter.py", line 495, in 
> saveAsTable
> 
> self._spark.client.execute_command(self._write.command(self._spark.client))
>   File "/.../python/pyspark/sql/connect/client.py", line 553, in 
> execute_command
> self._execute(req)
>   File "/.../python/pyspark/sql/connect/client.py", line 648, in _execute
> self._handle_error(rpc_error)
>   File "/.../python/pyspark/sql/connect/client.py", line 718, in 
> _handle_error
> raise convert_exception(info, status.message) from None
> pyspark.errors.exceptions.connect.AnalysisException: Cannot resolve 'age' 
> given input columns: [col1, col2].
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42426) insertInto fails when the column names are different from the table columns

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42426:


Assignee: (was: Apache Spark)

> insertInto fails when the column names are different from the table columns
> ---
>
> Key: SPARK-42426
> URL: https://issues.apache.org/jira/browse/SPARK-42426
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> {noformat}
> File "/.../python/pyspark/sql/connect/readwriter.py", line 518, in 
> pyspark.sql.connect.readwriter.DataFrameWriter.insertInto
> Failed example:
> df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA")
> Exception raised:
> Traceback (most recent call last):
>   File "/.../lib/python3.9/doctest.py", line 1334, in __run
> exec(compile(example.source, filename, "single",
>   File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[3]>", line 1, in 
> 
> df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA")
>   File "/.../python/pyspark/sql/connect/readwriter.py", line 477, in 
> insertInto
> self.saveAsTable(tableName)
>   File "/.../python/pyspark/sql/connect/readwriter.py", line 495, in 
> saveAsTable
> 
> self._spark.client.execute_command(self._write.command(self._spark.client))
>   File "/.../python/pyspark/sql/connect/client.py", line 553, in 
> execute_command
> self._execute(req)
>   File "/.../python/pyspark/sql/connect/client.py", line 648, in _execute
> self._handle_error(rpc_error)
>   File "/.../python/pyspark/sql/connect/client.py", line 718, in 
> _handle_error
> raise convert_exception(info, status.message) from None
> pyspark.errors.exceptions.connect.AnalysisException: Cannot resolve 'age' 
> given input columns: [col1, col2].
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42426) insertInto fails when the column names are different from the table columns

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42426:


Assignee: Apache Spark

> insertInto fails when the column names are different from the table columns
> ---
>
> Key: SPARK-42426
> URL: https://issues.apache.org/jira/browse/SPARK-42426
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> {noformat}
> File "/.../python/pyspark/sql/connect/readwriter.py", line 518, in 
> pyspark.sql.connect.readwriter.DataFrameWriter.insertInto
> Failed example:
> df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA")
> Exception raised:
> Traceback (most recent call last):
>   File "/.../lib/python3.9/doctest.py", line 1334, in __run
> exec(compile(example.source, filename, "single",
>   File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[3]>", line 1, in 
> 
> df.selectExpr("age AS col1", "name AS col2").write.insertInto("tblA")
>   File "/.../python/pyspark/sql/connect/readwriter.py", line 477, in 
> insertInto
> self.saveAsTable(tableName)
>   File "/.../python/pyspark/sql/connect/readwriter.py", line 495, in 
> saveAsTable
> 
> self._spark.client.execute_command(self._write.command(self._spark.client))
>   File "/.../python/pyspark/sql/connect/client.py", line 553, in 
> execute_command
> self._execute(req)
>   File "/.../python/pyspark/sql/connect/client.py", line 648, in _execute
> self._handle_error(rpc_error)
>   File "/.../python/pyspark/sql/connect/client.py", line 718, in 
> _handle_error
> raise convert_exception(info, status.message) from None
> pyspark.errors.exceptions.connect.AnalysisException: Cannot resolve 'age' 
> given input columns: [col1, col2].
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42401) Incorrect results or NPE when inserting null value into array using array_insert/array_append

2023-02-14 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688759#comment-17688759
 ] 

Bruce Robbins commented on SPARK-42401:
---

There is another case:
{noformat}
spark-sql> select array_insert(array('1', '2', '3', '4'), -6, '5');
23/02/14 16:10:19 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{noformat}
{{array_insert}} might implicitly add nulls, and my fix does not cover that 
case. I will follow up.

> Incorrect results or NPE when inserting null value into array using 
> array_insert/array_append
> -
>
> Key: SPARK-42401
> URL: https://issues.apache.org/jira/browse/SPARK-42401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
> Fix For: 3.4.0
>
>
> Example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array(1, 2, 3, 4), 5, 5),
> (array(1, 2, 3, 4), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> This produces an incorrect result:
> {noformat}
> [1,2,3,4,5]
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> A more succint example:
> {noformat}
> select array_insert(array(1, 2, 3, 4), 5, cast(null as int));
> {noformat}
> This also produces an incorrect result:
> {noformat}
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> Another example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array('1', '2', '3', '4'), 5, '5'),
> (array('1', '2', '3', '4'), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> The above query throws a {{NullPointerException}}:
> {noformat}
> 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, 
> col2, col3) from v1]
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44)
> {noformat}
> {{array_append}} has the same issue:
> {noformat}
> spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int));
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> Time taken: 3.679 seconds, Fetched 1 row(s)
> spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as 
> string));
> 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42443) Remove unused object in DataFrameAggregateSuite

2023-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42443.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40023
[https://github.com/apache/spark/pull/40023]

> Remove unused object in DataFrameAggregateSuite
> ---
>
> Key: SPARK-42443
> URL: https://issues.apache.org/jira/browse/SPARK-42443
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42443) Remove unused object in DataFrameAggregateSuite

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688757#comment-17688757
 ] 

Apache Spark commented on SPARK-42443:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/40023

> Remove unused object in DataFrameAggregateSuite
> ---
>
> Key: SPARK-42443
> URL: https://issues.apache.org/jira/browse/SPARK-42443
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42443) Remove unused object in DataFrameAggregateSuite

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42443:


Assignee: Rui Wang  (was: Apache Spark)

> Remove unused object in DataFrameAggregateSuite
> ---
>
> Key: SPARK-42443
> URL: https://issues.apache.org/jira/browse/SPARK-42443
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42443) Remove unused object in DataFrameAggregateSuite

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42443:


Assignee: Apache Spark  (was: Rui Wang)

> Remove unused object in DataFrameAggregateSuite
> ---
>
> Key: SPARK-42443
> URL: https://issues.apache.org/jira/browse/SPARK-42443
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42443) Remove unused object in DataFrameAggregateSuite

2023-02-14 Thread Rui Wang (Jira)
Rui Wang created SPARK-42443:


 Summary: Remove unused object in DataFrameAggregateSuite
 Key: SPARK-42443
 URL: https://issues.apache.org/jira/browse/SPARK-42443
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42399) CONV() silently overflows returning wrong results

2023-02-14 Thread Serge Rielau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688755#comment-17688755
 ] 

Serge Rielau commented on SPARK-42399:
--

Adding support is of course best. If it can be done quickly, if not we should 
stop the wrong results first.

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Critical
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42406) [PROTOBUF] Recursive field handling is incompatible with delta

2023-02-14 Thread Raghu Angadi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688752#comment-17688752
 ] 

Raghu Angadi commented on SPARK-42406:
--

Thanks for merging [https://github.com/apache/spark/pull/40011] 

Keeping this ticket open to fix the issue with 'nullType' and delta.

> [PROTOBUF] Recursive field handling is incompatible with delta
> --
>
> Key: SPARK-42406
> URL: https://issues.apache.org/jira/browse/SPARK-42406
> Project: Spark
>  Issue Type: Bug
>  Components: Protobuf
>Affects Versions: 3.4.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Major
> Fix For: 3.4.0
>
>
> Protobuf deserializer (`from_protobuf()` function()) optionally supports 
> recursive fields by limiting the depth to certain level. See example below. 
> It assigns a 'NullType' for such a field when allowed depth is reached. 
> It causes a few issues. E.g. a repeated field as in the following example 
> results in a Array field with 'NullType'. Delta does not support null type in 
> a complex type.
> Actually `Array[NullType]` is not really useful anyway.
> How about this fix: Drop the recursive field when the limit reached rather 
> than using a NullType. 
> The example below makes it clear:
> Consider a recursive Protobuf:
>  
> {code:python}
> message TreeNode {
>   string value = 1;
>   repeated TreeNode children = 2;
> }
> {code}
> Allow depth of 2: 
>  
> {code:python}
>    df.select(
>     'proto',
>      messageName = 'TreeNode',
>      options = { ... "recursive.fields.max.depth" : "2" }
>   ).printSchema()
> {code}
> Schema looks like this:
> {noformat}
> root
> |– from_protobuf(proto): struct (nullable = true)|
> | |– value: string (nullable = true)|
> | |– children: array (nullable = false)|
> | | |– element: struct (containsNull = false)|
> | | | |– value: string (nullable = true)|
> | | | |– children: array (nullable = false)|
> | | | | |– element: struct (containsNull = false)|
> | | | | | |– value: string (nullable = true)|
> | | | | | |– children: array (nullable = false). [ === Proposed fix: Drop 
> this field === ]|
> | | | | | | |– element: void (containsNull = false) [ === NOTICE 'void' HERE 
> === ] 
> {noformat}
> When we try to write this to a delta table, we get an error:
> {noformat}
> AnalysisException: Found nested NullType in column 
> from_protobuf(proto).children which is of ArrayType. Delta doesn't support 
> writing NullType in complex types.
> {noformat}
>  
> We could just drop the field 'element' when recursion depth is reached. It is 
> simpler and does not need to deal with NullType. We are ignoring the value 
> anyway. There is no use in keeping the field.
> Another issue is setting for 'recursive.fields.max.depth': It is not enforced 
> correctly. '0' does not make sense. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-42406) [PROTOBUF] Recursive field handling is incompatible with delta

2023-02-14 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reopened SPARK-42406:


> [PROTOBUF] Recursive field handling is incompatible with delta
> --
>
> Key: SPARK-42406
> URL: https://issues.apache.org/jira/browse/SPARK-42406
> Project: Spark
>  Issue Type: Bug
>  Components: Protobuf
>Affects Versions: 3.4.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Major
> Fix For: 3.4.0
>
>
> Protobuf deserializer (`from_protobuf()` function()) optionally supports 
> recursive fields by limiting the depth to certain level. See example below. 
> It assigns a 'NullType' for such a field when allowed depth is reached. 
> It causes a few issues. E.g. a repeated field as in the following example 
> results in a Array field with 'NullType'. Delta does not support null type in 
> a complex type.
> Actually `Array[NullType]` is not really useful anyway.
> How about this fix: Drop the recursive field when the limit reached rather 
> than using a NullType. 
> The example below makes it clear:
> Consider a recursive Protobuf:
>  
> {code:python}
> message TreeNode {
>   string value = 1;
>   repeated TreeNode children = 2;
> }
> {code}
> Allow depth of 2: 
>  
> {code:python}
>    df.select(
>     'proto',
>      messageName = 'TreeNode',
>      options = { ... "recursive.fields.max.depth" : "2" }
>   ).printSchema()
> {code}
> Schema looks like this:
> {noformat}
> root
> |– from_protobuf(proto): struct (nullable = true)|
> | |– value: string (nullable = true)|
> | |– children: array (nullable = false)|
> | | |– element: struct (containsNull = false)|
> | | | |– value: string (nullable = true)|
> | | | |– children: array (nullable = false)|
> | | | | |– element: struct (containsNull = false)|
> | | | | | |– value: string (nullable = true)|
> | | | | | |– children: array (nullable = false). [ === Proposed fix: Drop 
> this field === ]|
> | | | | | | |– element: void (containsNull = false) [ === NOTICE 'void' HERE 
> === ] 
> {noformat}
> When we try to write this to a delta table, we get an error:
> {noformat}
> AnalysisException: Found nested NullType in column 
> from_protobuf(proto).children which is of ArrayType. Delta doesn't support 
> writing NullType in complex types.
> {noformat}
>  
> We could just drop the field 'element' when recursion depth is reached. It is 
> simpler and does not need to deal with NullType. We are ignoring the value 
> anyway. There is no use in keeping the field.
> Another issue is setting for 'recursive.fields.max.depth': It is not enforced 
> correctly. '0' does not make sense. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42442) Use spark.sql.timestampType for data source inference

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42442:


Assignee: Gengliang Wang  (was: Apache Spark)

> Use spark.sql.timestampType for data source inference
> -
>
> Key: SPARK-42442
> URL: https://issues.apache.org/jira/browse/SPARK-42442
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> With the configuration `spark.sql.timestampType`,  TIMESTAMP in Spark is a 
> user-specified alias associated with one of the TIMESTAMP_LTZ and 
> TIMESTAMP_NTZ variations. This is quite complicated to Spark users.
> There is another option `spark.sql.sources.timestampNTZTypeInference.enabled` 
> for schema inference. I would like to introduce it in 
> [https://github.com/apache/spark/pull/40005] but having two flags seems too 
> much. After thoughts, I decide to merge 
> `spark.sql.sources.timestampNTZTypeInference.enabled` into 
> `spark.sql.timestampType` and let  `spark.sql.timestampType` control the 
> schema inference behavior.
> We can have followups to add data source options "inferTimestampNTZType" for 
> CSV/JSON/partiton column like JDBC data source did.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42442) Use spark.sql.timestampType for data source inference

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42442:


Assignee: Apache Spark  (was: Gengliang Wang)

> Use spark.sql.timestampType for data source inference
> -
>
> Key: SPARK-42442
> URL: https://issues.apache.org/jira/browse/SPARK-42442
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> With the configuration `spark.sql.timestampType`,  TIMESTAMP in Spark is a 
> user-specified alias associated with one of the TIMESTAMP_LTZ and 
> TIMESTAMP_NTZ variations. This is quite complicated to Spark users.
> There is another option `spark.sql.sources.timestampNTZTypeInference.enabled` 
> for schema inference. I would like to introduce it in 
> [https://github.com/apache/spark/pull/40005] but having two flags seems too 
> much. After thoughts, I decide to merge 
> `spark.sql.sources.timestampNTZTypeInference.enabled` into 
> `spark.sql.timestampType` and let  `spark.sql.timestampType` control the 
> schema inference behavior.
> We can have followups to add data source options "inferTimestampNTZType" for 
> CSV/JSON/partiton column like JDBC data source did.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42442) Use spark.sql.timestampType for data source inference

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688748#comment-17688748
 ] 

Apache Spark commented on SPARK-42442:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40022

> Use spark.sql.timestampType for data source inference
> -
>
> Key: SPARK-42442
> URL: https://issues.apache.org/jira/browse/SPARK-42442
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> With the configuration `spark.sql.timestampType`,  TIMESTAMP in Spark is a 
> user-specified alias associated with one of the TIMESTAMP_LTZ and 
> TIMESTAMP_NTZ variations. This is quite complicated to Spark users.
> There is another option `spark.sql.sources.timestampNTZTypeInference.enabled` 
> for schema inference. I would like to introduce it in 
> [https://github.com/apache/spark/pull/40005] but having two flags seems too 
> much. After thoughts, I decide to merge 
> `spark.sql.sources.timestampNTZTypeInference.enabled` into 
> `spark.sql.timestampType` and let  `spark.sql.timestampType` control the 
> schema inference behavior.
> We can have followups to add data source options "inferTimestampNTZType" for 
> CSV/JSON/partiton column like JDBC data source did.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42442) Use spark.sql.timestampType for data source inference

2023-02-14 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-42442:
--

 Summary: Use spark.sql.timestampType for data source inference
 Key: SPARK-42442
 URL: https://issues.apache.org/jira/browse/SPARK-42442
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


With the configuration `spark.sql.timestampType`,  TIMESTAMP in Spark is a 
user-specified alias associated with one of the TIMESTAMP_LTZ and TIMESTAMP_NTZ 
variations. This is quite complicated to Spark users.

There is another option `spark.sql.sources.timestampNTZTypeInference.enabled` 
for schema inference. I would like to introduce it in 
[https://github.com/apache/spark/pull/40005] but having two flags seems too 
much. After thoughts, I decide to merge 
`spark.sql.sources.timestampNTZTypeInference.enabled` into 
`spark.sql.timestampType` and let  `spark.sql.timestampType` control the schema 
inference behavior.

We can have followups to add data source options "inferTimestampNTZType" for 
CSV/JSON/partiton column like JDBC data source did.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42406) [PROTOBUF] Recursive field handling is incompatible with delta

2023-02-14 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-42406:
--

Assignee: Raghu Angadi

> [PROTOBUF] Recursive field handling is incompatible with delta
> --
>
> Key: SPARK-42406
> URL: https://issues.apache.org/jira/browse/SPARK-42406
> Project: Spark
>  Issue Type: Bug
>  Components: Protobuf
>Affects Versions: 3.4.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Major
> Fix For: 3.4.1
>
>
> Protobuf deserializer (`from_protobuf()` function()) optionally supports 
> recursive fields by limiting the depth to certain level. See example below. 
> It assigns a 'NullType' for such a field when allowed depth is reached. 
> It causes a few issues. E.g. a repeated field as in the following example 
> results in a Array field with 'NullType'. Delta does not support null type in 
> a complex type.
> Actually `Array[NullType]` is not really useful anyway.
> How about this fix: Drop the recursive field when the limit reached rather 
> than using a NullType. 
> The example below makes it clear:
> Consider a recursive Protobuf:
>  
> {code:python}
> message TreeNode {
>   string value = 1;
>   repeated TreeNode children = 2;
> }
> {code}
> Allow depth of 2: 
>  
> {code:python}
>    df.select(
>     'proto',
>      messageName = 'TreeNode',
>      options = { ... "recursive.fields.max.depth" : "2" }
>   ).printSchema()
> {code}
> Schema looks like this:
> {noformat}
> root
> |– from_protobuf(proto): struct (nullable = true)|
> | |– value: string (nullable = true)|
> | |– children: array (nullable = false)|
> | | |– element: struct (containsNull = false)|
> | | | |– value: string (nullable = true)|
> | | | |– children: array (nullable = false)|
> | | | | |– element: struct (containsNull = false)|
> | | | | | |– value: string (nullable = true)|
> | | | | | |– children: array (nullable = false). [ === Proposed fix: Drop 
> this field === ]|
> | | | | | | |– element: void (containsNull = false) [ === NOTICE 'void' HERE 
> === ] 
> {noformat}
> When we try to write this to a delta table, we get an error:
> {noformat}
> AnalysisException: Found nested NullType in column 
> from_protobuf(proto).children which is of ArrayType. Delta doesn't support 
> writing NullType in complex types.
> {noformat}
>  
> We could just drop the field 'element' when recursion depth is reached. It is 
> simpler and does not need to deal with NullType. We are ignoring the value 
> anyway. There is no use in keeping the field.
> Another issue is setting for 'recursive.fields.max.depth': It is not enforced 
> correctly. '0' does not make sense. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42406) [PROTOBUF] Recursive field handling is incompatible with delta

2023-02-14 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-42406.

Fix Version/s: 3.4.0
   (was: 3.4.1)
   Resolution: Fixed

Issue resolved by pull request 40011
[https://github.com/apache/spark/pull/40011]

> [PROTOBUF] Recursive field handling is incompatible with delta
> --
>
> Key: SPARK-42406
> URL: https://issues.apache.org/jira/browse/SPARK-42406
> Project: Spark
>  Issue Type: Bug
>  Components: Protobuf
>Affects Versions: 3.4.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Major
> Fix For: 3.4.0
>
>
> Protobuf deserializer (`from_protobuf()` function()) optionally supports 
> recursive fields by limiting the depth to certain level. See example below. 
> It assigns a 'NullType' for such a field when allowed depth is reached. 
> It causes a few issues. E.g. a repeated field as in the following example 
> results in a Array field with 'NullType'. Delta does not support null type in 
> a complex type.
> Actually `Array[NullType]` is not really useful anyway.
> How about this fix: Drop the recursive field when the limit reached rather 
> than using a NullType. 
> The example below makes it clear:
> Consider a recursive Protobuf:
>  
> {code:python}
> message TreeNode {
>   string value = 1;
>   repeated TreeNode children = 2;
> }
> {code}
> Allow depth of 2: 
>  
> {code:python}
>    df.select(
>     'proto',
>      messageName = 'TreeNode',
>      options = { ... "recursive.fields.max.depth" : "2" }
>   ).printSchema()
> {code}
> Schema looks like this:
> {noformat}
> root
> |– from_protobuf(proto): struct (nullable = true)|
> | |– value: string (nullable = true)|
> | |– children: array (nullable = false)|
> | | |– element: struct (containsNull = false)|
> | | | |– value: string (nullable = true)|
> | | | |– children: array (nullable = false)|
> | | | | |– element: struct (containsNull = false)|
> | | | | | |– value: string (nullable = true)|
> | | | | | |– children: array (nullable = false). [ === Proposed fix: Drop 
> this field === ]|
> | | | | | | |– element: void (containsNull = false) [ === NOTICE 'void' HERE 
> === ] 
> {noformat}
> When we try to write this to a delta table, we get an error:
> {noformat}
> AnalysisException: Found nested NullType in column 
> from_protobuf(proto).children which is of ArrayType. Delta doesn't support 
> writing NullType in complex types.
> {noformat}
>  
> We could just drop the field 'element' when recursion depth is reached. It is 
> simpler and does not need to deal with NullType. We are ignoring the value 
> anyway. There is no use in keeping the field.
> Another issue is setting for 'recursive.fields.max.depth': It is not enforced 
> correctly. '0' does not make sense. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42342) Introduce base hierarchy to exceptions.

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688691#comment-17688691
 ] 

Apache Spark commented on SPARK-42342:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/40021

> Introduce base hierarchy to exceptions.
> ---
>
> Key: SPARK-42342
> URL: https://issues.apache.org/jira/browse/SPARK-42342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41775) Implement training functions as input

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688683#comment-17688683
 ] 

Apache Spark commented on SPARK-41775:
--

User 'rithwik-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/40020

> Implement training functions as input
> -
>
> Key: SPARK-41775
> URL: https://issues.apache.org/jira/browse/SPARK-41775
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Assignee: Rithwik Ediga Lakhamsani
>Priority: Major
> Fix For: 3.4.0
>
>
> Sidenote: make formatting updates described in 
> https://github.com/apache/spark/pull/39188
>  
> Currently, `Distributor().run(...)` takes only files as input. Now we will 
> add in additional functionality to take in functions as well. This will 
> require us to go through the following process on each task in the executor 
> nodes:
> 1. take the input function and args and pickle them
> 2. Create a temp train.py file that looks like
> {code:java}
> import cloudpickle
> import os
> if _name_ == "_main_":
>     train, args = cloudpickle.load(f"{tempdir}/train_input.pkl")
>     output = train(*args)
>     if output and os.environ.get("RANK", "") == "0": # this is for 
> partitionId == 0
>         cloudpickle.dump(f"{tempdir}/train_output.pkl") {code}
> 3. Run that train.py file with `torchrun`
> 4. Check if `train_output.pkl` has been created on process on partitionId == 
> 0, if it has, then deserialize it and return that output through `.collect()`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42430) Add documentation for TimestampNTZ type

2023-02-14 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-42430.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40005
[https://github.com/apache/spark/pull/40005]

> Add documentation for TimestampNTZ type
> ---
>
> Key: SPARK-42430
> URL: https://issues.apache.org/jira/browse/SPARK-42430
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42441) Scala Client - Implement Column API

2023-02-14 Thread Jira
Herman van Hövell created SPARK-42441:
-

 Summary: Scala Client - Implement Column API
 Key: SPARK-42441
 URL: https://issues.apache.org/jira/browse/SPARK-42441
 Project: Spark
  Issue Type: Task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42440) Implement First batch of Dataset APIs

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688640#comment-17688640
 ] 

Apache Spark commented on SPARK-42440:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/40019

> Implement First batch of Dataset APIs
> -
>
> Key: SPARK-42440
> URL: https://issues.apache.org/jira/browse/SPARK-42440
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42440) Implement First batch of Dataset APIs

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42440:


Assignee: (was: Apache Spark)

> Implement First batch of Dataset APIs
> -
>
> Key: SPARK-42440
> URL: https://issues.apache.org/jira/browse/SPARK-42440
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42440) Implement First batch of Dataset APIs

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42440:


Assignee: Apache Spark

> Implement First batch of Dataset APIs
> -
>
> Key: SPARK-42440
> URL: https://issues.apache.org/jira/browse/SPARK-42440
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42440) Implement First batch of Dataset APIs

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688639#comment-17688639
 ] 

Apache Spark commented on SPARK-42440:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/40019

> Implement First batch of Dataset APIs
> -
>
> Key: SPARK-42440
> URL: https://issues.apache.org/jira/browse/SPARK-42440
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42440) Implement First batch of Dataset APIs

2023-02-14 Thread Jira
Herman van Hövell created SPARK-42440:
-

 Summary: Implement First batch of Dataset APIs
 Key: SPARK-42440
 URL: https://issues.apache.org/jira/browse/SPARK-42440
 Project: Spark
  Issue Type: Task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42439) Job description in v2 FileWrites can have the wrong committer

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688621#comment-17688621
 ] 

Apache Spark commented on SPARK-42439:
--

User 'LorenzoMartini' has created a pull request for this issue:
https://github.com/apache/spark/pull/40018

> Job description in v2 FileWrites can have the wrong committer
> -
>
> Key: SPARK-42439
> URL: https://issues.apache.org/jira/browse/SPARK-42439
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Lorenzo Martini
>Priority: Minor
>  Labels: bug
>
> There is a difference in behavior between v1 writes and v2 writes in the 
> order of events happening when configuring the file writer and the committer.
> v1:
>  # writer.prepareWrite()
>  # committer.setupJob()
> v2:
>  # committer.setupJob()
>  # writer.prepareWrite()
>  
> This is because the `prepareWrite()` call (that is the one performing the 
> call `
> job.setOutputFormatClass(classOf[ParquetOutputFormat[Row]])`)
> happens as part of the `createWriteJobDescription` which is `lazy val` in the 
> `toBatch` call and therefore is evaluated after the `committer.setupJob` at 
> the end of the `toBatch`
> This causes issues when evaluating the committer as some elements might be 
> missing, for example the aforementioned output format class not being set, 
> causing the committer being set up as generic write instead of parquet write.
>  
> The fix is very simple and it is to make the `createJobDescription` call 
> non-lazy



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42439) Job description in v2 FileWrites can have the wrong committer

2023-02-14 Thread Lorenzo Martini (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lorenzo Martini updated SPARK-42439:

Labels: bug  (was: )

> Job description in v2 FileWrites can have the wrong committer
> -
>
> Key: SPARK-42439
> URL: https://issues.apache.org/jira/browse/SPARK-42439
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Lorenzo Martini
>Priority: Minor
>  Labels: bug
>
> There is a difference in behavior between v1 writes and v2 writes in the 
> order of events happening when configuring the file writer and the committer.
> v1:
>  # writer.prepareWrite()
>  # committer.setupJob()
> v2:
>  # committer.setupJob()
>  # writer.prepareWrite()
>  
> This is because the `prepareWrite()` call (that is the one performing the 
> call `
> job.setOutputFormatClass(classOf[ParquetOutputFormat[Row]])`)
> happens as part of the `createWriteJobDescription` which is `lazy val` in the 
> `toBatch` call and therefore is evaluated after the `committer.setupJob` at 
> the end of the `toBatch`
> This causes issues when evaluating the committer as some elements might be 
> missing, for example the aforementioned output format class not being set, 
> causing the committer being set up as generic write instead of parquet write.
>  
> The fix is very simple and it is to make the `createJobDescription` call 
> non-lazy



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42439) Job description in v2 FileWrites can have the wrong committer

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688615#comment-17688615
 ] 

Apache Spark commented on SPARK-42439:
--

User 'LorenzoMartini' has created a pull request for this issue:
https://github.com/apache/spark/pull/40017

> Job description in v2 FileWrites can have the wrong committer
> -
>
> Key: SPARK-42439
> URL: https://issues.apache.org/jira/browse/SPARK-42439
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Lorenzo Martini
>Priority: Minor
>
> There is a difference in behavior between v1 writes and v2 writes in the 
> order of events happening when configuring the file writer and the committer.
> v1:
>  # writer.prepareWrite()
>  # committer.setupJob()
> v2:
>  # committer.setupJob()
>  # writer.prepareWrite()
>  
> This is because the `prepareWrite()` call (that is the one performing the 
> call `
> job.setOutputFormatClass(classOf[ParquetOutputFormat[Row]])`)
> happens as part of the `createWriteJobDescription` which is `lazy val` in the 
> `toBatch` call and therefore is evaluated after the `committer.setupJob` at 
> the end of the `toBatch`
> This causes issues when evaluating the committer as some elements might be 
> missing, for example the aforementioned output format class not being set, 
> causing the committer being set up as generic write instead of parquet write.
>  
> The fix is very simple and it is to make the `createJobDescription` call 
> non-lazy



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42439) Job description in v2 FileWrites can have the wrong committer

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42439:


Assignee: (was: Apache Spark)

> Job description in v2 FileWrites can have the wrong committer
> -
>
> Key: SPARK-42439
> URL: https://issues.apache.org/jira/browse/SPARK-42439
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Lorenzo Martini
>Priority: Minor
>
> There is a difference in behavior between v1 writes and v2 writes in the 
> order of events happening when configuring the file writer and the committer.
> v1:
>  # writer.prepareWrite()
>  # committer.setupJob()
> v2:
>  # committer.setupJob()
>  # writer.prepareWrite()
>  
> This is because the `prepareWrite()` call (that is the one performing the 
> call `
> job.setOutputFormatClass(classOf[ParquetOutputFormat[Row]])`)
> happens as part of the `createWriteJobDescription` which is `lazy val` in the 
> `toBatch` call and therefore is evaluated after the `committer.setupJob` at 
> the end of the `toBatch`
> This causes issues when evaluating the committer as some elements might be 
> missing, for example the aforementioned output format class not being set, 
> causing the committer being set up as generic write instead of parquet write.
>  
> The fix is very simple and it is to make the `createJobDescription` call 
> non-lazy



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42439) Job description in v2 FileWrites can have the wrong committer

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688613#comment-17688613
 ] 

Apache Spark commented on SPARK-42439:
--

User 'LorenzoMartini' has created a pull request for this issue:
https://github.com/apache/spark/pull/40017

> Job description in v2 FileWrites can have the wrong committer
> -
>
> Key: SPARK-42439
> URL: https://issues.apache.org/jira/browse/SPARK-42439
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Lorenzo Martini
>Priority: Minor
>
> There is a difference in behavior between v1 writes and v2 writes in the 
> order of events happening when configuring the file writer and the committer.
> v1:
>  # writer.prepareWrite()
>  # committer.setupJob()
> v2:
>  # committer.setupJob()
>  # writer.prepareWrite()
>  
> This is because the `prepareWrite()` call (that is the one performing the 
> call `
> job.setOutputFormatClass(classOf[ParquetOutputFormat[Row]])`)
> happens as part of the `createWriteJobDescription` which is `lazy val` in the 
> `toBatch` call and therefore is evaluated after the `committer.setupJob` at 
> the end of the `toBatch`
> This causes issues when evaluating the committer as some elements might be 
> missing, for example the aforementioned output format class not being set, 
> causing the committer being set up as generic write instead of parquet write.
>  
> The fix is very simple and it is to make the `createJobDescription` call 
> non-lazy



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42439) Job description in v2 FileWrites can have the wrong committer

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42439:


Assignee: Apache Spark

> Job description in v2 FileWrites can have the wrong committer
> -
>
> Key: SPARK-42439
> URL: https://issues.apache.org/jira/browse/SPARK-42439
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Lorenzo Martini
>Assignee: Apache Spark
>Priority: Minor
>
> There is a difference in behavior between v1 writes and v2 writes in the 
> order of events happening when configuring the file writer and the committer.
> v1:
>  # writer.prepareWrite()
>  # committer.setupJob()
> v2:
>  # committer.setupJob()
>  # writer.prepareWrite()
>  
> This is because the `prepareWrite()` call (that is the one performing the 
> call `
> job.setOutputFormatClass(classOf[ParquetOutputFormat[Row]])`)
> happens as part of the `createWriteJobDescription` which is `lazy val` in the 
> `toBatch` call and therefore is evaluated after the `committer.setupJob` at 
> the end of the `toBatch`
> This causes issues when evaluating the committer as some elements might be 
> missing, for example the aforementioned output format class not being set, 
> causing the committer being set up as generic write instead of parquet write.
>  
> The fix is very simple and it is to make the `createJobDescription` call 
> non-lazy



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42439) Job description in v2 FileWrites can have the wrong committer

2023-02-14 Thread Lorenzo Martini (Jira)
Lorenzo Martini created SPARK-42439:
---

 Summary: Job description in v2 FileWrites can have the wrong 
committer
 Key: SPARK-42439
 URL: https://issues.apache.org/jira/browse/SPARK-42439
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Lorenzo Martini


There is a difference in behavior between v1 writes and v2 writes in the order 
of events happening when configuring the file writer and the committer.

v1:
 # writer.prepareWrite()
 # committer.setupJob()

v2:
 # committer.setupJob()
 # writer.prepareWrite()

 

This is because the `prepareWrite()` call (that is the one performing the call `
job.setOutputFormatClass(classOf[ParquetOutputFormat[Row]])`)
happens as part of the `createWriteJobDescription` which is `lazy val` in the 
`toBatch` call and therefore is evaluated after the `committer.setupJob` at the 
end of the `toBatch`

This causes issues when evaluating the committer as some elements might be 
missing, for example the aforementioned output format class not being set, 
causing the committer being set up as generic write instead of parquet write.

 

The fix is very simple and it is to make the `createJobDescription` call 
non-lazy



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42399) CONV() silently overflows returning wrong results

2023-02-14 Thread Narek Karapetian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688598#comment-17688598
 ] 

Narek Karapetian edited comment on SPARK-42399 at 2/14/23 4:39 PM:
---

Why do we need to throw an exception in ANSI mode, is it described somewhere in 
SQL standards? 

What do you think if such a case will be considered as a valid scenario and it 
will give a correct result?

For example, such a query:
{code:java}
spark-sql> SELECT 
CONV(SUBSTRING('0x',
 3), 16, 10); {code}
will be evaluated to:
{code:java}
115792089237316195423570985008687907853269984665640564039457584007913129639935 
{code}
 

It could be implemented if we use BigInt, instead of 
`NumberConverter.convert(...)` which uses Long as a data type.

 

P.S. But it might affect the performance.


was (Author: JIRAUSER298803):
Why do we need to throw an exception in ANSI mode, is it described somewhere in 
SQL standards? 

What do you think if such a case will be considered as a valid scenario and it 
will give a correct result?

For example, such a query:
{code:java}
spark-sql> SELECT 
CONV(SUBSTRING('0x',
 3), 16, 10); {code}
will be evaluated to:
{code:java}
115792089237316195423570985008687907853269984665640564039457584007913129639935 
{code}
 

It could be implemented if we use BigInt, instead of 
`NumberConverter.convert(...)` which uses Long as a data type.

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Critical
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42399) CONV() silently overflows returning wrong results

2023-02-14 Thread Narek Karapetian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688598#comment-17688598
 ] 

Narek Karapetian commented on SPARK-42399:
--

Why do we need to throw an exception in ANSI mode, is it described somewhere in 
SQL standards? 

What do you think if such a case will be considered as a valid scenario and it 
will give a correct result?

For example, such a query:
{code:java}
spark-sql> SELECT 
CONV(SUBSTRING('0x',
 3), 16, 10); {code}
will be evaluated to:
{code:java}
115792089237316195423570985008687907853269984665640564039457584007913129639935 
{code}
 

It could be implemented if we use BigInt, instead of 
`NumberConverter.convert(...)` which uses Long as a data type.

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Critical
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42438) Improve constraint propagation using multiTransform

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688542#comment-17688542
 ] 

Apache Spark commented on SPARK-42438:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/38035

> Improve constraint propagation using multiTransform
> ---
>
> Key: SPARK-42438
> URL: https://issues.apache.org/jira/browse/SPARK-42438
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42438) Improve constraint propagation using multiTransform

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42438:


Assignee: Apache Spark

> Improve constraint propagation using multiTransform
> ---
>
> Key: SPARK-42438
> URL: https://issues.apache.org/jira/browse/SPARK-42438
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Peter Toth
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42438) Improve constraint propagation using multiTransform

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42438:


Assignee: (was: Apache Spark)

> Improve constraint propagation using multiTransform
> ---
>
> Key: SPARK-42438
> URL: https://issues.apache.org/jira/browse/SPARK-42438
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42436) Improve multiTransform to generate alternatives dynamically

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688541#comment-17688541
 ] 

Apache Spark commented on SPARK-42436:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/40016

> Improve multiTransform to generate alternatives dynamically
> ---
>
> Key: SPARK-42436
> URL: https://issues.apache.org/jira/browse/SPARK-42436
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42436) Improve multiTransform to generate alternatives dynamically

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688539#comment-17688539
 ] 

Apache Spark commented on SPARK-42436:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/40016

> Improve multiTransform to generate alternatives dynamically
> ---
>
> Key: SPARK-42436
> URL: https://issues.apache.org/jira/browse/SPARK-42436
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42438) Improve constraint propagation using multiTransform

2023-02-14 Thread Peter Toth (Jira)
Peter Toth created SPARK-42438:
--

 Summary: Improve constraint propagation using multiTransform
 Key: SPARK-42438
 URL: https://issues.apache.org/jira/browse/SPARK-42438
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42436) Improve multiTransform to generate alternatives dynamically

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42436:


Assignee: (was: Apache Spark)

> Improve multiTransform to generate alternatives dynamically
> ---
>
> Key: SPARK-42436
> URL: https://issues.apache.org/jira/browse/SPARK-42436
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42436) Improve multiTransform to generate alternatives dynamically

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42436:


Assignee: Apache Spark

> Improve multiTransform to generate alternatives dynamically
> ---
>
> Key: SPARK-42436
> URL: https://issues.apache.org/jira/browse/SPARK-42436
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Peter Toth
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42373) Remove unused blank line removal from CSVExprUtils

2023-02-14 Thread Ted Chester Jenks (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688531#comment-17688531
 ] 

Ted Chester Jenks commented on SPARK-42373:
---

For the main use-case for this, 
[#39907|https://github.com/apache/spark/pull/39907], I have settled to define 
an ordering that doesn't become unclear with these method names.

> Remove unused blank line removal from CSVExprUtils
> --
>
> Key: SPARK-42373
> URL: https://issues.apache.org/jira/browse/SPARK-42373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Willi Raschkowski
>Priority: Minor
>
> The non-multiline CSV read codepath contains references to removal of blank 
> lines throughout. This is not necessary as blank lines are removed by the 
> parser. Furthermore, it causes confusion, indicating that blank lines are 
> removed at this point when instead they are already omitted from the data. 
> The multiline code-path does not explicitly remove blank lines leading to 
> what looks like disparity in behavior between the two.
> The codepath for {{DataFrameReader.csv(dataset: Dataset[String])}} does need 
> to explicitly skip lines, and this should be respected in {{CSVUtils}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42437) Pyspark catalog.cacheTable allow to specify storage level

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42437:


Assignee: (was: Apache Spark)

> Pyspark catalog.cacheTable allow to specify storage level
> -
>
> Key: SPARK-42437
> URL: https://issues.apache.org/jira/browse/SPARK-42437
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Khalid Mammadov
>Priority: Major
>
> Currently PySpark version of catalog.cacheTable function does not support to 
> specify storage level. This is to add that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42437) Pyspark catalog.cacheTable allow to specify storage level

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688517#comment-17688517
 ] 

Apache Spark commented on SPARK-42437:
--

User 'khalidmammadov' has created a pull request for this issue:
https://github.com/apache/spark/pull/40015

> Pyspark catalog.cacheTable allow to specify storage level
> -
>
> Key: SPARK-42437
> URL: https://issues.apache.org/jira/browse/SPARK-42437
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Khalid Mammadov
>Priority: Major
>
> Currently PySpark version of catalog.cacheTable function does not support to 
> specify storage level. This is to add that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42437) Pyspark catalog.cacheTable allow to specify storage level

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42437:


Assignee: Apache Spark

> Pyspark catalog.cacheTable allow to specify storage level
> -
>
> Key: SPARK-42437
> URL: https://issues.apache.org/jira/browse/SPARK-42437
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Khalid Mammadov
>Assignee: Apache Spark
>Priority: Major
>
> Currently PySpark version of catalog.cacheTable function does not support to 
> specify storage level. This is to add that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42437) Pyspark catalog.cacheTable allow to specify storage level

2023-02-14 Thread Khalid Mammadov (Jira)
Khalid Mammadov created SPARK-42437:
---

 Summary: Pyspark catalog.cacheTable allow to specify storage level
 Key: SPARK-42437
 URL: https://issues.apache.org/jira/browse/SPARK-42437
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Khalid Mammadov


Currently PySpark version of catalog.cacheTable function does not support to 
specify storage level. This is to add that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42436) Improve multiTransform to generate alternatives dynamically

2023-02-14 Thread Peter Toth (Jira)
Peter Toth created SPARK-42436:
--

 Summary: Improve multiTransform to generate alternatives 
dynamically
 Key: SPARK-42436
 URL: https://issues.apache.org/jira/browse/SPARK-42436
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42418) Updating PySpark documentation to support new users better

2023-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42418.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39992
[https://github.com/apache/spark/pull/39992]

> Updating PySpark documentation to support new users better
> --
>
> Key: SPARK-42418
> URL: https://issues.apache.org/jira/browse/SPARK-42418
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Allan Folting
>Assignee: Allan Folting
>Priority: Major
> Fix For: 3.4.0
>
>
> This is the first of a series of updates to the PySpark documentation site to 
> better guide new users on what to use and when as well as help improve 
> discoverability of related pages/resources.
>  * Add "Overview" to the top navigation bar to make it easy to get back to 
> the main page (clicking the logo is not super discoverable)
>  * Break architecture image into separate, clickable parts for easy 
> navigation to information for each part
>  * Added links to related topics under each area description
>  * Added date and version to the page



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42418) Updating PySpark documentation to support new users better

2023-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42418:


Assignee: Allan Folting

> Updating PySpark documentation to support new users better
> --
>
> Key: SPARK-42418
> URL: https://issues.apache.org/jira/browse/SPARK-42418
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Allan Folting
>Assignee: Allan Folting
>Priority: Major
>
> This is the first of a series of updates to the PySpark documentation site to 
> better guide new users on what to use and when as well as help improve 
> discoverability of related pages/resources.
>  * Add "Overview" to the top navigation bar to make it easy to get back to 
> the main page (clicking the logo is not super discoverable)
>  * Break architecture image into separate, clickable parts for easy 
> navigation to information for each part
>  * Added links to related topics under each area description
>  * Added date and version to the page



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42433) Add `array_insert` to Connect

2023-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42433.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40010
[https://github.com/apache/spark/pull/40010]

> Add `array_insert` to Connect
> -
>
> Key: SPARK-42433
> URL: https://issues.apache.org/jira/browse/SPARK-42433
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42433) Add `array_insert` to Connect

2023-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42433:


Assignee: Ruifeng Zheng

> Add `array_insert` to Connect
> -
>
> Key: SPARK-42433
> URL: https://issues.apache.org/jira/browse/SPARK-42433
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42435) Update DataTables to 1.13.2

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688502#comment-17688502
 ] 

Apache Spark commented on SPARK-42435:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/40014

> Update DataTables to 1.13.2
> ---
>
> Key: SPARK-42435
> URL: https://issues.apache.org/jira/browse/SPARK-42435
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.5.0
>Reporter: Peter Toth
>Priority: Minor
>
> The 1.10.25 version of DataTables, that Spark uses, seems vulerable: 
> https://security.snyk.io/package/npm/datatables.net/1.10.25.
> It may or may not affect Spark, but updating to latest 1.13.2 seems doable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42435) Update DataTables to 1.13.2

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42435:


Assignee: (was: Apache Spark)

> Update DataTables to 1.13.2
> ---
>
> Key: SPARK-42435
> URL: https://issues.apache.org/jira/browse/SPARK-42435
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.5.0
>Reporter: Peter Toth
>Priority: Minor
>
> The 1.10.25 version of DataTables, that Spark uses, seems vulerable: 
> https://security.snyk.io/package/npm/datatables.net/1.10.25.
> It may or may not affect Spark, but updating to latest 1.13.2 seems doable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42435) Update DataTables to 1.13.2

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42435:


Assignee: Apache Spark

> Update DataTables to 1.13.2
> ---
>
> Key: SPARK-42435
> URL: https://issues.apache.org/jira/browse/SPARK-42435
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.5.0
>Reporter: Peter Toth
>Assignee: Apache Spark
>Priority: Minor
>
> The 1.10.25 version of DataTables, that Spark uses, seems vulerable: 
> https://security.snyk.io/package/npm/datatables.net/1.10.25.
> It may or may not affect Spark, but updating to latest 1.13.2 seems doable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42431) Union avoid calling `output` before analysis

2023-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42431.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40008
[https://github.com/apache/spark/pull/40008]

> Union avoid calling `output` before analysis
> 
>
> Key: SPARK-42431
> URL: https://issues.apache.org/jira/browse/SPARK-42431
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42431) Union avoid calling `output` before analysis

2023-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42431:


Assignee: Ruifeng Zheng

> Union avoid calling `output` before analysis
> 
>
> Key: SPARK-42431
> URL: https://issues.apache.org/jira/browse/SPARK-42431
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42434) `array_append` should accept `Any` value

2023-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42434.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40012
[https://github.com/apache/spark/pull/40012]

> `array_append` should accept `Any` value
> 
>
> Key: SPARK-42434
> URL: https://issues.apache.org/jira/browse/SPARK-42434
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42434) `array_append` should accept `Any` value

2023-02-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42434:


Assignee: Ruifeng Zheng

> `array_append` should accept `Any` value
> 
>
> Key: SPARK-42434
> URL: https://issues.apache.org/jira/browse/SPARK-42434
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42435) Update DataTables to 1.13.2

2023-02-14 Thread Peter Toth (Jira)
Peter Toth created SPARK-42435:
--

 Summary: Update DataTables to 1.13.2
 Key: SPARK-42435
 URL: https://issues.apache.org/jira/browse/SPARK-42435
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.5.0
Reporter: Peter Toth


The 1.10.25 version of DataTables, that Spark uses, seems vulerable: 
https://security.snyk.io/package/npm/datatables.net/1.10.25.
It may or may not affect Spark, but updating to latest 1.13.2 seems doable.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42367) DataFrame.drop should handle duplicated columns properly

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42367:


Assignee: (was: Apache Spark)

> DataFrame.drop should handle duplicated columns properly
> 
>
> Key: SPARK-42367
> URL: https://issues.apache.org/jira/browse/SPARK-42367
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> {code:java}
> >>> df.join(df2, df.name == df2.name, 'inner').show()
> +---++--++
> |age|name|height|name|
> +---++--++
> | 16| Bob|85| Bob|
> | 14| Tom|80| Tom|
> +---++--++
> >>> df.join(df2, df.name == df2.name, 'inner').drop('name').show()
> +---+--+
> |age|height|
> +---+--+
> | 16|85|
> | 14|80|
> +---+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42367) DataFrame.drop should handle duplicated columns properly

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42367:


Assignee: Apache Spark

> DataFrame.drop should handle duplicated columns properly
> 
>
> Key: SPARK-42367
> URL: https://issues.apache.org/jira/browse/SPARK-42367
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> >>> df.join(df2, df.name == df2.name, 'inner').show()
> +---++--++
> |age|name|height|name|
> +---++--++
> | 16| Bob|85| Bob|
> | 14| Tom|80| Tom|
> +---++--++
> >>> df.join(df2, df.name == df2.name, 'inner').drop('name').show()
> +---+--+
> |age|height|
> +---+--+
> | 16|85|
> | 14|80|
> +---+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42367) DataFrame.drop should handle duplicated columns properly

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688434#comment-17688434
 ] 

Apache Spark commented on SPARK-42367:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40013

> DataFrame.drop should handle duplicated columns properly
> 
>
> Key: SPARK-42367
> URL: https://issues.apache.org/jira/browse/SPARK-42367
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> {code:java}
> >>> df.join(df2, df.name == df2.name, 'inner').show()
> +---++--++
> |age|name|height|name|
> +---++--++
> | 16| Bob|85| Bob|
> | 14| Tom|80| Tom|
> +---++--++
> >>> df.join(df2, df.name == df2.name, 'inner').drop('name').show()
> +---+--+
> |age|height|
> +---+--+
> | 16|85|
> | 14|80|
> +---+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42374) User-facing documentaiton

2023-02-14 Thread Martin Grund (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688394#comment-17688394
 ] 

Martin Grund commented on SPARK-42374:
--

Yes, that is correct. There is not built-in authentication. The benefit of the 
GRPC / HTTP2 interface is that it's very easy to put a capable authenticating 
proxy in front of it so that we don't need to implement the logic in Spark 
directly, but can simply use existing infrastructure.

> User-facing documentaiton
> -
>
> Key: SPARK-42374
> URL: https://issues.apache.org/jira/browse/SPARK-42374
> Project: Spark
>  Issue Type: Documentation
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Haejoon Lee
>Priority: Major
>
> Should provide the user-facing documentation so end users how to use Spark 
> Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark

2023-02-14 Thread Martin Grund (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688393#comment-17688393
 ] 

Martin Grund commented on SPARK-39375:
--

[~tgraves] Currently the Python UDFs are implemented exactly the same way as 
they are today. In today's world, they are serialized bytes that are sent from 
the Python process via Py4J to the driver and then to the executors where 
they're deserialized and executed. The primary difference to Spark Connect is 
that we don't use Py4J anymore but leverage the protocol directly. This is 
backward compatible and allows us to make sure that we can build upon the 
existing architecture going forward. Please keep in mind that today, the Python 
process for the UDF execution is started by the executor as part of query 
execution. Depending on the setup the Python process is kept around or 
destroyed at the end of the processing. None of this behavior changed. This 
means that all of the existing applications using PySpark will simply continue 
to work.

Similarly, this means we're not changing the assumptions around the 
requirements of which Python version has to be present where. In the same way, 
the Python version on the client has to be the same as on the executor. 

The reason we did not create a design for it is that we did not change the 
semantics, the logic or the implementation. This is very similar to the way 
we're translating the Spark Connect proto API into Catalyst plans.



> SPIP: Spark Connect - A client and server interface for Apache Spark
> 
>
> Key: SPARK-39375
> URL: https://issues.apache.org/jira/browse/SPARK-39375
> Project: Spark
>  Issue Type: Epic
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Critical
>  Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime 

[jira] [Resolved] (SPARK-42428) Standardize __repr__ of CommonInlineUserDefinedFunction

2023-02-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-42428.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40003
[https://github.com/apache/spark/pull/40003]

> Standardize __repr__ of CommonInlineUserDefinedFunction
> ---
>
> Key: SPARK-42428
> URL: https://issues.apache.org/jira/browse/SPARK-42428
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> As shown below, `f(df.id)` is evaluated to a Column with a super long 
> representation in Connect, however, the vanilla PySpark returns 
> `Column<'(id)'>`. We shall Standardize __repr__ of 
> CommonInlineUserDefinedFunction.
> {code:python}
> >>> f = udf(lambda x : x + 1)
> >>> df.id
> Column<'id'>
> >>> f(df.id)
> Column<'(id), True, "string", 100, 
> b'\x80\x05\x95\xe1\x01\x00\x00\x00\x00\x00\x00\x8c\x1fpyspark.cloudpickle.cloudpickle\x94\x8c\x0e_make_function\x94\x93\x94(h\x00\x8c\r_builtin_type\x94\x93\x94\x8c\x08CodeType\x94\x85\x94R\x94(K\x01K\x00K\x00K\x01K\x02KCC\x08|\x00d\x01\x17\x00S\x00\x94NK\x01\x86\x94)\x8c\x01x\x94\x85\x94\x8c\x07\x94\x8c\x08\x94K\x01C\x00\x94))t\x94R\x94}\x94(\x8c\x0b__package__\x94N\x8c\x08__name__\x94\x8c\x08__main__\x94uNNNt\x94R\x94\x8c$pyspark.cloudpickle.cloudpickle_fast\x94\x8c\x12_function_setstate\x94\x93\x94h\x16}\x94}\x94(h\x13h\r\x8c\x0c__qualname__\x94h\r\x8c\x0f__annotations__\x94}\x94\x8c\x0e__kwdefaults__\x94N\x8c\x0c__defaults__\x94N\x8c\n__module__\x94h\x14\x8c\x07__doc__\x94N\x8c\x0b__closure__\x94N\x8c\x17_cloudpickle_submodules\x94]\x94\x8c\x0b__globals__\x94}\x94u\x86\x94\x86R0\x8c\x11pyspark.sql.types\x94\x8c\nStringType\x94\x93\x94)\x81\x94\x86\x94.',
>  f3.9'>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42428) Standardize __repr__ of CommonInlineUserDefinedFunction

2023-02-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-42428:
-

Assignee: Xinrong Meng

> Standardize __repr__ of CommonInlineUserDefinedFunction
> ---
>
> Key: SPARK-42428
> URL: https://issues.apache.org/jira/browse/SPARK-42428
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> As shown below, `f(df.id)` is evaluated to a Column with a super long 
> representation in Connect, however, the vanilla PySpark returns 
> `Column<'(id)'>`. We shall Standardize __repr__ of 
> CommonInlineUserDefinedFunction.
> {code:python}
> >>> f = udf(lambda x : x + 1)
> >>> df.id
> Column<'id'>
> >>> f(df.id)
> Column<'(id), True, "string", 100, 
> b'\x80\x05\x95\xe1\x01\x00\x00\x00\x00\x00\x00\x8c\x1fpyspark.cloudpickle.cloudpickle\x94\x8c\x0e_make_function\x94\x93\x94(h\x00\x8c\r_builtin_type\x94\x93\x94\x8c\x08CodeType\x94\x85\x94R\x94(K\x01K\x00K\x00K\x01K\x02KCC\x08|\x00d\x01\x17\x00S\x00\x94NK\x01\x86\x94)\x8c\x01x\x94\x85\x94\x8c\x07\x94\x8c\x08\x94K\x01C\x00\x94))t\x94R\x94}\x94(\x8c\x0b__package__\x94N\x8c\x08__name__\x94\x8c\x08__main__\x94uNNNt\x94R\x94\x8c$pyspark.cloudpickle.cloudpickle_fast\x94\x8c\x12_function_setstate\x94\x93\x94h\x16}\x94}\x94(h\x13h\r\x8c\x0c__qualname__\x94h\r\x8c\x0f__annotations__\x94}\x94\x8c\x0e__kwdefaults__\x94N\x8c\x0c__defaults__\x94N\x8c\n__module__\x94h\x14\x8c\x07__doc__\x94N\x8c\x0b__closure__\x94N\x8c\x17_cloudpickle_submodules\x94]\x94\x8c\x0b__globals__\x94}\x94u\x86\x94\x86R0\x8c\x11pyspark.sql.types\x94\x8c\nStringType\x94\x93\x94)\x81\x94\x86\x94.',
>  f3.9'>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42434) `array_append` should accept `Any` value

2023-02-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-42434:
--
Parent: SPARK-41283
Issue Type: Sub-task  (was: Improvement)

> `array_append` should accept `Any` value
> 
>
> Key: SPARK-42434
> URL: https://issues.apache.org/jira/browse/SPARK-42434
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42433) Add `array_insert` to Connect

2023-02-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-42433:
--
Parent: SPARK-41283
Issue Type: Sub-task  (was: Improvement)

> Add `array_insert` to Connect
> -
>
> Key: SPARK-42433
> URL: https://issues.apache.org/jira/browse/SPARK-42433
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42433) Add `array_insert` to Connect

2023-02-14 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-42433:
--
Summary: Add `array_insert` to Connect  (was: `array_insert` should accept 
literal parameters)

> Add `array_insert` to Connect
> -
>
> Key: SPARK-42433
> URL: https://issues.apache.org/jira/browse/SPARK-42433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41727) ClassCastException when config spark.sql.hive.metastore* properties under jdk17

2023-02-14 Thread kevinshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688377#comment-17688377
 ] 

kevinshin commented on SPARK-41727:
---

https://github.com/apache/hive/commit/93f2274b5ddce0454f5fcaef605823618c5d9c77

> ClassCastException when config spark.sql.hive.metastore* properties under 
> jdk17
> ---
>
> Key: SPARK-41727
> URL: https://issues.apache.org/jira/browse/SPARK-41727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
> Environment: Apache spark3.3.1 \ HDP3.1.5 with hive 3.1.0
>Reporter: kevinshin
>Priority: Major
> Attachments: hms-init-error.txt
>
>
> Apache spark3.3.1 \ HDP3.1.5 with hive 3.1.0
> when config properties about spark.sql.hive.metastore* to use 
> hive.metastore.version 3.1.2: 
> *spark.sql.hive.metastore.jars /data/soft/spark3/standalone-metastore/**
> *spark.sql.hive.metastore.version 3.1.2*
> then start spark-shell with master = local[*] under jdk17 
> try to select a hive table, will got error:
> 13:44:52.428 [main] ERROR 
> org.apache.hadoop.hive.metastore.utils.MetaStoreUtils - Got exception: 
> java.lang.ClassCastException class [Ljava.lang.Object; cannot be cast to 
> class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module 
> java.base of loader 'bootstrap')
> java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to 
> class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module 
> java.base of loader 'bootstrap')
>         at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.resolveUris(HiveMetaStoreClient.java:262)
>  ~[hive-standalone-metastore-3.1.2.jar:3.1.2]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42434) `array_append` should accept `Any` value

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42434:


Assignee: Apache Spark

> `array_append` should accept `Any` value
> 
>
> Key: SPARK-42434
> URL: https://issues.apache.org/jira/browse/SPARK-42434
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42434) `array_append` should accept `Any` value

2023-02-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688374#comment-17688374
 ] 

Apache Spark commented on SPARK-42434:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40012

> `array_append` should accept `Any` value
> 
>
> Key: SPARK-42434
> URL: https://issues.apache.org/jira/browse/SPARK-42434
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42434) `array_append` should accept `Any` value

2023-02-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42434:


Assignee: (was: Apache Spark)

> `array_append` should accept `Any` value
> 
>
> Key: SPARK-42434
> URL: https://issues.apache.org/jira/browse/SPARK-42434
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >