[jira] [Assigned] (SPARK-47145) Provide table identifier to scan node when DS v2 strategy is applied

2024-02-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47145:
---

Assignee: Uros Stankovic

> Provide table identifier to scan node when DS v2 strategy is applied
> 
>
> Key: SPARK-47145
> URL: https://issues.apache.org/jira/browse/SPARK-47145
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Uros Stankovic
>Assignee: Uros Stankovic
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, DataSourceScanExec node can accept table identifier, and that 
> information can be useful for later logging, debugging, etc, but 
> DataSourceV2Strategy does not provide that information to scan node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47145) Provide table identifier to scan node when DS v2 strategy is applied

2024-02-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47145.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45200
[https://github.com/apache/spark/pull/45200]

> Provide table identifier to scan node when DS v2 strategy is applied
> 
>
> Key: SPARK-47145
> URL: https://issues.apache.org/jira/browse/SPARK-47145
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Uros Stankovic
>Assignee: Uros Stankovic
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, DataSourceScanExec node can accept table identifier, and that 
> information can be useful for later logging, debugging, etc, but 
> DataSourceV2Strategy does not provide that information to scan node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47191) avoid unnecessary relation lookup when uncaching table/view

2024-02-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47191.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45289
[https://github.com/apache/spark/pull/45289]

> avoid unnecessary relation lookup when uncaching table/view
> ---
>
> Key: SPARK-47191
> URL: https://issues.apache.org/jira/browse/SPARK-47191
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47191) avoid unnecessary relation lookup when uncaching table/view

2024-02-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47191:
---

Assignee: Wenchen Fan

> avoid unnecessary relation lookup when uncaching table/view
> ---
>
> Key: SPARK-47191
> URL: https://issues.apache.org/jira/browse/SPARK-47191
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47144) Fix Spark Connect collation issue

2024-02-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47144.
-
Resolution: Fixed

Issue resolved by pull request 45233
[https://github.com/apache/spark/pull/45233]

> Fix Spark Connect collation issue
> -
>
> Key: SPARK-47144
> URL: https://issues.apache.org/jira/browse/SPARK-47144
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when 
> connecting to sever using Spark Connect:
> {code:java}
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support 
> convert string(UCS_BASIC_LCASE) to connect proto types.{code}
> When using default collation "UCS_BASIC", the error is not occurring.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47144) Fix Spark Connect collation issue

2024-02-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47144:
---

Assignee: Nikola Mandic

> Fix Spark Connect collation issue
> -
>
> Key: SPARK-47144
> URL: https://issues.apache.org/jira/browse/SPARK-47144
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when 
> connecting to sever using Spark Connect:
> {code:java}
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support 
> convert string(UCS_BASIC_LCASE) to connect proto types.{code}
> When using default collation "UCS_BASIC", the error is not occurring.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47191) avoid unnecessary relation lookup when uncaching table/view

2024-02-27 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-47191:
---

 Summary: avoid unnecessary relation lookup when uncaching 
table/view
 Key: SPARK-47191
 URL: https://issues.apache.org/jira/browse/SPARK-47191
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47176) Have a ResolveAllExpressionsUpWithPruning helper function

2024-02-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47176.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45270
[https://github.com/apache/spark/pull/45270]

> Have a ResolveAllExpressionsUpWithPruning helper function
> -
>
> Key: SPARK-47176
> URL: https://issues.apache.org/jira/browse/SPARK-47176
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47009) Create table with collation

2024-02-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47009.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45105
[https://github.com/apache/spark/pull/45105]

> Create table with collation
> ---
>
> Key: SPARK-47009
> URL: https://issues.apache.org/jira/browse/SPARK-47009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add support for creating table with columns containing non-default collated 
> data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47009) Create table with collation

2024-02-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47009:
---

Assignee: Stefan Kandic

> Create table with collation
> ---
>
> Key: SPARK-47009
> URL: https://issues.apache.org/jira/browse/SPARK-47009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>
> Add support for creating table with columns containing non-default collated 
> data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-02-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45599:
---

Assignee: Nicholas Chammas

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Assignee: Nicholas Chammas
>Priority: Critical
>  Labels: correctness, pull-request-available
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.4758

[jira] [Resolved] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-02-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45599.
-
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 45036
[https://github.com/apache/spark/pull/45036]

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Assignee: Nicholas Chammas
>Priority: Critical
>  Labels: correctness, pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e

[jira] [Assigned] (SPARK-47044) Add JDBC query to explain formatted command

2024-02-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47044:
---

Assignee: Uros Stankovic

> Add JDBC query to explain formatted command
> ---
>
> Key: SPARK-47044
> URL: https://issues.apache.org/jira/browse/SPARK-47044
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uros Stankovic
>Assignee: Uros Stankovic
>Priority: Major
>  Labels: pull-request-available
>
> Add generated JDBC query to EXPLAIN FORMATTED command when physical Scan node 
> should access to JDBC source to create RDD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47044) Add JDBC query to explain formatted command

2024-02-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47044.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45102
[https://github.com/apache/spark/pull/45102]

> Add JDBC query to explain formatted command
> ---
>
> Key: SPARK-47044
> URL: https://issues.apache.org/jira/browse/SPARK-47044
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uros Stankovic
>Assignee: Uros Stankovic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add generated JDBC query to EXPLAIN FORMATTED command when physical Scan node 
> should access to JDBC source to create RDD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45789) Support DESCRIBE TABLE for clustering columns

2024-02-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45789.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45077
[https://github.com/apache/spark/pull/45077]

> Support DESCRIBE TABLE for clustering columns
> -
>
> Key: SPARK-45789
> URL: https://issues.apache.org/jira/browse/SPARK-45789
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45789) Support DESCRIBE TABLE for clustering columns

2024-02-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45789:
---

Assignee: Terry Kim

> Support DESCRIBE TABLE for clustering columns
> -
>
> Key: SPARK-45789
> URL: https://issues.apache.org/jira/browse/SPARK-45789
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47071) inline With expression if it contains special expression

2024-02-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-47071:

Summary: inline With expression if it contains special expression  (was: 
inline With expression if it contains aggregate/window expression)

> inline With expression if it contains special expression
> 
>
> Key: SPARK-47071
> URL: https://issues.apache.org/jira/browse/SPARK-47071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47071) inline With expression if it contains aggregate/window expression

2024-02-15 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-47071:
---

 Summary: inline With expression if it contains aggregate/window 
expression
 Key: SPARK-47071
 URL: https://issues.apache.org/jira/browse/SPARK-47071
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47059) attach error context for ALTER COLUMN v1 command

2024-02-15 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-47059:
---

 Summary: attach error context for ALTER COLUMN v1 command
 Key: SPARK-47059
 URL: https://issues.apache.org/jira/browse/SPARK-47059
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39910) DataFrameReader API cannot read files from hadoop archives (.har)

2024-02-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39910:
---

Assignee: Christophe Préaud

> DataFrameReader API cannot read files from hadoop archives (.har)
> -
>
> Key: SPARK-39910
> URL: https://issues.apache.org/jira/browse/SPARK-39910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2
>Reporter: Christophe Préaud
>Assignee: Christophe Préaud
>Priority: Minor
>  Labels: DataFrameReader, pull-request-available
>
> Reading a file from an hadoop archive using the DataFrameReader API returns 
> an empty Dataset:
> {code:java}
> scala> val df = 
> spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719")
> df: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> df.count
> res7: Long = 0 {code}
>  
> On the other hand, reading the same file, from the same hadoop archive, but 
> using the RDD API yields the correct result:
> {code:java}
> scala> val df = 
> sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value")
> df: org.apache.spark.sql.DataFrame = [value: string]
> scala> df.count
> res8: Long = 5589 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39910) DataFrameReader API cannot read files from hadoop archives (.har)

2024-02-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39910.
-
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 43463
[https://github.com/apache/spark/pull/43463]

> DataFrameReader API cannot read files from hadoop archives (.har)
> -
>
> Key: SPARK-39910
> URL: https://issues.apache.org/jira/browse/SPARK-39910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2
>Reporter: Christophe Préaud
>Assignee: Christophe Préaud
>Priority: Minor
>  Labels: DataFrameReader, pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> Reading a file from an hadoop archive using the DataFrameReader API returns 
> an empty Dataset:
> {code:java}
> scala> val df = 
> spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719")
> df: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> df.count
> res7: Long = 0 {code}
>  
> On the other hand, reading the same file, from the same hadoop archive, but 
> using the RDD API yields the correct result:
> {code:java}
> scala> val df = 
> sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value")
> df: org.apache.spark.sql.DataFrame = [value: string]
> scala> df.count
> res8: Long = 5589 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46999) ExpressionWithUnresolvedIdentifier should include other expressions in the expression tree

2024-02-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46999.
-
Fix Version/s: 4.0.0
 Assignee: Wenchen Fan
   Resolution: Fixed

> ExpressionWithUnresolvedIdentifier should include other expressions in the 
> expression tree
> --
>
> Key: SPARK-46999
> URL: https://issues.apache.org/jira/browse/SPARK-46999
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46993) Allow session variables in more places such as from_json for schema

2024-02-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46993.
-
Fix Version/s: 4.0.0
 Assignee: Serge Rielau
   Resolution: Fixed

> Allow session variables in more places such as from_json for schema
> ---
>
> Key: SPARK-46993
> URL: https://issues.apache.org/jira/browse/SPARK-46993
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.2
>Reporter: Serge Rielau
>Assignee: Serge Rielau
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> It appears we do not allow session variables to provide a schema for 
> from_json().
> This is likely a generic restriction re constant folding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46922) Do not wrap runtime user-facing errors

2024-02-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46922:
---

Assignee: Wenchen Fan

> Do not wrap runtime user-facing errors
> --
>
> Key: SPARK-46922
> URL: https://issues.apache.org/jira/browse/SPARK-46922
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46922) Do not wrap runtime user-facing errors

2024-02-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46922.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44953
[https://github.com/apache/spark/pull/44953]

> Do not wrap runtime user-facing errors
> --
>
> Key: SPARK-46922
> URL: https://issues.apache.org/jira/browse/SPARK-46922
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46999) ExpressionWithUnresolvedIdentifier should include other expressions in the expression tree

2024-02-07 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-46999:
---

 Summary: ExpressionWithUnresolvedIdentifier should include other 
expressions in the expression tree
 Key: SPARK-46999
 URL: https://issues.apache.org/jira/browse/SPARK-46999
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46526) Limit over certain correlated subqueries results in Nosuchelement exception

2024-02-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46526.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44514
[https://github.com/apache/spark/pull/44514]

> Limit over certain correlated subqueries results in Nosuchelement exception
> ---
>
> Key: SPARK-46526
> URL: https://issues.apache.org/jira/browse/SPARK-46526
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Andrey Gubichev
>Assignee: Andrey Gubichev
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> The types of queries that result in errors are:
>  * have LIMIT in the subquery
>  * predicate with correlated references does not depend on the inner query 
> (references exclusively outer table).
> For example:
> {code:java}
> SELECT COUNT(DISTINCT(t1a))
> FROM t1
> WHERE t1d IN (SELECT t2d
>   FROM   t2
>   WHERE t1a IS NOT NULL
>   LIMIT 10);
>  {code}
> Here, WHERE t1a IS NOT NULL can be conceptually lifted to the join that 
> connects inner and outer query. 
> Currently, this query results in an error ("no such element exception").



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46980) Avoid using internal APIs in dataframe end-to-end tests

2024-02-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46980.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45034
[https://github.com/apache/spark/pull/45034]

> Avoid using internal APIs in dataframe end-to-end tests
> ---
>
> Key: SPARK-46980
> URL: https://issues.apache.org/jira/browse/SPARK-46980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Mark Jarvin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46980) Avoid using internal APIs in tests

2024-02-05 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-46980:
---

 Summary: Avoid using internal APIs in tests
 Key: SPARK-46980
 URL: https://issues.apache.org/jira/browse/SPARK-46980
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46980) Avoid using internal APIs in dataframe end-to-end tests

2024-02-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-46980:

Summary: Avoid using internal APIs in dataframe end-to-end tests  (was: 
Avoid using internal APIs in tests)

> Avoid using internal APIs in dataframe end-to-end tests
> ---
>
> Key: SPARK-46980
> URL: https://issues.apache.org/jira/browse/SPARK-46980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Mark Jarvin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46833) Using ICU library for collation tracking

2024-02-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46833:
---

Assignee: Aleksandar Tomic

> Using ICU library for collation tracking
> 
>
> Key: SPARK-46833
> URL: https://issues.apache.org/jira/browse/SPARK-46833
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46833) Using ICU library for collation tracking

2024-02-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46833.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44968
[https://github.com/apache/spark/pull/44968]

> Using ICU library for collation tracking
> 
>
> Key: SPARK-46833
> URL: https://issues.apache.org/jira/browse/SPARK-46833
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46946) Supporting broadcast of multiple filtering keys in DynamicPruning

2024-02-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46946.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44988
[https://github.com/apache/spark/pull/44988]

> Supporting broadcast of multiple filtering keys in DynamicPruning
> -
>
> Key: SPARK-46946
> URL: https://issues.apache.org/jira/browse/SPARK-46946
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Thang Long Vu
>Assignee: Thang Long Vu
>Priority: Major
>  Labels: pull-request-available, releasenotes
> Fix For: 4.0.0
>
>
> This PR extends `DynamicPruningSubquery` to support broadcasting of multiple 
> filtering keys (instead of one as before). The majority of the PR is to 
> simply generalise singularity to plurality.
> Note: We actually do not use the multiple filtering keys 
> `DynamicPruningSubquery` in this PR, we are doing this to make supporting DPP 
> Null Safe Equality or multiple Equality predicates easier in the future.
> In Null Safe Equality JOIN, the JOIN condition `a <=> b` is transformed to 
> `Coalesce(key1, Literal(key1.dataType)) = Coalesce(key2, 
> Literal(key2.dataType)) AND IsNull(key1) = IsNull(key2)`. In order to have 
> the highest pruning efficiency, we broadcast the 2 keys `Coalesce(key, 
> Literal(key.dataType))` and `IsNull(key)` and use them to prune the other 
> side at the same time. 
> Before, the `DynamicPruningSubquery` only has one broadcasting key and we 
> only supports DPP for one `EqualTo` JOIN predicate, now we are extending the 
> subquery to multiple broadcasting keys. Please note that DPP has not been 
> supported for multiple JOIN predicates. 
> Put it in another way, at the moment, we don't insert a DPP Filter for 
> multiple JOIN predicates at the same time, only potentially insert a DPP 
> Filter for a given Equality JOIN predicate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46946) Supporting broadcast of multiple filtering keys in DynamicPruning

2024-02-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46946:
---

Assignee: Thang Long Vu

> Supporting broadcast of multiple filtering keys in DynamicPruning
> -
>
> Key: SPARK-46946
> URL: https://issues.apache.org/jira/browse/SPARK-46946
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Thang Long Vu
>Assignee: Thang Long Vu
>Priority: Major
>  Labels: pull-request-available, releasenotes
>
> This PR extends `DynamicPruningSubquery` to support broadcasting of multiple 
> filtering keys (instead of one as before). The majority of the PR is to 
> simply generalise singularity to plurality.
> Note: We actually do not use the multiple filtering keys 
> `DynamicPruningSubquery` in this PR, we are doing this to make supporting DPP 
> Null Safe Equality or multiple Equality predicates easier in the future.
> In Null Safe Equality JOIN, the JOIN condition `a <=> b` is transformed to 
> `Coalesce(key1, Literal(key1.dataType)) = Coalesce(key2, 
> Literal(key2.dataType)) AND IsNull(key1) = IsNull(key2)`. In order to have 
> the highest pruning efficiency, we broadcast the 2 keys `Coalesce(key, 
> Literal(key.dataType))` and `IsNull(key)` and use them to prune the other 
> side at the same time. 
> Before, the `DynamicPruningSubquery` only has one broadcasting key and we 
> only supports DPP for one `EqualTo` JOIN predicate, now we are extending the 
> subquery to multiple broadcasting keys. Please note that DPP has not been 
> supported for multiple JOIN predicates. 
> Put it in another way, at the moment, we don't insert a DPP Filter for 
> multiple JOIN predicates at the same time, only potentially insert a DPP 
> Filter for a given Equality JOIN predicate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46922) Do not wrap runtime user-facing errors

2024-02-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-46922:

Summary: Do not wrap runtime user-facing errors  (was: better handling for 
runtime user errors)

> Do not wrap runtime user-facing errors
> --
>
> Key: SPARK-46922
> URL: https://issues.apache.org/jira/browse/SPARK-46922
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46951) Define retry-able errors

2024-02-01 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-46951:
---

 Summary: Define retry-able errors
 Key: SPARK-46951
 URL: https://issues.apache.org/jira/browse/SPARK-46951
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46908) Extend SELECT * support outside of select list

2024-02-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46908:
---

Assignee: Serge Rielau

> Extend SELECT * support outside of select list
> --
>
> Key: SPARK-46908
> URL: https://issues.apache.org/jira/browse/SPARK-46908
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Assignee: Serge Rielau
>Priority: Major
>  Labels: SQL, pull-request-available
>
> Traditionally * is confined to thr select list and there to the top level of 
> expressions.
> Spark does, in an undocumented fashion support * in the SELECT list for 
> function argument list.
> Here we want to expand upon this capability by adding the WHERE clause 
> (Filter) as well as a couple of more scenarios such as row value constructors 
> and IN operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46908) Extend SELECT * support outside of select list

2024-02-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46908.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44938
[https://github.com/apache/spark/pull/44938]

> Extend SELECT * support outside of select list
> --
>
> Key: SPARK-46908
> URL: https://issues.apache.org/jira/browse/SPARK-46908
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Assignee: Serge Rielau
>Priority: Major
>  Labels: SQL, pull-request-available
> Fix For: 4.0.0
>
>
> Traditionally * is confined to thr select list and there to the top level of 
> expressions.
> Spark does, in an undocumented fashion support * in the SELECT list for 
> function argument list.
> Here we want to expand upon this capability by adding the WHERE clause 
> (Filter) as well as a couple of more scenarios such as row value constructors 
> and IN operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46933) Add execution time metric for jdbc query

2024-02-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46933.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44969
[https://github.com/apache/spark/pull/44969]

> Add execution time metric for jdbc query
> 
>
> Key: SPARK-46933
> URL: https://issues.apache.org/jira/browse/SPARK-46933
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Milan Stefanovic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> expose additional metrics when jdbcrdd is used



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46922) better handling for runtime user errors

2024-01-30 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-46922:
---

 Summary: better handling for runtime user errors
 Key: SPARK-46922
 URL: https://issues.apache.org/jira/browse/SPARK-46922
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46905) Add dedicated class to keep column definition instead of StructField in Create/ReplaceTable command

2024-01-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46905.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44935
[https://github.com/apache/spark/pull/44935]

> Add dedicated class to keep column definition instead of StructField in 
> Create/ReplaceTable command
> ---
>
> Key: SPARK-46905
> URL: https://issues.apache.org/jira/browse/SPARK-46905
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46905) Add dedicated class to keep column definition instead of StructField in Create/ReplaceTable command

2024-01-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46905:
---

Assignee: Wenchen Fan

> Add dedicated class to keep column definition instead of StructField in 
> Create/ReplaceTable command
> ---
>
> Key: SPARK-46905
> URL: https://issues.apache.org/jira/browse/SPARK-46905
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46905) Add dedicated class to keep column definition instead of StructField in Create/ReplaceTable command

2024-01-29 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-46905:
---

 Summary: Add dedicated class to keep column definition instead of 
StructField in Create/ReplaceTable command
 Key: SPARK-46905
 URL: https://issues.apache.org/jira/browse/SPARK-46905
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46683) Write a subquery generator that generates subqueries of different variations to increase testing coverage in this area

2024-01-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46683.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44599
[https://github.com/apache/spark/pull/44599]

> Write a subquery generator that generates subqueries of different variations 
> to increase testing coverage in this area
> --
>
> Key: SPARK-46683
> URL: https://issues.apache.org/jira/browse/SPARK-46683
> Project: Spark
>  Issue Type: Test
>  Components: Optimizer, SQL
>Affects Versions: 3.5.1
>Reporter: Andy Lam
>Assignee: Andy Lam
>Priority: Major
>  Labels: correctness, pull-request-available, testing
> Fix For: 4.0.0
>
>
> There are a lot of subquery correctness issues, ranging from very old bugs to 
> new ones that are being introduced due to work being done on subquery 
> correlation optimization. This is especially in the areas of COUNT bugs and 
> null behaviors.
> To increase test coverage and robustness in this area, we want to write a 
> subquery generator that writes variations of subqueries, producing SQL tests 
> that also run against Postgres (from my work in SPARK-46179).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46763) ReplaceDeduplicateWithAggregate fails when non-grouping keys have duplicate attributes

2024-01-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46763.
-
Fix Version/s: 3.4.3
   3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 44835
[https://github.com/apache/spark/pull/44835]

> ReplaceDeduplicateWithAggregate fails when non-grouping keys have duplicate 
> attributes
> --
>
> Key: SPARK-46763
> URL: https://issues.apache.org/jira/browse/SPARK-46763
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.5.0
>Reporter: Nikhil Sheoran
>Assignee: Nikhil Sheoran
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.3, 3.5.1, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46763) ReplaceDeduplicateWithAggregate fails when non-grouping keys have duplicate attributes

2024-01-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46763:
---

Assignee: Nikhil Sheoran

> ReplaceDeduplicateWithAggregate fails when non-grouping keys have duplicate 
> attributes
> --
>
> Key: SPARK-46763
> URL: https://issues.apache.org/jira/browse/SPARK-46763
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.5.0
>Reporter: Nikhil Sheoran
>Assignee: Nikhil Sheoran
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46590) Coalesce partiton assert error after skew join optimization

2024-01-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46590.
-
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 44661
[https://github.com/apache/spark/pull/44661]

> Coalesce partiton assert error after skew join optimization
> ---
>
> Key: SPARK-46590
> URL: https://issues.apache.org/jira/browse/SPARK-46590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 
> 3.3.4
>Reporter: Jackey Lee
>Assignee: Jackey Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
> Attachments: problem.log
>
>
> Recently when we were testing TPCDS Q71, we found that if 
> `spark.sql.shuffle.partitions` and 
> `spark.sql.adaptive.coalescePartitions.initialPartitionNum` are both set to 
> the number of executor cores, an AssertError may be reported in 
> coalescePartition due to the partitionSpecs of joins after skew are different.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46590) Coalesce partiton assert error after skew join optimization

2024-01-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46590:
---

Assignee: Jackey Lee

> Coalesce partiton assert error after skew join optimization
> ---
>
> Key: SPARK-46590
> URL: https://issues.apache.org/jira/browse/SPARK-46590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 
> 3.3.4
>Reporter: Jackey Lee
>Assignee: Jackey Lee
>Priority: Major
>  Labels: pull-request-available
> Attachments: problem.log
>
>
> Recently when we were testing TPCDS Q71, we found that if 
> `spark.sql.shuffle.partitions` and 
> `spark.sql.adaptive.coalescePartitions.initialPartitionNum` are both set to 
> the number of executor cores, an AssertError may be reported in 
> coalescePartition due to the partitionSpecs of joins after skew are different.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46769) Refine timestamp related schema inference

2024-01-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-46769:

Summary: Refine timestamp related schema inference  (was: Fix inferring of 
TIMESTAMP_NTZ in CSV/JSON)

> Refine timestamp related schema inference
> -
>
> Key: SPARK-46769
> URL: https://issues.apache.org/jira/browse/SPARK-46769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>
> After the PR https://github.com/apache/spark/pull/43243, the TIMESTAMP_NTZ 
> type inference in CSV/JSON datasource got 2 new guards which means 
> TIMESTAMP_NTZ should be inferred either if:
> 1. the SQL config `spark.sql.legacy.timeParserPolicy` is set to `LEGACY` or
> 2. `spark.sql.timestampType` is set to `TIMESTAMP_NTZ`.
> otherwise CSV/JSON should try to infer `TIMESTAMP_LTZ`.
> Both guards are unnecessary because:
> 1. when `spark.sql.legacy.timeParserPolicy` is `LEGACY` that only means Spark 
> should use a legacy Java 7- parser: `FastDateFormat` or `SimpleDateFormat`. 
> Both parser are applicable for parsing `TIMESTAMP_NTZ`.
> 2. when `spark.sql.timestampType` is set to `TIMESTAMP_LTZ`, it doesn't mean 
> that we should skip inferring of `TIMESTAMP_NTZ` types in CSV/JSON, and try 
> to parse the timestamp string value w/o time zone like 
> `2024-01-19T09:10:11.123` using a LTZ format **with timezone** like 
> `-MM-dd'T'HH:mm:ss.SSSXXX`. _The last one cannot match any NTZ values for 
> sure._



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46769) Refine timestamp related schema inference

2024-01-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-46769:

Description: (was: After the PR 
https://github.com/apache/spark/pull/43243, the TIMESTAMP_NTZ type inference in 
CSV/JSON datasource got 2 new guards which means TIMESTAMP_NTZ should be 
inferred either if:

1. the SQL config `spark.sql.legacy.timeParserPolicy` is set to `LEGACY` or
2. `spark.sql.timestampType` is set to `TIMESTAMP_NTZ`.

otherwise CSV/JSON should try to infer `TIMESTAMP_LTZ`.

Both guards are unnecessary because:

1. when `spark.sql.legacy.timeParserPolicy` is `LEGACY` that only means Spark 
should use a legacy Java 7- parser: `FastDateFormat` or `SimpleDateFormat`. 
Both parser are applicable for parsing `TIMESTAMP_NTZ`.
2. when `spark.sql.timestampType` is set to `TIMESTAMP_LTZ`, it doesn't mean 
that we should skip inferring of `TIMESTAMP_NTZ` types in CSV/JSON, and try to 
parse the timestamp string value w/o time zone like `2024-01-19T09:10:11.123` 
using a LTZ format **with timezone** like `-MM-dd'T'HH:mm:ss.SSSXXX`. _The 
last one cannot match any NTZ values for sure._)

> Refine timestamp related schema inference
> -
>
> Key: SPARK-46769
> URL: https://issues.apache.org/jira/browse/SPARK-46769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46769) Refine timestamp related schema inference

2024-01-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46769:
---

Assignee: Wenchen Fan  (was: Max Gekk)

> Refine timestamp related schema inference
> -
>
> Key: SPARK-46769
> URL: https://issues.apache.org/jira/browse/SPARK-46769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>
> After the PR https://github.com/apache/spark/pull/43243, the TIMESTAMP_NTZ 
> type inference in CSV/JSON datasource got 2 new guards which means 
> TIMESTAMP_NTZ should be inferred either if:
> 1. the SQL config `spark.sql.legacy.timeParserPolicy` is set to `LEGACY` or
> 2. `spark.sql.timestampType` is set to `TIMESTAMP_NTZ`.
> otherwise CSV/JSON should try to infer `TIMESTAMP_LTZ`.
> Both guards are unnecessary because:
> 1. when `spark.sql.legacy.timeParserPolicy` is `LEGACY` that only means Spark 
> should use a legacy Java 7- parser: `FastDateFormat` or `SimpleDateFormat`. 
> Both parser are applicable for parsing `TIMESTAMP_NTZ`.
> 2. when `spark.sql.timestampType` is set to `TIMESTAMP_LTZ`, it doesn't mean 
> that we should skip inferring of `TIMESTAMP_NTZ` types in CSV/JSON, and try 
> to parse the timestamp string value w/o time zone like 
> `2024-01-19T09:10:11.123` using a LTZ format **with timezone** like 
> `-MM-dd'T'HH:mm:ss.SSSXXX`. _The last one cannot match any NTZ values for 
> sure._



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46769) Fix inferring of TIMESTAMP_NTZ in CSV/JSON

2024-01-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46769.
-
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 44800
[https://github.com/apache/spark/pull/44800]

> Fix inferring of TIMESTAMP_NTZ in CSV/JSON
> --
>
> Key: SPARK-46769
> URL: https://issues.apache.org/jira/browse/SPARK-46769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> After the PR https://github.com/apache/spark/pull/43243, the TIMESTAMP_NTZ 
> type inference in CSV/JSON datasource got 2 new guards which means 
> TIMESTAMP_NTZ should be inferred either if:
> 1. the SQL config `spark.sql.legacy.timeParserPolicy` is set to `LEGACY` or
> 2. `spark.sql.timestampType` is set to `TIMESTAMP_NTZ`.
> otherwise CSV/JSON should try to infer `TIMESTAMP_LTZ`.
> Both guards are unnecessary because:
> 1. when `spark.sql.legacy.timeParserPolicy` is `LEGACY` that only means Spark 
> should use a legacy Java 7- parser: `FastDateFormat` or `SimpleDateFormat`. 
> Both parser are applicable for parsing `TIMESTAMP_NTZ`.
> 2. when `spark.sql.timestampType` is set to `TIMESTAMP_LTZ`, it doesn't mean 
> that we should skip inferring of `TIMESTAMP_NTZ` types in CSV/JSON, and try 
> to parse the timestamp string value w/o time zone like 
> `2024-01-19T09:10:11.123` using a LTZ format **with timezone** like 
> `-MM-dd'T'HH:mm:ss.SSSXXX`. _The last one cannot match any NTZ values for 
> sure._



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46644) Fix add in SQLMetric

2024-01-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46644.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44649
[https://github.com/apache/spark/pull/44649]

> Fix add in SQLMetric
> 
>
> Key: SPARK-46644
> URL: https://issues.apache.org/jira/browse/SPARK-46644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Davin Tjong
>Assignee: Davin Tjong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> A previous refactor mistakenly used `isValid` for add. Since 
> `defaultValidValue` was always `0`, this didn't cause any correctness issues.
> What we really want to do for add (and merge) is `if (isZero) _value = 0`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46274) Range operator computeStats() proper long conversions

2024-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46274:
---

Assignee: Nick Young  (was: Kelvin Jiang)

> Range operator computeStats() proper long conversions
> -
>
> Key: SPARK-46274
> URL: https://issues.apache.org/jira/browse/SPARK-46274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kelvin Jiang
>Assignee: Nick Young
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>
> Range operator's `computeStats()` function unsafely casts from `BigInt` to 
> `Long` and causes issues downstream with statistics estimation. Adds bounds 
> checking to avoid crashing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45435) Document that lazy checkpoint may not be a consistent

2024-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45435:
---

Assignee: Juliusz Sompolski

> Document that lazy checkpoint may not be a consistent
> -
>
> Key: SPARK-45435
> URL: https://issues.apache.org/jira/browse/SPARK-45435
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
>  Labels: pull-request-available
>
> Some people may want to use checkpoint to get a consistent snapshot of the 
> Dataset / RDD. Warn that this is not the case with lazy checkpoint, because 
> checkpoint is computed only at the end of the first action, and the data used 
> during the first action may be different because of non-determinism and 
> retries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45435) Document that lazy checkpoint may not be a consistent

2024-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45435.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43247
[https://github.com/apache/spark/pull/43247]

> Document that lazy checkpoint may not be a consistent
> -
>
> Key: SPARK-45435
> URL: https://issues.apache.org/jira/browse/SPARK-45435
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Some people may want to use checkpoint to get a consistent snapshot of the 
> Dataset / RDD. Warn that this is not the case with lazy checkpoint, because 
> checkpoint is computed only at the end of the first action, and the data used 
> during the first action may be different because of non-determinism and 
> retries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46700) count the last spilling for the shuffle disk spilling bytes metric

2024-01-12 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-46700:
---

 Summary: count the last spilling for the shuffle disk spilling 
bytes metric
 Key: SPARK-46700
 URL: https://issues.apache.org/jira/browse/SPARK-46700
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46052) Remove unnecessary TaskScheduler.killAllTaskAttempts

2024-01-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46052:
---

Assignee: wuyi

> Remove unnecessary TaskScheduler.killAllTaskAttempts
> 
>
> Key: SPARK-46052
> URL: https://issues.apache.org/jira/browse/SPARK-46052
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>  Labels: pull-request-available
>
> Spark has two functions to kill all tasks in a Stage:
> * `cancelTasks`: Not only kill all the running tasks in all the stage 
> attempts but also abort all the stage attempts
> *  `killAllTaskAttempts`: Only kill all the running tasks in all the stage 
> attemtps but won't abort the attempts.
> However, there's no use case in Spark that a stage would launch new tasks 
> after its all tasks get killed. So I think we can replace 
> `killAllTaskAttempts` with `cancelTasks` directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46052) Remove unnecessary TaskScheduler.killAllTaskAttempts

2024-01-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46052.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43954
[https://github.com/apache/spark/pull/43954]

> Remove unnecessary TaskScheduler.killAllTaskAttempts
> 
>
> Key: SPARK-46052
> URL: https://issues.apache.org/jira/browse/SPARK-46052
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Spark has two functions to kill all tasks in a Stage:
> * `cancelTasks`: Not only kill all the running tasks in all the stage 
> attempts but also abort all the stage attempts
> *  `killAllTaskAttempts`: Only kill all the running tasks in all the stage 
> attemtps but won't abort the attempts.
> However, there's no use case in Spark that a stage would launch new tasks 
> after its all tasks get killed. So I think we can replace 
> `killAllTaskAttempts` with `cancelTasks` directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46383) Reduce Driver Heap Usage by Reducing the Lifespan of `TaskInfo.accumulables()`

2024-01-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46383:
---

Assignee: Utkarsh Agarwal

> Reduce Driver Heap Usage by Reducing the Lifespan of `TaskInfo.accumulables()`
> --
>
> Key: SPARK-46383
> URL: https://issues.apache.org/jira/browse/SPARK-46383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Utkarsh Agarwal
>Assignee: Utkarsh Agarwal
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screenshot 2023-11-06 at 3.56.26 PM.png, screenshot-1.png
>
>
> `AccumulableInfo` is one of the top heap consumers in driver's heap dumps for 
> stages with many tasks. For a stage with a large number of tasks 
> ({_}O(100k){_}), we saw {*}{{*}}30%{{*}}{*} of the heap usage stemming from 
> `TaskInfo.accumulables()`.
> !screenshot-1.png|width=641,height=98!  
> The `TaskSetManager` today keeps around the TaskInfo objects 
> ([ref1|https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L134],
>  
> [ref2|https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L192]))
>  and in turn the task metrics (`AccumulableInfo`) for every task attempt 
> until the stage is completed. This means that for stages with a large number 
> of tasks, we keep metrics for all the tasks (`AccumulableInfo`) around even 
> when the task has completed and its metrics have been aggregated. Given a 
> task has a large number of metrics, stages with many tasks end up with a 
> large heap usage in the form of task metrics.
> Ideally, we should clear up a task's TaskInfo upon the task's completion, 
> thereby reducing the driver's heap usage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46383) Reduce Driver Heap Usage by Reducing the Lifespan of `TaskInfo.accumulables()`

2024-01-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46383.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44321
[https://github.com/apache/spark/pull/44321]

> Reduce Driver Heap Usage by Reducing the Lifespan of `TaskInfo.accumulables()`
> --
>
> Key: SPARK-46383
> URL: https://issues.apache.org/jira/browse/SPARK-46383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Utkarsh Agarwal
>Assignee: Utkarsh Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: Screenshot 2023-11-06 at 3.56.26 PM.png, screenshot-1.png
>
>
> `AccumulableInfo` is one of the top heap consumers in driver's heap dumps for 
> stages with many tasks. For a stage with a large number of tasks 
> ({_}O(100k){_}), we saw {*}{{*}}30%{{*}}{*} of the heap usage stemming from 
> `TaskInfo.accumulables()`.
> !screenshot-1.png|width=641,height=98!  
> The `TaskSetManager` today keeps around the TaskInfo objects 
> ([ref1|https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L134],
>  
> [ref2|https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L192]))
>  and in turn the task metrics (`AccumulableInfo`) for every task attempt 
> until the stage is completed. This means that for stages with a large number 
> of tasks, we keep metrics for all the tasks (`AccumulableInfo`) around even 
> when the task has completed and its metrics have been aggregated. Given a 
> task has a large number of metrics, stages with many tasks end up with a 
> large heap usage in the form of task metrics.
> Ideally, we should clear up a task's TaskInfo upon the task's completion, 
> thereby reducing the driver's heap usage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46640) RemoveRedundantAliases does not account for SubqueryExpression when removing aliases

2024-01-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46640.
-
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 44645
[https://github.com/apache/spark/pull/44645]

> RemoveRedundantAliases does not account for SubqueryExpression when removing 
> aliases
> 
>
> Key: SPARK-46640
> URL: https://issues.apache.org/jira/browse/SPARK-46640
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 4.0.0
>Reporter: Nikhil Sheoran
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> `RemoveRedundantAliases{{{}`{}}} does not take into account the outer 
> attributes of a `SubqueryExpression` aliases, potentially removing them if it 
> thinks they are redundant.
> This can cause scenarios where a subquery expression has conditions like `a#x 
> = a#x` i.e. both the attribute names and the expression ID(s) are the same. 
> This can then lead to conflicting expression ID(s) error.
> In `RemoveRedundantAliases`, we have an excluded AttributeSet argument 
> denoting the references for which we should not remove aliases. For a query 
> with a subquery expression, adding the references of this subquery in the 
> excluded set prevents such rewrite from happening.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46640) RemoveRedundantAliases does not account for SubqueryExpression when removing aliases

2024-01-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46640:
---

Assignee: Nikhil Sheoran

> RemoveRedundantAliases does not account for SubqueryExpression when removing 
> aliases
> 
>
> Key: SPARK-46640
> URL: https://issues.apache.org/jira/browse/SPARK-46640
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 4.0.0
>Reporter: Nikhil Sheoran
>Assignee: Nikhil Sheoran
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>
> `RemoveRedundantAliases{{{}`{}}} does not take into account the outer 
> attributes of a `SubqueryExpression` aliases, potentially removing them if it 
> thinks they are redundant.
> This can cause scenarios where a subquery expression has conditions like `a#x 
> = a#x` i.e. both the attribute names and the expression ID(s) are the same. 
> This can then lead to conflicting expression ID(s) error.
> In `RemoveRedundantAliases`, we have an excluded AttributeSet argument 
> denoting the references for which we should not remove aliases. For a query 
> with a subquery expression, adding the references of this subquery in the 
> excluded set prevents such rewrite from happening.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46634) literal validation should not drill down to null fields

2024-01-09 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-46634:
---

 Summary: literal validation should not drill down to null fields
 Key: SPARK-46634
 URL: https://issues.apache.org/jira/browse/SPARK-46634
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46331) Removing CodeGenFallback trait from subset of datetime and spark version functions

2024-01-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46331:
---

Assignee: Aleksandar Tomic

> Removing CodeGenFallback trait from subset of datetime and spark version 
> functions
> --
>
> Key: SPARK-46331
> URL: https://issues.apache.org/jira/browse/SPARK-46331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
>
> This change moves us further into direction of removing CodegenFallback and 
> instead using RuntimeReplacable with StaticInvoke which will directly insert 
> provided code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46331) Removing CodeGenFallback trait from subset of datetime and spark version functions

2024-01-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46331.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44261
[https://github.com/apache/spark/pull/44261]

> Removing CodeGenFallback trait from subset of datetime and spark version 
> functions
> --
>
> Key: SPARK-46331
> URL: https://issues.apache.org/jira/browse/SPARK-46331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This change moves us further into direction of removing CodegenFallback and 
> instead using RuntimeReplacable with StaticInvoke which will directly insert 
> provided code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46610) Create table should throw exception when no value for a key in options

2024-01-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46610.
-
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 44615
[https://github.com/apache/spark/pull/44615]

> Create table should throw exception when no value for a key in options
> --
>
> Key: SPARK-46610
> URL: https://issues.apache.org/jira/browse/SPARK-46610
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46581) AccumulatorV2 isZero doesn't do what its name implies

2024-01-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46581:
---

Assignee: Davin Tjong

> AccumulatorV2 isZero doesn't do what its name implies
> -
>
> Key: SPARK-46581
> URL: https://issues.apache.org/jira/browse/SPARK-46581
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Davin Tjong
>Assignee: Davin Tjong
>Priority: Major
>  Labels: pull-request-available
>
> `AccumulatorV2`'s `isZero` doesn't do what the name or comment implies - it 
> actually checks if the accumulator hasn't been updated.
> The comment implies that for a `LongAccumulator`, for example, a value of `0` 
> would cause `isZero` to be `true`. But if we were to `add(0)`, then the value 
> would still be `0` but `isZero` would return `false`.
> Propose to rename this to `isUpdated` so the name matches the meaning more 
> closely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46581) AccumulatorV2 isZero doesn't do what its name implies

2024-01-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46581.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44583
[https://github.com/apache/spark/pull/44583]

> AccumulatorV2 isZero doesn't do what its name implies
> -
>
> Key: SPARK-46581
> URL: https://issues.apache.org/jira/browse/SPARK-46581
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Davin Tjong
>Assignee: Davin Tjong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> `AccumulatorV2`'s `isZero` doesn't do what the name or comment implies - it 
> actually checks if the accumulator hasn't been updated.
> The comment implies that for a `LongAccumulator`, for example, a value of `0` 
> would cause `isZero` to be `true`. But if we were to `add(0)`, then the value 
> would still be `0` but `isZero` would return `false`.
> Propose to rename this to `isUpdated` so the name matches the meaning more 
> closely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45352) Eliminate foldable window partitions

2024-01-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45352.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43144
[https://github.com/apache/spark/pull/43144]

> Eliminate foldable window partitions
> 
>
> Key: SPARK-45352
> URL: https://issues.apache.org/jira/browse/SPARK-45352
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Mingliang Zhu
>Assignee: Mingliang Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Foldable partition is redundant, remove it not only can simplify plan, but 
> some rules can also take effect when the partitions are all foldable, such as 
> `LimitPushDownThroughWindow{{{}`{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45352) Eliminate foldable window partitions

2024-01-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45352:
---

Assignee: Mingliang Zhu

> Eliminate foldable window partitions
> 
>
> Key: SPARK-45352
> URL: https://issues.apache.org/jira/browse/SPARK-45352
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Mingliang Zhu
>Assignee: Mingliang Zhu
>Priority: Major
>  Labels: pull-request-available
>
> Foldable partition is redundant, remove it not only can simplify plan, but 
> some rules can also take effect when the partitions are all foldable, such as 
> `LimitPushDownThroughWindow{{{}`{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46609) avoid exponential explosion in PartitioningPreservingUnaryExecNode

2024-01-05 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-46609:
---

 Summary: avoid exponential explosion in 
PartitioningPreservingUnaryExecNode
 Key: SPARK-46609
 URL: https://issues.apache.org/jira/browse/SPARK-46609
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46602) CREATE VIEW IF NOT EXISTS should never throw `TABLE_OR_VIEW_ALREADY_EXISTS` exception

2024-01-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46602:
---

Assignee: Xinyi Yu

> CREATE VIEW IF NOT EXISTS should never throw `TABLE_OR_VIEW_ALREADY_EXISTS` 
> exception
> -
>
> Key: SPARK-46602
> URL: https://issues.apache.org/jira/browse/SPARK-46602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
>  Labels: pull-request-available
>
> `CREATE VIEW IF NOT EXISTS` should never throw `TABLE_OR_VIEW_ALREADY_EXISTS` 
> exceptions. However the current implementation error out in some concurrent 
> cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46602) CREATE VIEW IF NOT EXISTS should never throw `TABLE_OR_VIEW_ALREADY_EXISTS` exception

2024-01-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46602.
-
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 44603
[https://github.com/apache/spark/pull/44603]

> CREATE VIEW IF NOT EXISTS should never throw `TABLE_OR_VIEW_ALREADY_EXISTS` 
> exception
> -
>
> Key: SPARK-46602
> URL: https://issues.apache.org/jira/browse/SPARK-46602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> `CREATE VIEW IF NOT EXISTS` should never throw `TABLE_OR_VIEW_ALREADY_EXISTS` 
> exceptions. However the current implementation error out in some concurrent 
> cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46179) Generate golden files for SQLQueryTestSuites with Postgres

2024-01-04 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46179.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44084
[https://github.com/apache/spark/pull/44084]

> Generate golden files for SQLQueryTestSuites with Postgres
> --
>
> Key: SPARK-46179
> URL: https://issues.apache.org/jira/browse/SPARK-46179
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Andy Lam
>Assignee: Andy Lam
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> For correctness checking of our SQLQueryTestSuites, we want to run 
> SQLQueryTestSuites with other DBMS as a reference DBMS to generate golden 
> files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46179) Generate golden files for SQLQueryTestSuites with Postgres

2024-01-04 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46179:
---

Assignee: Andy Lam

> Generate golden files for SQLQueryTestSuites with Postgres
> --
>
> Key: SPARK-46179
> URL: https://issues.apache.org/jira/browse/SPARK-46179
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Andy Lam
>Assignee: Andy Lam
>Priority: Major
>  Labels: pull-request-available
>
> For correctness checking of our SQLQueryTestSuites, we want to run 
> SQLQueryTestSuites with other DBMS as a reference DBMS to generate golden 
> files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46598) OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column

2024-01-04 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-46598:
---

 Summary: OrcColumnarBatchReader should respect the memory mode 
when creating column vectors for the missing column
 Key: SPARK-46598
 URL: https://issues.apache.org/jira/browse/SPARK-46598
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46536) Support GROUP BY calendar_interval_type

2023-12-28 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-46536:
---

 Summary: Support GROUP BY calendar_interval_type
 Key: SPARK-46536
 URL: https://issues.apache.org/jira/browse/SPARK-46536
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan


Currently, Spark GROUP BY only allows orderable data types, otherwise the plan 
analysis fails: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala#L197-L203]

However, this is too strict as GROUP BY only cares about equality, not 
ordering. The CalendarInterval type is not orderable (1 month and 30 days, we 
don't know which one is larger), but has well-defined equality. In fact, we 
already support `SELECT DISTINCT calendar_interval_type` in some cases (when 
hash aggregate is picked by the planner).

The proposal here is to officially support calendar interval type in GROUP BY. 
We should relax the check inside `CheckAnalysis`, and make `CalendarInterval` 
implements `Comparable` using natural ordering (compare months first, then 
days, then seconds), and test with both hash aggregate and sort aggregate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46366) Use with expression to avoid duplicating expressions in BETWEEN operation

2023-12-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46366.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44299
[https://github.com/apache/spark/pull/44299]

> Use with expression to avoid duplicating expressions in BETWEEN operation
> -
>
> Key: SPARK-46366
> URL: https://issues.apache.org/jira/browse/SPARK-46366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46444) V2SessionCatalog#createTable should not load the table

2023-12-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46444:
---

Assignee: Wenchen Fan

> V2SessionCatalog#createTable should not load the table
> --
>
> Key: SPARK-46444
> URL: https://issues.apache.org/jira/browse/SPARK-46444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46444) V2SessionCatalog#createTable should not load the table

2023-12-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46444.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44377
[https://github.com/apache/spark/pull/44377]

> V2SessionCatalog#createTable should not load the table
> --
>
> Key: SPARK-46444
> URL: https://issues.apache.org/jira/browse/SPARK-46444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46481) EXECUTE IMMEDIATE does not fold variables when given as parameters

2023-12-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46481.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44450
[https://github.com/apache/spark/pull/44450]

> EXECUTE IMMEDIATE does not fold variables when given as parameters
> --
>
> Key: SPARK-46481
> URL: https://issues.apache.org/jira/browse/SPARK-46481
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Milan Stefanovic
>Assignee: Milan Stefanovic
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46485) V1Write should not add Sort when not needed

2023-12-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46485.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44458
[https://github.com/apache/spark/pull/44458]

> V1Write should not add Sort when not needed
> ---
>
> Key: SPARK-46485
> URL: https://issues.apache.org/jira/browse/SPARK-46485
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46466) vectorized parquet reader should never do rebase for timestamp ntz

2023-12-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-46466:

Fix Version/s: 3.5.1
   3.4.3

> vectorized parquet reader should never do rebase for timestamp ntz
> --
>
> Key: SPARK-46466
> URL: https://issues.apache.org/jira/browse/SPARK-46466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0, 3.5.1, 3.4.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40876) Spark's Vectorized ParquetReader should support type promotions

2023-12-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40876.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44368
[https://github.com/apache/spark/pull/44368]

> Spark's Vectorized ParquetReader should support type promotions
> ---
>
> Key: SPARK-40876
> URL: https://issues.apache.org/jira/browse/SPARK-40876
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.3.0
>Reporter: Alexey Kudinkin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, when reading Parquet table using Spark's `VectorizedColumnReader`, 
> we hit an issue where we specify requested (projection) schema where one of 
> the field's type is widened from int32 to long.
> Expectation is that since this is totally legitimate primitive type 
> promotion, we should be able to read Ints into Longs w/ no problems (for ex, 
> Avro is able to do that perfectly fine).
> However, we're facing an issue where `ParquetVectorUpdaterFactory.getUpdater` 
> method fails w/ the exception listed below.
> Looking at the code, It actually seems to be allowing the opposite – it 
> allows to "down-size" Int32s persisted in the Parquet to be read as Bytes or 
> Shorts for ex. I'm actually not sure what's the rationale for this behavior, 
> and this actually seems like a bug to me (as this will essentially be leading 
> to data truncation):
> {code:java}
> case INT32:
>   if (sparkType == DataTypes.IntegerType || canReadAsIntDecimal(descriptor, 
> sparkType)) {
> return new IntegerUpdater();
>   } else if (sparkType == DataTypes.LongType && isUnsignedIntTypeMatched(32)) 
> {
> // In `ParquetToSparkSchemaConverter`, we map parquet UINT32 to our 
> LongType.
> // For unsigned int32, it stores as plain signed int32 in Parquet when 
> dictionary
> // fallbacks. We read them as long values.
> return new UnsignedIntegerUpdater();
>   } else if (sparkType == DataTypes.ByteType) {
> return new ByteUpdater();
>   } else if (sparkType == DataTypes.ShortType) {
> return new ShortUpdater();
>   } else if (sparkType == DataTypes.DateType) {
> if ("CORRECTED".equals(datetimeRebaseMode)) {
>   return new IntegerUpdater();
> } else {
>   boolean failIfRebase = "EXCEPTION".equals(datetimeRebaseMode);
>   return new IntegerWithRebaseUpdater(failIfRebase);
> }
>   }
>   break; {code}
> Exception:
> {code:java}
> at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
>     at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>     at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>     at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
>     at scala.Option.foreach(Option.scala:407)
>     at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
>     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279)
>     at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>     at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
>     at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:304)
>     at org.apache.spark.RangePartitioner.(Partitioner.

[jira] [Assigned] (SPARK-40876) Spark's Vectorized ParquetReader should support type promotions

2023-12-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40876:
---

Assignee: Johan Lasperas

> Spark's Vectorized ParquetReader should support type promotions
> ---
>
> Key: SPARK-40876
> URL: https://issues.apache.org/jira/browse/SPARK-40876
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.3.0
>Reporter: Alexey Kudinkin
>Assignee: Johan Lasperas
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, when reading Parquet table using Spark's `VectorizedColumnReader`, 
> we hit an issue where we specify requested (projection) schema where one of 
> the field's type is widened from int32 to long.
> Expectation is that since this is totally legitimate primitive type 
> promotion, we should be able to read Ints into Longs w/ no problems (for ex, 
> Avro is able to do that perfectly fine).
> However, we're facing an issue where `ParquetVectorUpdaterFactory.getUpdater` 
> method fails w/ the exception listed below.
> Looking at the code, It actually seems to be allowing the opposite – it 
> allows to "down-size" Int32s persisted in the Parquet to be read as Bytes or 
> Shorts for ex. I'm actually not sure what's the rationale for this behavior, 
> and this actually seems like a bug to me (as this will essentially be leading 
> to data truncation):
> {code:java}
> case INT32:
>   if (sparkType == DataTypes.IntegerType || canReadAsIntDecimal(descriptor, 
> sparkType)) {
> return new IntegerUpdater();
>   } else if (sparkType == DataTypes.LongType && isUnsignedIntTypeMatched(32)) 
> {
> // In `ParquetToSparkSchemaConverter`, we map parquet UINT32 to our 
> LongType.
> // For unsigned int32, it stores as plain signed int32 in Parquet when 
> dictionary
> // fallbacks. We read them as long values.
> return new UnsignedIntegerUpdater();
>   } else if (sparkType == DataTypes.ByteType) {
> return new ByteUpdater();
>   } else if (sparkType == DataTypes.ShortType) {
> return new ShortUpdater();
>   } else if (sparkType == DataTypes.DateType) {
> if ("CORRECTED".equals(datetimeRebaseMode)) {
>   return new IntegerUpdater();
> } else {
>   boolean failIfRebase = "EXCEPTION".equals(datetimeRebaseMode);
>   return new IntegerWithRebaseUpdater(failIfRebase);
> }
>   }
>   break; {code}
> Exception:
> {code:java}
> at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
>     at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>     at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>     at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
>     at scala.Option.foreach(Option.scala:407)
>     at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
>     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279)
>     at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>     at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
>     at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:304)
>     at org.apache.spark.RangePartitioner.(Partitioner.scala:171)
>     at 
> org.apache.spark.sql.execution.exchang

[jira] [Created] (SPARK-46485) V1Write should not add Sort when not needed

2023-12-21 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-46485:
---

 Summary: V1Write should not add Sort when not needed
 Key: SPARK-46485
 URL: https://issues.apache.org/jira/browse/SPARK-46485
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46443) Decimal precision and scale should decided by JDBC dialect.

2023-12-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46443.
-
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 44398
[https://github.com/apache/spark/pull/44398]

> Decimal precision and scale should decided by JDBC dialect.
> ---
>
> Key: SPARK-46443
> URL: https://issues.apache.org/jira/browse/SPARK-46443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46468) COUNT bug in lateral/exists subqueries

2023-12-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46468.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44451
[https://github.com/apache/spark/pull/44451]

> COUNT bug in lateral/exists subqueries
> --
>
> Key: SPARK-46468
> URL: https://issues.apache.org/jira/browse/SPARK-46468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Andrey Gubichev
>Assignee: Andrey Gubichev
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Some further instances of a COUNT bug.
>  
> One example is this test from join-lateral.sql
> [https://github.com/apache/spark/blame/master/sql/core/src/test/resources/sql-tests/results/join-lateral.sql.out#L757]
>  
> According to PostgreSQL, the query should return 2 rows:
> c1 | c2 | sum
> ---{-}++{-}--{-}{-}
>   0 |  1 |   2
>   1 |  2 |    NULL
>  
> whereas Spark SQL only returns the first one.
>  
> Similar instance is the following query, which should return 1 row from t1 
> but has an empty result now:
> {{create temporary view t1(c1, c2) as values (0, 1), (1, 2);}}
> {{create temporary view t2(c1, c2) as values (0, 2), (0, 3);}}
> {{SELECT tt1.c2}}
> {{FROM t1 as tt1}}
> {{WHERE tt1.c1 in (}}
> select max(tt2.c1)
> from t2 as tt2
>  where tt1.c2 is null);



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46468) COUNT bug in lateral/exists subqueries

2023-12-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46468:
---

Assignee: Andrey Gubichev

> COUNT bug in lateral/exists subqueries
> --
>
> Key: SPARK-46468
> URL: https://issues.apache.org/jira/browse/SPARK-46468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Andrey Gubichev
>Assignee: Andrey Gubichev
>Priority: Major
>  Labels: pull-request-available
>
> Some further instances of a COUNT bug.
>  
> One example is this test from join-lateral.sql
> [https://github.com/apache/spark/blame/master/sql/core/src/test/resources/sql-tests/results/join-lateral.sql.out#L757]
>  
> According to PostgreSQL, the query should return 2 rows:
> c1 | c2 | sum
> ---{-}++{-}--{-}{-}
>   0 |  1 |   2
>   1 |  2 |    NULL
>  
> whereas Spark SQL only returns the first one.
>  
> Similar instance is the following query, which should return 1 row from t1 
> but has an empty result now:
> {{create temporary view t1(c1, c2) as values (0, 1), (1, 2);}}
> {{create temporary view t2(c1, c2) as values (0, 2), (0, 3);}}
> {{SELECT tt1.c2}}
> {{FROM t1 as tt1}}
> {{WHERE tt1.c1 in (}}
> select max(tt2.c1)
> from t2 as tt2
>  where tt1.c2 is null);



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45525) Initial support for Python data source write API

2023-12-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45525.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43791
[https://github.com/apache/spark/pull/43791]

> Initial support for Python data source write API
> 
>
> Key: SPARK-45525
> URL: https://issues.apache.org/jira/browse/SPARK-45525
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add a new command and logical rules (similar to V1Writes and V2Writes) to 
> support Python data source write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45525) Initial support for Python data source write API

2023-12-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45525:
---

Assignee: Allison Wang

> Initial support for Python data source write API
> 
>
> Key: SPARK-45525
> URL: https://issues.apache.org/jira/browse/SPARK-45525
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Add a new command and logical rules (similar to V1Writes and V2Writes) to 
> support Python data source write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46380) Replacing current time prior to inline table eval

2023-12-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46380.
-
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 44316
[https://github.com/apache/spark/pull/44316]

> Replacing current time prior to inline table eval
> -
>
> Key: SPARK-46380
> URL: https://issues.apache.org/jira/browse/SPARK-46380
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> When current time/date specific functions are being used as part of inline 
> tables each invocation will be evaluated. Instead, proper behaviour would be 
> to replace time expressions with current time/date and always return single 
> value. Example
> SELECT COUNT(DISTINCT ct) FROM VALUES
> (CURRENT_TIMESTAMP()),
> (CURRENT_TIMESTAMP()),
> (CURRENT_TIMESTAMP()) as data(ct)
>  
> Is supposed to return 1, while currently it returns 3.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46466) vectorized parquet reader should never do rebase for timestamp ntz

2023-12-20 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-46466:
---

 Summary: vectorized parquet reader should never do rebase for 
timestamp ntz
 Key: SPARK-46466
 URL: https://issues.apache.org/jira/browse/SPARK-46466
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46452) Add a new API in DSv2 DataWriter to write an iterator of records

2023-12-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46452.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44410
[https://github.com/apache/spark/pull/44410]

> Add a new API in DSv2 DataWriter to write an iterator of records
> 
>
> Key: SPARK-46452
> URL: https://issues.apache.org/jira/browse/SPARK-46452
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add a new API that takes an iterator of records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46452) Add a new API in DSv2 DataWriter to write an iterator of records

2023-12-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46452:
---

Assignee: Allison Wang

> Add a new API in DSv2 DataWriter to write an iterator of records
> 
>
> Key: SPARK-46452
> URL: https://issues.apache.org/jira/browse/SPARK-46452
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Add a new API that takes an iterator of records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46272) Support CTAS using DSv2 sources

2023-12-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46272.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44190
[https://github.com/apache/spark/pull/44190]

> Support CTAS using DSv2 sources
> ---
>
> Key: SPARK-46272
> URL: https://issues.apache.org/jira/browse/SPARK-46272
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46272) Support CTAS using DSv2 sources

2023-12-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46272:
---

Assignee: Allison Wang

> Support CTAS using DSv2 sources
> ---
>
> Key: SPARK-46272
> URL: https://issues.apache.org/jira/browse/SPARK-46272
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46446) Correctness bug in correlated subquery with OFFSET

2023-12-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46446.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44401
[https://github.com/apache/spark/pull/44401]

> Correctness bug in correlated subquery with OFFSET
> --
>
> Key: SPARK-46446
> URL: https://issues.apache.org/jira/browse/SPARK-46446
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
> introduced recently when support for correlation under OFFSET was enabled but 
> were not handled correctly. (So we went from unsupported, query throws error 
> -> wrong results.)
> It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS
>  
> It's easy to repro with a query like
> {code:java}
> create table x(x1 int, x2 int);
> insert into x values (1, 1), (2, 2);
> create table y(y1 int, y2 int);
> insert into y values (1, 1), (1, 2), (2, 4);
> select * from x where exists (select * from y where x1 = y1 limit 1 offset 
> 2){code}
> Correct result: empty set, see postgres: 
> [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 
> Spark result: Array([2,2])
>  
> The 
> [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
>  where it was introduced added a test for it, but the golden file results for 
> the test actually were incorrect and we didn't notice. (The bug was initially 
> found by https://github.com/apache/spark/pull/44084)
> I'll work on both:
>  * Adding support for offset in DecorrelateInnerQuery (the transformation is 
> into a filter on row_number window function, similar to limit).
>  * Adding a feature flag to enable/disable offset in subquery support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46446) Correctness bug in correlated subquery with OFFSET

2023-12-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46446:
---

Assignee: Jack Chen

> Correctness bug in correlated subquery with OFFSET
> --
>
> Key: SPARK-46446
> URL: https://issues.apache.org/jira/browse/SPARK-46446
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
>  Labels: pull-request-available
>
> Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
> introduced recently when support for correlation under OFFSET was enabled but 
> were not handled correctly. (So we went from unsupported, query throws error 
> -> wrong results.)
> It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS
>  
> It's easy to repro with a query like
> {code:java}
> create table x(x1 int, x2 int);
> insert into x values (1, 1), (2, 2);
> create table y(y1 int, y2 int);
> insert into y values (1, 1), (1, 2), (2, 4);
> select * from x where exists (select * from y where x1 = y1 limit 1 offset 
> 2){code}
> Correct result: empty set, see postgres: 
> [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 
> Spark result: Array([2,2])
>  
> The 
> [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
>  where it was introduced added a test for it, but the golden file results for 
> the test actually were incorrect and we didn't notice. (The bug was initially 
> found by https://github.com/apache/spark/pull/44084)
> I'll work on both:
>  * Adding support for offset in DecorrelateInnerQuery (the transformation is 
> into a filter on row_number window function, similar to limit).
>  * Adding a feature flag to enable/disable offset in subquery support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >