[jira] [Updated] (SPARK-48382) Add controller / reconciler module to operator

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48382:
---
Labels: pull-request-available  (was: )

> Add controller / reconciler module to operator
> --
>
> Key: SPARK-48382
> URL: https://issues.apache.org/jira/browse/SPARK-48382
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker

2024-05-22 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848824#comment-17848824
 ] 

Eric Yang edited comment on SPARK-48397 at 5/23/24 6:38 AM:


The PR: https://github.com/apache/spark/pull/46714


was (Author: JIRAUSER304132):
I'm working on a PR for it.

> Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker
> 
>
> Key: SPARK-48397
> URL: https://issues.apache.org/jira/browse/SPARK-48397
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Eric Yang
>Priority: Major
>  Labels: pull-request-available
>
> For FileFormatDataWriter we currently record metrics of "task commit time" 
> and "job commit time" in 
> `org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`.
>  We may also record the time spent on "data write" (together with the time 
> spent on producing records from the iterator), which is usually one of the 
> major parts of the total duration of a writing operation. It helps us 
> identify the bottleneck and time skew, and also the generic performance 
> tuning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48397:
---
Labels: pull-request-available  (was: )

> Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker
> 
>
> Key: SPARK-48397
> URL: https://issues.apache.org/jira/browse/SPARK-48397
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Eric Yang
>Priority: Major
>  Labels: pull-request-available
>
> For FileFormatDataWriter we currently record metrics of "task commit time" 
> and "job commit time" in 
> `org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`.
>  We may also record the time spent on "data write" (together with the time 
> spent on producing records from the iterator), which is usually one of the 
> major parts of the total duration of a writing operation. It helps us 
> identify the bottleneck and time skew, and also the generic performance 
> tuning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48396) Support configuring limit control for SQL to use maximum cores

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48396:
---
Labels: pull-request-available  (was: )

> Support configuring limit control for SQL to use maximum cores
> --
>
> Key: SPARK-48396
> URL: https://issues.apache.org/jira/browse/SPARK-48396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Mars
>Priority: Major
>  Labels: pull-request-available
>
> When there is a long-running shared Spark SQL cluster, there may be a 
> situation where a large SQL occupies all the cores of the cluster, affecting 
> the execution of other SQLs. Therefore, it is hoped that there is a 
> configuration that can limit the maximum cores used by SQL.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker

2024-05-22 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848824#comment-17848824
 ] 

Eric Yang commented on SPARK-48397:
---

I'm working on a PR for it.

> Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker
> 
>
> Key: SPARK-48397
> URL: https://issues.apache.org/jira/browse/SPARK-48397
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Eric Yang
>Priority: Major
>
> For FileFormatDataWriter we currently record metrics of "task commit time" 
> and "job commit time" in 
> `org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`.
>  We may also record the time spent on "data write" (together with the time 
> spent on producing records from the iterator), which is usually one of the 
> major parts of the total duration of a writing operation. It helps us 
> identify the bottleneck and time skew, and also the generic performance 
> tuning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48398) Add Helm chart for Operator Deployment

2024-05-22 Thread Zhou JIANG (Jira)
Zhou JIANG created SPARK-48398:
--

 Summary: Add Helm chart for Operator Deployment
 Key: SPARK-48398
 URL: https://issues.apache.org/jira/browse/SPARK-48398
 Project: Spark
  Issue Type: Sub-task
  Components: k8s
Affects Versions: kubernetes-operator-0.1.0
Reporter: Zhou JIANG






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker

2024-05-22 Thread Eric Yang (Jira)
Eric Yang created SPARK-48397:
-

 Summary: Add data write time metric to 
FileFormatDataWriter/BasicWriteJobStatsTracker
 Key: SPARK-48397
 URL: https://issues.apache.org/jira/browse/SPARK-48397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Eric Yang


For FileFormatDataWriter we currently record metrics of "task commit time" and 
"job commit time" in 
`org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`. 
We may also record the time spent on "data write" (together with the time spent 
on producing records from the iterator), which is usually one of the major 
parts of the total duration of a writing operation. It helps us identify the 
bottleneck and time skew, and also the generic performance tuning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48396) Support configuring limit control for SQL to use maximum cores

2024-05-22 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated SPARK-48396:
-
Description: 
When there is a long-running shared Spark SQL cluster, there may be a situation 
where a large SQL occupies all the cores of the cluster, affecting the 
execution of other SQLs. Therefore, it is hoped that there is a configuration 
that can limit the maximum cores used by SQL.
 

> Support configuring limit control for SQL to use maximum cores
> --
>
> Key: SPARK-48396
> URL: https://issues.apache.org/jira/browse/SPARK-48396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Mars
>Priority: Major
>
> When there is a long-running shared Spark SQL cluster, there may be a 
> situation where a large SQL occupies all the cores of the cluster, affecting 
> the execution of other SQLs. Therefore, it is hoped that there is a 
> configuration that can limit the maximum cores used by SQL.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48396) Support configuring limit control for SQL to use maximum cores

2024-05-22 Thread Mars (Jira)
Mars created SPARK-48396:


 Summary: Support configuring limit control for SQL to use maximum 
cores
 Key: SPARK-48396
 URL: https://issues.apache.org/jira/browse/SPARK-48396
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.1
Reporter: Mars






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48370) Checkpoint and localCheckpoint in Scala Spark Connect client

2024-05-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48370.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46683
[https://github.com/apache/spark/pull/46683]

> Checkpoint and localCheckpoint in Scala Spark Connect client
> 
>
> Key: SPARK-48370
> URL: https://issues.apache.org/jira/browse/SPARK-48370
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> SPARK-48258 implemented checkpoint and localcheckpoint in Python Spark 
> Connect client. We should do it in Scala too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48370) Checkpoint and localCheckpoint in Scala Spark Connect client

2024-05-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48370:


Assignee: Hyukjin Kwon

> Checkpoint and localCheckpoint in Scala Spark Connect client
> 
>
> Key: SPARK-48370
> URL: https://issues.apache.org/jira/browse/SPARK-48370
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> SPARK-48258 implemented checkpoint and localcheckpoint in Python Spark 
> Connect client. We should do it in Scala too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48391) use addAll instead of add function in TaskMetrics to accelerate

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48391:
---
Labels: pull-request-available  (was: )

> use addAll instead of add function  in TaskMetrics  to accelerate
> -
>
> Key: SPARK-48391
> URL: https://issues.apache.org/jira/browse/SPARK-48391
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: jiahong.li
>Priority: Major
>  Labels: pull-request-available
>
> In the fromAccumulators method of TaskMetrics,we should use `
> tm._externalAccums.addAll` instead of `tm._externalAccums.add`, as 
> _externalAccums is a instance of CopyOnWriteArrayList



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48387) Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE

2024-05-22 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48387.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46701
[https://github.com/apache/spark/pull/46701]

> Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE
> ---
>
> Key: SPARK-48387
> URL: https://issues.apache.org/jira/browse/SPARK-48387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48386) Replace JVM assert with JUnit Assert in tests

2024-05-22 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-48386.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46698
[https://github.com/apache/spark/pull/46698]

> Replace JVM assert with JUnit Assert in tests
> -
>
> Key: SPARK-48386
> URL: https://issues.apache.org/jira/browse/SPARK-48386
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48387) Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE

2024-05-22 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-48387:


Assignee: Kent Yao

> Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE
> ---
>
> Key: SPARK-48387
> URL: https://issues.apache.org/jira/browse/SPARK-48387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48394) Cleanup mapIdToMapIndex on mapoutput unregister

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48394:
---
Labels: pull-request-available  (was: )

> Cleanup mapIdToMapIndex on mapoutput unregister
> ---
>
> Key: SPARK-48394
> URL: https://issues.apache.org/jira/browse/SPARK-48394
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: wuyi
>Priority: Major
>  Labels: pull-request-available
>
> There is only one valid mapstatus for the same {{mapIndex}} at the same time 
> in Spark. {{mapIdToMapIndex}} should also follows the same rule to avoid 
> chaos.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48394) Cleanup mapIdToMapIndex on mapoutput unregister

2024-05-22 Thread wuyi (Jira)
wuyi created SPARK-48394:


 Summary: Cleanup mapIdToMapIndex on mapoutput unregister
 Key: SPARK-48394
 URL: https://issues.apache.org/jira/browse/SPARK-48394
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.1, 3.5.0, 4.0.0
Reporter: wuyi


There is only one valid mapstatus for the same {{mapIndex}} at the same time in 
Spark. {{mapIdToMapIndex}} should also follows the same rule to avoid chaos.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-22 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848787#comment-17848787
 ] 

Bruce Robbins commented on SPARK-48361:
---

Did you mean the following?
{noformat}
val dfDropped = dfWithJagged.filter(col("__is_jagged") =!= true)
{noformat}
Either way (with `=== true` or `=!= true`), a bug of some sort is revealed.

With `=== true`, the grouping produces an empty result (it shouldn't).

With `=!= true`, the grouping includes `8, 9` (it shouldn't, as you mentioned).

In fact, for both cases, if you persist {{dfWithJagged}}, you get the right 
answer.

> Correctness: CSV corrupt record filter with aggregate ignored
> -
>
> Key: SPARK-48361
> URL: https://issues.apache.org/jira/browse/SPARK-48361
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Using spark shell 3.5.1 on M1 Mac
>Reporter: Ted Chester Jenks
>Priority: Major
>
> Using corrupt record in CSV parsing for some data cleaning logic, I came 
> across a correctness bug.
>  
> The following repro can be ran with spark-shell 3.5.1.
> *Create test.csv with the following content:*
> {code:java}
> test,1,2,three
> four,5,6,seven
> 8,9
> ten,11,12,thirteen {code}
>  
>  
> *In spark-shell:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> dfWithJagged.show(){code}
> *Returns:*
> {code:java}
> +---+---+---++---+---+
> |column1|column2|column3| column4|_corrupt_record|__is_jagged|
> +---+---+---++---+---+
> |   four|    5.0|    6.0|   seven|           NULL|      false|
> |      8|    9.0|   NULL|    NULL|            8,9|       true|
> |    ten|   11.0|   12.0|thirteen|           NULL|      false|
> +---+---+---++---+---+ {code}
> So far so good...
>  
> *BUT*
>  
> *If we add an aggregate before we show:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> val dfDropped = dfWithJagged.filter(col("__is_jagged") === true)
> val groupedSum = 
> dfDropped.groupBy("column1").agg(sum("column2").alias("sum_column2"))
> groupedSum.show(){code}
> *We get:*
> {code:java}
> +---+---+
> |column1|sum_column2|
> +---+---+
> |      8|        9.0|
> |   four|        5.0|
> |    ten|       11.0|
> +---+---+ {code}
>  
> *Which is not correct*
>  
> With the addition of the aggregate, the filter down to rows with 3 commas in 
> the corrupt record column is ignored. This does not happed with any other 
> operators I have tried - just aggregates so far.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apa

[jira] [Assigned] (SPARK-48393) Move a group of constants to `pyspark.util`

2024-05-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48393:


Assignee: Ruifeng Zheng

> Move a group of constants to `pyspark.util`
> ---
>
> Key: SPARK-48393
> URL: https://issues.apache.org/jira/browse/SPARK-48393
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48393) Move a group of constants to `pyspark.util`

2024-05-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48393.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46710
[https://github.com/apache/spark/pull/46710]

> Move a group of constants to `pyspark.util`
> ---
>
> Key: SPARK-48393
> URL: https://issues.apache.org/jira/browse/SPARK-48393
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48393) Move a group of constants to `pyspark.util`

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48393:
---
Labels: pull-request-available  (was: )

> Move a group of constants to `pyspark.util`
> ---
>
> Key: SPARK-48393
> URL: https://issues.apache.org/jira/browse/SPARK-48393
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48393) Move a group of constants to `pyspark.util`

2024-05-22 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-48393:
-

 Summary: Move a group of constants to `pyspark.util`
 Key: SPARK-48393
 URL: https://issues.apache.org/jira/browse/SPARK-48393
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48392) (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48392:
---
Labels: pull-request-available  (was: )

> (Optionally) Load `spark-defaults.conf` when passing configurations to 
> `spark-submit` through `--properties-file`
> -
>
> Key: SPARK-48392
> URL: https://issues.apache.org/jira/browse/SPARK-48392
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Chao Sun
>Priority: Major
>  Labels: pull-request-available
>
> Currently, when user pass configurations to {{spark-submit.sh}} via 
> {{--properties-file}}, the {{spark-defaults.conf}} will be completely 
> ignored. This poses issues for some people, for instance, those using [Spark 
> on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. 
> See related issues: 
> - https://github.com/kubeflow/spark-operator/issues/1183
> - https://github.com/kubeflow/spark-operator/issues/1321



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48392) (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`

2024-05-22 Thread Chao Sun (Jira)
Chao Sun created SPARK-48392:


 Summary: (Optionally) Load `spark-defaults.conf` when passing 
configurations to `spark-submit` through `--properties-file`
 Key: SPARK-48392
 URL: https://issues.apache.org/jira/browse/SPARK-48392
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.1
Reporter: Chao Sun


Currently, when user pass configurations to `spark-submit.sh` via 
`--properties-file`, the `spark-defaults.conf` will be completely ignored. This 
poses issues for some people, for instance, those using [Spark on K8S operator 
from kubeflow|https://github.com/kubeflow/spark-operator]. See related issues: 
- https://github.com/kubeflow/spark-operator/issues/1183
- https://github.com/kubeflow/spark-operator/issues/1321



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48392) (Optionally) Load `spark-defaults.conf` when passing configurations to `spark-submit` through `--properties-file`

2024-05-22 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-48392:
-
Description: 
Currently, when user pass configurations to {{spark-submit.sh}} via 
{{--properties-file}}, the {{spark-defaults.conf}} will be completely ignored. 
This poses issues for some people, for instance, those using [Spark on K8S 
operator from kubeflow|https://github.com/kubeflow/spark-operator]. See related 
issues: 
- https://github.com/kubeflow/spark-operator/issues/1183
- https://github.com/kubeflow/spark-operator/issues/1321

  was:
Currently, when user pass configurations to `spark-submit.sh` via 
`--properties-file`, the `spark-defaults.conf` will be completely ignored. This 
poses issues for some people, for instance, those using [Spark on K8S operator 
from kubeflow|https://github.com/kubeflow/spark-operator]. See related issues: 
- https://github.com/kubeflow/spark-operator/issues/1183
- https://github.com/kubeflow/spark-operator/issues/1321


> (Optionally) Load `spark-defaults.conf` when passing configurations to 
> `spark-submit` through `--properties-file`
> -
>
> Key: SPARK-48392
> URL: https://issues.apache.org/jira/browse/SPARK-48392
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Chao Sun
>Priority: Major
>
> Currently, when user pass configurations to {{spark-submit.sh}} via 
> {{--properties-file}}, the {{spark-defaults.conf}} will be completely 
> ignored. This poses issues for some people, for instance, those using [Spark 
> on K8S operator from kubeflow|https://github.com/kubeflow/spark-operator]. 
> See related issues: 
> - https://github.com/kubeflow/spark-operator/issues/1183
> - https://github.com/kubeflow/spark-operator/issues/1321



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48275) array_sort: Improve documentation for default comparator's behavior for different types

2024-05-22 Thread Matt Braymer-Hayes (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Braymer-Hayes updated SPARK-48275:
---
Description: 
h1. tl;dr

It would be helpful for the documentation for array_sort() to include the 
default comparator behavior for different array element types, especially 
structs. It would also be helpful for the 
{{{\{DATATYPE_MISMATCH.INVALID_ORDERING_TYPE  error to recommend using a 
custom comparator instead of the default comparator, especially when sorting on 
a complex type (e.g., a struct containing an unorderable field, like a map).

 

h1. Background

The default comparator for {{array_sort()}} for struct elements is to sort by 
every field in the struct in schema order (i.e., ORDER BY field1, field2, ..., 
fieldN). This requires every field to be orderable: if they aren't an error 
occurs.

 

Here's a small example:
{code:java}
import pyspark.sql.functions as F
import pyspark.sql.types as T

schema = T.StructType([
T.StructField(
'value',
T.ArrayType(
T.StructType([
T.StructField('orderable', T.IntegerType(), True),
T.StructField('unorderable', T.MapType(T.StringType(), 
T.StringType(), True), True), # remove this field and both commands below 
succeed
]),
False
),
False
)
])
df = spark.createDataFrame([], schema=schema)

df.select(F.array_sort(df['value'])).collect(){code}
Output:
{code:java}
[DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve 
"(namedlambdavariable() < namedlambdavariable())" due to data type mismatch: 
The `<` does not support ordering on type "STRUCT>". SQLSTATE: 42K09 {code}
 

If the default comparator doesn't work for a user (e.g., they have an 
unorderable field like a map in their struct), array_sort() accepts a custom 
comparator, where users can order array elements however they like.

 

Building on the previous example:

 
{code:java}
import pyspark.sql as psql


def comparator(l: psql.Column, r: psql.Column) -> psql.Column:
"""Order structs l and r by order field.
Rules:
* Nulls are last
* In ascending order
"""
return (
F.when(l['order'].isNull() & r['order'].isNull(), 0)
.when(l['order'].isNull(), 1)
.when(r['order'].isNull(), -1)
.when(l['order'] < r['order'], -1)
.when(l['order'] == r['order'], 0)
.otherwise(1)
)

df.select(F.array_sort(df['value'], comparator)).collect(){code}
This works as intended.

 

h1. Ask

The documentation for {{array_sort()}} should include information on the 
behavior of the default comparator for various datatypes. For the 
array-of-unorderable-structs example, it would be helpful to know that the 
default comparator for structs compares all fields in schema order (i.e., 
{{{}ORDER BY field1, field2, ..., fieldN{}}}).

 

Additionally, when users passes an unorderable type to array_sort() and uses 
the default comparator, the returned error should recommend the user use a 
custom comparator instead.

  was:
h1. tl;dr

It would be helpful for the documentation for array_sort() to include the 
default comparator behavior for different array element types, especially 
structs. It would also be helpful for the 
\{{DATATYPE_MISMATCH.INVALID_ORDERING_TYPE }}error to recommend using a custom 
comparator instead of the default comparator, especially when sorting on a 
complex type (e.g., a struct containing an unorderable field, like a map).

 

h1. Background

The default comparator for {{array_sort()}} for struct elements is to sort by 
every field in the struct in schema order (i.e., ORDER BY field1, field2, ..., 
fieldN). This requires every field to be orderable: if they aren't an error 
occurs.

 

Here's a small example:
{code:java}
import pyspark.sql.functions as F
import pyspark.sql.types as T

schema = T.StructType([
T.StructField(
'value',
T.ArrayType(
T.StructType([
T.StructField('orderable', T.IntegerType(), True),
T.StructField('unorderable', T.MapType(T.StringType(), 
T.StringType(), True), True), # remove this field and both commands below 
succeed
]),
False
),
False
)
])
df = spark.createDataFrame([], schema=schema)

df.select(F.array_sort(df['value'])).collect(){code}
Output:
{code:java}
[DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve 
"(namedlambdavariable() < namedlambdavariable())" due to data type mismatch: 
The `<` does not support ordering on type "STRUCT>". SQLSTATE: 42K09 {code}
 

If the default comparator doesn't work for a user (e.g., they have an 
unorderable field like a map in their struct), array_sort() accepts a custom 
comparator, where users can order array elements however they like.

 

Building on the previous example:

 
{code:java}
import pyspark.sql as psql


d

[jira] [Updated] (SPARK-48386) Replace JVM assert with JUnit Assert in tests

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48386:
---
Labels: pull-request-available  (was: )

> Replace JVM assert with JUnit Assert in tests
> -
>
> Key: SPARK-48386
> URL: https://issues.apache.org/jira/browse/SPARK-48386
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-22 Thread Ted Chester Jenks (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Chester Jenks updated SPARK-48361:
--
Description: 
Using corrupt record in CSV parsing for some data cleaning logic, I came across 
a correctness bug.

 

The following repro can be ran with spark-shell 3.5.1.

*Create test.csv with the following content:*
{code:java}
test,1,2,three
four,5,6,seven
8,9
ten,11,12,thirteen {code}
 

 

*In spark-shell:*
{code:java}
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.functions._
 
# define a STRING, DOUBLE, DOUBLE, STRING schema for the data
val schema = StructType(List(StructField("column1", StringType, true), 
StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
true), StructField("column4", StringType, true)))
 
# add a column for corrupt records to the schema
val schemaWithCorrupt = StructType(schema.fields :+ 
StructField("_corrupt_record", StringType, true)) 
 
# read the CSV with the schema, headers, permissive parsing, and the corrupt 
record column
val df = spark.read.option("header", "true").option("mode", 
"PERMISSIVE").option("columnNameOfCorruptRecord", 
"_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
 
# define a UDF to count the commas in the corrupt record column
val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else -1) 
 
# add a true/false column for whether the number of commas is 3
val dfWithJagged = df.withColumn("__is_jagged", 
when(col("_corrupt_record").isNull, 
false).otherwise(countCommas(col("_corrupt_record")) =!= 3))

dfWithJagged.show(){code}
*Returns:*
{code:java}
+---+---+---++---+---+
|column1|column2|column3| column4|_corrupt_record|__is_jagged|
+---+---+---++---+---+
|   four|    5.0|    6.0|   seven|           NULL|      false|
|      8|    9.0|   NULL|    NULL|            8,9|       true|
|    ten|   11.0|   12.0|thirteen|           NULL|      false|
+---+---+---++---+---+ {code}
So far so good...

 

*BUT*

 

*If we add an aggregate before we show:*
{code:java}
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.functions._
val schema = StructType(List(StructField("column1", StringType, true), 
StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
true), StructField("column4", StringType, true)))
val schemaWithCorrupt = StructType(schema.fields :+ 
StructField("_corrupt_record", StringType, true)) 
val df = spark.read.option("header", "true").option("mode", 
"PERMISSIVE").option("columnNameOfCorruptRecord", 
"_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else -1) 
val dfWithJagged = df.withColumn("__is_jagged", 
when(col("_corrupt_record").isNull, 
false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
val dfDropped = dfWithJagged.filter(col("__is_jagged") === true)
val groupedSum = 
dfDropped.groupBy("column1").agg(sum("column2").alias("sum_column2"))
groupedSum.show(){code}
*We get:*
{code:java}
+---+---+
|column1|sum_column2|
+---+---+
|      8|        9.0|
|   four|        5.0|
|    ten|       11.0|
+---+---+ {code}
 

*Which is not correct*

 

With the addition of the aggregate, the filter down to rows with 3 commas in 
the corrupt record column is ignored. This does not happed with any other 
operators I have tried - just aggregates so far.

 

 

 

  was:
Using corrupt record in CSV parsing for some data cleaning logic, I came across 
a correctness bug.

 

The following repro can be ran with spark-shell 3.5.1.

*Create test.csv with the following content:*
{code:java}
test,1,2,three
four,5,6,seven
8,9
ten,11,12,thirteen {code}
 

 

*In spark-shell:*
{code:java}
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.functions._
 
# define a STRING, DOUBLE, DOUBLE, STRING schema for the data
val schema = StructType(List(StructField("column1", StringType, true), 
StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
true), StructField("column4", StringType, true)))
 
# add a column for corrupt records to the schema
val schemaWithCorrupt = StructType(schema.fields :+ 
StructField("_corrupt_record", StringType, true)) 
 
# read the CSV with the schema, headers, permissive parsing, and the corrupt 
record column
val df = spark.read.option("header", "true").option("mode", 
"PERMISSIVE").option("columnNameOfCorruptRecord", 
"_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
 
# define a UDF to count the commas in the corrupt record column
val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else -1) 
 
# add a true/false column for whether the number of commas is 3
val dfWithJagged = df.withColumn("__is_jagged", 
when(col("_corrupt_reco

[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-22 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848696#comment-17848696
 ] 

Bruce Robbins commented on SPARK-48361:
---

`8,9` is still present before the aggregate:
{noformat}
scala> dfWithJagged.show(false)
24/05/22 10:33:24 WARN CSVHeaderChecker: CSV header does not conform to the 
schema.
 Header: test, 1, 2, three
 Schema: column1, column2, column3, column4
Expected: column1 but found: test
CSV file: file:///Users/bruce/github/spark_up3.5.1/test.csv
+---+---+---++---+---+
|column1|column2|column3|column4 |_corrupt_record|__is_jagged|
+---+---+---++---+---+
|four   |5.0|6.0|seven   |NULL   |false  |
|8  |9.0|NULL   |NULL|8,9|true   |
|ten|11.0   |12.0   |thirteen|NULL   |false  |
+---+---+---++---+---+


scala> sql("select version()").collect
res6: Array[org.apache.spark.sql.Row] = Array([3.5.1 
fd86f85e181fc2dc0f50a096855acf83a6cc5d9c])

scala> 
{noformat}
Which piece of code filters out `8,9`? I could't find the filter in your 
example, but again I may be missing something. 

> Correctness: CSV corrupt record filter with aggregate ignored
> -
>
> Key: SPARK-48361
> URL: https://issues.apache.org/jira/browse/SPARK-48361
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Using spark shell 3.5.1 on M1 Mac
>Reporter: Ted Chester Jenks
>Priority: Major
>
> Using corrupt record in CSV parsing for some data cleaning logic, I came 
> across a correctness bug.
>  
> The following repro can be ran with spark-shell 3.5.1.
> *Create test.csv with the following content:*
> {code:java}
> test,1,2,three
> four,5,6,seven
> 8,9
> ten,11,12,thirteen {code}
>  
>  
> *In spark-shell:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> dfWithJagged.show(){code}
> *Returns:*
> {code:java}
> +---+---+---++---+---+
> |column1|column2|column3| column4|_corrupt_record|__is_jagged|
> +---+---+---++---+---+
> |   four|    5.0|    6.0|   seven|           NULL|      false|
> |      8|    9.0|   NULL|    NULL|            8,9|       true|
> |    ten|   11.0|   12.0|thirteen|           NULL|      false|
> +---+---+---++---+---+ {code}
> So far so good...
>  
> *BUT*
>  
> *If we add an aggregate before we show:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged",

[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-22 Thread Ted Chester Jenks (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848694#comment-17848694
 ] 

Ted Chester Jenks commented on SPARK-48361:
---

{code:java}
+---+---+
|column1|sum_column2|
+---+---+
|   four|        5.0|
|    ten|       11.0|
+---+---+  {code}
The row with `8,9` should be filtered out as it was before adding the aggregate.

> Correctness: CSV corrupt record filter with aggregate ignored
> -
>
> Key: SPARK-48361
> URL: https://issues.apache.org/jira/browse/SPARK-48361
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Using spark shell 3.5.1 on M1 Mac
>Reporter: Ted Chester Jenks
>Priority: Major
>
> Using corrupt record in CSV parsing for some data cleaning logic, I came 
> across a correctness bug.
>  
> The following repro can be ran with spark-shell 3.5.1.
> *Create test.csv with the following content:*
> {code:java}
> test,1,2,three
> four,5,6,seven
> 8,9
> ten,11,12,thirteen {code}
>  
>  
> *In spark-shell:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> dfWithJagged.show(){code}
> *Returns:*
> {code:java}
> +---+---+---++---+---+
> |column1|column2|column3| column4|_corrupt_record|__is_jagged|
> +---+---+---++---+---+
> |   four|    5.0|    6.0|   seven|           NULL|      false|
> |      8|    9.0|   NULL|    NULL|            8,9|       true|
> |    ten|   11.0|   12.0|thirteen|           NULL|      false|
> +---+---+---++---+---+ {code}
> So far so good...
>  
> *BUT*
>  
> *If we add an aggregate before we show:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
>   
> # sum up column1
> val groupedSum = 
> dfWithJagged.groupBy("column1").agg(sum("column2").alias("sum_column2"))
> groupedSum.show(){code}
> *We get:*
> {code:java}
> +---+---+
> |column1|sum_column2|
> +---+---+
> |      8|        9.0|
> |   four|        5.0|
> |    ten|       11.0|
> +---+---+ {code}
>  
> *Which is not correct*
>  
> With the addition of the aggregate, the filter down to rows with 3 commas in 
> the corrupt record column is ignored. This does not happed with any other 
> operators I have tried - just aggregates so far.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

--

[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-22 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848692#comment-17848692
 ] 

Bruce Robbins commented on SPARK-48361:
---

Sorry for being dense. What would the correct answer be?

> Correctness: CSV corrupt record filter with aggregate ignored
> -
>
> Key: SPARK-48361
> URL: https://issues.apache.org/jira/browse/SPARK-48361
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Using spark shell 3.5.1 on M1 Mac
>Reporter: Ted Chester Jenks
>Priority: Major
>
> Using corrupt record in CSV parsing for some data cleaning logic, I came 
> across a correctness bug.
>  
> The following repro can be ran with spark-shell 3.5.1.
> *Create test.csv with the following content:*
> {code:java}
> test,1,2,three
> four,5,6,seven
> 8,9
> ten,11,12,thirteen {code}
>  
>  
> *In spark-shell:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> dfWithJagged.show(){code}
> *Returns:*
> {code:java}
> +---+---+---++---+---+
> |column1|column2|column3| column4|_corrupt_record|__is_jagged|
> +---+---+---++---+---+
> |   four|    5.0|    6.0|   seven|           NULL|      false|
> |      8|    9.0|   NULL|    NULL|            8,9|       true|
> |    ten|   11.0|   12.0|thirteen|           NULL|      false|
> +---+---+---++---+---+ {code}
> So far so good...
>  
> *BUT*
>  
> *If we add an aggregate before we show:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
>   
> # sum up column1
> val groupedSum = 
> dfWithJagged.groupBy("column1").agg(sum("column2").alias("sum_column2"))
> groupedSum.show(){code}
> *We get:*
> {code:java}
> +---+---+
> |column1|sum_column2|
> +---+---+
> |      8|        9.0|
> |   four|        5.0|
> |    ten|       11.0|
> +---+---+ {code}
>  
> *Which is not correct*
>  
> With the addition of the aggregate, the filter down to rows with 3 commas in 
> the corrupt record column is ignored. This does not happed with any other 
> operators I have tried - just aggregates so far.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43043) Improve the performance of MapOutputTracker.updateMapOutput

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43043:
---
Labels: pull-request-available  (was: )

> Improve the performance of MapOutputTracker.updateMapOutput
> ---
>
> Key: SPARK-43043
> URL: https://issues.apache.org/jira/browse/SPARK-43043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.2
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Inside of MapOutputTracker, there is a line of code which does a linear find 
> through a mapStatuses collection: 
> https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L167
>   (plus a similar search a few lines down at 
> https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L174)
> This scan is necessary because we only know the mapId of the updated status 
> and not its mapPartitionId.
> We perform this scan once per migrated block, so if a large proportion of all 
> blocks in the map are migrated then we get O(n^2) total runtime across all of 
> the calls.
> I think we might be able to fix this by extending ShuffleStatus to have an 
> OpenHashMap mapping from mapId to mapPartitionId. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48391) use addAll instead of add function in TaskMetrics to accelerate

2024-05-22 Thread jiahong.li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiahong.li updated SPARK-48391:
---
Summary: use addAll instead of add function  in TaskMetrics  to accelerate  
(was: use addAll instead of add function  in TaskMetrics )

> use addAll instead of add function  in TaskMetrics  to accelerate
> -
>
> Key: SPARK-48391
> URL: https://issues.apache.org/jira/browse/SPARK-48391
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: jiahong.li
>Priority: Major
>
> In the fromAccumulators method of TaskMetrics,we should use `
> tm._externalAccums.addAll` instead of `tm._externalAccums.add`, as 
> _externalAccums is a instance of CopyOnWriteArrayList



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48391) use addAll instead of add function in TaskMetrics

2024-05-22 Thread jiahong.li (Jira)
jiahong.li created SPARK-48391:
--

 Summary: use addAll instead of add function  in TaskMetrics 
 Key: SPARK-48391
 URL: https://issues.apache.org/jira/browse/SPARK-48391
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.1, 3.5.0
Reporter: jiahong.li


In the fromAccumulators method of TaskMetrics,we should use `
tm._externalAccums.addAll` instead of `tm._externalAccums.add`, as 
_externalAccums is a instance of CopyOnWriteArrayList



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48390) SparkListenerBus not sending tableName details in logical plan for spark versions 3.4.2 and above

2024-05-22 Thread Mayur Madnani (Jira)
Mayur Madnani created SPARK-48390:
-

 Summary: SparkListenerBus not sending tableName details in logical 
plan for spark versions 3.4.2 and above
 Key: SPARK-48390
 URL: https://issues.apache.org/jira/browse/SPARK-48390
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 3.4.3, 3.5.1, 3.5.0, 3.4.2, 3.5.2
Reporter: Mayur Madnani


In OpenLineage, via SparkEventListener a logical plan event is received and by 
parsing it the frameworks deduces Input/Output table names to create a lineage.
The issue is that in spark versions 3.4.2 and above (tested and reproducible in 
3.4.2 & 3.5.0) the logical plan event sent by spark core is partial and is 
missing the tableName property which was been sent in earlier versions (working 
in spark 3.3.4).


+_Note: This issue is only encountered in drop table events._+

For a drop table event, see below the logical plan in different spark versions

*Spark 3.3.4*
{code:java}
[
{
"class": "org.apache.spark.sql.execution.command.DropTableCommand",
"num-children": 0,
"tableName":

{ "product-class": "org.apache.spark.sql.catalyst.TableIdentifier", "table": 
"drop_table_test", "database": "default" }

,
"ifExists": false,
"isView": false,
"purge": false
}
]

{code}
*Spark 3.4.2*
{code:java}
[

{ "class": "org.apache.spark.sql.catalyst.plans.logical.DropTable", 
"num-children": 1, "child": 0, "ifExists": false, "purge": false }

,

{ "class": "org.apache.spark.sql.catalyst.analysis.ResolvedIdentifier", 
"num-children": 0, "catalog": null, "identifier": null }

]

{code}
More details in referenced issue here: 
[https://github.com/OpenLineage/OpenLineage/issues/2716]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48364) Type casting for AbstractMapType

2024-05-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48364:
---

Assignee: Uroš Bojanić

> Type casting for AbstractMapType
> 
>
> Key: SPARK-48364
> URL: https://issues.apache.org/jira/browse/SPARK-48364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48364) Type casting for AbstractMapType

2024-05-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48364.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46661
[https://github.com/apache/spark/pull/46661]

> Type casting for AbstractMapType
> 
>
> Key: SPARK-48364
> URL: https://issues.apache.org/jira/browse/SPARK-48364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48364) Type casting for AbstractMapType

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48364:
---
Labels: pull-request-available  (was: )

> Type casting for AbstractMapType
> 
>
> Key: SPARK-48364
> URL: https://issues.apache.org/jira/browse/SPARK-48364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48215) DateFormatClass (all collations)

2024-05-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48215.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46561
[https://github.com/apache/spark/pull/46561]

> DateFormatClass (all collations)
> 
>
> Key: SPARK-48215
> URL: https://issues.apache.org/jira/browse/SPARK-48215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Nebojsa Savic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Enable collation support for the *DateFormatClass* built-in function in 
> Spark. First confirm what is the expected behaviour for this expression when 
> given collated strings, and then move on to implementation and testing. You 
> will find this expression in the *datetimeExpressions.scala* file, and it 
> should be considered a pass-through function with respect to collation 
> awareness. Implement the corresponding E2E SQL tests 
> (CollationSQLExpressionsSuite) to reflect how this function should be used 
> with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor 
> to experiment with the existing functions to learn more about how they work. 
> In addition, look into the possible use-cases and implementation of similar 
> functions within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *DateFormatClass* 
> expression so that it supports all collation types currently supported in 
> Spark. To understand what changes were introduced in order to enable full 
> collation support for other existing functions in Spark, take a look at the 
> Spark PRs and Jira tickets for completed tasks in this parent (for example: 
> Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, 
> FormatNumber, Sentences).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48215) DateFormatClass (all collations)

2024-05-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48215:
---

Assignee: Nebojsa Savic

> DateFormatClass (all collations)
> 
>
> Key: SPARK-48215
> URL: https://issues.apache.org/jira/browse/SPARK-48215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Nebojsa Savic
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *DateFormatClass* built-in function in 
> Spark. First confirm what is the expected behaviour for this expression when 
> given collated strings, and then move on to implementation and testing. You 
> will find this expression in the *datetimeExpressions.scala* file, and it 
> should be considered a pass-through function with respect to collation 
> awareness. Implement the corresponding E2E SQL tests 
> (CollationSQLExpressionsSuite) to reflect how this function should be used 
> with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor 
> to experiment with the existing functions to learn more about how they work. 
> In addition, look into the possible use-cases and implementation of similar 
> functions within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *DateFormatClass* 
> expression so that it supports all collation types currently supported in 
> Spark. To understand what changes were introduced in order to enable full 
> collation support for other existing functions in Spark, take a look at the 
> Spark PRs and Jira tickets for completed tasks in this parent (for example: 
> Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, 
> FormatNumber, Sentences).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-48379) Cancel build during a PR when a new commit is pushed

2024-05-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-48379:
--
  Assignee: (was: Stefan Kandic)

Reverted in 
https://github.com/apache/spark/commit/9fd85d9acc5acf455d0ad910ef2848695576242b

> Cancel build during a PR when a new commit is pushed
> 
>
> Key: SPARK-48379
> URL: https://issues.apache.org/jira/browse/SPARK-48379
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Creating a new commit on a branch should cancel the build of previous commits 
> for the same branch.
> Exceptions are master and branch-* branches where we still want to have 
> concurrent builds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48379) Cancel build during a PR when a new commit is pushed

2024-05-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-48379:
-
Fix Version/s: (was: 4.0.0)

> Cancel build during a PR when a new commit is pushed
> 
>
> Key: SPARK-48379
> URL: https://issues.apache.org/jira/browse/SPARK-48379
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>
> Creating a new commit on a branch should cancel the build of previous commits 
> for the same branch.
> Exceptions are master and branch-* branches where we still want to have 
> concurrent builds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48389) Remove obsolete workflow cancel_duplicate_workflow_runs

2024-05-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48389.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46703
[https://github.com/apache/spark/pull/46703]

> Remove obsolete workflow cancel_duplicate_workflow_runs
> ---
>
> Key: SPARK-48389
> URL: https://issues.apache.org/jira/browse/SPARK-48389
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> After https://github.com/apache/spark/pull/46689, we don't need this anymore



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48373) Allow schema parameter of createDataFrame() to be length-1 list or tuple of StructType

2024-05-22 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook resolved SPARK-48373.
--
Resolution: Won't Fix

> Allow schema parameter of createDataFrame() to be length-1 list or tuple of 
> StructType
> --
>
> Key: SPARK-48373
> URL: https://issues.apache.org/jira/browse/SPARK-48373
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.1
>Reporter: Ian Cook
>Priority: Major
>  Labels: pull-request-available
>
> Currently in PySpark (both Classic and Connect), if a user passes a length-1 
> list or tuple of {{StructType}} as the {{schema}} argument to 
> {{{}createDataFrame{}}}, PySpark raises an unhelpful error message.
> Unfortunately it is easy for this to happen. For example if a user leaves a 
> trailing comma at the end of the line that defines the {{{}StructType{}}}.
> Add some simple code to {{createDataFrame}} to handle this case more 
> gracefully



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-48373) Allow schema parameter of createDataFrame() to be length-1 list or tuple of StructType

2024-05-22 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook closed SPARK-48373.


> Allow schema parameter of createDataFrame() to be length-1 list or tuple of 
> StructType
> --
>
> Key: SPARK-48373
> URL: https://issues.apache.org/jira/browse/SPARK-48373
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.1
>Reporter: Ian Cook
>Priority: Major
>  Labels: pull-request-available
>
> Currently in PySpark (both Classic and Connect), if a user passes a length-1 
> list or tuple of {{StructType}} as the {{schema}} argument to 
> {{{}createDataFrame{}}}, PySpark raises an unhelpful error message.
> Unfortunately it is easy for this to happen. For example if a user leaves a 
> trailing comma at the end of the line that defines the {{{}StructType{}}}.
> Add some simple code to {{createDataFrame}} to handle this case more 
> gracefully



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48389) Remove obsolete workflow cancel_duplicate_workflow_runs

2024-05-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48389:


Assignee: Hyukjin Kwon

> Remove obsolete workflow cancel_duplicate_workflow_runs
> ---
>
> Key: SPARK-48389
> URL: https://issues.apache.org/jira/browse/SPARK-48389
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> After https://github.com/apache/spark/pull/46689, we don't need this anymore



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48389) Remove obsolete workflow cancel_duplicate_workflow_runs

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48389:
---
Labels: pull-request-available  (was: )

> Remove obsolete workflow cancel_duplicate_workflow_runs
> ---
>
> Key: SPARK-48389
> URL: https://issues.apache.org/jira/browse/SPARK-48389
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> After https://github.com/apache/spark/pull/46689, we don't need this anymore



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48389) Remove obsolete workflow cancel_duplicate_workflow_runs

2024-05-22 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-48389:


 Summary: Remove obsolete workflow cancel_duplicate_workflow_runs
 Key: SPARK-48389
 URL: https://issues.apache.org/jira/browse/SPARK-48389
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


After https://github.com/apache/spark/pull/46689, we don't need this anymore



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48388) Fix SET behavior for scripts

2024-05-22 Thread David Milicevic (Jira)
David Milicevic created SPARK-48388:
---

 Summary: Fix SET behavior for scripts
 Key: SPARK-48388
 URL: https://issues.apache.org/jira/browse/SPARK-48388
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


By standard, SET is used to set variable value in SQL scripts.

On our end, SET is configured to work with some Hive configs, so the grammar is 
a bit messed up and for that reason it was decided to use SET VAR instead of 
SET to work with SQL variables.

This is not by standard and we should figure out the way to be able to use SET 
for SQL variables and forbid setting of Hive configs from SQL scripts.

 

For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48379) Cancel build during a PR when a new commit is pushed

2024-05-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48379:


Assignee: Stefan Kandic

> Cancel build during a PR when a new commit is pushed
> 
>
> Key: SPARK-48379
> URL: https://issues.apache.org/jira/browse/SPARK-48379
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Creating a new commit on a branch should cancel the build of previous commits 
> for the same branch.
> Exceptions are master and branch-* branches where we still want to have 
> concurrent builds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48379) Cancel build during a PR when a new commit is pushed

2024-05-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48379.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46689
[https://github.com/apache/spark/pull/46689]

> Cancel build during a PR when a new commit is pushed
> 
>
> Key: SPARK-48379
> URL: https://issues.apache.org/jira/browse/SPARK-48379
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Creating a new commit on a branch should cancel the build of previous commits 
> for the same branch.
> Exceptions are master and branch-* branches where we still want to have 
> concurrent builds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48379) Cancel build during a PR when a new commit is pushed

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48379:
--

Assignee: (was: Apache Spark)

> Cancel build during a PR when a new commit is pushed
> 
>
> Key: SPARK-48379
> URL: https://issues.apache.org/jira/browse/SPARK-48379
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>
> Creating a new commit on a branch should cancel the build of previous commits 
> for the same branch.
> Exceptions are master and branch-* branches where we still want to have 
> concurrent builds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48379) Cancel build during a PR when a new commit is pushed

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48379:
--

Assignee: Apache Spark

> Cancel build during a PR when a new commit is pushed
> 
>
> Key: SPARK-48379
> URL: https://issues.apache.org/jira/browse/SPARK-48379
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Creating a new commit on a branch should cancel the build of previous commits 
> for the same branch.
> Exceptions are master and branch-* branches where we still want to have 
> concurrent builds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48370) Checkpoint and localCheckpoint in Scala Spark Connect client

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48370:
--

Assignee: Apache Spark

> Checkpoint and localCheckpoint in Scala Spark Connect client
> 
>
> Key: SPARK-48370
> URL: https://issues.apache.org/jira/browse/SPARK-48370
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> SPARK-48258 implemented checkpoint and localcheckpoint in Python Spark 
> Connect client. We should do it in Scala too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48370) Checkpoint and localCheckpoint in Scala Spark Connect client

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48370:
--

Assignee: (was: Apache Spark)

> Checkpoint and localCheckpoint in Scala Spark Connect client
> 
>
> Key: SPARK-48370
> URL: https://issues.apache.org/jira/browse/SPARK-48370
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> SPARK-48258 implemented checkpoint and localcheckpoint in Python Spark 
> Connect client. We should do it in Scala too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48311) Nested pythonUDF in groupBy and aggregate result in Binding Exception

2024-05-22 Thread Sumit Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848516#comment-17848516
 ] 

Sumit Singh commented on SPARK-48311:
-

Looks like issue is due to *expr.transformUp* code in 
ExtractGroupingPythonUDFFromAggregate 
{code:java}
val aggExpr = agg.aggregateExpressions.map { expr =>
  expr.transformUp {
// PythonUDF over aggregate was pull out by ExtractPythonUDFFromAggregate.
// PythonUDF here should be either
// 1. Argument of an aggregate function.
//CheckAnalysis guarantees the arguments are deterministic.
// 2. PythonUDF in grouping key. Grouping key must be deterministic.
// 3. PythonUDF not in grouping key. It is either no arguments or with 
grouping key
// in its arguments. Such PythonUDF was pull out by 
ExtractPythonUDFFromAggregate, too.
case p: PythonUDF if p.udfDeterministic =>
  val canonicalized = p.canonicalized.asInstanceOf[PythonUDF]
  attributeMap.getOrElse(canonicalized, p) {code}

If we have udf1("a"), udf2(udf1("a") in grouping and udf2(udf1("a") in 
aggregate. Then because of *expr.transformUp* for expr udf2(udf1("a"))
1. udf1("a") will be picked and it will be replaced by grouping by 
canonicalized value some groupingUDF#
2. then it will become udf2(groupingUDF#)
3. now this will not be found in cache and will be add as it is. 

I think this should be change to expr.transformDown as it is in grouping 
section in same class. 

Details: 
[https://docs.google.com/document/d/1RXOOCsaFU-E1ZmXJnSrJ2jdRgYQgGAVxDPeFNNaoB1M/edit?usp=sharing]

 

> Nested pythonUDF in groupBy and aggregate result in Binding Exception 
> --
>
> Key: SPARK-48311
> URL: https://issues.apache.org/jira/browse/SPARK-48311
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.3.2
>Reporter: Sumit Singh
>Priority: Major
>
> Steps to Reproduce 
> 1. Data creation
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, LongType, 
> TimestampType, StringType
> from datetime import datetime
> # Define the schema
> schema = StructType([
>     StructField("col1", LongType(), nullable=True),
>     StructField("col2", TimestampType(), nullable=True),
>     StructField("col3", StringType(), nullable=True)
> ])
> # Define the data
> data = [
>     (1, datetime(2023, 5, 15, 12, 30), "Discount"),
>     (2, datetime(2023, 5, 16, 16, 45), "Promotion"),
>     (3, datetime(2023, 5, 17, 9, 15), "Coupon")
> ]
> # Create the DataFrame
> df = spark.createDataFrame(data, schema)
> df.createOrReplaceTempView("temp_offers")
> # Query the temporary table using SQL
> # DISTINCT required to reproduce the issue. 
> testDf = spark.sql("""
>                     SELECT DISTINCT 
>                     col1,
>                     col2,
>                     col3 FROM temp_offers
>                     """) {code}
> 2. UDF registration 
> {code:java}
> import pyspark.sql.functions as F 
> import pyspark.sql.types as T
> #Creating udf functions 
> def udf1(d):
>     return d
> def udf2(d):
>     if d.isoweekday() in (1, 2, 3, 4):
>         return 'WEEKDAY'
>     else:
>         return 'WEEKEND'
> udf1_name = F.udf(udf1, T.TimestampType())
> udf2_name = F.udf(udf2, T.StringType()) {code}
> 3. Adding UDF in grouping and agg
> {code:java}
> groupBy_cols = ['col1', 'col4', 'col5', 'col3']
> temp = testDf \
>   .select('*', udf1_name(F.col('col2')).alias('col4')).select('*', 
> udf2_name('col4').alias('col5')) 
> result = 
> (temp.groupBy(*groupBy_cols).agg(F.countDistinct('col5').alias('col6'))){code}
> 4. Result
> {code:java}
> result.show(5, False) {code}
> *We get below error*
> {code:java}
> An error was encountered:
> An error occurred while calling o1079.showString.
> : java.lang.IllegalStateException: Couldn't find pythonUDF0#1108 in 
> [col1#978L,groupingPythonUDF#1104,groupingPythonUDF#1105,col3#980,count(pythonUDF0#1108)#1080L]
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48385) Migrate the jdbc driver of mariadb from 2.x to 3.x

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48385:
---
Labels: pull-request-available  (was: )

> Migrate the jdbc driver of mariadb from 2.x to 3.x
> --
>
> Key: SPARK-48385
> URL: https://issues.apache.org/jira/browse/SPARK-48385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48387) Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48387:
---
Labels: pull-request-available  (was: )

> Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE
> ---
>
> Key: SPARK-48387
> URL: https://issues.apache.org/jira/browse/SPARK-48387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48311) Nested pythonUDF in groupBy and aggregate result in Binding Exception

2024-05-22 Thread kalyan s (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848511#comment-17848511
 ] 

kalyan s commented on SPARK-48311:
--

Seems something changed in 
ExtractGroupingPythonUDFFromAggregate by this commit:  
fdccd88c2a6dd18c9d446b63fccd5c6188ea125c
[~cloud_fan] Can you help this change? 
 
 

> Nested pythonUDF in groupBy and aggregate result in Binding Exception 
> --
>
> Key: SPARK-48311
> URL: https://issues.apache.org/jira/browse/SPARK-48311
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.3.2
>Reporter: Sumit Singh
>Priority: Major
>
> Steps to Reproduce 
> 1. Data creation
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, LongType, 
> TimestampType, StringType
> from datetime import datetime
> # Define the schema
> schema = StructType([
>     StructField("col1", LongType(), nullable=True),
>     StructField("col2", TimestampType(), nullable=True),
>     StructField("col3", StringType(), nullable=True)
> ])
> # Define the data
> data = [
>     (1, datetime(2023, 5, 15, 12, 30), "Discount"),
>     (2, datetime(2023, 5, 16, 16, 45), "Promotion"),
>     (3, datetime(2023, 5, 17, 9, 15), "Coupon")
> ]
> # Create the DataFrame
> df = spark.createDataFrame(data, schema)
> df.createOrReplaceTempView("temp_offers")
> # Query the temporary table using SQL
> # DISTINCT required to reproduce the issue. 
> testDf = spark.sql("""
>                     SELECT DISTINCT 
>                     col1,
>                     col2,
>                     col3 FROM temp_offers
>                     """) {code}
> 2. UDF registration 
> {code:java}
> import pyspark.sql.functions as F 
> import pyspark.sql.types as T
> #Creating udf functions 
> def udf1(d):
>     return d
> def udf2(d):
>     if d.isoweekday() in (1, 2, 3, 4):
>         return 'WEEKDAY'
>     else:
>         return 'WEEKEND'
> udf1_name = F.udf(udf1, T.TimestampType())
> udf2_name = F.udf(udf2, T.StringType()) {code}
> 3. Adding UDF in grouping and agg
> {code:java}
> groupBy_cols = ['col1', 'col4', 'col5', 'col3']
> temp = testDf \
>   .select('*', udf1_name(F.col('col2')).alias('col4')).select('*', 
> udf2_name('col4').alias('col5')) 
> result = 
> (temp.groupBy(*groupBy_cols).agg(F.countDistinct('col5').alias('col6'))){code}
> 4. Result
> {code:java}
> result.show(5, False) {code}
> *We get below error*
> {code:java}
> An error was encountered:
> An error occurred while calling o1079.showString.
> : java.lang.IllegalStateException: Couldn't find pythonUDF0#1108 in 
> [col1#978L,groupingPythonUDF#1104,groupingPythonUDF#1105,col3#980,count(pythonUDF0#1108)#1080L]
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48387) Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE

2024-05-22 Thread Kent Yao (Jira)
Kent Yao created SPARK-48387:


 Summary: Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE
 Key: SPARK-48387
 URL: https://issues.apache.org/jira/browse/SPARK-48387
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48386) Replace JVM assert with JUnit Assert in tests

2024-05-22 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-48386:
---

 Summary: Replace JVM assert with JUnit Assert in tests
 Key: SPARK-48386
 URL: https://issues.apache.org/jira/browse/SPARK-48386
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47920) Add documentation for python streaming data source

2024-05-22 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47920.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46139
[https://github.com/apache/spark/pull/46139]

> Add documentation for python streaming data source
> --
>
> Key: SPARK-47920
> URL: https://issues.apache.org/jira/browse/SPARK-47920
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add documentation (user guide) for Python data source API.
> The DOC should explain how to develop and use DataSourceStreamReader and 
> DataSourceStreamWriter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45716) Python parity method StructType.treeString

2024-05-22 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-45716.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46685
[https://github.com/apache/spark/pull/46685]

> Python parity method StructType.treeString
> --
>
> Key: SPARK-45716
> URL: https://issues.apache.org/jira/browse/SPARK-45716
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Khalid Mammadov
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add missing parity megthod from Scala to Python



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48372) Implement `StructType.treeString`

2024-05-22 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-48372.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46685
[https://github.com/apache/spark/pull/46685]

> Implement `StructType.treeString`
> -
>
> Key: SPARK-48372
> URL: https://issues.apache.org/jira/browse/SPARK-48372
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48385) Migrate the jdbc driver of mariadb from 2.x to 3.x

2024-05-22 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-48385:

Summary: Migrate the jdbc driver of mariadb from 2.x to 3.x  (was: Migrate 
the driver of mariadb from 2.x to 3.x)

> Migrate the jdbc driver of mariadb from 2.x to 3.x
> --
>
> Key: SPARK-48385
> URL: https://issues.apache.org/jira/browse/SPARK-48385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org