[jira] [Updated] (SPARK-48421) SPJ: Add documentation

2024-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48421:
---
Labels: pull-request-available  (was: )

> SPJ: Add documentation
> --
>
> Key: SPARK-48421
> URL: https://issues.apache.org/jira/browse/SPARK-48421
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
>
> As part of SPARK-48329, we mentioned "Storage Partition Join" but noticed 
> there is no documentation describing the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48421) SPJ: Add documentation

2024-05-25 Thread Szehon Ho (Jira)
Szehon Ho created SPARK-48421:
-

 Summary: SPJ: Add documentation
 Key: SPARK-48421
 URL: https://issues.apache.org/jira/browse/SPARK-48421
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 4.0.0
Reporter: Szehon Ho


As part of SPARK-48329, we mentioned "Storage Partition Join" but noticed there 
is no documentation describing the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48420) Upgrade netty to `4.1.110.Final`

2024-05-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48420:
---
Labels: pull-request-available  (was: )

> Upgrade netty to `4.1.110.Final`
> 
>
> Key: SPARK-48420
> URL: https://issues.apache.org/jira/browse/SPARK-48420
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48420) Upgrade netty to `4.1.110.Final`

2024-05-24 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-48420:
---

 Summary: Upgrade netty to `4.1.110.Final`
 Key: SPARK-48420
 URL: https://issues.apache.org/jira/browse/SPARK-48420
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48419) Foldable propagation replace foldable column should use origin column

2024-05-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48419:
---
Labels: pull-request-available  (was: )

> Foldable propagation replace foldable column should use origin column
> -
>
> Key: SPARK-48419
> URL: https://issues.apache.org/jira/browse/SPARK-48419
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.2.4, 4.0.0, 3.5.1, 3.3.4
>Reporter: KnightChess
>Priority: Major
>  Labels: pull-request-available
>
> column name will be change by `FoldablePropagation` in optimizer
> befor optimizer:
> ```shell
> 'Project ['x, 'y, 'z]
> +- 'Project ['a AS x#112, str AS Y#113, 'b AS z#114]
>    +- LocalRelation , [a#0, b#1]
> ```
> after optimizer:
> ```shell
> Project [x#112, str AS Y#113, z#114]
> +- Project [a#0 AS x#112, str AS Y#113, b#1 AS z#114]
>    +- LocalRelation , [a#0, b#1]
> ```
> column name `y` will be replace to 'Y'



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48419) Foldable propagation replace foldable column should use origin column

2024-05-24 Thread KnightChess (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KnightChess updated SPARK-48419:

Summary: Foldable propagation replace foldable column should use origin 
column  (was: Foldable propagation change output schema)

> Foldable propagation replace foldable column should use origin column
> -
>
> Key: SPARK-48419
> URL: https://issues.apache.org/jira/browse/SPARK-48419
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.2.4, 4.0.0, 3.5.1, 3.3.4
>Reporter: KnightChess
>Priority: Major
>
> column name will be change by `FoldablePropagation` in optimizer
> befor optimizer:
> ```shell
> 'Project ['x, 'y, 'z]
> +- 'Project ['a AS x#112, str AS Y#113, 'b AS z#114]
>    +- LocalRelation , [a#0, b#1]
> ```
> after optimizer:
> ```shell
> Project [x#112, str AS Y#113, z#114]
> +- Project [a#0 AS x#112, str AS Y#113, b#1 AS z#114]
>    +- LocalRelation , [a#0, b#1]
> ```
> column name `y` will be replace to 'Y'



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48419) Foldable propagation change output schema

2024-05-24 Thread KnightChess (Jira)
KnightChess created SPARK-48419:
---

 Summary: Foldable propagation change output schema
 Key: SPARK-48419
 URL: https://issues.apache.org/jira/browse/SPARK-48419
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.4, 3.5.1, 3.2.4, 3.1.3, 3.0.3, 4.0.0
Reporter: KnightChess


column name will be change by `FoldablePropagation` in optimizer

befor optimizer:

```shell

'Project ['x, 'y, 'z]
+- 'Project ['a AS x#112, str AS Y#113, 'b AS z#114]
   +- LocalRelation , [a#0, b#1]

```

after optimizer:

```shell

Project [x#112, str AS Y#113, z#114]
+- Project [a#0 AS x#112, str AS Y#113, b#1 AS z#114]
   +- LocalRelation , [a#0, b#1]

```

column name `y` will be replace to 'Y'



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48418) Spark structured streaming: Add microbatch timestamp to foreachBatch method

2024-05-24 Thread Anil Dasari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anil Dasari updated SPARK-48418:

Description: 
We are on Spark 3.x and using Spark dstream + kafka and planning to use 
structured streaming + Kafka.

Differences in Dstream microbatch and structured streaming microbatch metadata 
is making migration difficult. 

Dstream#foreachRDD gives both microbatch RDD and start timestamp (in long). 
However, structured streaming Dataset#foreachBatch returns only microbatch 
dataset and batchID where BatchID is a numeric number. 

micorbatch start time used across our data pipelines and final result. 

Could you add microbatch start timestamp to  Dataset#foreachBatch method?

Pseudo code :

 
{code:java}
val inputStream = sparkSession.readStream.format("rate").load

inputStream
  .writeStream
  .trigger(Trigger.ProcessingTime(10 * 1000))
  .foreachBatch {
(ds: Dataset[Row], batchId: Long, batchTime: Long) => // batchTime is 
microbatch triggered/start timestamp
  
 // application logic.
  }
  .start()
  .awaitTermination() {code}
 

Implementation approach when batchTime is trigger executor executed time:

( `currentTriggerStartTimestamp` can be used as well as batch time. Trigger 
executor time is source of microbatch and also can be easily added to query 
processor event as well)

1. Add trigger time to 
[TriggerExecutor|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala#L31]
 
{code:java}
trait TriggerExecutor {
  // added batchTime (Long) argument 
  def execute(batchRunner: (MicroBatchExecutionContext, Long) => Boolean): Unit 

  ... // (other methods)
}{code}
2. Update ProcessingTimeExecutor and other executors to pass trigger time.
{code:java}
override def execute(triggerHandler: (MicroBatchExecutionContext, Long) => 
Boolean): Unit = {
  while (true) {
val triggerTimeMs = clock.getTimeMillis()
val nextTriggerTimeMs = nextBatchTime(triggerTimeMs)

// pass triggerTimeMs to runOneBatch which invokes triggerHandler and is 
used in MicroBatchExecution#runActivatedStream method.
    val terminated = !runOneBatch(triggerHandler, triggerTimeMs)

   ...
  }
} {code}
3. Add argument executionTime (long) argument to 
MicroBatchExecution#excuteOneBatch method 
[here|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L330]

4. Pass execution time in 
[runBatch|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L380C11-L380C19]
 and 
[here|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L849]

5. Finally add following following method `foreachBatch` in `DataStreamWriter` 
and update existing `foreachBatch` methods for new argument.  and also add it 
to query processor event. 
{code:java}
def foreachBatch(function: (Dataset[T], Long, Long) => Unit): 
DataStreamWriter[T] = {
  this.source = SOURCE_NAME_FOREACH_BATCH
  if (function == null) throw new IllegalArgumentException("foreachBatch 
function cannot be null")
  this.foreachBatchWriter = function
  this
}{code}
Let me know your thoughts. 

  was:
We are on Spark 3.x and using Spark dstream + kafka and planning to use 
structured streaming + Kafka.

Differences in Dstream microbatch and structured streaming microbatch metadata 
is making migration difficult. 

Dstream#foreachRDD gives both microbatch RDD and start timestamp (in long). 
However, structured streaming Dataset#foreachBatch returns only microbatch 
dataset and batchID where BatchID is a numeric number. 

micorbatch start time used across our data pipelines and final result. 

Could you add microbatch start timestamp to  Dataset#foreachBatch method?

Pseudo code :

 
{code:java}
val inputStream = sparkSession.readStream.format("rate").load

inputStream
  .writeStream
  .trigger(Trigger.ProcessingTime(10 * 1000))
  .foreachBatch {
(ds: Dataset[Row], batchId: Long, batchTime: Long) => // batchTime is 
microbatch triggered/start timestamp
  
 // application logic.
  }
  .start()
  .awaitTermination() {code}
 

Implementation approach when batchTime is trigger executor executed time:

( `currentTriggerStartTimestamp` can be used as well as batch time. Trigger 
executor time is source of microbatch and also can be easily added to query 
processor event as well)
 # Add trigger time to 
[TriggerExecutor|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala#L31]
 
{code:java}
trait TriggerExecutor {
  // added batchTime (Long) argument 
  def execute(batchRunner: (MicroBatchExecutionContext, Long) => 

[jira] [Created] (SPARK-48418) Spark structured streaming: Add microbatch timestamp to foreachBatch method

2024-05-24 Thread Anil Dasari (Jira)
Anil Dasari created SPARK-48418:
---

 Summary: Spark structured streaming: Add microbatch timestamp to 
foreachBatch method
 Key: SPARK-48418
 URL: https://issues.apache.org/jira/browse/SPARK-48418
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.5.1
Reporter: Anil Dasari


We are on Spark 3.x and using Spark dstream + kafka and planning to use 
structured streaming + Kafka.

Differences in Dstream microbatch and structured streaming microbatch metadata 
is making migration difficult. 

Dstream#foreachRDD gives both microbatch RDD and start timestamp (in long). 
However, structured streaming Dataset#foreachBatch returns only microbatch 
dataset and batchID where BatchID is a numeric number. 

micorbatch start time used across our data pipelines and final result. 

Could you add microbatch start timestamp to  Dataset#foreachBatch method?

Pseudo code :

 
{code:java}
val inputStream = sparkSession.readStream.format("rate").load

inputStream
  .writeStream
  .trigger(Trigger.ProcessingTime(10 * 1000))
  .foreachBatch {
(ds: Dataset[Row], batchId: Long, batchTime: Long) => // batchTime is 
microbatch triggered/start timestamp
  
 // application logic.
  }
  .start()
  .awaitTermination() {code}
 

Implementation approach when batchTime is trigger executor executed time:

( `currentTriggerStartTimestamp` can be used as well as batch time. Trigger 
executor time is source of microbatch and also can be easily added to query 
processor event as well)
 # Add trigger time to 
[TriggerExecutor|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala#L31]
 
{code:java}
trait TriggerExecutor {
  // added batchTime (Long) argument 
  def execute(batchRunner: (MicroBatchExecutionContext, Long) => Boolean): Unit 

  ... // (other methods)
}{code}

 # Update ProcessingTimeExecutor and other executors to pass trigger time.
{code:java}
override def execute(triggerHandler: (MicroBatchExecutionContext, Long) => 
Boolean): Unit = {
  while (true) {
val triggerTimeMs = clock.getTimeMillis()
val nextTriggerTimeMs = nextBatchTime(triggerTimeMs)

// pass triggerTimeMs to runOneBatch which invokes triggerHandler and is 
used in MicroBatchExecution#runActivatedStream method.
    val terminated = !runOneBatch(triggerHandler, triggerTimeMs)

   ...
  }
} {code}
 

 # Add argument executionTime (long) argument to 
MicroBatchExecution#excuteOneBatch method 
[here|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L330]

 #  

Pass execution time in 
[runBatch|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L380C11-L380C19]
 and 
[here|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L849]

 # Finally add following following method `foreachBatch` in `DataStreamWriter` 
and update existing `foreachBatch` methods for new argument.  and also add it 
to query processor event. 
{code:java}
def foreachBatch(function: (Dataset[T], Long, Long) => Unit): 
DataStreamWriter[T] = {
  this.source = SOURCE_NAME_FOREACH_BATCH
  if (function == null) throw new IllegalArgumentException("foreachBatch 
function cannot be null")
  this.foreachBatchWriter = function
  this
}{code}

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47041) PushDownUtils uses FileScanBuilder instead of SupportsPushDownCatalystFilters trait

2024-05-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47041:
---
Labels: pull-request-available  (was: )

> PushDownUtils uses FileScanBuilder instead of SupportsPushDownCatalystFilters 
> trait
> ---
>
> Key: SPARK-47041
> URL: https://issues.apache.org/jira/browse/SPARK-47041
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Никита Соколов
>Priority: Major
>  Labels: pull-request-available
>
> It could use an existing more generic interface looking like it was created 
> for that reason, but uses a narrower type forcing you to extend
> FileScanBuilder when implementing a ScanBuilder.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48416) Support related nested WITH expression

2024-05-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48416:
---
Labels: pull-request-available  (was: )

> Support related nested WITH expression
> --
>
> Key: SPARK-48416
> URL: https://issues.apache.org/jira/browse/SPARK-48416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Mingliang Zhu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48394) Cleanup mapIdToMapIndex on mapoutput unregister

2024-05-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48394.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46706
[https://github.com/apache/spark/pull/46706]

> Cleanup mapIdToMapIndex on mapoutput unregister
> ---
>
> Key: SPARK-48394
> URL: https://issues.apache.org/jira/browse/SPARK-48394
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> There is only one valid mapstatus for the same {{mapIndex}} at the same time 
> in Spark. {{mapIdToMapIndex}} should also follows the same rule to avoid 
> chaos.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48394) Cleanup mapIdToMapIndex on mapoutput unregister

2024-05-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48394:
-

Assignee: wuyi

> Cleanup mapIdToMapIndex on mapoutput unregister
> ---
>
> Key: SPARK-48394
> URL: https://issues.apache.org/jira/browse/SPARK-48394
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>  Labels: pull-request-available
>
> There is only one valid mapstatus for the same {{mapIndex}} at the same time 
> in Spark. {{mapIdToMapIndex}} should also follows the same rule to avoid 
> chaos.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48407) Teradata: Document Type Conversion rules between Spark SQL and teradata

2024-05-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48407.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46728
[https://github.com/apache/spark/pull/46728]

> Teradata: Document Type Conversion rules between Spark SQL and teradata
> ---
>
> Key: SPARK-48407
> URL: https://issues.apache.org/jira/browse/SPARK-48407
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48325) Always specify messages in ExecutorRunner.killProcess

2024-05-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48325.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46641
[https://github.com/apache/spark/pull/46641]

> Always specify messages in ExecutorRunner.killProcess
> -
>
> Key: SPARK-48325
> URL: https://issues.apache.org/jira/browse/SPARK-48325
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> For some of the cases in ExecutorRunner.killProcess, the argument `message` 
> is `None`. We should always specify the message so that we can get the 
> occurrence rate for different cases, in order to analyze executor running 
> stability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48417) Filesystems do not load with spark.jars.packages configuration

2024-05-24 Thread Ravi Dalal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849399#comment-17849399
 ] 

Ravi Dalal commented on SPARK-48417:


For anyone facing this issue, use following configuration to read file from GCS 
when spark.jars.packages is used:
{code:java}
config("spark.jars", 
"https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-2.2.22.jar;)
config("spark.hadoop.fs.AbstractFileSystem.gs.impl", 
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")   
config("spark.hadoop.fs.gs.impl", 
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"){code}
When spark.jars.pacakges is not used, following configuration alone works:
{code:java}
config("spark.jars", 
"https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-2.2.22.jar;)
config("spark.hadoop.fs.AbstractFileSystem.gs.impl", 
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS") {code}

> Filesystems do not load with spark.jars.packages configuration
> --
>
> Key: SPARK-48417
> URL: https://issues.apache.org/jira/browse/SPARK-48417
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.5.1
>Reporter: Ravi Dalal
>Priority: Major
> Attachments: pyspark_mleap.py, 
> pyspark_spark_jar_package_config_logs.txt, 
> pyspark_without_spark_jar_package_config_logs.txt
>
>
> When we use spark.jars.packages configuration parameter in Python 
> SparkSession Builder (Pyspark), it appears that the filesystems are not 
> loaded when session starts. Because of this, Spark fails to read file from 
> Google Cloud Storage (GCS) bucket (with GCS Connector). 
> I tested this with different packages so it does not appear specific to a 
> particular package. I will attach the sample code and debug logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-48417) Filesystems do not load with spark.jars.packages configuration

2024-05-24 Thread Ravi Dalal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Dalal closed SPARK-48417.
--

> Filesystems do not load with spark.jars.packages configuration
> --
>
> Key: SPARK-48417
> URL: https://issues.apache.org/jira/browse/SPARK-48417
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.5.1
>Reporter: Ravi Dalal
>Priority: Major
> Attachments: pyspark_mleap.py, 
> pyspark_spark_jar_package_config_logs.txt, 
> pyspark_without_spark_jar_package_config_logs.txt
>
>
> When we use spark.jars.packages configuration parameter in Python 
> SparkSession Builder (Pyspark), it appears that the filesystems are not 
> loaded when session starts. Because of this, Spark fails to read file from 
> Google Cloud Storage (GCS) bucket (with GCS Connector). 
> I tested this with different packages so it does not appear specific to a 
> particular package. I will attach the sample code and debug logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48417) Filesystems do not load with spark.jars.packages configuration

2024-05-24 Thread Ravi Dalal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Dalal resolved SPARK-48417.

Resolution: Not A Problem

Apologies. We missed a configuration parameter. Found it after creating this 
bug. Resolving the bug now.

> Filesystems do not load with spark.jars.packages configuration
> --
>
> Key: SPARK-48417
> URL: https://issues.apache.org/jira/browse/SPARK-48417
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.5.1
>Reporter: Ravi Dalal
>Priority: Major
> Attachments: pyspark_mleap.py, 
> pyspark_spark_jar_package_config_logs.txt, 
> pyspark_without_spark_jar_package_config_logs.txt
>
>
> When we use spark.jars.packages configuration parameter in Python 
> SparkSession Builder (Pyspark), it appears that the filesystems are not 
> loaded when session starts. Because of this, Spark fails to read file from 
> Google Cloud Storage (GCS) bucket (with GCS Connector). 
> I tested this with different packages so it does not appear specific to a 
> particular package. I will attach the sample code and debug logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48417) Filesystems do not load with spark.jars.packages configuration

2024-05-24 Thread Ravi Dalal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Dalal updated SPARK-48417:
---
Attachment: pyspark_mleap.py
pyspark_spark_jar_package_config_logs.txt
pyspark_without_spark_jar_package_config_logs.txt

> Filesystems do not load with spark.jars.packages configuration
> --
>
> Key: SPARK-48417
> URL: https://issues.apache.org/jira/browse/SPARK-48417
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.5.1
>Reporter: Ravi Dalal
>Priority: Major
> Attachments: pyspark_mleap.py, 
> pyspark_spark_jar_package_config_logs.txt, 
> pyspark_without_spark_jar_package_config_logs.txt
>
>
> When we use spark.jars.packages configuration parameter in Python 
> SparkSession Builder (Pyspark), it appears that the filesystems are not 
> loaded when session starts. Because of this, Spark fails to read file from 
> Google Cloud Storage (GCS) bucket (with GCS Connector). 
> I tested this with different packages so it does not appear specific to a 
> particular package. I will attach the sample code and debug logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48417) Filesystems do not load with spark.jars.packages configuration

2024-05-24 Thread Ravi Dalal (Jira)
Ravi Dalal created SPARK-48417:
--

 Summary: Filesystems do not load with spark.jars.packages 
configuration
 Key: SPARK-48417
 URL: https://issues.apache.org/jira/browse/SPARK-48417
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 3.5.1
Reporter: Ravi Dalal


When we use spark.jars.packages configuration parameter in Python SparkSession 
Builder (Pyspark), it appears that the filesystems are not loaded when session 
starts. Because of this, Spark fails to read file from Google Cloud Storage 
(GCS) bucket (with GCS Connector). 

I tested this with different packages so it does not appear specific to a 
particular package. I will attach the sample code and debug logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark

2024-05-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48411:
---
Labels: pull-request-available  (was: )

> Add E2E test for DropDuplicateWithinWatermark
> -
>
> Key: SPARK-48411
> URL: https://issues.apache.org/jira/browse/SPARK-48411
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Priority: Major
>  Labels: pull-request-available
>
> Currently we do not have a e2e test for DropDuplicateWithinWatermark, we 
> should add one. We can simply use one of the test written in Scala here (with 
> the testStream API) and replicate it to python:
> [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103]
>  
> The change should happen in 
> [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29]
>  
> so we can test it in both connect and non-connect.
>  
> Test with:
> ```
> python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming
> python/run-tests --testnames 
> pyspark.sql.tests.connect.streaming.test_parity_streaming
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45373) Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated

2024-05-24 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45373:
-
Priority: Major  (was: Minor)

> Minimizing calls to HiveMetaStore layer for getting partitions,  when tables 
> are repeated
> -
>
> Key: SPARK-45373
> URL: https://issues.apache.org/jira/browse/SPARK-45373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> In the rule PruneFileSourcePartitions where the CatalogFileIndex gets 
> converted to InMemoryFileIndex,  the HMS calls can get very expensive if :
> 1) The translated filter string for push down to HMS layer becomes empty ,  
> resulting in fetching of all partitions and same table is referenced multiple 
> times in the query. 
> 2) Or just in case same table is referenced multiple times in the query with 
> different partition filters.
> In such cases current code would result in multiple calls to HMS layer. 
> This can be avoided by grouping the tables based on CatalogFileIndex and 
> passing a common minimum filter ( filter1 || filter2) and getting a base 
> PrunedInmemoryFileIndex which can become a basis for each of the specific 
> table.
> Opened following PR for ticket:
> [SPARK-45373-PR|https://github.com/apache/spark/pull/43183]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark

2024-05-24 Thread Yuchen Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849356#comment-17849356
 ] 

Yuchen Liu commented on SPARK-48411:


I will work on this.

> Add E2E test for DropDuplicateWithinWatermark
> -
>
> Key: SPARK-48411
> URL: https://issues.apache.org/jira/browse/SPARK-48411
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Priority: Major
>
> Currently we do not have a e2e test for DropDuplicateWithinWatermark, we 
> should add one. We can simply use one of the test written in Scala here (with 
> the testStream API) and replicate it to python:
> [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103]
>  
> The change should happen in 
> [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29]
>  
> so we can test it in both connect and non-connect.
>  
> Test with:
> ```
> python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming
> python/run-tests --testnames 
> pyspark.sql.tests.connect.streaming.test_parity_streaming
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark

2024-05-24 Thread Wei Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849348#comment-17849348
 ] 

Wei Liu commented on SPARK-48411:
-

Sorry I tagged the wrong Yuchen

> Add E2E test for DropDuplicateWithinWatermark
> -
>
> Key: SPARK-48411
> URL: https://issues.apache.org/jira/browse/SPARK-48411
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Priority: Major
>
> Currently we do not have a e2e test for DropDuplicateWithinWatermark, we 
> should add one. We can simply use one of the test written in Scala here (with 
> the testStream API) and replicate it to python:
> [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103]
>  
> The change should happen in 
> [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29]
>  
> so we can test it in both connect and non-connect.
>  
> Test with:
> ```
> python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming
> python/run-tests --testnames 
> pyspark.sql.tests.connect.streaming.test_parity_streaming
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48338) Sql Scripting support for Spark SQL

2024-05-24 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48338:

Attachment: [Design Doc] Sql Scripting - OSS.pdf

> Sql Scripting support for Spark SQL
> ---
>
> Key: SPARK-48338
> URL: https://issues.apache.org/jira/browse/SPARK-48338
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
> Attachments: Sql Scripting - OSS.odt, [Design Doc] Sql Scripting - 
> OSS.pdf
>
>
> Design doc for this feature is in attachment.
> High level example of Sql Script:
> ```
> BEGIN
>   DECLARE c INT = 10;
>   WHILE c > 0 DO
> INSERT INTO tscript VALUES (c);
> SET c = c - 1;
>   END WHILE;
> END
> ```
> High level motivation behind this feature:
> SQL Scripting gives customers the ability to develop complex ETL and analysis 
> entirely in SQL. Until now, customers have had to write verbose SQL 
> statements or combine SQL + Python to efficiently write business logic. 
> Coming from another system, customers have to choose whether or not they want 
> to migrate to pyspark. Some customers end up not using Spark because of this 
> gap. SQL Scripting is a key milestone towards enabling SQL practitioners to 
> write sophisticated queries, without the need to use pyspark. Further, SQL 
> Scripting is a necessary step towards support for SQL Stored Procedures, and 
> along with SQL Variables (released) and Temp Tables (in progress), will allow 
> for more seamless data warehouse migrations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48343) Interpreter support

2024-05-24 Thread David Milicevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48343:

Description: 
Implement interpreter for SQL scripting:
 * Interpreter
 * Interpreter testing

For more details, design doc can be found in parent Jira item.

Update design doc accordingly.

  was:
Implement interpreter for SQL scripting:
 * Interpreter
 * Interpreter testing

For more details, design doc can be found in parent Jira item.


> Interpreter support
> ---
>
> Key: SPARK-48343
> URL: https://issues.apache.org/jira/browse/SPARK-48343
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Implement interpreter for SQL scripting:
>  * Interpreter
>  * Interpreter testing
> For more details, design doc can be found in parent Jira item.
> Update design doc accordingly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48416) Support related nested WITH expression

2024-05-24 Thread Mingliang Zhu (Jira)
Mingliang Zhu created SPARK-48416:
-

 Summary: Support related nested WITH expression
 Key: SPARK-48416
 URL: https://issues.apache.org/jira/browse/SPARK-48416
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.1
Reporter: Mingliang Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48388) Fix SET behavior for scripts

2024-05-24 Thread David Milicevic (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849301#comment-17849301
 ] 

David Milicevic commented on SPARK-48388:
-

Already have ready changes. I will work on this as soon as SPARK-48342 is 
completed.

> Fix SET behavior for scripts
> 
>
> Key: SPARK-48388
> URL: https://issues.apache.org/jira/browse/SPARK-48388
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> By standard, SET is used to set variable value in SQL scripts.
> On our end, SET is configured to work with some Hive configs, so the grammar 
> is a bit messed up and for that reason it was decided to use SET VAR instead 
> of SET to work with SQL variables.
> This is not by standard and we should figure out the way to be able to use 
> SET for SQL variables and forbid setting of Hive configs from SQL scripts.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48415) TypeName support parameterized datatypes

2024-05-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48415:
---
Labels: pull-request-available  (was: )

> TypeName support parameterized datatypes
> 
>
> Key: SPARK-48415
> URL: https://issues.apache.org/jira/browse/SPARK-48415
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48415) TypeName support parameterized datatypes

2024-05-24 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-48415:
-

 Summary: TypeName support parameterized datatypes
 Key: SPARK-48415
 URL: https://issues.apache.org/jira/browse/SPARK-48415
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21720) Filter predicate with many conditions throw stackoverflow error

2024-05-24 Thread Abhishek Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849267#comment-17849267
 ] 

Abhishek Singh commented on SPARK-21720:


Hi, I m using spark version 2.4 I m getting the same issue. 
{code:java}
operation match {
        case "equals" =>         
   val joinDf = 
filterDf.select(lower(col("field_value")).alias("field_value")).distinct()      
    
   excludeDf = excludeDf.join(broadcast(joinDf), colLower === 
joinDf("field_value"), "left_anti")
        case "contains" =>          
   values.foreach { value =>           
   excludeDf = 
excludeDf.filter(!colLower.contains(value))
          }     
 } {code}
i m using this code to generate a filter condition on around 110 distinct 
values. 

The error i m getting is


{panel:title=Error log}
glue.ProcessLauncher (Logging.scala:logError(70)): Exception in User Class: 
java.lang.StackOverflowError
org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:395)
org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:557)
org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:557)
org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:557)
org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:557){panel}

> Filter predicate with many conditions throw stackoverflow error
> ---
>
>     Key: SPARK-21720
> URL: https://issues.apache.org/jira/browse/SPARK-21720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: srinivasan
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.2.1, 2.3.0
>
>
> When trying to filter on dataset with many predicate conditions on both spark 
> sql and dataset filter transformation as described below, spark throws a 
> stackoverflow exception
> Case 1: Filter Transformation on Data
> Dataset filter = sourceDataset.filter(String.format("not(%s)", 
> buildQuery()));
> filter.show();
> where buildQuery() returns
> Field1 = "" and  Field2 = "" and  Field3 = "" and  Field4 = "" and  Field5 = 
> "" and  BLANK_5 = "" and  Field7 = "" and  Field8 = "" and  Field9 = "" and  
> Field10 = "" and  Field11 = "" and  Field12 = "" and  Field13 = "" and  
> Field14 = "" and  Field15 = "" and  Field16 = "" and  Field17 = "" and  
> Field18 = "" and  Field19 = "" and  Field20 = "" and  Field21 = "" and  
> Field22 = "" and  Field23 = "" and  Field24 = "" and  Field25 = "" and  
> Field26 = "" and  Field27 = "" and  Field28 = "" and  Field29 = "" and  
> Field30 = "" and  Field31 = "" and  Field32 = "" and  Field33 = "" and  
> Field34 = "" and  Field35 = "" and  Field36 = "" and  Field37 = "" and  
> Field38 = "" and  Field39 = "" and  Field40 = "" and  Field41 = "" and  
> Field42 = "" and  Field43 = "" and  Field44 = "" and  Field45 = "" and  
> Field46 = "" and  Field47 = "" and  Field48 = "" and  Field49 = "" and  
> Field50 = "" and  Field51 = "" and  Field52 = "" and  Field53 = "" and  
> Field54 = "" and  Field55 = "" and  Field56 = "" and  Field57 = "" and  
> Field58 = "" and  Field59 = "" and  Field60 = "" and  Field61 = "" and  
> Field62 = "" and  Field63 = "" and  Field64 = "" and  Field65 = "" and  
> Field66 = "" and  Field67 = "" and  Field68 = "" and  Field69 = "" and  
> Field70 = "" and  Field71 = "" and  Field72 = "" and  Field73 = "" and  
> Field74 = "" and  Field75 = "" and  Field76 = "" and  Field77 = "" and  
> Field78 = "" and  Field79 = "" and  Field80 = "" and  Field81 = "" and  
> Field82 = "" and  Field83 = "" and  Field84 = "" and  Field85 = "" and  
> Field86 = "" and  Field87 = "" and  Field88 = "&q

[jira] [Updated] (SPARK-48414) Fix breaking change in python's `fromJson`

2024-05-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48414:
---
Labels: pull-request-available  (was: )

> Fix breaking change in python's `fromJson`
> --
>
> Key: SPARK-48414
> URL: https://issues.apache.org/jira/browse/SPARK-48414
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48168) Add bitwise shifting operators support

2024-05-24 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-48168.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46440
[https://github.com/apache/spark/pull/46440]

> Add bitwise shifting operators support
> --
>
> Key: SPARK-48168
> URL: https://issues.apache.org/jira/browse/SPARK-48168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48412) Refactor data type json parse

2024-05-24 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-48412.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46733
[https://github.com/apache/spark/pull/46733]

> Refactor data type json parse
> -
>
> Key: SPARK-48412
> URL: https://issues.apache.org/jira/browse/SPARK-48412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48409) Upgrade MySQL & Postgres & mariadb docker image version

2024-05-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48409.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46704
[https://github.com/apache/spark/pull/46704]

> Upgrade MySQL & Postgres & mariadb docker image version
> ---
>
> Key: SPARK-48409
> URL: https://issues.apache.org/jira/browse/SPARK-48409
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48409) Upgrade MySQL & Postgres & mariadb docker image version

2024-05-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-48409:


Assignee: BingKun Pan

> Upgrade MySQL & Postgres & mariadb docker image version
> ---
>
> Key: SPARK-48409
> URL: https://issues.apache.org/jira/browse/SPARK-48409
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48409) Upgrade MySQL & Postgres & mariadb docker image version

2024-05-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48409:
---
Labels: pull-request-available  (was: )

> Upgrade MySQL & Postgres & mariadb docker image version
> ---
>
> Key: SPARK-48409
> URL: https://issues.apache.org/jira/browse/SPARK-48409
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48384) Exclude `io.netty:netty-tcnative-boringssl-static` from `zookeeper`

2024-05-24 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-48384.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46695
[https://github.com/apache/spark/pull/46695]

> Exclude `io.netty:netty-tcnative-boringssl-static` from `zookeeper`
> ---
>
> Key: SPARK-48384
> URL: https://issues.apache.org/jira/browse/SPARK-48384
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48384) Exclude `io.netty:netty-tcnative-boringssl-static` from `zookeeper`

2024-05-24 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-48384:


Assignee: BingKun Pan

> Exclude `io.netty:netty-tcnative-boringssl-static` from `zookeeper`
> ---
>
> Key: SPARK-48384
> URL: https://issues.apache.org/jira/browse/SPARK-48384
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48413) ALTER COLUMN with collation

2024-05-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48413:
---
Labels: pull-request-available  (was: )

> ALTER COLUMN with collation
> ---
>
> Key: SPARK-48413
> URL: https://issues.apache.org/jira/browse/SPARK-48413
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
>
> Add support for changing collation of a column with ALTER COLUMN command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48412) Refactor data type json parse

2024-05-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48412:
---
Labels: pull-request-available  (was: )

> Refactor data type json parse
> -
>
> Key: SPARK-48412
> URL: https://issues.apache.org/jira/browse/SPARK-48412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48413) ALTER COLUMN with collation

2024-05-24 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-48413:
--
Epic Link: SPARK-46830

> ALTER COLUMN with collation
> ---
>
> Key: SPARK-48413
> URL: https://issues.apache.org/jira/browse/SPARK-48413
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>
> Add support for changing collation of a column with ALTER COLUMN command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48413) ALTER COLUMN with collation

2024-05-24 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-48413:
-

 Summary: ALTER COLUMN with collation
 Key: SPARK-48413
 URL: https://issues.apache.org/jira/browse/SPARK-48413
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nikola Mandic


Add support for changing collation of a column with ALTER COLUMN command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48412) Refactor data type json parse

2024-05-24 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-48412:
-

 Summary: Refactor data type json parse
 Key: SPARK-48412
 URL: https://issues.apache.org/jira/browse/SPARK-48412
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48410) Fix InitCap expression

2024-05-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48410:
---
Labels: pull-request-available  (was: )

> Fix InitCap expression
> --
>
> Key: SPARK-48410
> URL: https://issues.apache.org/jira/browse/SPARK-48410
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark

2024-05-24 Thread Wei Liu (Jira)
Wei Liu created SPARK-48411:
---

 Summary: Add E2E test for DropDuplicateWithinWatermark
 Key: SPARK-48411
 URL: https://issues.apache.org/jira/browse/SPARK-48411
 Project: Spark
  Issue Type: New Feature
  Components: Connect, SS
Affects Versions: 4.0.0
Reporter: Wei Liu


Currently we do not have a e2e test for DropDuplicateWithinWatermark, we should 
add one. We can simply use one of the test written in Scala here (with the 
testStream API) and replicate it to python:

[https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103]

 

The change should happen in 
[https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29]

 

so we can test it in both connect and non-connect.

 

Test with:

```
python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming
python/run-tests --testnames 
pyspark.sql.tests.connect.streaming.test_parity_streaming
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48411) Add E2E test for DropDuplicateWithinWatermark

2024-05-24 Thread Wei Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849202#comment-17849202
 ] 

Wei Liu commented on SPARK-48411:
-

[~liuyuchen777] is going to work on this

> Add E2E test for DropDuplicateWithinWatermark
> -
>
> Key: SPARK-48411
> URL: https://issues.apache.org/jira/browse/SPARK-48411
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Priority: Major
>
> Currently we do not have a e2e test for DropDuplicateWithinWatermark, we 
> should add one. We can simply use one of the test written in Scala here (with 
> the testStream API) and replicate it to python:
> [https://github.com/apache/spark/commit/0e9e34c1bd9bd16ad5efca77ce2763eb950f3103]
>  
> The change should happen in 
> [https://github.com/apache/spark/blob/eee179135ed21dbdd8b342d053c9eda849e2de77/python/pyspark/sql/tests/streaming/test_streaming.py#L29]
>  
> so we can test it in both connect and non-connect.
>  
> Test with:
> ```
> python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming
> python/run-tests --testnames 
> pyspark.sql.tests.connect.streaming.test_parity_streaming
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47257) Assign error classes to ALTER COLUMN errors

2024-05-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47257:
---
Labels: pull-request-available starter  (was: starter)

> Assign error classes to ALTER COLUMN errors
> ---
>
> Key: SPARK-47257
> URL: https://issues.apache.org/jira/browse/SPARK-47257
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_105[3-4]* 
> defined in {*}core/src/main/resources/error/error-classes.json{*}. The name 
> should be short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48403) Fix Upper & Lower expressions for UTF8_BINARY_LCASE

2024-05-24 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-48403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-48403:
-
Summary: Fix Upper & Lower expressions for UTF8_BINARY_LCASE  (was: Fix 
Upper & Lower expressions for UTF8_BINARY_LCASE))

> Fix Upper & Lower expressions for UTF8_BINARY_LCASE
> ---
>
> Key: SPARK-48403
> URL: https://issues.apache.org/jira/browse/SPARK-48403
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48410) Fix InitCap expression

2024-05-24 Thread Jira
Uroš Bojanić created SPARK-48410:


 Summary: Fix InitCap expression
 Key: SPARK-48410
 URL: https://issues.apache.org/jira/browse/SPARK-48410
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48403) Fix Upper & Lower expressions for UTF8_BINARY_LCASE)

2024-05-24 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-48403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-48403:
-
Summary: Fix Upper & Lower expressions for UTF8_BINARY_LCASE)  (was: Fix 
Upper, Lower, InitCap for UTF8_BINARY_LCASE)

> Fix Upper & Lower expressions for UTF8_BINARY_LCASE)
> 
>
> Key: SPARK-48403
> URL: https://issues.apache.org/jira/browse/SPARK-48403
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48409) Upgrade MySQL & Postgres & mariadb docker image version

2024-05-24 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-48409:
---

 Summary: Upgrade MySQL & Postgres & mariadb docker image version
 Key: SPARK-48409
 URL: https://issues.apache.org/jira/browse/SPARK-48409
 Project: Spark
  Issue Type: Improvement
  Components: Build, Tests
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46090) Support plan fragment level SQL configs in AQE

2024-05-24 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-46090:
-

Assignee: XiDuo You

> Support plan fragment level SQL configs  in AQE
> ---
>
> Key: SPARK-46090
> URL: https://issues.apache.org/jira/browse/SPARK-46090
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
>
> AQE executes query plan stage by stage, so there is a chance to support plan 
> fragment level SQL configs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46090) Support plan fragment level SQL configs in AQE

2024-05-24 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-46090.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44013
[https://github.com/apache/spark/pull/44013]

> Support plan fragment level SQL configs  in AQE
> ---
>
> Key: SPARK-46090
> URL: https://issues.apache.org/jira/browse/SPARK-46090
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> AQE executes query plan stage by stage, so there is a chance to support plan 
> fragment level SQL configs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48406) Upgrade commons-cli to 1.8.0

2024-05-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48406.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46727
[https://github.com/apache/spark/pull/46727]

> Upgrade commons-cli to 1.8.0
> 
>
> Key: SPARK-48406
> URL: https://issues.apache.org/jira/browse/SPARK-48406
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> * [https://commons.apache.org/proper/commons-cli/changes-report.html#a1.7.0]
>  * 
> [https://commons.apache.org/proper/commons-cli/changes-report.html#a1.8.0|https://commons.apache.org/proper/commons-cli/changes-report.html#a1.7.8]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48406) Upgrade commons-cli to 1.8.0

2024-05-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-48406:


Assignee: Yang Jie

> Upgrade commons-cli to 1.8.0
> 
>
> Key: SPARK-48406
> URL: https://issues.apache.org/jira/browse/SPARK-48406
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> * [https://commons.apache.org/proper/commons-cli/changes-report.html#a1.7.0]
>  * 
> [https://commons.apache.org/proper/commons-cli/changes-report.html#a1.8.0|https://commons.apache.org/proper/commons-cli/changes-report.html#a1.7.8]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48408) Simplify `date_format` & `from_unixtime`

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48408:
---
Labels: pull-request-available  (was: )

> Simplify `date_format` & `from_unixtime`
> 
>
> Key: SPARK-48408
> URL: https://issues.apache.org/jira/browse/SPARK-48408
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48408) Simplify `date_format` & `from_unixtime`

2024-05-23 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-48408:
---

 Summary: Simplify `date_format` & `from_unixtime`
 Key: SPARK-48408
 URL: https://issues.apache.org/jira/browse/SPARK-48408
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48405) Upgrade `commons-compress` to 1.26.2

2024-05-23 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48405.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46725
[https://github.com/apache/spark/pull/46725]

> Upgrade `commons-compress` to 1.26.2
> 
>
> Key: SPARK-48405
> URL: https://issues.apache.org/jira/browse/SPARK-48405
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48405) Upgrade `commons-compress` to 1.26.2

2024-05-23 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-48405:


Assignee: BingKun Pan

> Upgrade `commons-compress` to 1.26.2
> 
>
> Key: SPARK-48405
> URL: https://issues.apache.org/jira/browse/SPARK-48405
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48407) Teradata: Document Type Conversion rules between Spark SQL and teradata

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48407:
---
Labels: pull-request-available  (was: )

> Teradata: Document Type Conversion rules between Spark SQL and teradata
> ---
>
> Key: SPARK-48407
> URL: https://issues.apache.org/jira/browse/SPARK-48407
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48407) Teradata: Document Type Conversion rules between Spark SQL and teradata

2024-05-23 Thread Kent Yao (Jira)
Kent Yao created SPARK-48407:


 Summary: Teradata: Document Type Conversion rules between Spark 
SQL and teradata
 Key: SPARK-48407
 URL: https://issues.apache.org/jira/browse/SPARK-48407
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48406) Upgrade commons-cli to 1.8.0

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48406:
---
Labels: pull-request-available  (was: )

> Upgrade commons-cli to 1.8.0
> 
>
> Key: SPARK-48406
> URL: https://issues.apache.org/jira/browse/SPARK-48406
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> * [https://commons.apache.org/proper/commons-cli/changes-report.html#a1.7.0]
>  * 
> [https://commons.apache.org/proper/commons-cli/changes-report.html#a1.8.0|https://commons.apache.org/proper/commons-cli/changes-report.html#a1.7.8]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48283) Implement modified Lowercase operation for UTF8_BINARY_LCASE

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48283:
---
Labels: pull-request-available  (was: )

> Implement modified Lowercase operation for UTF8_BINARY_LCASE
> 
>
> Key: SPARK-48283
> URL: https://issues.apache.org/jira/browse/SPARK-48283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48406) Upgrade commons-cli to 1.8.0

2024-05-23 Thread Yang Jie (Jira)
Yang Jie created SPARK-48406:


 Summary: Upgrade commons-cli to 1.8.0
 Key: SPARK-48406
 URL: https://issues.apache.org/jira/browse/SPARK-48406
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Yang Jie


* [https://commons.apache.org/proper/commons-cli/changes-report.html#a1.7.0]
 * 
[https://commons.apache.org/proper/commons-cli/changes-report.html#a1.8.0|https://commons.apache.org/proper/commons-cli/changes-report.html#a1.7.8]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48399) Teradata: ByteType wrongly mapping to teradata byte(binary) type

2024-05-23 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48399.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46715
[https://github.com/apache/spark/pull/46715]

> Teradata: ByteType wrongly mapping to teradata byte(binary) type
> 
>
> Key: SPARK-48399
> URL: https://issues.apache.org/jira/browse/SPARK-48399
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48405) Upgrade `commons-compress` to 1.26.2

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48405:
---
Labels: pull-request-available  (was: )

> Upgrade `commons-compress` to 1.26.2
> 
>
> Key: SPARK-48405
> URL: https://issues.apache.org/jira/browse/SPARK-48405
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48405) Upgrade `commons-compress` to 1.26.2

2024-05-23 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-48405:
---

 Summary: Upgrade `commons-compress` to 1.26.2
 Key: SPARK-48405
 URL: https://issues.apache.org/jira/browse/SPARK-48405
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48404) Driver and Executor support merge and run in a single jvm

2024-05-23 Thread melin (Jira)
melin created SPARK-48404:
-

 Summary: Driver and Executor support merge and run in a single jvm
 Key: SPARK-48404
 URL: https://issues.apache.org/jira/browse/SPARK-48404
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: melin


Spark is used in data integration scenarios (such as reading data from mysql 
and writing data to other data sources), and in many cases can run tasks in a 
single concurrency. The Driver and Executor consume resources separately. If 
Driver and Executor support merging, especially when running on the cloud. Can 
save calculation cost.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33164) SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to DataSet.dropColumn(someColumn)

2024-05-23 Thread Jonathan Boarman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849126#comment-17849126
 ] 

Jonathan Boarman commented on SPARK-33164:
--

There are significant benefits provided by the `{*}{{EXCEPT}}{*}` feature 
provided by most large data platforms, including Databricks, Snowflake, 
BigQuery, DuckDB, etc.  The list of vendors that support *{{EXCEPT}}* (or 
increasingly called `{*}{{EXCLUDE}}{*}` to avoid conflicts) is pretty long and 
growing.  As such, migrating projects from those platforms to a pure Spark SQL 
environment is extremely costly.

Further, the "risks" associated with `{*}{{SELECT *}}{*}` do not apply to all 
scenarios – very importantly, with CTEs these risks are not applicable since 
the constraints on column selection are generally made in the first CTE.

For example, any subsequent CTEs in a chain of CTEs inherits the field 
selection of the first CTEs.  On platforms that lack this feature, we have a 
different risk caused be crazy levels of duplication if we are forced to 
enumerate fields in each and every CTE.  This is particularly problematic when 
joining two CTEs that share a field, such as an `{*}{{ID}}{*}` column.  In that 
situation, the most efficient and risk-free approach is to `{*}{{SELECT * 
EXCEPT(right.id)}}{*}` from the join of its two dependent CTEs.

Any perceived judgment aside, this is a highly-relied-upon feature in 
enterprise environments that depend on these quality-of-life innovations.  
Clearly such improvements are providing value in those environments, and Spark 
SQL should not be any different in supporting its users that have come to rely 
on such innovations.

> SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to 
> DataSet.dropColumn(someColumn)
> 
>
> Key: SPARK-33164
>     URL: https://issues.apache.org/jira/browse/SPARK-33164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Arnaud Nauwynck
>Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> I would like to have the extended SQL syntax "SELECT * EXCEPT someColumn FROM 
> .." 
> to be able to select all columns except some in a SELECT clause.
> It would be similar to SQL syntax from some databases, like Google BigQuery 
> or PostgresQL.
> https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax
> Google question "select * EXCEPT one column", and you will see many 
> developpers have the same problems.
> example posts: 
> https://blog.jooq.org/2018/05/14/selecting-all-columns-except-one-in-postgresql/
> https://www.thetopsites.net/article/53001825.shtml
> There are several typicall examples where is is very helpfull :
> use-case1:
>  you add "count ( * )  countCol" column, and then filter on it using for 
> example "having countCol = 1" 
>   ... and then you want to select all columns EXCEPT this dummy column which 
> always is "1"
> {noformat}
>   select * (EXCEPT countCol)
>   from (  
>  select count(*) countCol, * 
>from MyTable 
>where ... 
>group by ... having countCol = 1
>   )
> {noformat}
>
> use-case 2:
>  same with analytical function "partition over(...) rankCol  ... where 
> rankCol=1"
>  For example to get the latest row before a given time, in a time series 
> table.
>  This is "Time-Travel" queries addressed by framework like "DeltaLake"
> {noformat}
>  CREATE table t_updates (update_time timestamp, id string, col1 type1, col2 
> type2, ... col42)
>  pastTime=..
>  SELECT * (except rankCol)
>  FROM (
>SELECT *,
>   RANK() OVER (PARTITION BY id ORDER BY update_time) rankCol   
>FROM t_updates
>where update_time < pastTime
>  ) WHERE rankCol = 1
>  
> {noformat}
>  
> use-case 3:
>  copy some data from table "t" to corresponding table "t_snapshot", and back 
> to "t"
> {noformat}
>CREATE TABLE t (col1 type1, col2 type2, col3 type3, ... col42 type42) ...
>
>/* create corresponding table: (snap_id string, col1 type1, col2 type2, 
> col3 type3, ... col42 type42) */
>CREATE TABLE t_snapshot
>AS SELECT '' as snap_id, * FROM t WHERE 1=2
>/* insert data from t to some snapshot */
>INSERT INTO t_snapshot
>SELECT 'snap1' 

[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-23 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849095#comment-17849095
 ] 

Bruce Robbins commented on SPARK-48361:
---

I can take a look at the root cause, unless you are already looking at that, in 
which case I will hold off.

> Correctness: CSV corrupt record filter with aggregate ignored
> -
>
> Key: SPARK-48361
> URL: https://issues.apache.org/jira/browse/SPARK-48361
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Using spark shell 3.5.1 on M1 Mac
>Reporter: Ted Chester Jenks
>Priority: Major
>
> Using corrupt record in CSV parsing for some data cleaning logic, I came 
> across a correctness bug.
>  
> The following repro can be ran with spark-shell 3.5.1.
> *Create test.csv with the following content:*
> {code:java}
> test,1,2,three
> four,5,6,seven
> 8,9
> ten,11,12,thirteen {code}
>  
>  
> *In spark-shell:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> dfWithJagged.show(){code}
> *Returns:*
> {code:java}
> +---+---+---++---+---+
> |column1|column2|column3| column4|_corrupt_record|__is_jagged|
> +---+---+---++---+---+
> |   four|    5.0|    6.0|   seven|           NULL|      false|
> |      8|    9.0|   NULL|    NULL|            8,9|       true|
> |    ten|   11.0|   12.0|thirteen|           NULL|      false|
> +---+---+---++---+---+ {code}
> So far so good...
>  
> *BUT*
>  
> *If we add an aggregate before we show:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> val dfDropped = dfWithJagged.filter(col("__is_jagged") =!= true)
> val groupedSum = 
> dfDropped.groupBy("column1").agg(sum("column2").alias("sum_column2"))
> groupedSum.show(){code}
> *We get:*
> {code:java}
> +---+---+
> |column1|sum_column2|
> +---+---+
> |      8|        9.0|
> |   four|        5.0|
> |    ten|       11.0|
> +---+---+ {code}
>  
> *Which is not correct*
>  
> With the addition of the aggregate, the filter down to rows with 3 commas in 
> the corrupt record column is ignored. This does not happed with any other 
> operators I have tried - just aggregates so far.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48318) Hash join support for strings with collation (complex types)

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48318:
---
Labels: pull-request-available  (was: )

> Hash join support for strings with collation (complex types)
> 
>
> Key: SPARK-48318
> URL: https://issues.apache.org/jira/browse/SPARK-48318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-41049:
---
Labels: correctness pull-request-available  (was: correctness)

> Nondeterministic expressions have unstable values if they are children of 
> CodegenFallback expressions
> -
>
> Key: SPARK-41049
> URL: https://issues.apache.org/jira/browse/SPARK-41049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Guy Boo
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness, pull-request-available
> Fix For: 3.4.0
>
>
> h2. Expectation
> For a given row, Nondeterministic expressions are expected to have stable 
> values.
> {code:scala}
> import org.apache.spark.sql.functions._
> val df = sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> df.select(v1, v1).collect{code}
> Returns a set like this:
> |8777|8777|
> |1357|1357|
> |3435|3435|
> |9204|9204|
> |3870|3870|
> where both columns always have the same value, but what that value is changes 
> from row to row. This is different from the following:
> {code:scala}
> df.select(rand(), rand()).collect{code}
> In this case, because the rand() calls are distinct, the values in both 
> columns should be different.
> h2. Problem
> This expectation does not appear to be stable in the event that any 
> subsequent expression is a CodegenFallback. This program:
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> val sparkSession = SparkSession.builder().getOrCreate()
> val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
> df.select(v1, v1, v2, v2).collect {code}
> produces output like this:
> |8159|8159|8159|{color:#ff}2028{color}|
> |8320|8320|8320|{color:#ff}1640{color}|
> |7937|7937|7937|{color:#ff}769{color}|
> |436|436|436|{color:#ff}8924{color}|
> |8924|8924|2827|{color:#ff}2731{color}|
> Not sure why the first call via the CodegenFallback path should be correct 
> while subsequent calls aren't.
> h2. Workaround
> If the Nondeterministic expression is moved to a separate, earlier select() 
> call, so the CodegenFallback instead only refers to a column reference, then 
> the problem seems to go away. But this workaround may not be reliable if 
> optimization is ever able to restructure adjacent select()s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48403) Fix Upper, Lower, InitCap for UTF8_BINARY_LCASE

2024-05-23 Thread Jira
Uroš Bojanić created SPARK-48403:


 Summary: Fix Upper, Lower, InitCap for UTF8_BINARY_LCASE
 Key: SPARK-48403
 URL: https://issues.apache.org/jira/browse/SPARK-48403
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48400) Promote `PrometheusServlet` to `DeveloperApi`

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48400:
---
Labels: pull-request-available  (was: )

> Promote `PrometheusServlet` to `DeveloperApi`
> -
>
> Key: SPARK-48400
> URL: https://issues.apache.org/jira/browse/SPARK-48400
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-23 Thread Ted Chester Jenks (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848936#comment-17848936
 ] 

Ted Chester Jenks commented on SPARK-48361:
---

Ah yes! Sorry [~bersprockets] I messed up my example but you are right, I meant 
the 
{quote}With `=!= true`, the grouping includes `8, 9` (it shouldn't, as you 
mentioned).
{quote}
as you mentioned. Fixed in the example now.

 

I know persisting fixes - so the bug is specifically without 
collecting/checkpointing/caching the data.

> Correctness: CSV corrupt record filter with aggregate ignored
> -
>
> Key: SPARK-48361
> URL: https://issues.apache.org/jira/browse/SPARK-48361
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Using spark shell 3.5.1 on M1 Mac
>Reporter: Ted Chester Jenks
>Priority: Major
>
> Using corrupt record in CSV parsing for some data cleaning logic, I came 
> across a correctness bug.
>  
> The following repro can be ran with spark-shell 3.5.1.
> *Create test.csv with the following content:*
> {code:java}
> test,1,2,three
> four,5,6,seven
> 8,9
> ten,11,12,thirteen {code}
>  
>  
> *In spark-shell:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> dfWithJagged.show(){code}
> *Returns:*
> {code:java}
> +---+---+---++---+---+
> |column1|column2|column3| column4|_corrupt_record|__is_jagged|
> +---+---+---++---+---+
> |   four|    5.0|    6.0|   seven|           NULL|      false|
> |      8|    9.0|   NULL|    NULL|            8,9|       true|
> |    ten|   11.0|   12.0|thirteen|           NULL|      false|
> +---+---+---++---+---+ {code}
> So far so good...
>  
> *BUT*
>  
> *If we add an aggregate before we show:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> val dfDropped = dfWithJagged.filter(col("__is_jagged") =!= true)
> val groupedSum = 
> dfDropped.groupBy("column1").agg(sum("column2").alias("sum_column2"))
> groupedSum.show(){code}
> *We get:*
> {code:java}
> +---+---+
> |column1|sum_column2|
> +---+---+
> |      8|        9.0|
> |   four|        5.0|
> |    ten|       11.0|
> +---+---+ {code}
>  
> *Which is not correct*
>  
> With the addition of the aggregate, the filter down to rows with 3 commas in 
> the corrupt record column is ignored. This does not happed with any other 
> operators I have tried - just aggregates so far.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-23 Thread Ted Chester Jenks (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Chester Jenks updated SPARK-48361:
--
Description: 
Using corrupt record in CSV parsing for some data cleaning logic, I came across 
a correctness bug.

 

The following repro can be ran with spark-shell 3.5.1.

*Create test.csv with the following content:*
{code:java}
test,1,2,three
four,5,6,seven
8,9
ten,11,12,thirteen {code}
 

 

*In spark-shell:*
{code:java}
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.functions._
 
# define a STRING, DOUBLE, DOUBLE, STRING schema for the data
val schema = StructType(List(StructField("column1", StringType, true), 
StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
true), StructField("column4", StringType, true)))
 
# add a column for corrupt records to the schema
val schemaWithCorrupt = StructType(schema.fields :+ 
StructField("_corrupt_record", StringType, true)) 
 
# read the CSV with the schema, headers, permissive parsing, and the corrupt 
record column
val df = spark.read.option("header", "true").option("mode", 
"PERMISSIVE").option("columnNameOfCorruptRecord", 
"_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
 
# define a UDF to count the commas in the corrupt record column
val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else -1) 
 
# add a true/false column for whether the number of commas is 3
val dfWithJagged = df.withColumn("__is_jagged", 
when(col("_corrupt_record").isNull, 
false).otherwise(countCommas(col("_corrupt_record")) =!= 3))

dfWithJagged.show(){code}
*Returns:*
{code:java}
+---+---+---++---+---+
|column1|column2|column3| column4|_corrupt_record|__is_jagged|
+---+---+---++---+---+
|   four|    5.0|    6.0|   seven|           NULL|      false|
|      8|    9.0|   NULL|    NULL|            8,9|       true|
|    ten|   11.0|   12.0|thirteen|           NULL|      false|
+---+---+---++---+---+ {code}
So far so good...

 

*BUT*

 

*If we add an aggregate before we show:*
{code:java}
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.functions._
val schema = StructType(List(StructField("column1", StringType, true), 
StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
true), StructField("column4", StringType, true)))
val schemaWithCorrupt = StructType(schema.fields :+ 
StructField("_corrupt_record", StringType, true)) 
val df = spark.read.option("header", "true").option("mode", 
"PERMISSIVE").option("columnNameOfCorruptRecord", 
"_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else -1) 
val dfWithJagged = df.withColumn("__is_jagged", 
when(col("_corrupt_record").isNull, 
false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
val dfDropped = dfWithJagged.filter(col("__is_jagged") =!= true)
val groupedSum = 
dfDropped.groupBy("column1").agg(sum("column2").alias("sum_column2"))
groupedSum.show(){code}
*We get:*
{code:java}
+---+---+
|column1|sum_column2|
+---+---+
|      8|        9.0|
|   four|        5.0|
|    ten|       11.0|
+---+---+ {code}
 

*Which is not correct*

 

With the addition of the aggregate, the filter down to rows with 3 commas in 
the corrupt record column is ignored. This does not happed with any other 
operators I have tried - just aggregates so far.

 

 

 

  was:
Using corrupt record in CSV parsing for some data cleaning logic, I came across 
a correctness bug.

 

The following repro can be ran with spark-shell 3.5.1.

*Create test.csv with the following content:*
{code:java}
test,1,2,three
four,5,6,seven
8,9
ten,11,12,thirteen {code}
 

 

*In spark-shell:*
{code:java}
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.functions._
 
# define a STRING, DOUBLE, DOUBLE, STRING schema for the data
val schema = StructType(List(StructField("column1", StringType, true), 
StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
true), StructField("column4", StringType, true)))
 
# add a column for corrupt records to the schema
val schemaWithCorrupt = StructType(schema.fields :+ 
StructField("_corrupt_record", StringType, true)) 
 
# read the CSV with the schema, headers, permissive parsing, and the corrupt 
record column
val df = spark.read.option("header", 

[jira] [Commented] (SPARK-45265) Support Hive 4.0 metastore

2024-05-23 Thread Ankit Prakash Gupta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848924#comment-17848924
 ] 

Ankit Prakash Gupta commented on SPARK-45265:
-

Since Hive 4.0.0 is already GA  ! Maybe I can help on this one.

!image-2024-05-23-17-35-48-426.png!

> Support Hive 4.0 metastore
> --
>
> Key: SPARK-45265
> URL: https://issues.apache.org/jira/browse/SPARK-45265
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-05-23-17-35-48-426.png
>
>
> Although Hive 4.0.0 is still beta I would like to work on this as Hive 4.0.0 
> will support support the pushdowns of partition column filters with 
> VARCHAR/CHAR types.
> For details please see HIVE-26661: Support partition filter for char and 
> varchar types on Hive metastore



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45265) Support Hive 4.0 metastore

2024-05-23 Thread Ankit Prakash Gupta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Prakash Gupta updated SPARK-45265:

Attachment: image-2024-05-23-17-35-48-426.png

> Support Hive 4.0 metastore
> --
>
> Key: SPARK-45265
> URL: https://issues.apache.org/jira/browse/SPARK-45265
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-05-23-17-35-48-426.png
>
>
> Although Hive 4.0.0 is still beta I would like to work on this as Hive 4.0.0 
> will support support the pushdowns of partition column filters with 
> VARCHAR/CHAR types.
> For details please see HIVE-26661: Support partition filter for char and 
> varchar types on Hive metastore



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48395) Fix StructType.treeString for parameterized types

2024-05-23 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-48395.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46711
[https://github.com/apache/spark/pull/46711]

> Fix StructType.treeString for parameterized types
> -
>
> Key: SPARK-48395
> URL: https://issues.apache.org/jira/browse/SPARK-48395
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48401) Spark Connect Go Session Type cannot be used in complex code

2024-05-23 Thread David Sisson (Jira)
David Sisson created SPARK-48401:


 Summary: Spark Connect Go Session Type cannot be used in complex 
code
 Key: SPARK-48401
 URL: https://issues.apache.org/jira/browse/SPARK-48401
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.1
Reporter: David Sisson


The return type of a Spark Session builder is an unexported type.  This means 
that code cannot pass around a sql.sparkSession to perform more complex tasks 
(for instance, a routine that sets up a set of temporary views for later 
processing on).

 
spark, err := sql.SparkSession.Builder.Remote(*remote).Build()
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48399) Teradata: ByteType wrongly mapping to teradata byte(binary) type

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48399:
---
Labels: pull-request-available  (was: )

> Teradata: ByteType wrongly mapping to teradata byte(binary) type
> 
>
> Key: SPARK-48399
> URL: https://issues.apache.org/jira/browse/SPARK-48399
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48400) Promote `PrometheusServlet` to `DeveloperApi`

2024-05-23 Thread Zhou JIANG (Jira)
Zhou JIANG created SPARK-48400:
--

 Summary: Promote `PrometheusServlet` to `DeveloperApi`
 Key: SPARK-48400
 URL: https://issues.apache.org/jira/browse/SPARK-48400
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: kubernetes-operator-0.1.0
Reporter: Zhou JIANG






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48399) Teradata: ByteType wrongly mapping to teradata byte(binary) type

2024-05-23 Thread Kent Yao (Jira)
Kent Yao created SPARK-48399:


 Summary: Teradata: ByteType wrongly mapping to teradata 
byte(binary) type
 Key: SPARK-48399
 URL: https://issues.apache.org/jira/browse/SPARK-48399
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48382) Add controller / reconciler module to operator

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48382:
---
Labels: pull-request-available  (was: )

> Add controller / reconciler module to operator
> --
>
> Key: SPARK-48382
> URL: https://issues.apache.org/jira/browse/SPARK-48382
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker

2024-05-23 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848824#comment-17848824
 ] 

Eric Yang edited comment on SPARK-48397 at 5/23/24 6:38 AM:


The PR: https://github.com/apache/spark/pull/46714


was (Author: JIRAUSER304132):
I'm working on a PR for it.

> Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker
> 
>
> Key: SPARK-48397
> URL: https://issues.apache.org/jira/browse/SPARK-48397
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Eric Yang
>Priority: Major
>  Labels: pull-request-available
>
> For FileFormatDataWriter we currently record metrics of "task commit time" 
> and "job commit time" in 
> `org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`.
>  We may also record the time spent on "data write" (together with the time 
> spent on producing records from the iterator), which is usually one of the 
> major parts of the total duration of a writing operation. It helps us 
> identify the bottleneck and time skew, and also the generic performance 
> tuning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48397:
---
Labels: pull-request-available  (was: )

> Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker
> 
>
> Key: SPARK-48397
> URL: https://issues.apache.org/jira/browse/SPARK-48397
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Eric Yang
>Priority: Major
>  Labels: pull-request-available
>
> For FileFormatDataWriter we currently record metrics of "task commit time" 
> and "job commit time" in 
> `org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`.
>  We may also record the time spent on "data write" (together with the time 
> spent on producing records from the iterator), which is usually one of the 
> major parts of the total duration of a writing operation. It helps us 
> identify the bottleneck and time skew, and also the generic performance 
> tuning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48396) Support configuring limit control for SQL to use maximum cores

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48396:
---
Labels: pull-request-available  (was: )

> Support configuring limit control for SQL to use maximum cores
> --
>
> Key: SPARK-48396
> URL: https://issues.apache.org/jira/browse/SPARK-48396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Mars
>Priority: Major
>  Labels: pull-request-available
>
> When there is a long-running shared Spark SQL cluster, there may be a 
> situation where a large SQL occupies all the cores of the cluster, affecting 
> the execution of other SQLs. Therefore, it is hoped that there is a 
> configuration that can limit the maximum cores used by SQL.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker

2024-05-23 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848824#comment-17848824
 ] 

Eric Yang commented on SPARK-48397:
---

I'm working on a PR for it.

> Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker
> 
>
> Key: SPARK-48397
> URL: https://issues.apache.org/jira/browse/SPARK-48397
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Eric Yang
>Priority: Major
>
> For FileFormatDataWriter we currently record metrics of "task commit time" 
> and "job commit time" in 
> `org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`.
>  We may also record the time spent on "data write" (together with the time 
> spent on producing records from the iterator), which is usually one of the 
> major parts of the total duration of a writing operation. It helps us 
> identify the bottleneck and time skew, and also the generic performance 
> tuning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48398) Add Helm chart for Operator Deployment

2024-05-23 Thread Zhou JIANG (Jira)
Zhou JIANG created SPARK-48398:
--

 Summary: Add Helm chart for Operator Deployment
 Key: SPARK-48398
 URL: https://issues.apache.org/jira/browse/SPARK-48398
 Project: Spark
  Issue Type: Sub-task
  Components: k8s
Affects Versions: kubernetes-operator-0.1.0
Reporter: Zhou JIANG






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker

2024-05-23 Thread Eric Yang (Jira)
Eric Yang created SPARK-48397:
-

 Summary: Add data write time metric to 
FileFormatDataWriter/BasicWriteJobStatsTracker
 Key: SPARK-48397
 URL: https://issues.apache.org/jira/browse/SPARK-48397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Eric Yang


For FileFormatDataWriter we currently record metrics of "task commit time" and 
"job commit time" in 
`org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`. 
We may also record the time spent on "data write" (together with the time spent 
on producing records from the iterator), which is usually one of the major 
parts of the total duration of a writing operation. It helps us identify the 
bottleneck and time skew, and also the generic performance tuning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48396) Support configuring limit control for SQL to use maximum cores

2024-05-22 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated SPARK-48396:
-
Description: 
When there is a long-running shared Spark SQL cluster, there may be a situation 
where a large SQL occupies all the cores of the cluster, affecting the 
execution of other SQLs. Therefore, it is hoped that there is a configuration 
that can limit the maximum cores used by SQL.
 

> Support configuring limit control for SQL to use maximum cores
> --
>
> Key: SPARK-48396
> URL: https://issues.apache.org/jira/browse/SPARK-48396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Mars
>Priority: Major
>
> When there is a long-running shared Spark SQL cluster, there may be a 
> situation where a large SQL occupies all the cores of the cluster, affecting 
> the execution of other SQLs. Therefore, it is hoped that there is a 
> configuration that can limit the maximum cores used by SQL.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48396) Support configuring limit control for SQL to use maximum cores

2024-05-22 Thread Mars (Jira)
Mars created SPARK-48396:


 Summary: Support configuring limit control for SQL to use maximum 
cores
 Key: SPARK-48396
 URL: https://issues.apache.org/jira/browse/SPARK-48396
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.1
Reporter: Mars






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48370) Checkpoint and localCheckpoint in Scala Spark Connect client

2024-05-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48370.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46683
[https://github.com/apache/spark/pull/46683]

> Checkpoint and localCheckpoint in Scala Spark Connect client
> 
>
> Key: SPARK-48370
> URL: https://issues.apache.org/jira/browse/SPARK-48370
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> SPARK-48258 implemented checkpoint and localcheckpoint in Python Spark 
> Connect client. We should do it in Scala too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48370) Checkpoint and localCheckpoint in Scala Spark Connect client

2024-05-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48370:


Assignee: Hyukjin Kwon

> Checkpoint and localCheckpoint in Scala Spark Connect client
> 
>
> Key: SPARK-48370
> URL: https://issues.apache.org/jira/browse/SPARK-48370
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> SPARK-48258 implemented checkpoint and localcheckpoint in Python Spark 
> Connect client. We should do it in Scala too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48391) use addAll instead of add function in TaskMetrics to accelerate

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48391:
---
Labels: pull-request-available  (was: )

> use addAll instead of add function  in TaskMetrics  to accelerate
> -
>
> Key: SPARK-48391
> URL: https://issues.apache.org/jira/browse/SPARK-48391
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: jiahong.li
>Priority: Major
>  Labels: pull-request-available
>
> In the fromAccumulators method of TaskMetrics,we should use `
> tm._externalAccums.addAll` instead of `tm._externalAccums.add`, as 
> _externalAccums is a instance of CopyOnWriteArrayList



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48387) Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE

2024-05-22 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48387.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46701
[https://github.com/apache/spark/pull/46701]

> Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE
> ---
>
> Key: SPARK-48387
> URL: https://issues.apache.org/jira/browse/SPARK-48387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48386) Replace JVM assert with JUnit Assert in tests

2024-05-22 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-48386.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46698
[https://github.com/apache/spark/pull/46698]

> Replace JVM assert with JUnit Assert in tests
> -
>
> Key: SPARK-48386
> URL: https://issues.apache.org/jira/browse/SPARK-48386
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48387) Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE

2024-05-22 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-48387:


Assignee: Kent Yao

> Postgres: Map TimestampType to TIMESTAMP WITH TIME ZONE
> ---
>
> Key: SPARK-48387
> URL: https://issues.apache.org/jira/browse/SPARK-48387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48394) Cleanup mapIdToMapIndex on mapoutput unregister

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48394:
---
Labels: pull-request-available  (was: )

> Cleanup mapIdToMapIndex on mapoutput unregister
> ---
>
> Key: SPARK-48394
> URL: https://issues.apache.org/jira/browse/SPARK-48394
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: wuyi
>Priority: Major
>  Labels: pull-request-available
>
> There is only one valid mapstatus for the same {{mapIndex}} at the same time 
> in Spark. {{mapIdToMapIndex}} should also follows the same rule to avoid 
> chaos.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >