[jira] [Commented] (SPARK-40099) Merge adjacent CaseWhen branches if their values are the same

2023-02-08 Thread daile (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686249#comment-17686249
 ] 

daile commented on SPARK-40099:
---

[~yumwang]  can you help review it again ?

> Merge adjacent CaseWhen branches if their values are the same
> -
>
> Key: SPARK-40099
> URL: https://issues.apache.org/jira/browse/SPARK-40099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:sql}
>   CASE
> WHEN f1.buyer_id IS NOT NULL THEN 1
> WHEN f2.buyer_id IS NOT NULL THEN 1
> ELSE 0
>   END
> {code}
> The excepted result:
> {code:sql}
>   CASE
> WHEN f1.buyer_id IS NOT NULL or f2.buyer_id IS NOT NULL 
> THEN 1
> ELSE 0
>   END
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4073) Parquet+Snappy can cause significant off-heap memory usage

2023-02-08 Thread shufan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-4073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686239#comment-17686239
 ] 

shufan edited comment on SPARK-4073 at 2/9/23 7:46 AM:
---

I had a similar problem。

When I submitted a hive on spark task, which was to join two tables, one of the 
big tables was parquet + snappy, about 5G in size, with 100 million rows of 
data, the executor was killed by k8s. 

Configuration is
    set spark.executor.memoryOverhead=6g;
    set spark.executor.memory=5g;
    set spark.executor.cores=4;
    set spark.executor.instances=2;
    set spark.executor.extraJavaOptions=-XX:MaxDirectMemorySize=4096m 
-Dio.netty.maxDirectMemory=104857600;

In the above configuration, the jvm memory usage exceeds 11G. The executor has 
less than 5G of heap memory and the Direct ByteBuffer less than 4G, which are 
around 9G. 11 - 9 = 2G of memory is unknown.

Can you tell me which part of the remaining 2G of  non-heap memory is used? Is 
there any way to limit it?


was (Author: shufan084):
I had a similar problem。

When I submitted a hive on spark task, which was to join two tables, one of the 
big tables was parquet + snappy, about 5G in size, with 100 million rows of 
data, the executor was killed by k8s. 

Configuration is
    set spark.executor.memoryOverhead=6g;
    set spark.executor.memory=5g;
    set spark.executor.cores=4;
    set spark.executor.instances=2;
    set spark.executor.extraJavaOptions=-XX:MaxDirectMemorySize=4096m 
-Dio.netty.maxDirectMemory=104857600;

In the above configuration, the jvm memory usage exceeds 11G. The executor has 
less than 5G of heap memory and the Direct ByteBuffer less than 4G, which are 
around 9G. 11 - 9 = 2G of memory is unknown.

Can you tell me which part of the remaining 2G of memory is used? Is there any 
way to limit it?

> Parquet+Snappy can cause significant off-heap memory usage
> --
>
> Key: SPARK-4073
> URL: https://issues.apache.org/jira/browse/SPARK-4073
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Patrick Wendell
>Priority: Critical
>
> The parquet snappy codec allocates off-heap buffers for decompression[1]. In 
> one cases the observed size of these buffers was high enough to add several 
> GB of data to the overall virtual memory usage of the Spark executor process. 
> I don't understand enough about our use of Snappy to fully grok how much data 
> we would _expect_ to be present in these buffers at any given time, but I can 
> say a few things.
> 1. The dataset had individual rows that were fairly large, e.g. megabytes.
> 2. Direct buffers are not cleaned up until GC events, and overall there was 
> not much heap contention. So maybe they just weren't being cleaned.
> I opened PARQUET-118 to see if they can provide an option to use on-heap 
> buffers for decompression. In the mean time, we could consider changing the 
> default back to gzip, or we could do nothing (not sure how many other users 
> will hit this).
> [1] 
> https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4073) Parquet+Snappy can cause significant off-heap memory usage

2023-02-08 Thread shufan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-4073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686239#comment-17686239
 ] 

shufan commented on SPARK-4073:
---

I had a similar problem。

When I submitted a hive on spark task, which was to join two tables, one of the 
big tables was parquet + snappy, about 5G in size, with 100 million rows of 
data, the executor was killed by k8s. 

Configuration is
    set spark.executor.memoryOverhead=6g;
    set spark.executor.memory=5g;
    set spark.executor.cores=4;
    set spark.executor.instances=2;
    set spark.executor.extraJavaOptions=-XX:MaxDirectMemorySize=4096m 
-Dio.netty.maxDirectMemory=104857600;

In the above configuration, the jvm memory usage exceeds 11G. The executor has 
less than 5G of heap memory and the Direct ByteBuffer less than 4G, which are 
around 9G. 11 - 9 = 2G of memory is unknown.

Can you tell me which part of the remaining 2G of memory is used? Is there any 
way to limit it?

> Parquet+Snappy can cause significant off-heap memory usage
> --
>
> Key: SPARK-4073
> URL: https://issues.apache.org/jira/browse/SPARK-4073
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Patrick Wendell
>Priority: Critical
>
> The parquet snappy codec allocates off-heap buffers for decompression[1]. In 
> one cases the observed size of these buffers was high enough to add several 
> GB of data to the overall virtual memory usage of the Spark executor process. 
> I don't understand enough about our use of Snappy to fully grok how much data 
> we would _expect_ to be present in these buffers at any given time, but I can 
> say a few things.
> 1. The dataset had individual rows that were fairly large, e.g. megabytes.
> 2. Direct buffers are not cleaned up until GC events, and overall there was 
> not much heap contention. So maybe they just weren't being cleaned.
> I opened PARQUET-118 to see if they can provide an option to use on-heap 
> buffers for decompression. In the mean time, we could consider changing the 
> default back to gzip, or we could do nothing (not sure how many other users 
> will hit this).
> [1] 
> https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42388) Avoid unnecessary parquet footer reads when no filters in vectorized parquet reader

2023-02-08 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated SPARK-42388:
-
Summary: Avoid unnecessary parquet footer reads when no filters in 
vectorized parquet reader  (was: Avoid unnecessary parquet footer reads when no 
filters)

> Avoid unnecessary parquet footer reads when no filters in vectorized parquet 
> reader
> ---
>
> Key: SPARK-42388
> URL: https://issues.apache.org/jira/browse/SPARK-42388
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Mars
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42388) Avoid unnecessary parquet footer reads when no filters

2023-02-08 Thread Mars (Jira)
Mars created SPARK-42388:


 Summary: Avoid unnecessary parquet footer reads when no filters
 Key: SPARK-42388
 URL: https://issues.apache.org/jira/browse/SPARK-42388
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Mars






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42387) Avoid unnecessary parquet footer reads when no filters

2023-02-08 Thread Mars (Jira)
Mars created SPARK-42387:


 Summary: Avoid unnecessary parquet footer reads when no filters
 Key: SPARK-42387
 URL: https://issues.apache.org/jira/browse/SPARK-42387
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Mars






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42386) Rewrite HiveGenericUDF with Invoke

2023-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686222#comment-17686222
 ] 

Apache Spark commented on SPARK-42386:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39949

> Rewrite HiveGenericUDF with Invoke
> --
>
> Key: SPARK-42386
> URL: https://issues.apache.org/jira/browse/SPARK-42386
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42386) Rewrite HiveGenericUDF with Invoke

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42386:


Assignee: (was: Apache Spark)

> Rewrite HiveGenericUDF with Invoke
> --
>
> Key: SPARK-42386
> URL: https://issues.apache.org/jira/browse/SPARK-42386
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42386) Rewrite HiveGenericUDF with Invoke

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42386:


Assignee: Apache Spark

> Rewrite HiveGenericUDF with Invoke
> --
>
> Key: SPARK-42386
> URL: https://issues.apache.org/jira/browse/SPARK-42386
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42386) Rewrite HiveGenericUDF with Invoke

2023-02-08 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-42386:
---

 Summary: Rewrite HiveGenericUDF with Invoke
 Key: SPARK-42386
 URL: https://issues.apache.org/jira/browse/SPARK-42386
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42385) Upgrade RoaringBitmap to 0.9.39

2023-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686190#comment-17686190
 ] 

Apache Spark commented on SPARK-42385:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39948

> Upgrade RoaringBitmap to 0.9.39
> ---
>
> Key: SPARK-42385
> URL: https://issues.apache.org/jira/browse/SPARK-42385
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.39]
>  * ForAllInRange Fixes Yet Again by [@larsk-db|https://github.com/larsk-db] 
> in [#614|https://github.com/RoaringBitmap/RoaringBitmap/pull/614]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42385) Upgrade RoaringBitmap to 0.9.39

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42385:


Assignee: (was: Apache Spark)

> Upgrade RoaringBitmap to 0.9.39
> ---
>
> Key: SPARK-42385
> URL: https://issues.apache.org/jira/browse/SPARK-42385
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.39]
>  * ForAllInRange Fixes Yet Again by [@larsk-db|https://github.com/larsk-db] 
> in [#614|https://github.com/RoaringBitmap/RoaringBitmap/pull/614]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42385) Upgrade RoaringBitmap to 0.9.39

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42385:


Assignee: Apache Spark

> Upgrade RoaringBitmap to 0.9.39
> ---
>
> Key: SPARK-42385
> URL: https://issues.apache.org/jira/browse/SPARK-42385
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> [https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.39]
>  * ForAllInRange Fixes Yet Again by [@larsk-db|https://github.com/larsk-db] 
> in [#614|https://github.com/RoaringBitmap/RoaringBitmap/pull/614]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41715) Catch specific exceptions for both Spark Connect and PySpark

2023-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686163#comment-17686163
 ] 

Apache Spark commented on SPARK-41715:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39947

> Catch specific exceptions for both Spark Connect and PySpark
> 
>
> Key: SPARK-41715
> URL: https://issues.apache.org/jira/browse/SPARK-41715
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> In python/pyspark/sql/tests/test_catalog.py, we should catch more specific 
> exceptions such as AnalysisException. The test is shared in both Spark 
> Connect and PySpark so we should figure the way out to share it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41715) Catch specific exceptions for both Spark Connect and PySpark

2023-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686164#comment-17686164
 ] 

Apache Spark commented on SPARK-41715:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39947

> Catch specific exceptions for both Spark Connect and PySpark
> 
>
> Key: SPARK-41715
> URL: https://issues.apache.org/jira/browse/SPARK-41715
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> In python/pyspark/sql/tests/test_catalog.py, we should catch more specific 
> exceptions such as AnalysisException. The test is shared in both Spark 
> Connect and PySpark so we should figure the way out to share it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41715) Catch specific exceptions for both Spark Connect and PySpark

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41715:


Assignee: (was: Apache Spark)

> Catch specific exceptions for both Spark Connect and PySpark
> 
>
> Key: SPARK-41715
> URL: https://issues.apache.org/jira/browse/SPARK-41715
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> In python/pyspark/sql/tests/test_catalog.py, we should catch more specific 
> exceptions such as AnalysisException. The test is shared in both Spark 
> Connect and PySpark so we should figure the way out to share it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41715) Catch specific exceptions for both Spark Connect and PySpark

2023-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686162#comment-17686162
 ] 

Apache Spark commented on SPARK-41715:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39947

> Catch specific exceptions for both Spark Connect and PySpark
> 
>
> Key: SPARK-41715
> URL: https://issues.apache.org/jira/browse/SPARK-41715
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> In python/pyspark/sql/tests/test_catalog.py, we should catch more specific 
> exceptions such as AnalysisException. The test is shared in both Spark 
> Connect and PySpark so we should figure the way out to share it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41715) Catch specific exceptions for both Spark Connect and PySpark

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41715:


Assignee: Apache Spark

> Catch specific exceptions for both Spark Connect and PySpark
> 
>
> Key: SPARK-41715
> URL: https://issues.apache.org/jira/browse/SPARK-41715
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> In python/pyspark/sql/tests/test_catalog.py, we should catch more specific 
> exceptions such as AnalysisException. The test is shared in both Spark 
> Connect and PySpark so we should figure the way out to share it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40453) Improve error handling for GRPC server

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40453:


Assignee: Apache Spark

> Improve error handling for GRPC server
> --
>
> Key: SPARK-40453
> URL: https://issues.apache.org/jira/browse/SPARK-40453
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.2.2
>Reporter: Martin Grund
>Assignee: Apache Spark
>Priority: Major
>
> Right now the errors are handled very rudimentary and do not produce proper 
> GRPC errors. This issue address the work needed to return proper errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40453) Improve error handling for GRPC server

2023-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686161#comment-17686161
 ] 

Apache Spark commented on SPARK-40453:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39947

> Improve error handling for GRPC server
> --
>
> Key: SPARK-40453
> URL: https://issues.apache.org/jira/browse/SPARK-40453
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.2.2
>Reporter: Martin Grund
>Priority: Major
>
> Right now the errors are handled very rudimentary and do not produce proper 
> GRPC errors. This issue address the work needed to return proper errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40453) Improve error handling for GRPC server

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40453:


Assignee: (was: Apache Spark)

> Improve error handling for GRPC server
> --
>
> Key: SPARK-40453
> URL: https://issues.apache.org/jira/browse/SPARK-40453
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.2.2
>Reporter: Martin Grund
>Priority: Major
>
> Right now the errors are handled very rudimentary and do not produce proper 
> GRPC errors. This issue address the work needed to return proper errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40453) Improve error handling for GRPC server

2023-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686160#comment-17686160
 ] 

Apache Spark commented on SPARK-40453:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39947

> Improve error handling for GRPC server
> --
>
> Key: SPARK-40453
> URL: https://issues.apache.org/jira/browse/SPARK-40453
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.2.2
>Reporter: Martin Grund
>Priority: Major
>
> Right now the errors are handled very rudimentary and do not produce proper 
> GRPC errors. This issue address the work needed to return proper errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42385) Upgrade RoaringBitmap to 0.9.39

2023-02-08 Thread Yang Jie (Jira)
Yang Jie created SPARK-42385:


 Summary: Upgrade RoaringBitmap to 0.9.39
 Key: SPARK-42385
 URL: https://issues.apache.org/jira/browse/SPARK-42385
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yang Jie


[https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.39]
 * ForAllInRange Fixes Yet Again by [@larsk-db|https://github.com/larsk-db] in 
[#614|https://github.com/RoaringBitmap/RoaringBitmap/pull/614]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42355) Upgrade some maven-plugins

2023-02-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-42355:


Assignee: Yang Jie

> Upgrade some maven-plugins
> --
>
> Key: SPARK-42355
> URL: https://issues.apache.org/jira/browse/SPARK-42355
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> [INFO]   maven-checkstyle-plugin  3.2.0 -> 3.2.1
> [INFO]   maven-clean-plugin . 3.1.0 -> 3.2.0
> [INFO]   maven-dependency-plugin  3.3.0 -> 3.5.0
> [INFO]   maven-enforcer-plugin ... 3.0.0-M2 -> 3.2.1
> [INFO]   maven-source-plugin  3.1.0 -> 3.2.1
> [INFO]   maven-surefire-plugin  3.0.0-M7 -> 3.0.0-M8
> [INFO]   maven-jar-plugin ... 3.2.2 -> 3.3.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42355) Upgrade some maven-plugins

2023-02-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42355.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 39899
[https://github.com/apache/spark/pull/39899]

> Upgrade some maven-plugins
> --
>
> Key: SPARK-42355
> URL: https://issues.apache.org/jira/browse/SPARK-42355
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>
> [INFO]   maven-checkstyle-plugin  3.2.0 -> 3.2.1
> [INFO]   maven-clean-plugin . 3.1.0 -> 3.2.0
> [INFO]   maven-dependency-plugin  3.3.0 -> 3.5.0
> [INFO]   maven-enforcer-plugin ... 3.0.0-M2 -> 3.2.1
> [INFO]   maven-source-plugin  3.1.0 -> 3.2.1
> [INFO]   maven-surefire-plugin  3.0.0-M7 -> 3.0.0-M8
> [INFO]   maven-jar-plugin ... 3.2.2 -> 3.3.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42350) Replcace `get().getOrElse` with `getOrElse`

2023-02-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42350.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 39893
[https://github.com/apache/spark/pull/39893]

> Replcace `get().getOrElse` with `getOrElse`
> ---
>
> Key: SPARK-42350
> URL: https://issues.apache.org/jira/browse/SPARK-42350
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42350) Replcace `get().getOrElse` with `getOrElse`

2023-02-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-42350:


Assignee: Yang Jie

> Replcace `get().getOrElse` with `getOrElse`
> ---
>
> Key: SPARK-42350
> URL: https://issues.apache.org/jira/browse/SPARK-42350
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40770) Improved error messages for applyInPandas for schema mismatch

2023-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40770:
-
Fix Version/s: 3.5.0
   (was: 3.4.0)

> Improved error messages for applyInPandas for schema mismatch
> -
>
> Key: SPARK-40770
> URL: https://issues.apache.org/jira/browse/SPARK-40770
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Minor
> Fix For: 3.5.0
>
>
> Error messages raised by `applyInPandas` are very generic or useless when 
> used with complex schemata:
> {code}
> KeyError: 'val'
> {code}
> {code}
> RuntimeError: Number of columns of the returned pandas.DataFrame doesn't 
> match specified schema. Expected: 2 Actual: 3
> {code}
> {code}
> java.lang.IllegalArgumentException: not all nodes and buffers were consumed. 
> nodes: [ArrowFieldNode [length=3, nullCount=0]] buffers: [ArrowBuf[304], 
> address:139860828549160, length:0, ArrowBuf[305], address:139860828549160, 
> length:24]
> {code}
> {code}
> pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
> {code}
> {code}
> pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to 
> convert to double
> {code}
> These should be improved by adding column names or descriptive messages (in 
> the same order as above):
> {code}
> RuntimeError: Column names of the returned pandas.DataFrame do not match 
> specified schema.  Missing: val  Unexpected: v  Schema: id, val
> {code}
> {code}
> RuntimeError: Column names of the returned pandas.DataFrame do not match 
> specified schema.  Missing: val  Unexpected: foo, v  Schema: id, val
> {code}
> {code}
> RuntimeError: Column names of the returned pandas.DataFrame do not match 
> specified schema.  Unexpected: v  Schema: id, id
> {code}
> {code}
> pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
> The above exception was the direct cause of the following exception:
> TypeError: Exception thrown when converting pandas.Series (int64) with name 
> 'val' to Arrow Array (string).
> {code}
> {code}
> pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to 
> convert to double
> The above exception was the direct cause of the following exception:
> ValueError: Exception thrown when converting pandas.Series (object) with name 
> 'val' to Arrow Array (double).
> {code}
> When no column names are given, the following error was returned:
> {code}
> RuntimeError: Number of columns of the returned pandas.DataFrame doesn't 
> match specified schema. Expected: 2 Actual: 3
> {code}
> Where it should contain the output schema:
> {code}
> RuntimeError: Number of columns of the returned pandas.DataFrame doesn't 
> match specified schema.  Expected: 2  Actual: 3  Schema: id, val
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40770) Improved error messages for applyInPandas for schema mismatch

2023-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40770:


Assignee: Enrico Minack

> Improved error messages for applyInPandas for schema mismatch
> -
>
> Key: SPARK-40770
> URL: https://issues.apache.org/jira/browse/SPARK-40770
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Minor
>
> Error messages raised by `applyInPandas` are very generic or useless when 
> used with complex schemata:
> {code}
> KeyError: 'val'
> {code}
> {code}
> RuntimeError: Number of columns of the returned pandas.DataFrame doesn't 
> match specified schema. Expected: 2 Actual: 3
> {code}
> {code}
> java.lang.IllegalArgumentException: not all nodes and buffers were consumed. 
> nodes: [ArrowFieldNode [length=3, nullCount=0]] buffers: [ArrowBuf[304], 
> address:139860828549160, length:0, ArrowBuf[305], address:139860828549160, 
> length:24]
> {code}
> {code}
> pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
> {code}
> {code}
> pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to 
> convert to double
> {code}
> These should be improved by adding column names or descriptive messages (in 
> the same order as above):
> {code}
> RuntimeError: Column names of the returned pandas.DataFrame do not match 
> specified schema.  Missing: val  Unexpected: v  Schema: id, val
> {code}
> {code}
> RuntimeError: Column names of the returned pandas.DataFrame do not match 
> specified schema.  Missing: val  Unexpected: foo, v  Schema: id, val
> {code}
> {code}
> RuntimeError: Column names of the returned pandas.DataFrame do not match 
> specified schema.  Unexpected: v  Schema: id, id
> {code}
> {code}
> pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
> The above exception was the direct cause of the following exception:
> TypeError: Exception thrown when converting pandas.Series (int64) with name 
> 'val' to Arrow Array (string).
> {code}
> {code}
> pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to 
> convert to double
> The above exception was the direct cause of the following exception:
> ValueError: Exception thrown when converting pandas.Series (object) with name 
> 'val' to Arrow Array (double).
> {code}
> When no column names are given, the following error was returned:
> {code}
> RuntimeError: Number of columns of the returned pandas.DataFrame doesn't 
> match specified schema. Expected: 2 Actual: 3
> {code}
> Where it should contain the output schema:
> {code}
> RuntimeError: Number of columns of the returned pandas.DataFrame doesn't 
> match specified schema.  Expected: 2  Actual: 3  Schema: id, val
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40770) Improved error messages for applyInPandas for schema mismatch

2023-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40770.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38223
[https://github.com/apache/spark/pull/38223]

> Improved error messages for applyInPandas for schema mismatch
> -
>
> Key: SPARK-40770
> URL: https://issues.apache.org/jira/browse/SPARK-40770
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Minor
> Fix For: 3.4.0
>
>
> Error messages raised by `applyInPandas` are very generic or useless when 
> used with complex schemata:
> {code}
> KeyError: 'val'
> {code}
> {code}
> RuntimeError: Number of columns of the returned pandas.DataFrame doesn't 
> match specified schema. Expected: 2 Actual: 3
> {code}
> {code}
> java.lang.IllegalArgumentException: not all nodes and buffers were consumed. 
> nodes: [ArrowFieldNode [length=3, nullCount=0]] buffers: [ArrowBuf[304], 
> address:139860828549160, length:0, ArrowBuf[305], address:139860828549160, 
> length:24]
> {code}
> {code}
> pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
> {code}
> {code}
> pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to 
> convert to double
> {code}
> These should be improved by adding column names or descriptive messages (in 
> the same order as above):
> {code}
> RuntimeError: Column names of the returned pandas.DataFrame do not match 
> specified schema.  Missing: val  Unexpected: v  Schema: id, val
> {code}
> {code}
> RuntimeError: Column names of the returned pandas.DataFrame do not match 
> specified schema.  Missing: val  Unexpected: foo, v  Schema: id, val
> {code}
> {code}
> RuntimeError: Column names of the returned pandas.DataFrame do not match 
> specified schema.  Unexpected: v  Schema: id, id
> {code}
> {code}
> pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
> The above exception was the direct cause of the following exception:
> TypeError: Exception thrown when converting pandas.Series (int64) with name 
> 'val' to Arrow Array (string).
> {code}
> {code}
> pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to 
> convert to double
> The above exception was the direct cause of the following exception:
> ValueError: Exception thrown when converting pandas.Series (object) with name 
> 'val' to Arrow Array (double).
> {code}
> When no column names are given, the following error was returned:
> {code}
> RuntimeError: Number of columns of the returned pandas.DataFrame doesn't 
> match specified schema. Expected: 2 Actual: 3
> {code}
> Where it should contain the output schema:
> {code}
> RuntimeError: Number of columns of the returned pandas.DataFrame doesn't 
> match specified schema.  Expected: 2  Actual: 3  Schema: id, val
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42379) Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists

2023-02-08 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-42379.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 39936
[https://github.com/apache/spark/pull/39936]

> Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists
> 
>
> Key: SPARK-42379
> URL: https://issues.apache.org/jira/browse/SPARK-42379
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.5.0
>
>
> Other methods in FileSystemBasedCheckpointFileManager already uses 
> FileSystem.exists for all cases checking existence of the path. Use 
> FileSystem.exists in FileSystemBasedCheckpointFileManager.exists, which is 
> consistent with other methods in FileSystemBasedCheckpointFileManager.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42379) Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists

2023-02-08 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-42379:


Assignee: Jungtaek Lim

> Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists
> 
>
> Key: SPARK-42379
> URL: https://issues.apache.org/jira/browse/SPARK-42379
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> Other methods in FileSystemBasedCheckpointFileManager already uses 
> FileSystem.exists for all cases checking existence of the path. Use 
> FileSystem.exists in FileSystemBasedCheckpointFileManager.exists, which is 
> consistent with other methods in FileSystemBasedCheckpointFileManager.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41962) Update the import order of scala package in class SpecificParquetRecordReaderBase

2023-02-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41962:
--
Fix Version/s: (was: 3.2.4)

> Update the import order of scala package in class 
> SpecificParquetRecordReaderBase
> -
>
> Key: SPARK-41962
> URL: https://issues.apache.org/jira/browse/SPARK-41962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: shuyouZZ
>Assignee: shuyouZZ
>Priority: Major
> Fix For: 3.3.2, 3.4.0
>
>
> There is a check style issue in class {{SpecificParquetRecordReaderBase}}. 
> The import order of scala package is not correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42335.
--
Fix Version/s: 3.5.0
   (was: 3.4.0)
   Resolution: Fixed

Issue resolved by pull request 39878
[https://github.com/apache/spark/pull/39878]

> Pass the comment option through to univocity if users set it explicitly in 
> CSV dataSource
> -
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Assignee: Wei Guo
>Priority: Minor
> Fix For: 3.5.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior as 
> before because the new added `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   // other code
>   if (isCommentSet) {
> format.setComment(comment)
>   }
>   // other code
> }
>  {code}
> It's better to pass the comment option through to univocity if users set it 
> explicitly in CSV dataSource.
>  
> After this change, the behavior as flows:
> |id|code|2.4 and before|3.0 and after|this update|remark|
> |1|Seq("#abc", "\udef", "xyz").toDF()
> .write.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc
> *def*
> xyz|{color:#4c9aff}"#abc"{color}
> {color:#4c9aff}*def*{color}
> {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
> {color:#4c9aff}*"def"*{color}
> {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
> difference with 3.0{color}|
> |2|Seq("#abc", "\udef", "xyz").toDF()
> .write{color:#57d9a3}.option("comment", "#"){color}.csv(path)|#abc
> *def*
> xyz|"#abc"
> *def*
> xyz|"#abc"
> *def*
> xyz|the same|
> |3|Seq("#abc", "\udef", "xyz").toDF()
> .write.csv(path)|#abc
> *def*
> xyz|"#abc"
> *def*
> xyz|"#abc"
> *def*
> xyz|default behavior: the same|
> |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
> spark.read.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc
> xyz|{color:#4c9aff}#abc{color}
> {color:#4c9aff}\udef{color}
> {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
> {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
> difference with 3.0{color}|
> |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
> spark.read.{color:#57d9a3}option("comment", "#"){color}.csv(path)|\udef
> xyz|\udef
> xyz|\udef
> xyz|the same|
> |6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
> spark.read.csv(path)|#abc
> xyz|#abc
> \udef
> xyz|#abc
> \udef
> xyz|default behavior: the same|
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-42335:


Assignee: Wei Guo

> Pass the comment option through to univocity if users set it explicitly in 
> CSV dataSource
> -
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Assignee: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior as 
> before because the new added `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   // other code
>   if (isCommentSet) {
> format.setComment(comment)
>   }
>   // other code
> }
>  {code}
> It's better to pass the comment option through to univocity if users set it 
> explicitly in CSV dataSource.
>  
> After this change, the behavior as flows:
> |id|code|2.4 and before|3.0 and after|this update|remark|
> |1|Seq("#abc", "\udef", "xyz").toDF()
> .write.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc
> *def*
> xyz|{color:#4c9aff}"#abc"{color}
> {color:#4c9aff}*def*{color}
> {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
> {color:#4c9aff}*"def"*{color}
> {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
> difference with 3.0{color}|
> |2|Seq("#abc", "\udef", "xyz").toDF()
> .write{color:#57d9a3}.option("comment", "#"){color}.csv(path)|#abc
> *def*
> xyz|"#abc"
> *def*
> xyz|"#abc"
> *def*
> xyz|the same|
> |3|Seq("#abc", "\udef", "xyz").toDF()
> .write.csv(path)|#abc
> *def*
> xyz|"#abc"
> *def*
> xyz|"#abc"
> *def*
> xyz|default behavior: the same|
> |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
> spark.read.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc
> xyz|{color:#4c9aff}#abc{color}
> {color:#4c9aff}\udef{color}
> {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
> {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
> difference with 3.0{color}|
> |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
> spark.read.{color:#57d9a3}option("comment", "#"){color}.csv(path)|\udef
> xyz|\udef
> xyz|\udef
> xyz|the same|
> |6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
> spark.read.csv(path)|#abc
> xyz|#abc
> \udef
> xyz|#abc
> \udef
> xyz|default behavior: the same|
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41695) Upgrade netty to 4.1.86.Final

2023-02-08 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-41695.
-
Resolution: Won't Fix

> Upgrade netty to 4.1.86.Final
> -
>
> Key: SPARK-41695
> URL: https://issues.apache.org/jira/browse/SPARK-41695
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.3.1
>Reporter: Tobias Stadler
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41696) Upgrade Hadoop to 3.3.4

2023-02-08 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-41696.
-
Resolution: Won't Fix

> Upgrade Hadoop to 3.3.4
> ---
>
> Key: SPARK-41696
> URL: https://issues.apache.org/jira/browse/SPARK-41696
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.3.1
>Reporter: Tobias Stadler
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug

2023-02-08 Thread Ritika Maheshwari (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686081#comment-17686081
 ] 

Ritika Maheshwari commented on SPARK-42346:
---

Yes that caused the error to appear. Thanks

> distinct(count colname) with UNION ALL causes query analyzer bug
> 
>
> Key: SPARK-42346
> URL: https://issues.apache.org/jira/browse/SPARK-42346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
>Reporter: Robin
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.3.2, 3.4.0, 3.5.0
>
>
> If you combine a UNION ALL with a count(distinct colname) you get a query 
> analyzer bug.
>  
> This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.
>  
> Here is a reprex in PySpark:
> {{df_pd = pd.DataFrame([}}
> {{    \{'surname': 'a', 'first_name': 'b'}}}
> {{])}}
> {{df_spark = spark.createDataFrame(df_pd)}}
> {{df_spark.createOrReplaceTempView("input_table")}}
> {{sql = """}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT first_name) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table}}
> {{UNION ALL}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT surname) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table """}}
> {{spark.sql(sql).toPandas()}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42318) Assign name to _LEGACY_ERROR_TEMP_2125

2023-02-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-42318.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39891
[https://github.com/apache/spark/pull/39891]

> Assign name to _LEGACY_ERROR_TEMP_2125
> --
>
> Key: SPARK-42318
> URL: https://issues.apache.org/jira/browse/SPARK-42318
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42318) Assign name to _LEGACY_ERROR_TEMP_2125

2023-02-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-42318:


Assignee: Haejoon Lee

> Assign name to _LEGACY_ERROR_TEMP_2125
> --
>
> Key: SPARK-42318
> URL: https://issues.apache.org/jira/browse/SPARK-42318
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42319) Assign name to _LEGACY_ERROR_TEMP_2123

2023-02-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-42319.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39891
[https://github.com/apache/spark/pull/39891]

> Assign name to _LEGACY_ERROR_TEMP_2123
> --
>
> Key: SPARK-42319
> URL: https://issues.apache.org/jira/browse/SPARK-42319
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42319) Assign name to _LEGACY_ERROR_TEMP_2123

2023-02-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-42319:


Assignee: Haejoon Lee

> Assign name to _LEGACY_ERROR_TEMP_2123
> --
>
> Key: SPARK-42319
> URL: https://issues.apache.org/jira/browse/SPARK-42319
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42310) Assign name to _LEGACY_ERROR_TEMP_1289

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42310:


Assignee: (was: Apache Spark)

> Assign name to _LEGACY_ERROR_TEMP_1289
> --
>
> Key: SPARK-42310
> URL: https://issues.apache.org/jira/browse/SPARK-42310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42310) Assign name to _LEGACY_ERROR_TEMP_1289

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42310:


Assignee: Apache Spark

> Assign name to _LEGACY_ERROR_TEMP_1289
> --
>
> Key: SPARK-42310
> URL: https://issues.apache.org/jira/browse/SPARK-42310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42310) Assign name to _LEGACY_ERROR_TEMP_1289

2023-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686036#comment-17686036
 ] 

Apache Spark commented on SPARK-42310:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39946

> Assign name to _LEGACY_ERROR_TEMP_1289
> --
>
> Key: SPARK-42310
> URL: https://issues.apache.org/jira/browse/SPARK-42310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42314) Assign name to _LEGACY_ERROR_TEMP_2127

2023-02-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-42314:


Assignee: Haejoon Lee

> Assign name to _LEGACY_ERROR_TEMP_2127
> --
>
> Key: SPARK-42314
> URL: https://issues.apache.org/jira/browse/SPARK-42314
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42314) Assign name to _LEGACY_ERROR_TEMP_2127

2023-02-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-42314.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39890
[https://github.com/apache/spark/pull/39890]

> Assign name to _LEGACY_ERROR_TEMP_2127
> --
>
> Key: SPARK-42314
> URL: https://issues.apache.org/jira/browse/SPARK-42314
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42384) Mask function's generated code does not handle null input

2023-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685995#comment-17685995
 ] 

Apache Spark commented on SPARK-42384:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/39945

> Mask function's generated code does not handle null input
> -
>
> Key: SPARK-42384
> URL: https://issues.apache.org/jira/browse/SPARK-42384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (null),
> ('AbCD123-@$#')
> as data(col1);
> cache table v1;
> select mask(col1) from v1;
> {noformat}
> This query results in a {{NullPointerException}}:
> {noformat}
> 23/02/07 16:36:06 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
> {noformat}
> The generated code calls {{UnsafeWriter.write(0, value_0)}} regardless of 
> whether {{Mask.transformInput}} returns null or not. The 
> {{UnsafeWriter.write}} method for {{UTF8String}} does not expect a null 
> pointer.
> {noformat}
> /* 031 */ boolean isNull_1 = i.isNullAt(0);
> /* 032 */ UTF8String value_1 = isNull_1 ?
> /* 033 */ null : (i.getUTF8String(0));
> /* 034 */
> /* 035 */
> /* 036 */
> /* 037 */
> /* 038 */ UTF8String value_0 = null;
> /* 039 */ value_0 = 
> org.apache.spark.sql.catalyst.expressions.Mask.transformInput(value_1, 
> ((UTF8String) references[0] /* literal */), ((UTF8String) references[1] /* 
> literal */), ((UTF8String) references[2] /* literal */), ((UTF8String) 
> references[3] /* literal */));;
> /* 040 */ if (false) {
> /* 041 */   mutableStateArray_0[0].setNullAt(0);
> /* 042 */ } else {
> /* 043 */   mutableStateArray_0[0].write(0, value_0);
> /* 044 */ }
> /* 045 */ return (mutableStateArray_0[0].getRow());
> /* 046 */   }
> {noformat}
> The bug is not exercised by a literal null input value, since there appears 
> to be some optimization that simply replaces the entire function call with a 
> null literal:
> {noformat}
> spark-sql> explain SELECT mask(NULL);
> == Physical Plan ==
> *(1) Project [null AS mask(NULL, X, x, n, NULL)#47]
> +- *(1) Scan OneRowRelation[]
> Time taken: 0.026 seconds, Fetched 1 row(s)
> spark-sql> SELECT mask(NULL);
> NULL
> Time taken: 0.042 seconds, Fetched 1 row(s)
> spark-sql> 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42384) Mask function's generated code does not handle null input

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42384:


Assignee: (was: Apache Spark)

> Mask function's generated code does not handle null input
> -
>
> Key: SPARK-42384
> URL: https://issues.apache.org/jira/browse/SPARK-42384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (null),
> ('AbCD123-@$#')
> as data(col1);
> cache table v1;
> select mask(col1) from v1;
> {noformat}
> This query results in a {{NullPointerException}}:
> {noformat}
> 23/02/07 16:36:06 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
> {noformat}
> The generated code calls {{UnsafeWriter.write(0, value_0)}} regardless of 
> whether {{Mask.transformInput}} returns null or not. The 
> {{UnsafeWriter.write}} method for {{UTF8String}} does not expect a null 
> pointer.
> {noformat}
> /* 031 */ boolean isNull_1 = i.isNullAt(0);
> /* 032 */ UTF8String value_1 = isNull_1 ?
> /* 033 */ null : (i.getUTF8String(0));
> /* 034 */
> /* 035 */
> /* 036 */
> /* 037 */
> /* 038 */ UTF8String value_0 = null;
> /* 039 */ value_0 = 
> org.apache.spark.sql.catalyst.expressions.Mask.transformInput(value_1, 
> ((UTF8String) references[0] /* literal */), ((UTF8String) references[1] /* 
> literal */), ((UTF8String) references[2] /* literal */), ((UTF8String) 
> references[3] /* literal */));;
> /* 040 */ if (false) {
> /* 041 */   mutableStateArray_0[0].setNullAt(0);
> /* 042 */ } else {
> /* 043 */   mutableStateArray_0[0].write(0, value_0);
> /* 044 */ }
> /* 045 */ return (mutableStateArray_0[0].getRow());
> /* 046 */   }
> {noformat}
> The bug is not exercised by a literal null input value, since there appears 
> to be some optimization that simply replaces the entire function call with a 
> null literal:
> {noformat}
> spark-sql> explain SELECT mask(NULL);
> == Physical Plan ==
> *(1) Project [null AS mask(NULL, X, x, n, NULL)#47]
> +- *(1) Scan OneRowRelation[]
> Time taken: 0.026 seconds, Fetched 1 row(s)
> spark-sql> SELECT mask(NULL);
> NULL
> Time taken: 0.042 seconds, Fetched 1 row(s)
> spark-sql> 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42384) Mask function's generated code does not handle null input

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42384:


Assignee: Apache Spark

> Mask function's generated code does not handle null input
> -
>
> Key: SPARK-42384
> URL: https://issues.apache.org/jira/browse/SPARK-42384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> Example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (null),
> ('AbCD123-@$#')
> as data(col1);
> cache table v1;
> select mask(col1) from v1;
> {noformat}
> This query results in a {{NullPointerException}}:
> {noformat}
> 23/02/07 16:36:06 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
> {noformat}
> The generated code calls {{UnsafeWriter.write(0, value_0)}} regardless of 
> whether {{Mask.transformInput}} returns null or not. The 
> {{UnsafeWriter.write}} method for {{UTF8String}} does not expect a null 
> pointer.
> {noformat}
> /* 031 */ boolean isNull_1 = i.isNullAt(0);
> /* 032 */ UTF8String value_1 = isNull_1 ?
> /* 033 */ null : (i.getUTF8String(0));
> /* 034 */
> /* 035 */
> /* 036 */
> /* 037 */
> /* 038 */ UTF8String value_0 = null;
> /* 039 */ value_0 = 
> org.apache.spark.sql.catalyst.expressions.Mask.transformInput(value_1, 
> ((UTF8String) references[0] /* literal */), ((UTF8String) references[1] /* 
> literal */), ((UTF8String) references[2] /* literal */), ((UTF8String) 
> references[3] /* literal */));;
> /* 040 */ if (false) {
> /* 041 */   mutableStateArray_0[0].setNullAt(0);
> /* 042 */ } else {
> /* 043 */   mutableStateArray_0[0].write(0, value_0);
> /* 044 */ }
> /* 045 */ return (mutableStateArray_0[0].getRow());
> /* 046 */   }
> {noformat}
> The bug is not exercised by a literal null input value, since there appears 
> to be some optimization that simply replaces the entire function call with a 
> null literal:
> {noformat}
> spark-sql> explain SELECT mask(NULL);
> == Physical Plan ==
> *(1) Project [null AS mask(NULL, X, x, n, NULL)#47]
> +- *(1) Scan OneRowRelation[]
> Time taken: 0.026 seconds, Fetched 1 row(s)
> spark-sql> SELECT mask(NULL);
> NULL
> Time taken: 0.042 seconds, Fetched 1 row(s)
> spark-sql> 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42384) Mask function's generated code does not handle null input

2023-02-08 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-42384:
--
Affects Version/s: 3.4.0

> Mask function's generated code does not handle null input
> -
>
> Key: SPARK-42384
> URL: https://issues.apache.org/jira/browse/SPARK-42384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (null),
> ('AbCD123-@$#')
> as data(col1);
> cache table v1;
> select mask(col1) from v1;
> {noformat}
> This query results in a {{NullPointerException}}:
> {noformat}
> 23/02/07 16:36:06 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
> {noformat}
> The generated code calls {{UnsafeWriter.write(0, value_0)}} regardless of 
> whether {{Mask.transformInput}} returns null or not. The 
> {{UnsafeWriter.write}} method for {{UTF8String}} does not expect a null 
> pointer.
> {noformat}
> /* 031 */ boolean isNull_1 = i.isNullAt(0);
> /* 032 */ UTF8String value_1 = isNull_1 ?
> /* 033 */ null : (i.getUTF8String(0));
> /* 034 */
> /* 035 */
> /* 036 */
> /* 037 */
> /* 038 */ UTF8String value_0 = null;
> /* 039 */ value_0 = 
> org.apache.spark.sql.catalyst.expressions.Mask.transformInput(value_1, 
> ((UTF8String) references[0] /* literal */), ((UTF8String) references[1] /* 
> literal */), ((UTF8String) references[2] /* literal */), ((UTF8String) 
> references[3] /* literal */));;
> /* 040 */ if (false) {
> /* 041 */   mutableStateArray_0[0].setNullAt(0);
> /* 042 */ } else {
> /* 043 */   mutableStateArray_0[0].write(0, value_0);
> /* 044 */ }
> /* 045 */ return (mutableStateArray_0[0].getRow());
> /* 046 */   }
> {noformat}
> The bug is not exercised by a literal null input value, since there appears 
> to be some optimization that simply replaces the entire function call with a 
> null literal:
> {noformat}
> spark-sql> explain SELECT mask(NULL);
> == Physical Plan ==
> *(1) Project [null AS mask(NULL, X, x, n, NULL)#47]
> +- *(1) Scan OneRowRelation[]
> Time taken: 0.026 seconds, Fetched 1 row(s)
> spark-sql> SELECT mask(NULL);
> NULL
> Time taken: 0.042 seconds, Fetched 1 row(s)
> spark-sql> 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42384) Mask function's generated code does not handle null input

2023-02-08 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-42384:
-

 Summary: Mask function's generated code does not handle null input
 Key: SPARK-42384
 URL: https://issues.apache.org/jira/browse/SPARK-42384
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


Example:
{noformat}
create or replace temp view v1 as
select * from values
(null),
('AbCD123-@$#')
as data(col1);

cache table v1;

select mask(col1) from v1;
{noformat}
This query results in a {{NullPointerException}}:
{noformat}
23/02/07 16:36:06 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
{noformat}
The generated code calls {{UnsafeWriter.write(0, value_0)}} regardless of 
whether {{Mask.transformInput}} returns null or not. The {{UnsafeWriter.write}} 
method for {{UTF8String}} does not expect a null pointer.
{noformat}
/* 031 */ boolean isNull_1 = i.isNullAt(0);
/* 032 */ UTF8String value_1 = isNull_1 ?
/* 033 */ null : (i.getUTF8String(0));
/* 034 */
/* 035 */
/* 036 */
/* 037 */
/* 038 */ UTF8String value_0 = null;
/* 039 */ value_0 = 
org.apache.spark.sql.catalyst.expressions.Mask.transformInput(value_1, 
((UTF8String) references[0] /* literal */), ((UTF8String) references[1] /* 
literal */), ((UTF8String) references[2] /* literal */), ((UTF8String) 
references[3] /* literal */));;
/* 040 */ if (false) {
/* 041 */   mutableStateArray_0[0].setNullAt(0);
/* 042 */ } else {
/* 043 */   mutableStateArray_0[0].write(0, value_0);
/* 044 */ }
/* 045 */ return (mutableStateArray_0[0].getRow());
/* 046 */   }
{noformat}

The bug is not exercised by a literal null input value, since there appears to 
be some optimization that simply replaces the entire function call with a null 
literal:
{noformat}
spark-sql> explain SELECT mask(NULL);
== Physical Plan ==
*(1) Project [null AS mask(NULL, X, x, n, NULL)#47]
+- *(1) Scan OneRowRelation[]

Time taken: 0.026 seconds, Fetched 1 row(s)
spark-sql> SELECT mask(NULL);
NULL
Time taken: 0.042 seconds, Fetched 1 row(s)
spark-sql> 
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42383) Protobuf serializer for RocksDB.TypeAliases

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42383:


Assignee: Apache Spark

> Protobuf serializer for RocksDB.TypeAliases
> ---
>
> Key: SPARK-42383
> URL: https://issues.apache.org/jira/browse/SPARK-42383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42383) Protobuf serializer for RocksDB.TypeAliases

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42383:


Assignee: (was: Apache Spark)

> Protobuf serializer for RocksDB.TypeAliases
> ---
>
> Key: SPARK-42383
> URL: https://issues.apache.org/jira/browse/SPARK-42383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42383) Protobuf serializer for RocksDB.TypeAliases

2023-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685950#comment-17685950
 ] 

Apache Spark commented on SPARK-42383:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39944

> Protobuf serializer for RocksDB.TypeAliases
> ---
>
> Key: SPARK-42383
> URL: https://issues.apache.org/jira/browse/SPARK-42383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42383) Protobuf serializer for RocksDB.TypeAliases

2023-02-08 Thread Yang Jie (Jira)
Yang Jie created SPARK-42383:


 Summary: Protobuf serializer for RocksDB.TypeAliases
 Key: SPARK-42383
 URL: https://issues.apache.org/jira/browse/SPARK-42383
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40819) Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of automatically converting to LongType

2023-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685928#comment-17685928
 ] 

Apache Spark commented on SPARK-40819:
--

User 'awdavidson' has created a pull request for this issue:
https://github.com/apache/spark/pull/39943

> Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type 
> instead of automatically converting to LongType 
> 
>
> Key: SPARK-40819
> URL: https://issues.apache.org/jira/browse/SPARK-40819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.3.2, 3.4.0
>Reporter: Alfred Davidson
>Assignee: Alfred Davidson
>Priority: Critical
>  Labels: regression
> Fix For: 3.2.4, 3.3.2, 3.4.0
>
>
> Since 3.2 parquet files containing attributes with type "INT64 
> (TIMESTAMP(NANOS, true))" are no longer readable and attempting to read 
> throws:
>  
> {code:java}
> Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: 
> INT64 (TIMESTAMP(NANOS,true))
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:174)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:90)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:72)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:66)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:548)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:548)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:528)
>   at scala.collection.immutable.Stream.map(Stream.scala:418)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:528)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:521)
>   at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:76)
>  {code}
> Prior to 3.2 successfully reads the parquet automatically converting to a 
> LongType.
> I believe work part of https://issues.apache.org/jira/browse/SPARK-34661 
> introduced the change in behaviour, more specifically here: 
> [https://github.com/apache/spark/pull/31776/files#diff-3730a913c4b95edf09fb78f8739c538bae53f7269555b6226efe7ccee1901b39R154]
>  which throws the QueryCompilationErrors.illegalParquetTypeError



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42305) Assign name to _LEGACY_ERROR_TEMP_1229

2023-02-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-42305:


Assignee: Haejoon Lee

> Assign name to _LEGACY_ERROR_TEMP_1229
> --
>
> Key: SPARK-42305
> URL: https://issues.apache.org/jira/browse/SPARK-42305
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42305) Assign name to _LEGACY_ERROR_TEMP_1229

2023-02-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-42305.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39875
[https://github.com/apache/spark/pull/39875]

> Assign name to _LEGACY_ERROR_TEMP_1229
> --
>
> Key: SPARK-42305
> URL: https://issues.apache.org/jira/browse/SPARK-42305
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42131) Extract the function that construct the select statement for JDBC dialect.

2023-02-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-42131.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 39667
[https://github.com/apache/spark/pull/39667]

> Extract the function that construct the select statement for JDBC dialect.
> --
>
> Key: SPARK-42131
> URL: https://issues.apache.org/jira/browse/SPARK-42131
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>
> Currently, JDBCRDD uses fixed format for SELECT statement.
> {code:java}
> val sqlText = options.prepareQuery +
>   s"SELECT $columnList FROM ${options.tableOrQuery} $myTableSampleClause" 
> +
>   s" $myWhereClause $getGroupByClause $getOrderByClause $myLimitClause 
> $myOffsetClause"
> {code}
> But some databases have different syntax that uses different keyword or sort. 
> For example, MS SQL Server uses keyword TOP to describe LIMIT clause or Top N.
> The LIMIT clause of MS SQL Server.
> {code:java}
> SELECT TOP(1) Model, Color, Price  
>   FROM dbo.Cars  
>   WHERE Color = 'blue'
> {code}
> The Top N of MS SQL Server.
> {code:java}
> SELECT TOP(1) Model, Color, Price  
> FROM dbo.Cars  
> WHERE Color = 'blue'  
> ORDER BY Price ASC
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42131) Extract the function that construct the select statement for JDBC dialect.

2023-02-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-42131:
---

Assignee: jiaan.geng

> Extract the function that construct the select statement for JDBC dialect.
> --
>
> Key: SPARK-42131
> URL: https://issues.apache.org/jira/browse/SPARK-42131
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Currently, JDBCRDD uses fixed format for SELECT statement.
> {code:java}
> val sqlText = options.prepareQuery +
>   s"SELECT $columnList FROM ${options.tableOrQuery} $myTableSampleClause" 
> +
>   s" $myWhereClause $getGroupByClause $getOrderByClause $myLimitClause 
> $myOffsetClause"
> {code}
> But some databases have different syntax that uses different keyword or sort. 
> For example, MS SQL Server uses keyword TOP to describe LIMIT clause or Top N.
> The LIMIT clause of MS SQL Server.
> {code:java}
> SELECT TOP(1) Model, Color, Price  
>   FROM dbo.Cars  
>   WHERE Color = 'blue'
> {code}
> The Top N of MS SQL Server.
> {code:java}
> SELECT TOP(1) Model, Color, Price  
> FROM dbo.Cars  
> WHERE Color = 'blue'  
> ORDER BY Price ASC
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42303) Assign name to _LEGACY_ERROR_TEMP_1326

2023-02-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-42303:


Assignee: Haejoon Lee

> Assign name to _LEGACY_ERROR_TEMP_1326
> --
>
> Key: SPARK-42303
> URL: https://issues.apache.org/jira/browse/SPARK-42303
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42303) Assign name to _LEGACY_ERROR_TEMP_1326

2023-02-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-42303.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39873
[https://github.com/apache/spark/pull/39873]

> Assign name to _LEGACY_ERROR_TEMP_1326
> --
>
> Key: SPARK-42303
> URL: https://issues.apache.org/jira/browse/SPARK-42303
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2023-02-08 Thread Hussein Awala (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685897#comment-17685897
 ] 

Hussein Awala commented on SPARK-34645:
---

I am facing a similar problem with Spark 3.2.1 and JDK 8, I'm running the jobs 
in client mode on arm64 nodes, in 10% of these jobs, after deleting the 
executors pods and the created PVCs, the driver pod stucks in running state 
with this log:
{code:java}
3/02/08 13:04:38 INFO SparkUI: Stopped Spark web UI at http://172.17.45.51:4040
23/02/08 13:04:38 INFO KubernetesClusterSchedulerBackend: Shutting down all 
executors
23/02/08 13:04:38 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
executor to shut down
23/02/08 13:04:38 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
been closed.
23/02/08 13:04:39 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
23/02/08 13:04:39 INFO MemoryStore: MemoryStore cleared
23/02/08 13:04:39 INFO BlockManager: BlockManager stopped
23/02/08 13:04:39 INFO BlockManagerMaster: BlockManagerMaster stopped
23/02/08 13:04:39 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
23/02/08 13:04:39 INFO SparkContext: Successfully stopped SparkContext {code}
JDK:
{code:java}
root@***:/# java -version | tail -n3
openjdk version "1.8.0_362"
OpenJDK Runtime Environment (Temurin)(build 1.8.0_362-b09)
OpenJDK 64-Bit Server VM (Temurin)(build 25.362-b09, mixed mode) {code}
I tried:
 * without the conf _spark.kubernetes.driver.reusePersistentVolumeClaim_ and 
without the PVC at all
 * applying the patch 
[https://github.com/apache/spark/commit/457b75ea2bca6b5811d61ce9f1d28c94b0dde3a2]
 proposed by [~mickayg] on spark 3.2.1
 * upgrading tp 3.2.3

but I still have the same problem.

I didn't find any relevant fix in the spark 3.3.0 and 3.3.1 release note except 
upgrading the kubernetes client, do you have some tips for investigating the 
issue?

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this is expected if the application is shutting down.)
> 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
> 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
> 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
> stopped
> 2021-03-05 20:09:24,752 INFO 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped 
> SparkContext
> 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: 

[jira] [Comment Edited] (SPARK-42380) Upgrade maven to 3.9.0

2023-02-08 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685881#comment-17685881
 ] 

Yang Jie edited comment on SPARK-42380 at 2/8/23 12:24 PM:
---

Upgrade `cyclonedx-maven-plugin` to 2.7.4 can avoid this error, but 
`cyclonedx-maven-plugin` 2.7.4 has another issue waiting to be fixed

 

https://github.com/CycloneDX/cyclonedx-maven-plugin/issues/272


was (Author: luciferyang):
Upgrade `cyclonedx-maven-plugin` to 2.7.4 can avoid this error, but 
`cyclonedx-maven-plugin` 2.7.4 has another issue waiting to be fixed

> Upgrade maven to 3.9.0
> --
>
> Key: SPARK-42380
> URL: https://issues.apache.org/jira/browse/SPARK-42380
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> {code:java}
> [ERROR] An error occurred attempting to read POM
> org.codehaus.plexus.util.xml.pull.XmlPullParserException: UTF-8 BOM plus xml 
> decl of ISO-8859-1 is incompatible (position: START_DOCUMENT seen  version="1.0" encoding="ISO-8859-1"... @1:42) 
> at org.codehaus.plexus.util.xml.pull.MXParser.parseXmlDeclWithVersion 
> (MXParser.java:3423)
> at org.codehaus.plexus.util.xml.pull.MXParser.parseXmlDecl 
> (MXParser.java:3345)
> at org.codehaus.plexus.util.xml.pull.MXParser.parsePI (MXParser.java:3197)
> at org.codehaus.plexus.util.xml.pull.MXParser.parseProlog 
> (MXParser.java:1828)
> at org.codehaus.plexus.util.xml.pull.MXParser.nextImpl 
> (MXParser.java:1757)
> at org.codehaus.plexus.util.xml.pull.MXParser.next (MXParser.java:1375)
> at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read 
> (MavenXpp3Reader.java:3940)
> at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read 
> (MavenXpp3Reader.java:612)
> at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read 
> (MavenXpp3Reader.java:627)
> at org.cyclonedx.maven.BaseCycloneDxMojo.readPom 
> (BaseCycloneDxMojo.java:759)
> at org.cyclonedx.maven.BaseCycloneDxMojo.readPom 
> (BaseCycloneDxMojo.java:746)
> at org.cyclonedx.maven.BaseCycloneDxMojo.retrieveParentProject 
> (BaseCycloneDxMojo.java:694)
> at org.cyclonedx.maven.BaseCycloneDxMojo.getClosestMetadata 
> (BaseCycloneDxMojo.java:524)
> at org.cyclonedx.maven.BaseCycloneDxMojo.convert 
> (BaseCycloneDxMojo.java:481)
> at org.cyclonedx.maven.CycloneDxMojo.execute (CycloneDxMojo.java:70)
> at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo 
> (DefaultBuildPluginManager.java:126)
> at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 
> (MojoExecutor.java:342)
> at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute 
> (MojoExecutor.java:330)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:213)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:175)
> at org.apache.maven.lifecycle.internal.MojoExecutor.access$000 
> (MojoExecutor.java:76)
> at org.apache.maven.lifecycle.internal.MojoExecutor$1.run 
> (MojoExecutor.java:163)
> at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute 
> (DefaultMojosExecutionStrategy.java:39)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:160)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:105)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:73)
> at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
>  (SingleThreadedBuilder.java:53)
> at org.apache.maven.lifecycle.internal.LifecycleStarter.execute 
> (LifecycleStarter.java:118)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:260)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:172)
> at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:100)
> at org.apache.maven.cli.MavenCli.execute (MavenCli.java:821)
> at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:270)
> at org.apache.maven.cli.MavenCli.main (MavenCli.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke 
> (NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke 
> (DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke (Method.java:498)
> at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced 
> (Launcher.java:282)
> at org.codehaus.plexus.classworlds.launcher.Launcher.launch 
> (Launcher.java:225)
> at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode 
> (Launcher.java:406)
> at 

[jira] [Commented] (SPARK-42380) Upgrade maven to 3.9.0

2023-02-08 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685881#comment-17685881
 ] 

Yang Jie commented on SPARK-42380:
--

Upgrade `cyclonedx-maven-plugin` to 2.7.4 can avoid this error, but 
`cyclonedx-maven-plugin` 2.7.4 has another issue waiting to be fixed

> Upgrade maven to 3.9.0
> --
>
> Key: SPARK-42380
> URL: https://issues.apache.org/jira/browse/SPARK-42380
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> {code:java}
> [ERROR] An error occurred attempting to read POM
> org.codehaus.plexus.util.xml.pull.XmlPullParserException: UTF-8 BOM plus xml 
> decl of ISO-8859-1 is incompatible (position: START_DOCUMENT seen  version="1.0" encoding="ISO-8859-1"... @1:42) 
> at org.codehaus.plexus.util.xml.pull.MXParser.parseXmlDeclWithVersion 
> (MXParser.java:3423)
> at org.codehaus.plexus.util.xml.pull.MXParser.parseXmlDecl 
> (MXParser.java:3345)
> at org.codehaus.plexus.util.xml.pull.MXParser.parsePI (MXParser.java:3197)
> at org.codehaus.plexus.util.xml.pull.MXParser.parseProlog 
> (MXParser.java:1828)
> at org.codehaus.plexus.util.xml.pull.MXParser.nextImpl 
> (MXParser.java:1757)
> at org.codehaus.plexus.util.xml.pull.MXParser.next (MXParser.java:1375)
> at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read 
> (MavenXpp3Reader.java:3940)
> at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read 
> (MavenXpp3Reader.java:612)
> at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read 
> (MavenXpp3Reader.java:627)
> at org.cyclonedx.maven.BaseCycloneDxMojo.readPom 
> (BaseCycloneDxMojo.java:759)
> at org.cyclonedx.maven.BaseCycloneDxMojo.readPom 
> (BaseCycloneDxMojo.java:746)
> at org.cyclonedx.maven.BaseCycloneDxMojo.retrieveParentProject 
> (BaseCycloneDxMojo.java:694)
> at org.cyclonedx.maven.BaseCycloneDxMojo.getClosestMetadata 
> (BaseCycloneDxMojo.java:524)
> at org.cyclonedx.maven.BaseCycloneDxMojo.convert 
> (BaseCycloneDxMojo.java:481)
> at org.cyclonedx.maven.CycloneDxMojo.execute (CycloneDxMojo.java:70)
> at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo 
> (DefaultBuildPluginManager.java:126)
> at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 
> (MojoExecutor.java:342)
> at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute 
> (MojoExecutor.java:330)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:213)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:175)
> at org.apache.maven.lifecycle.internal.MojoExecutor.access$000 
> (MojoExecutor.java:76)
> at org.apache.maven.lifecycle.internal.MojoExecutor$1.run 
> (MojoExecutor.java:163)
> at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute 
> (DefaultMojosExecutionStrategy.java:39)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:160)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:105)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:73)
> at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
>  (SingleThreadedBuilder.java:53)
> at org.apache.maven.lifecycle.internal.LifecycleStarter.execute 
> (LifecycleStarter.java:118)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:260)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:172)
> at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:100)
> at org.apache.maven.cli.MavenCli.execute (MavenCli.java:821)
> at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:270)
> at org.apache.maven.cli.MavenCli.main (MavenCli.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke 
> (NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke 
> (DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke (Method.java:498)
> at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced 
> (Launcher.java:282)
> at org.codehaus.plexus.classworlds.launcher.Launcher.launch 
> (Launcher.java:225)
> at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode 
> (Launcher.java:406)
> at org.codehaus.plexus.classworlds.launcher.Launcher.main 
> (Launcher.java:347)
> {code}
> A existing problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Updated] (SPARK-42380) Upgrade maven to 3.9.0

2023-02-08 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42380:
-
Description: 
{code:java}
[ERROR] An error occurred attempting to read POM
org.codehaus.plexus.util.xml.pull.XmlPullParserException: UTF-8 BOM plus xml 
decl of ISO-8859-1 is incompatible (position: START_DOCUMENT seen  Upgrade maven to 3.9.0
> --
>
> Key: SPARK-42380
> URL: https://issues.apache.org/jira/browse/SPARK-42380
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> {code:java}
> [ERROR] An error occurred attempting to read POM
> org.codehaus.plexus.util.xml.pull.XmlPullParserException: UTF-8 BOM plus xml 
> decl of ISO-8859-1 is incompatible (position: START_DOCUMENT seen  version="1.0" encoding="ISO-8859-1"... @1:42) 
> at org.codehaus.plexus.util.xml.pull.MXParser.parseXmlDeclWithVersion 
> (MXParser.java:3423)
> at org.codehaus.plexus.util.xml.pull.MXParser.parseXmlDecl 
> (MXParser.java:3345)
> at org.codehaus.plexus.util.xml.pull.MXParser.parsePI (MXParser.java:3197)
> at org.codehaus.plexus.util.xml.pull.MXParser.parseProlog 
> (MXParser.java:1828)
> at org.codehaus.plexus.util.xml.pull.MXParser.nextImpl 
> (MXParser.java:1757)
> at org.codehaus.plexus.util.xml.pull.MXParser.next (MXParser.java:1375)
> at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read 
> (MavenXpp3Reader.java:3940)
> at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read 
> (MavenXpp3Reader.java:612)
> at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read 
> (MavenXpp3Reader.java:627)
> at org.cyclonedx.maven.BaseCycloneDxMojo.readPom 
> (BaseCycloneDxMojo.java:759)
> at org.cyclonedx.maven.BaseCycloneDxMojo.readPom 
> (BaseCycloneDxMojo.java:746)
> at org.cyclonedx.maven.BaseCycloneDxMojo.retrieveParentProject 
> (BaseCycloneDxMojo.java:694)
> at org.cyclonedx.maven.BaseCycloneDxMojo.getClosestMetadata 
> (BaseCycloneDxMojo.java:524)
> at org.cyclonedx.maven.BaseCycloneDxMojo.convert 
> (BaseCycloneDxMojo.java:481)
> at org.cyclonedx.maven.CycloneDxMojo.execute (CycloneDxMojo.java:70)
> at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo 
> (DefaultBuildPluginManager.java:126)
> at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 
> (MojoExecutor.java:342)
> at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute 
> (MojoExecutor.java:330)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:213)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:175)
> at org.apache.maven.lifecycle.internal.MojoExecutor.access$000 
> (MojoExecutor.java:76)
> at org.apache.maven.lifecycle.internal.MojoExecutor$1.run 
> (MojoExecutor.java:163)
> at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute 
> (DefaultMojosExecutionStrategy.java:39)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:160)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:105)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:73)
> at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
>  (SingleThreadedBuilder.java:53)
> at org.apache.maven.lifecycle.internal.LifecycleStarter.execute 
> (LifecycleStarter.java:118)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:260)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:172)
> at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:100)
> at org.apache.maven.cli.MavenCli.execute (MavenCli.java:821)
> at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:270)
> at org.apache.maven.cli.MavenCli.main (MavenCli.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke 
> (NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke 
> (DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke (Method.java:498)
> at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced 
> (Launcher.java:282)
> at org.codehaus.plexus.classworlds.launcher.Launcher.launch 
> (Launcher.java:225)
> at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode 
> (Launcher.java:406)
> at org.codehaus.plexus.classworlds.launcher.Launcher.main 
> (Launcher.java:347)
> {code}
> A existing problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-42382) Upgrade `cyclonedx-maven-plugin` to 2.7.4

2023-02-08 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685880#comment-17685880
 ] 

Yang Jie commented on SPARK-42382:
--

[https://github.com/CycloneDX/cyclonedx-maven-plugin/issues/272]

need wait this fix

> Upgrade `cyclonedx-maven-plugin` to 2.7.4
> -
>
> Key: SPARK-42382
> URL: https://issues.apache.org/jira/browse/SPARK-42382
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.4



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42342) Introduce base hierarchy to exceptions.

2023-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42342.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39882
[https://github.com/apache/spark/pull/39882]

> Introduce base hierarchy to exceptions.
> ---
>
> Key: SPARK-42342
> URL: https://issues.apache.org/jira/browse/SPARK-42342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42342) Introduce base hierarchy to exceptions.

2023-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42342:


Assignee: Takuya Ueshin

> Introduce base hierarchy to exceptions.
> ---
>
> Key: SPARK-42342
> URL: https://issues.apache.org/jira/browse/SPARK-42342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30022) Supporting Parsing of Simple Hive Virtual View created from Presto

2023-02-08 Thread jorge arada (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685861#comment-17685861
 ] 

jorge arada commented on SPARK-30022:
-

We are facing the same problem. The presto/trino views are not "readable" from 
spark. What can we do to push this request forward? 

> Supporting Parsing of Simple Hive Virtual View created from Presto
> --
>
> Key: SPARK-30022
> URL: https://issues.apache.org/jira/browse/SPARK-30022
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Arun Ravi M V
>Priority: Major
>
>  
> We have an environment were we use Apache Spark and Presto (both backed by 
> Apache Hive Metastore). Currently, Views created from Presto fail to get 
> parsed in Apache Spark. This is because Presto stores the View definition and 
> View schema in a base64 encoded fashion and Spark is unable to process it. I 
> would like to propose a minor change that will allow us to read these encoded 
> definitions created by Presto in a Spark Program.
> Assuming that the UDFs are made available, the user should be able to read 
> presto views after the fix.
>  
> I would like to propose a change to 
> [https://github.com/apache/spark/blob/9459833eae7fae887af560f3127997e023c51d00/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L440]
> to have support creation of CatlogTable for Views created from Presto.
>  
> Hive Metastore DB, the table definition for Presto views  (*select* * *from* 
> TBLS *where* `TBL_TYPE` *like* '%VIRTUAL_VIEW%') shows that the 
> VIEW_EXPANDED_TEXT is hardcoded as `/* Presto View **/` and 
> VIEW_ORIGINAL_TEXT is `/** Presto View: base64({ "originalSql": "" "catalog": 
> "", "schema": "", "columns": [
> { "name": "", "type": "" }
> ], "owner": ""}) */`  
> Refer: 
> [https://github.com/prestodb/presto/blob/3242715959a169dbcdd88946c28488d2365c8886/presto-hive/src/main/java/com/facebook/presto/hive/HiveUtil.java#L614]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42381) `CreateDataFrame` should accept objects

2023-02-08 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-42381:
-

Assignee: Ruifeng Zheng

> `CreateDataFrame` should accept objects
> ---
>
> Key: SPARK-42381
> URL: https://issues.apache.org/jira/browse/SPARK-42381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug

2023-02-08 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685858#comment-17685858
 ] 

Peter Toth edited comment on SPARK-42346 at 2/8/23 11:16 AM:
-

[~ritikam], you also need to disable the "ConvertToLocalRelation" rule 
optimization `--conf 
"spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation"`
 to get the error from spark-shell.


was (Author: petertoth):
[~ritikam], you also need to disable the "ConvertToLocalRelation" rule 
optimization `--conf 
"spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation"`
 to get the error from spark-schell.

> distinct(count colname) with UNION ALL causes query analyzer bug
> 
>
> Key: SPARK-42346
> URL: https://issues.apache.org/jira/browse/SPARK-42346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
>Reporter: Robin
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.3.2, 3.4.0, 3.5.0
>
>
> If you combine a UNION ALL with a count(distinct colname) you get a query 
> analyzer bug.
>  
> This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.
>  
> Here is a reprex in PySpark:
> {{df_pd = pd.DataFrame([}}
> {{    \{'surname': 'a', 'first_name': 'b'}}}
> {{])}}
> {{df_spark = spark.createDataFrame(df_pd)}}
> {{df_spark.createOrReplaceTempView("input_table")}}
> {{sql = """}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT first_name) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table}}
> {{UNION ALL}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT surname) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table """}}
> {{spark.sql(sql).toPandas()}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug

2023-02-08 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685858#comment-17685858
 ] 

Peter Toth commented on SPARK-42346:


[~ritikam], you also need to disable the "ConvertToLocalRelation" rule 
optimization `--conf 
"spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation"`
 to get the error from spark-schell.

> distinct(count colname) with UNION ALL causes query analyzer bug
> 
>
> Key: SPARK-42346
> URL: https://issues.apache.org/jira/browse/SPARK-42346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
>Reporter: Robin
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.3.2, 3.4.0, 3.5.0
>
>
> If you combine a UNION ALL with a count(distinct colname) you get a query 
> analyzer bug.
>  
> This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.
>  
> Here is a reprex in PySpark:
> {{df_pd = pd.DataFrame([}}
> {{    \{'surname': 'a', 'first_name': 'b'}}}
> {{])}}
> {{df_spark = spark.createDataFrame(df_pd)}}
> {{df_spark.createOrReplaceTempView("input_table")}}
> {{sql = """}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT first_name) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table}}
> {{UNION ALL}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT surname) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table """}}
> {{spark.sql(sql).toPandas()}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42381) `CreateDataFrame` should accept objects

2023-02-08 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-42381.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39939
[https://github.com/apache/spark/pull/39939]

> `CreateDataFrame` should accept objects
> ---
>
> Key: SPARK-42381
> URL: https://issues.apache.org/jira/browse/SPARK-42381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42382) Upgrade `cyclonedx-maven-plugin` to 2.7.4

2023-02-08 Thread Yang Jie (Jira)
Yang Jie created SPARK-42382:


 Summary: Upgrade `cyclonedx-maven-plugin` to 2.7.4
 Key: SPARK-42382
 URL: https://issues.apache.org/jira/browse/SPARK-42382
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yang Jie


https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.4



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications

2023-02-08 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685765#comment-17685765
 ] 

Gengliang Wang commented on SPARK-41053:


[~dongjoon] sure.

 [~LuciferYang] [~techaddict] [~panbingkun]  [~mridul] [~dongjoon]  
[~cloud_fan] Thanks all for the contributions and reviews!

> Better Spark UI scalability and Driver stability for large applications
> ---
>
> Key: SPARK-41053
> URL: https://issues.apache.org/jira/browse/SPARK-41053
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>  Labels: releasenotes
> Attachments: Better Spark UI scalability and Driver stability for 
> large applications.pdf
>
>
> After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
> processing large applications by supporting a persistent 
> KV-store(LevelDB/RocksDB) as the storage layer.
> As for the live Spark UI, all the data is still stored in memory, which can 
> bring memory pressures to the Spark driver for large applications.
> For better Spark UI scalability and Driver stability, I propose to
>  * {*}Support storing all the UI data in a persistent KV store{*}. 
> RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
> fast enough to serve the write/read workload for live UI. SHS can leverage 
> the persistent KV store to fasten its startup.
>  * *Support a new Protobuf serializer for all the UI data.* The new 
> serializer is supposed to be faster, according to benchmarks. It will be the 
> default serializer for the persistent KV store of live UI. As for event logs, 
> it is optional. The current serializer for UI data is JSON. When writing 
> persistent KV-store, there is GZip compression. Since there is compression 
> support in RocksDB/LevelDB, the new serializer won’t compress the output 
> before writing to the persistent KV store. Here is a benchmark of 
> writing/reading 100,000 SQLExecutionUIData to/from RocksDB:
>  
> |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
> Size(MB)*|*Result total size in memory(MB)*|
> |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
> |*Protobuf*|109.9|34.3|858|2105|
> I am also proposing to support RocksDB instead of both LevelDB & RocksDB in 
> the live UI.
> SPIP: 
> [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing]
> SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications

2023-02-08 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-41053.

Resolution: Fixed

> Better Spark UI scalability and Driver stability for large applications
> ---
>
> Key: SPARK-41053
> URL: https://issues.apache.org/jira/browse/SPARK-41053
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>  Labels: releasenotes
> Attachments: Better Spark UI scalability and Driver stability for 
> large applications.pdf
>
>
> After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
> processing large applications by supporting a persistent 
> KV-store(LevelDB/RocksDB) as the storage layer.
> As for the live Spark UI, all the data is still stored in memory, which can 
> bring memory pressures to the Spark driver for large applications.
> For better Spark UI scalability and Driver stability, I propose to
>  * {*}Support storing all the UI data in a persistent KV store{*}. 
> RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
> fast enough to serve the write/read workload for live UI. SHS can leverage 
> the persistent KV store to fasten its startup.
>  * *Support a new Protobuf serializer for all the UI data.* The new 
> serializer is supposed to be faster, according to benchmarks. It will be the 
> default serializer for the persistent KV store of live UI. As for event logs, 
> it is optional. The current serializer for UI data is JSON. When writing 
> persistent KV-store, there is GZip compression. Since there is compression 
> support in RocksDB/LevelDB, the new serializer won’t compress the output 
> before writing to the persistent KV store. Here is a benchmark of 
> writing/reading 100,000 SQLExecutionUIData to/from RocksDB:
>  
> |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
> Size(MB)*|*Result total size in memory(MB)*|
> |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
> |*Protobuf*|109.9|34.3|858|2105|
> I am also proposing to support RocksDB instead of both LevelDB & RocksDB in 
> the live UI.
> SPIP: 
> [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing]
> SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications

2023-02-08 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-41053:
--

Assignee: Apache Spark

> Better Spark UI scalability and Driver stability for large applications
> ---
>
> Key: SPARK-41053
> URL: https://issues.apache.org/jira/browse/SPARK-41053
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>  Labels: releasenotes
> Attachments: Better Spark UI scalability and Driver stability for 
> large applications.pdf
>
>
> After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
> processing large applications by supporting a persistent 
> KV-store(LevelDB/RocksDB) as the storage layer.
> As for the live Spark UI, all the data is still stored in memory, which can 
> bring memory pressures to the Spark driver for large applications.
> For better Spark UI scalability and Driver stability, I propose to
>  * {*}Support storing all the UI data in a persistent KV store{*}. 
> RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
> fast enough to serve the write/read workload for live UI. SHS can leverage 
> the persistent KV store to fasten its startup.
>  * *Support a new Protobuf serializer for all the UI data.* The new 
> serializer is supposed to be faster, according to benchmarks. It will be the 
> default serializer for the persistent KV store of live UI. As for event logs, 
> it is optional. The current serializer for UI data is JSON. When writing 
> persistent KV-store, there is GZip compression. Since there is compression 
> support in RocksDB/LevelDB, the new serializer won’t compress the output 
> before writing to the persistent KV store. Here is a benchmark of 
> writing/reading 100,000 SQLExecutionUIData to/from RocksDB:
>  
> |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
> Size(MB)*|*Result total size in memory(MB)*|
> |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
> |*Protobuf*|109.9|34.3|858|2105|
> I am also proposing to support RocksDB instead of both LevelDB & RocksDB in 
> the live UI.
> SPIP: 
> [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing]
> SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38005) Support cleaning up merged shuffle files and state from external shuffle service

2023-02-08 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated SPARK-38005:
-
Fix Version/s: 3.4.0

> Support cleaning up merged shuffle files and state from external shuffle 
> service
> 
>
> Key: SPARK-38005
> URL: https://issues.apache.org/jira/browse/SPARK-38005
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Chandni Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently merged shuffle files and state is not cleaned up until an 
> application ends. SPARK-37618 handles the cleanup of regular shuffle files. 
> This jira will address cleaning up of merged shuffle files/state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38005) Support cleaning up merged shuffle files and state from external shuffle service

2023-02-08 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars resolved SPARK-38005.
--
Resolution: Fixed

> Support cleaning up merged shuffle files and state from external shuffle 
> service
> 
>
> Key: SPARK-38005
> URL: https://issues.apache.org/jira/browse/SPARK-38005
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Chandni Singh
>Priority: Major
>
> Currently merged shuffle files and state is not cleaned up until an 
> application ends. SPARK-37618 handles the cleanup of regular shuffle files. 
> This jira will address cleaning up of merged shuffle files/state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42267) Support left_outer join

2023-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685758#comment-17685758
 ] 

Apache Spark commented on SPARK-42267:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39940

> Support left_outer join
> ---
>
> Key: SPARK-42267
> URL: https://issues.apache.org/jira/browse/SPARK-42267
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> ```
> >>> df = spark.range(1)
> >>> df2 = spark.range(2)
> >>> df.join(df2, how="left_outer")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/dataframe.py", 
> line 438, in join
> plan.Join(left=self._plan, right=other._plan, on=on, how=how),
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/plan.py", line 
> 730, in __init__
> raise NotImplementedError(
> NotImplementedError: 
> Unsupported join type: left_outer. Supported join types 
> include:
> "inner", "outer", "full", "fullouter", "full_outer",
> "leftouter", "left", "left_outer", "rightouter",
> "right", "right_outer", "leftsemi", "left_semi",
> "semi", "leftanti", "left_anti", "anti", "cross",
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42267) Support left_outer join

2023-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685760#comment-17685760
 ] 

Apache Spark commented on SPARK-42267:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39940

> Support left_outer join
> ---
>
> Key: SPARK-42267
> URL: https://issues.apache.org/jira/browse/SPARK-42267
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> ```
> >>> df = spark.range(1)
> >>> df2 = spark.range(2)
> >>> df.join(df2, how="left_outer")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/dataframe.py", 
> line 438, in join
> plan.Join(left=self._plan, right=other._plan, on=on, how=how),
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/plan.py", line 
> 730, in __init__
> raise NotImplementedError(
> NotImplementedError: 
> Unsupported join type: left_outer. Supported join types 
> include:
> "inner", "outer", "full", "fullouter", "full_outer",
> "leftouter", "left", "left_outer", "rightouter",
> "right", "right_outer", "leftsemi", "left_semi",
> "semi", "leftanti", "left_anti", "anti", "cross",
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42267) Support left_outer join

2023-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42267:


Assignee: Ruifeng Zheng

> Support left_outer join
> ---
>
> Key: SPARK-42267
> URL: https://issues.apache.org/jira/browse/SPARK-42267
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> ```
> >>> df = spark.range(1)
> >>> df2 = spark.range(2)
> >>> df.join(df2, how="left_outer")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/dataframe.py", 
> line 438, in join
> plan.Join(left=self._plan, right=other._plan, on=on, how=how),
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/plan.py", line 
> 730, in __init__
> raise NotImplementedError(
> NotImplementedError: 
> Unsupported join type: left_outer. Supported join types 
> include:
> "inner", "outer", "full", "fullouter", "full_outer",
> "leftouter", "left", "left_outer", "rightouter",
> "right", "right_outer", "leftsemi", "left_semi",
> "semi", "leftanti", "left_anti", "anti", "cross",
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42267) Support left_outer join

2023-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42267.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39938
[https://github.com/apache/spark/pull/39938]

> Support left_outer join
> ---
>
> Key: SPARK-42267
> URL: https://issues.apache.org/jira/browse/SPARK-42267
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> ```
> >>> df = spark.range(1)
> >>> df2 = spark.range(2)
> >>> df.join(df2, how="left_outer")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/dataframe.py", 
> line 438, in join
> plan.Join(left=self._plan, right=other._plan, on=on, how=how),
>   File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/plan.py", line 
> 730, in __init__
> raise NotImplementedError(
> NotImplementedError: 
> Unsupported join type: left_outer. Supported join types 
> include:
> "inner", "outer", "full", "fullouter", "full_outer",
> "leftouter", "left", "left_outer", "rightouter",
> "right", "right_outer", "leftsemi", "left_semi",
> "semi", "leftanti", "left_anti", "anti", "cross",
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42381) `CreateDataFrame` should accept objects

2023-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685752#comment-17685752
 ] 

Apache Spark commented on SPARK-42381:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39939

> `CreateDataFrame` should accept objects
> ---
>
> Key: SPARK-42381
> URL: https://issues.apache.org/jira/browse/SPARK-42381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42381) `CreateDataFrame` should accept objects

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42381:


Assignee: (was: Apache Spark)

> `CreateDataFrame` should accept objects
> ---
>
> Key: SPARK-42381
> URL: https://issues.apache.org/jira/browse/SPARK-42381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42381) `CreateDataFrame` should accept objects

2023-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685750#comment-17685750
 ] 

Apache Spark commented on SPARK-42381:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39939

> `CreateDataFrame` should accept objects
> ---
>
> Key: SPARK-42381
> URL: https://issues.apache.org/jira/browse/SPARK-42381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42381) `CreateDataFrame` should accept objects

2023-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42381:


Assignee: Apache Spark

> `CreateDataFrame` should accept objects
> ---
>
> Key: SPARK-42381
> URL: https://issues.apache.org/jira/browse/SPARK-42381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42381) `CreateDataFrame` should accept objects

2023-02-08 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-42381:
-

 Summary: `CreateDataFrame` should accept objects
 Key: SPARK-42381
 URL: https://issues.apache.org/jira/browse/SPARK-42381
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org