[jira] [Commented] (SPARK-40502) Support dataframe API use jdbc data source in PySpark

2022-09-20 Thread CaoYu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607523#comment-17607523
 ] 

CaoYu commented on SPARK-40502:
---

I am a teacher
Recently designed Python language basic course, big data direction

PySpark is one of the practical cases, but it is only a simple use of RDD code 
to complete the basic data processing work, and the use of JDBC data source is 
a part of the course

DataFrames(SparkSQL) will be used in future design advanced courses.
So I hope the datastream API to have the capability of jdbc datasource.

 

 

> Support dataframe API use jdbc data source in PySpark
> -
>
> Key: SPARK-40502
> URL: https://issues.apache.org/jira/browse/SPARK-40502
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: CaoYu
>Priority: Major
>
> When i using pyspark, i wanna get data from mysql database.  so i want use 
> JDBCRDD like java\scala.
> But that is not be supported in PySpark.
>  
> For some reasons, i can't using DataFrame API, only can use RDD(datastream) 
> API. Even i know the DataFrame can get data from jdbc source fairly well.
>  
> So i want to implement functionality that can use rdd to get data from jdbc 
> source for PySpark.
>  
> *But i don't know if that are necessary for PySpark.   so we can discuss it.*
>  
> {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*}  
> *i hope this Jira task can assigned to me, so i can start working to 
> implement it.*
>  
> *if not, please close this Jira task.*
>  
>  
> *thanks a lot.*
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40511) Upgrade slf4j to 2.x

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40511:


Assignee: Apache Spark

> Upgrade slf4j to 2.x
> 
>
> Key: SPARK-40511
> URL: https://issues.apache.org/jira/browse/SPARK-40511
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> https://www.slf4j.org/news.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40511) Upgrade slf4j to 2.x

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40511:


Assignee: (was: Apache Spark)

> Upgrade slf4j to 2.x
> 
>
> Key: SPARK-40511
> URL: https://issues.apache.org/jira/browse/SPARK-40511
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://www.slf4j.org/news.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40511) Upgrade slf4j to 2.x

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607522#comment-17607522
 ] 

Apache Spark commented on SPARK-40511:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37844

> Upgrade slf4j to 2.x
> 
>
> Key: SPARK-40511
> URL: https://issues.apache.org/jira/browse/SPARK-40511
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://www.slf4j.org/news.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40511) Upgrade slf4j to 2.x

2022-09-20 Thread Yang Jie (Jira)
Yang Jie created SPARK-40511:


 Summary: Upgrade slf4j to 2.x
 Key: SPARK-40511
 URL: https://issues.apache.org/jira/browse/SPARK-40511
 Project: Spark
  Issue Type: Improvement
  Components: Build, Spark Core
Affects Versions: 3.4.0
Reporter: Yang Jie


https://www.slf4j.org/news.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped

2022-09-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40496:
---

Assignee: Ivan Sadikov

> Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
> --
>
> Key: SPARK-40496
> URL: https://issues.apache.org/jira/browse/SPARK-40496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped

2022-09-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40496.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37942
[https://github.com/apache/spark/pull/37942]

> Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
> --
>
> Key: SPARK-40496
> URL: https://issues.apache.org/jira/browse/SPARK-40496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40506) Spark Streaming metrics name don't need application name

2022-09-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

王俊博 updated SPARK-40506:

Description: 
Spark  StreamingSource  Metrics sourceName is inappropriate.The label now looks 
like 
`application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
 `, instead of 
`application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
 the Spark app name is not need.

This makes it hard to use metrics for different Spark applications over time. 
And this makes the metrics sourceName standard inconsistent.
{code:java}
//代码占位符

private[streaming] class StreamingSource(ssc: StreamingContext) extends Source {
  override val metricRegistry = new MetricRegistry
  override val sourceName = 
"%s.StreamingMetrics".format(ssc.sparkContext.appName)

}{code}
And for example, other metrics sourceName don't have appName.
{code:java}
//代码占位符
private[spark] class LiveListenerBusMetrics(conf: SparkConf)
  extends Source with Logging {

  override val sourceName: String = "LiveListenerBus"
  override val metricRegistry: MetricRegistry = new MetricRegistry
...
}

{code}

  was:
Spark  StreamingSource  Metrics sourceName is inappropriate.The label now looks 
like 
`application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
 `, instead of 
`application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
 the Spark app name is not need.

This makes it hard to use metrics for different Spark applications over time. 
And this makes the metrics sourceName standard inconsistent
{code:java}
//代码占位符

private[streaming] class StreamingSource(ssc: StreamingContext) extends Source {
  override val metricRegistry = new MetricRegistry
  override val sourceName = 
"%s.StreamingMetrics".format(ssc.sparkContext.appName)

}{code}
And for example, other metrics sourceName don't have appName.
{code:java}
//代码占位符
private[spark] class LiveListenerBusMetrics(conf: SparkConf)
  extends Source with Logging {

  override val sourceName: String = "LiveListenerBus"
  override val metricRegistry: MetricRegistry = new MetricRegistry
...
}

{code}


> Spark Streaming metrics name don't need application name
> 
>
> Key: SPARK-40506
> URL: https://issues.apache.org/jira/browse/SPARK-40506
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.2.2
>Reporter: 王俊博
>Priority: Trivial
>
> Spark  StreamingSource  Metrics sourceName is inappropriate.The label now 
> looks like 
> `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
>  `, instead of 
> `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
>  the Spark app name is not need.
> This makes it hard to use metrics for different Spark applications over time. 
> And this makes the metrics sourceName standard inconsistent.
> {code:java}
> //代码占位符
> private[streaming] class StreamingSource(ssc: StreamingContext) extends 
> Source {
>   override val metricRegistry = new MetricRegistry
>   override val sourceName = 
> "%s.StreamingMetrics".format(ssc.sparkContext.appName)
> 
> }{code}
> And for example, other metrics sourceName don't have appName.
> {code:java}
> //代码占位符
> private[spark] class LiveListenerBusMetrics(conf: SparkConf)
>   extends Source with Logging {
>   override val sourceName: String = "LiveListenerBus"
>   override val metricRegistry: MetricRegistry = new MetricRegistry
> ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40332) Implement `GroupBy.quantile`.

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607489#comment-17607489
 ] 

Apache Spark commented on SPARK-40332:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37954

> Implement `GroupBy.quantile`.
> -
>
> Key: SPARK-40332
> URL: https://issues.apache.org/jira/browse/SPARK-40332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> We should implement `GroupBy.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40332) Implement `GroupBy.quantile`.

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607488#comment-17607488
 ] 

Apache Spark commented on SPARK-40332:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37954

> Implement `GroupBy.quantile`.
> -
>
> Key: SPARK-40332
> URL: https://issues.apache.org/jira/browse/SPARK-40332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> We should implement `GroupBy.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40501) Enhance 'SpecialLimits' to support project(..., limit(...))

2022-09-20 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40501:

Summary: Enhance 'SpecialLimits' to support project(..., limit(...))  (was: 
Add PushProjectionThroughLimit for Optimizer)

> Enhance 'SpecialLimits' to support project(..., limit(...))
> ---
>
> Key: SPARK-40501
> URL: https://issues.apache.org/jira/browse/SPARK-40501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> h4. It took a long time to fetch out, still running after 20 minutes...
> when run as follow code in spark-shell:
> spark.sql("select * from xxx where event_day = '20220919' limit 1").show()
> [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png]
> [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40510) Implement `ddof` in `Series.cov`

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607483#comment-17607483
 ] 

Apache Spark commented on SPARK-40510:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37953

> Implement `ddof` in `Series.cov`
> 
>
> Key: SPARK-40510
> URL: https://issues.apache.org/jira/browse/SPARK-40510
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40510) Implement `ddof` in `Series.cov`

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40510:


Assignee: Apache Spark

> Implement `ddof` in `Series.cov`
> 
>
> Key: SPARK-40510
> URL: https://issues.apache.org/jira/browse/SPARK-40510
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40510) Implement `ddof` in `Series.cov`

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607481#comment-17607481
 ] 

Apache Spark commented on SPARK-40510:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37953

> Implement `ddof` in `Series.cov`
> 
>
> Key: SPARK-40510
> URL: https://issues.apache.org/jira/browse/SPARK-40510
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40510) Implement `ddof` in `Series.cov`

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40510:


Assignee: (was: Apache Spark)

> Implement `ddof` in `Series.cov`
> 
>
> Key: SPARK-40510
> URL: https://issues.apache.org/jira/browse/SPARK-40510
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40510) Implement `ddof` in `Series.cov`

2022-09-20 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40510:
-

 Summary: Implement `ddof` in `Series.cov`
 Key: SPARK-40510
 URL: https://issues.apache.org/jira/browse/SPARK-40510
 Project: Spark
  Issue Type: Sub-task
  Components: ps
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40491) Remove too old TODO for JdbcRDD

2022-09-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40491.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37937
[https://github.com/apache/spark/pull/37937]

> Remove too old TODO for JdbcRDD
> ---
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Trivial
> Fix For: 3.4.0
>
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.
> In fact, this is a very old TODO and we need to revisit if this is still 
> necessary. Since Spark SQL is the new core, I'm not sure if anyone is 
> interested in a new API to create jdbc RDD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40491) Remove too old TODO for JdbcRDD

2022-09-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40491:


Assignee: jiaan.geng

> Remove too old TODO for JdbcRDD
> ---
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Trivial
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.
> In fact, this is a very old TODO and we need to revisit if this is still 
> necessary. Since Spark SQL is the new core, I'm not sure if anyone is 
> interested in a new API to create jdbc RDD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`

2022-09-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40500.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37947
[https://github.com/apache/spark/pull/37947]

> Use `pd.items` instead of `pd.iteritems`
> 
>
> Key: SPARK-40500
> URL: https://issues.apache.org/jira/browse/SPARK-40500
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`

2022-09-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40500:


Assignee: Ruifeng Zheng

> Use `pd.items` instead of `pd.iteritems`
> 
>
> Key: SPARK-40500
> URL: https://issues.apache.org/jira/browse/SPARK-40500
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40499:
-
Priority: Major  (was: Blocker)

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Major
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40502) Support dataframe API use jdbc data source in PySpark

2022-09-20 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607447#comment-17607447
 ] 

Hyukjin Kwon commented on SPARK-40502:
--

{quote}
For some reasons, i can't using DataFrame API, only can use RDD(datastream) API.
{quote}
What's the reason?

> Support dataframe API use jdbc data source in PySpark
> -
>
> Key: SPARK-40502
> URL: https://issues.apache.org/jira/browse/SPARK-40502
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: CaoYu
>Priority: Major
>
> When i using pyspark, i wanna get data from mysql database.  so i want use 
> JDBCRDD like java\scala.
> But that is not be supported in PySpark.
>  
> For some reasons, i can't using DataFrame API, only can use RDD(datastream) 
> API. Even i know the DataFrame can get data from jdbc source fairly well.
>  
> So i want to implement functionality that can use rdd to get data from jdbc 
> source for PySpark.
>  
> *But i don't know if that are necessary for PySpark.   so we can discuss it.*
>  
> {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*}  
> *i hope this Jira task can assigned to me, so i can start working to 
> implement it.*
>  
> *if not, please close this Jira task.*
>  
>  
> *thanks a lot.*
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40509) Construct an example of applyInPandasWithState in examples directory

2022-09-20 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-40509:


 Summary: Construct an example of applyInPandasWithState in 
examples directory
 Key: SPARK-40509
 URL: https://issues.apache.org/jira/browse/SPARK-40509
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Jungtaek Lim


Since we introduce a new API (applyInPandasWithState) in PySpark, it worths to 
have a separate full example of the API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607384#comment-17607384
 ] 

Apache Spark commented on SPARK-40508:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/37952

> Treat unknown partitioning as UnknownPartitioning
> -
>
> Key: SPARK-40508
> URL: https://issues.apache.org/jira/browse/SPARK-40508
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ted Yu
>Priority: Major
>
> When running spark application against spark 3.3, I see the following :
> {code}
> java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
> type: CustomPartitioning
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> {code}
> The CustomPartitioning works fine with Spark 3.2.1
> This PR proposes to relax the code and treat all unknown partitioning the 
> same way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40508:


Assignee: (was: Apache Spark)

> Treat unknown partitioning as UnknownPartitioning
> -
>
> Key: SPARK-40508
> URL: https://issues.apache.org/jira/browse/SPARK-40508
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ted Yu
>Priority: Major
>
> When running spark application against spark 3.3, I see the following :
> {code}
> java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
> type: CustomPartitioning
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> {code}
> The CustomPartitioning works fine with Spark 3.2.1
> This PR proposes to relax the code and treat all unknown partitioning the 
> same way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40508:


Assignee: Apache Spark

> Treat unknown partitioning as UnknownPartitioning
> -
>
> Key: SPARK-40508
> URL: https://issues.apache.org/jira/browse/SPARK-40508
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ted Yu
>Assignee: Apache Spark
>Priority: Major
>
> When running spark application against spark 3.3, I see the following :
> {code}
> java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
> type: CustomPartitioning
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> {code}
> The CustomPartitioning works fine with Spark 3.2.1
> This PR proposes to relax the code and treat all unknown partitioning the 
> same way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-20 Thread Ted Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-40508:
---
Description: 
When running spark application against spark 3.3, I see the following :
{code}
java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
type: CustomPartitioning
at 
org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
at 
org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
{code}
The CustomPartitioning works fine with Spark 3.2.1
This PR proposes to relax the code and treat all unknown partitioning the same 
way as that for UnknownPartitioning.

  was:
When running spark application against spark 3.3, I see the following :
```
java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
type: CustomPartitioning
at 
org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
at 
org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
```
The CustomPartitioning works fine with Spark 3.2.1
This PR proposes to relax the code and treat all unknown partitioning the same 
way as that for UnknownPartitioning.


> Treat unknown partitioning as UnknownPartitioning
> -
>
> Key: SPARK-40508
> URL: https://issues.apache.org/jira/browse/SPARK-40508
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ted Yu
>Priority: Major
>
> When running spark application against spark 3.3, I see the following :
> {code}
> java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
> type: CustomPartitioning
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> {code}
> The CustomPartitioning works fine with Spark 3.2.1
> This PR proposes to relax the code and treat all unknown partitioning the 
> same way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning

2022-09-20 Thread Ted Yu (Jira)
Ted Yu created SPARK-40508:
--

 Summary: Treat unknown partitioning as UnknownPartitioning
 Key: SPARK-40508
 URL: https://issues.apache.org/jira/browse/SPARK-40508
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Ted Yu


When running spark application against spark 3.3, I see the following :
```
java.lang.IllegalArgumentException: Unsupported data source V2 partitioning 
type: CustomPartitioning
at 
org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
at 
org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
```
The CustomPartitioning works fine with Spark 3.2.1
This PR proposes to relax the code and treat all unknown partitioning the same 
way as that for UnknownPartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40477) Support `NullType` in `ColumnarBatchRow`

2022-09-20 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura resolved SPARK-40477.
---
Resolution: Won't Fix

gave another thought and decided to close this one not to be fixed. There is no 
natural code path of calling ColumnarBatchRow.get() for NullType columns, 
especially NullType cannot be stored as partition in columnar format like 
Parquet.

> Support `NullType` in `ColumnarBatchRow`
> 
>
> Key: SPARK-40477
> URL: https://issues.apache.org/jira/browse/SPARK-40477
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> `ColumnarBatchRow.get()` does not support `NullType` currently. Support 
> `NullType` in `ColumnarBatchRow` so that `NullType` can be partition column 
> type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40416) Add error classes for subquery expression CheckAnalysis failures

2022-09-20 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-40416.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37840
[https://github.com/apache/spark/pull/37840]

> Add error classes for subquery expression CheckAnalysis failures
> 
>
> Key: SPARK-40416
> URL: https://issues.apache.org/jira/browse/SPARK-40416
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40416) Add error classes for subquery expression CheckAnalysis failures

2022-09-20 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-40416:
--

Assignee: Daniel

> Add error classes for subquery expression CheckAnalysis failures
> 
>
> Key: SPARK-40416
> URL: https://issues.apache.org/jira/browse/SPARK-40416
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40507) Spark creates an optional columns in hive table for fields that are not null

2022-09-20 Thread Anil Dasari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anil Dasari updated SPARK-40507:

Description: 
Dataframe saveAsTable sets all columns as optional/nullable while creating the 
table here  

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L531]

(`outputColumns.toStructType.asNullable`)

This makes source parquet schema and hive table schema doesn't match and is 
problematic when large dataframe(s) process uses hive as temporary storage to 
avoid the memory pressure. 

Hive 3.x supports non null constraints on table columns. Please add support for 
non null constraints on Spark sql hive table. 

  was:
Dataframe saveAsTable sets all columns as optional/nullable while creating the 
table here  

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L531]

(`outputColumns.toStructType.asNullable`)

This makes source parquet schema and hive table schema doesn't match and is 
problematic when large dataframe(s) process uses hive as temporary storage to 
avoid the memory pressure. 

Hive 3.x supports non null constraints on table columns. Please add support non 
null constraints on Spark sql hive table. 


> Spark creates an optional columns in hive table for fields that are not null
> 
>
> Key: SPARK-40507
> URL: https://issues.apache.org/jira/browse/SPARK-40507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anil Dasari
>Priority: Major
>
> Dataframe saveAsTable sets all columns as optional/nullable while creating 
> the table here  
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L531]
> (`outputColumns.toStructType.asNullable`)
> This makes source parquet schema and hive table schema doesn't match and is 
> problematic when large dataframe(s) process uses hive as temporary storage to 
> avoid the memory pressure. 
> Hive 3.x supports non null constraints on table columns. Please add support 
> for non null constraints on Spark sql hive table. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40507) Spark creates an optional columns in hive table for fields that are not null

2022-09-20 Thread Anil Dasari (Jira)
Anil Dasari created SPARK-40507:
---

 Summary: Spark creates an optional columns in hive table for 
fields that are not null
 Key: SPARK-40507
 URL: https://issues.apache.org/jira/browse/SPARK-40507
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Anil Dasari


Dataframe saveAsTable sets all columns as optional/nullable while creating the 
table here  

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L531]

(`outputColumns.toStructType.asNullable`)

This makes source parquet schema and hive table schema doesn't match and is 
problematic when large dataframe(s) process uses hive as temporary storage to 
avoid the memory pressure. 

Hive 3.x supports non null constraints on table columns. Please add support non 
null constraints on Spark sql hive table. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:23 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy}} would work.

For instance, after inspecting the code, I thought nullOnOverflow is controlled 
by {{spark.sql.ansi.enabled. }}I tried to achieve the desired behavior by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy}} would work.

For instance, after inspecting the code, I thought that nullOnOverflow is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:23 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy}} would work.

For instance, after inspecting the code, I thought nullOnOverflow is controlled 
by \{{spark.sql.ansi.enabled.}} I tried to achieve the desired behavior by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy}} would work.

For instance, after inspecting the code, I thought nullOnOverflow is controlled 
by {{spark.sql.ansi.enabled. }}I tried to achieve the desired behavior by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       scale: 

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:22 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy}} would work.

For instance, after inspecting the code, I thought that nullOnOverflow is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy }}would work.

For instance, after inspecting the code, I thought that nullOnOverflow is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:22 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy }}would work.

For instance, after inspecting the code, I thought that nullOnOverflow is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy would work.}}

For instance, after inspecting the code, I thought that nullOnOverflow is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:21 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy would work.}}

{{ For instance, after inspecting the code, I thought that nullOnOverflow}} is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. 

I believe it could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy would work. For instance, after inspecting 
the code, I thought that nullOnOverflow}} is controlled by 
{{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:21 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy would work.}}

For instance, after inspecting the code, I thought that nullOnOverflow is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it 
could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy would work.}}

{{ For instance, after inspecting the code, I thought that nullOnOverflow}} is 
controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired 
behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>  

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:20 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. 

I believe it could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy would work. For instance, after inspecting 
the code, I thought that nullOnOverflow}} is controlled by 
{{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}

I believe it could get non trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting 
the code, I thought that {{nullOnOverflow}} is controlled by 
{{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:20 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}

I believe it could get non trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting 
the code, I thought that {{nullOnOverflow}} is controlled by 
{{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}I believe it could get 
non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} 
would work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>   

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:18 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}I believe it could get 
non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} 
would work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. I believe it could get 
non-trivial for users to discover that spark.sql.storeAssignmentPolicy}} would 
work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Commented] (SPARK-31404) file source backward compatibility after calendar switch

2022-09-20 Thread Sachit (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607326#comment-17607326
 ] 

Sachit commented on SPARK-31404:


Hi [~cloud_fan] ,

Could you please confirm if we need to use below properties to ensure it can 
read data written by spark2.4x


spark.conf.set("spark.sql.parquet.int96RebaseModeInRead","CORRECTED")
spark.conf.set("spark.sql.parquet.int96RebaseModeInWrite","CORRECTED")


Regards
Sachit

> file source backward compatibility after calendar switch
> 
>
> Key: SPARK-31404
> URL: https://issues.apache.org/jira/browse/SPARK-31404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: Switch to Java 8 time API in Spark 3.0.pdf
>
>
> In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java 
> 8 datetime APIs. This makes Spark follow the ISO and SQL standard, but 
> introduces some backward compatibility problems:
> 1. may read wrong data from the data files written by Spark 2.4
> 2. may have perf regression



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:17 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. I believe it could get 
non-trivial for users to discover that spark.sql.storeAssignmentPolicy}} would 
work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY }}works. I believe it could get 
non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} 
would work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Commented] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys commented on SPARK-40439:
--

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY }}works. I believe it could get 
non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} 
would work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       scale: Int,
>       roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP,
>       nullOnOverflow: Boolean = true,
>       context: SQLQueryContext = null): Decimal = {
>     val copy = clone()
>     if (copy.changePrecision(precision, scale, roundMode)) {
>       copy
>     } else {
>       if (nullOnOverflow) {
>         null
>       } else {
>         throw QueryExecutionErrors.cannotChangeDecimalPrecisionError(
>           this, precision, scale, context)
>       }
>     }
>   }{code}
> The above function is invoked from 
> 

[jira] [Assigned] (SPARK-39494) Support `createDataFrame` from a list of scalars when schema is not provided

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39494:


Assignee: Apache Spark

> Support `createDataFrame` from a list of scalars when schema is not provided
> 
>
> Key: SPARK-39494
> URL: https://issues.apache.org/jira/browse/SPARK-39494
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, DataFrame creation from a list of native Python scalars is 
> unsupported in PySpark, for example,
> {{>>> spark.createDataFrame([1, 2]).collect()}}
> {{Traceback (most recent call last):}}
> {{...}}
> {{TypeError: Can not infer schema for type: }}
> {{However, Spark DataFrame Scala API supports that:}}
> {{scala> Seq(1, 2).toDF().collect()}}
> {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}}
> To maintain API consistency, we propose to support DataFrame creation from a 
> list of scalars. 
> See more 
> [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39494) Support `createDataFrame` from a list of scalars when schema is not provided

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39494:


Assignee: (was: Apache Spark)

> Support `createDataFrame` from a list of scalars when schema is not provided
> 
>
> Key: SPARK-39494
> URL: https://issues.apache.org/jira/browse/SPARK-39494
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, DataFrame creation from a list of native Python scalars is 
> unsupported in PySpark, for example,
> {{>>> spark.createDataFrame([1, 2]).collect()}}
> {{Traceback (most recent call last):}}
> {{...}}
> {{TypeError: Can not infer schema for type: }}
> {{However, Spark DataFrame Scala API supports that:}}
> {{scala> Seq(1, 2).toDF().collect()}}
> {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}}
> To maintain API consistency, we propose to support DataFrame creation from a 
> list of scalars. 
> See more 
> [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40357) Migrate window type check failures onto error classes

2022-09-20 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607277#comment-17607277
 ] 

Max Gekk commented on SPARK-40357:
--

[~lvshaokang] Sure, go ahead.

> Migrate window type check failures onto error classes
> -
>
> Key: SPARK-40357
> URL: https://issues.apache.org/jira/browse/SPARK-40357
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in window 
> expressions:
> 1. WindowSpecDefinition (4): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85
> 2. SpecifiedWindowFrame (3): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231
> 3. checkBoundary (2): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269
> 4. FrameLessOffsetWindowFunction (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40357) Migrate window type check failures onto error classes

2022-09-20 Thread Shaokang Lv (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607275#comment-17607275
 ] 

Shaokang Lv commented on SPARK-40357:
-

Hi, [~maxgekk] , I would like to do some work and pick up this ** if possible.

> Migrate window type check failures onto error classes
> -
>
> Key: SPARK-40357
> URL: https://issues.apache.org/jira/browse/SPARK-40357
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in window 
> expressions:
> 1. WindowSpecDefinition (4): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85
> 2. SpecifiedWindowFrame (3): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231
> 3. checkBoundary (2): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269
> 4. FrameLessOffsetWindowFunction (1): 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

2022-09-20 Thread Joost Farla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607270#comment-17607270
 ] 

Joost Farla commented on SPARK-34805:
-

[~cloud_fan] I was running into the exact same issue using Spark v3.3.0. It 
looks like the fix was merged into the 3.3 branch (on March 21st), but was not 
yet released as part of v3.3. It is also not mentioned in the release notes. Is 
that possible? Thanks in advance!

> PySpark loses metadata in DataFrame fields when selecting nested columns
> 
>
> Key: SPARK-34805
> URL: https://issues.apache.org/jira/browse/SPARK-34805
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Mark Ressler
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: jsonMetadataTest.py, nested_columns_metadata.scala
>
>
> For a DataFrame schema with nested StructTypes, where metadata is set for 
> fields in the schema, that metadata is lost when a DataFrame selects nested 
> fields.  For example, suppose
> {code:java}
> df.schema.fields[0].dataType.fields[0].metadata
> {code}
> returns a non-empty dictionary, then
> {code:java}
> df.select('Field0.SubField0').schema.fields[0].metadata{code}
> returns an empty dictionary, where "Field0" is the name of the first field in 
> the DataFrame and "SubField0" is the name of the first nested field under 
> "Field0".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40479) Migrate unexpected input type error to an error class

2022-09-20 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40479.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37921
[https://github.com/apache/spark/pull/37921]

> Migrate unexpected input type error to an error class
> -
>
> Key: SPARK-40479
> URL: https://issues.apache.org/jira/browse/SPARK-40479
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Migrate the function ExpectsInputTypes.checkInputDataTypes onto 
> DataTypeMismatch and introduce new error class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40479) Migrate unexpected input type error to an error class

2022-09-20 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40479:


Assignee: Max Gekk

> Migrate unexpected input type error to an error class
> -
>
> Key: SPARK-40479
> URL: https://issues.apache.org/jira/browse/SPARK-40479
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Migrate the function ExpectsInputTypes.checkInputDataTypes onto 
> DataTypeMismatch and introduce new error class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40491) Remove too old TODO for JdbcRDD

2022-09-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40491:
-
Issue Type: Task  (was: New Feature)
  Priority: Trivial  (was: Major)

This didn't need a JIRA - it was not Major. Please set the fields appropriately

> Remove too old TODO for JdbcRDD
> ---
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Trivial
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.
> In fact, this is a very old TODO and we need to revisit if this is still 
> necessary. Since Spark SQL is the new core, I'm not sure if anyone is 
> interested in a new API to create jdbc RDD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.

2022-09-20 Thread Garret Wilson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607188#comment-17607188
 ] 

Garret Wilson edited comment on SPARK-40489 at 9/20/22 1:19 PM:


{quote}It sounds like the new major version upgrade is done in a month ago and 
we don't quite know about stability.{quote}

You're missing the point. This ticket is not a request for Spark to move to 
SLF4J 2.x. The ticket is a bug report for Spark to stop access 
{{StaticLoggerBinder}}, which isn't part of the SLF4J API in the first place, 
so it won't break for people who do move to SLF4J 2.x Spark can keep using 
SLF4J 1.x as long as it wants.

{quote}The comment about log4j1 is moot as recent version of Spark uses 
log4j2.{quote}

This too is missing the point. You shouldn't be using Log4J2 directly. You 
should be coding to the SLF4J public API. Directly accessing one particular 
SLF4J implementation is just making it cumbersome for everybody else because 
you're not playing by the rules.


was (Author: garretwilson):
{quote}It sounds like the new major version upgrade is done in a month ago and 
we don't quite know about stability.{quote}

You're missing the point. This ticket is not a request for Spark to move to 
SLF4J 2.x. The ticket is a bug report for Spark to stop access 
{{StaticLoggerBinder}}, which isn't part of the SLF4J API in the first place, 
so it won't break for people who do move to SLF4J 2.x Spark can keep using 
SLF4J 1.x as long as it wants.

{quote}The comment about log4j1 is moot as recent version of Spark uses 
log4j2.{quote}

This too is missing the point. You shouldn't be using Log4J2 directly. You 
should be coding to the SLF4J public API. Directly accessing one particular 
SLF4J is just making it cumbersome for everybody else because you're not 
playing by the rules.

> Spark 3.3.0 breaks with SFL4J 2.
> 
>
> Key: SPARK-40489
> URL: https://issues.apache.org/jira/browse/SPARK-40489
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Garret Wilson
>Priority: Major
>
> Spark breaks fundamentally with SLF4J 2.x because it uses 
> {{StaticLoggerBinder}}.
> SLF4J is the logging facade that is meant to shield the application from the 
> implementation, whether it be Log4J or Logback or whatever. Historically 
> SLF4J 1.x used a bad approach to configuration: it used a 
> {{StaticLoggerBinder}} (a global static singleton instance) rather than the 
> Java {{ServiceLoader}} mechanism.
> SLF4J 2.x, which has been in development for years, has been released. It 
> finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
> FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
> compatible; an application just needs to use the latest Log4J/Logback 
> implementation which has the service loader.
> *Above all the application must _not_ use the low-level 
> {{StaticLoggerBinder}} method, because it has been removed!*
> Unfortunately 
> [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
>  uses {{StaticLoggerBinder}} and completely breaks any environment using 
> SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API 
> and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark 
> breaks completely just trying to get a Spark session:
> {noformat}
> Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
> at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
> at 
> org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.log(Logging.scala:53)
> at org.apache.spark.internal.Logging.log$(Logging.scala:51)
> at org.apache.spark.SparkContext.log(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.logInfo(Logging.scala:61)
> at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60)
> at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84)
> at org.apache.spark.SparkContext.(SparkContext.scala:195)
>

[jira] [Comment Edited] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.

2022-09-20 Thread Garret Wilson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607188#comment-17607188
 ] 

Garret Wilson edited comment on SPARK-40489 at 9/20/22 1:18 PM:


{quote}It sounds like the new major version upgrade is done in a month ago and 
we don't quite know about stability.{quote}

You're missing the point. This ticket is not a request for Spark to move to 
SLF4J 2.x. The ticket is a bug report for Spark to stop access 
{{StaticLoggerBinder}}, which isn't part of the SLF4J API in the first place, 
so it won't break for people who do move to SLF4J 2.x Spark can keep using 
SLF4J 1.x as long as it wants.

{quote}The comment about log4j1 is moot as recent version of Spark uses 
log4j2.{quote}

This too is missing the point. You shouldn't be using Log4J2 directly. You 
should be coding to the SLF4J public API. Directly accessing one particular 
SLF4J is just making it cumbersome for everybody else because you're not 
playing by the rules.


was (Author: garretwilson):
{quote}It sounds like the new major version upgrade is done in a month ago and 
we don't quite know about stability.{quote}

You're missing the point. This ticket is not a request for Spark to move to 
SLF4J 2.x. The ticket is a bug report for Spark to stop access 
{{StaticLoggerBinder}}, which isn't part of the SLF4J API in the first place, 
so it won't break for people who do move to SLF4J 2.x Spark can keep using 
SLF4J 1.x as long as it wants.

{quote}The comment about log4j1 is moot as recent version of Spark uses 
log4j2.{quote}

This too is missing the point. You shouldn't be using Log4J2 directly. You 
should be coding to the SLF4J public API. Directly accessing one particular 
SLF4J is just making it cumbersome for everybody else because you're playing by 
the rules.

> Spark 3.3.0 breaks with SFL4J 2.
> 
>
> Key: SPARK-40489
> URL: https://issues.apache.org/jira/browse/SPARK-40489
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Garret Wilson
>Priority: Major
>
> Spark breaks fundamentally with SLF4J 2.x because it uses 
> {{StaticLoggerBinder}}.
> SLF4J is the logging facade that is meant to shield the application from the 
> implementation, whether it be Log4J or Logback or whatever. Historically 
> SLF4J 1.x used a bad approach to configuration: it used a 
> {{StaticLoggerBinder}} (a global static singleton instance) rather than the 
> Java {{ServiceLoader}} mechanism.
> SLF4J 2.x, which has been in development for years, has been released. It 
> finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
> FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
> compatible; an application just needs to use the latest Log4J/Logback 
> implementation which has the service loader.
> *Above all the application must _not_ use the low-level 
> {{StaticLoggerBinder}} method, because it has been removed!*
> Unfortunately 
> [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
>  uses {{StaticLoggerBinder}} and completely breaks any environment using 
> SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API 
> and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark 
> breaks completely just trying to get a Spark session:
> {noformat}
> Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
> at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
> at 
> org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.log(Logging.scala:53)
> at org.apache.spark.internal.Logging.log$(Logging.scala:51)
> at org.apache.spark.SparkContext.log(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.logInfo(Logging.scala:61)
> at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60)
> at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84)
> at org.apache.spark.SparkContext.(SparkContext.scala:195)
> at 

[jira] [Commented] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.

2022-09-20 Thread Garret Wilson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607188#comment-17607188
 ] 

Garret Wilson commented on SPARK-40489:
---

{quote}It sounds like the new major version upgrade is done in a month ago and 
we don't quite know about stability.{quote}

You're missing the point. This ticket is not a request for Spark to move to 
SLF4J 2.x. The ticket is a bug report for Spark to stop access 
{{StaticLoggerBinder}}, which isn't part of the SLF4J API in the first place, 
so it won't break for people who do move to SLF4J 2.x Spark can keep using 
SLF4J 1.x as long as it wants.

{quote}The comment about log4j1 is moot as recent version of Spark uses 
log4j2.{quote}

This too is missing the point. You shouldn't be using Log4J2 directly. You 
should be coding to the SLF4J public API. Directly accessing one particular 
SLF4J is just making it cumbersome for everybody else because you're playing by 
the rules.

> Spark 3.3.0 breaks with SFL4J 2.
> 
>
> Key: SPARK-40489
> URL: https://issues.apache.org/jira/browse/SPARK-40489
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Garret Wilson
>Priority: Major
>
> Spark breaks fundamentally with SLF4J 2.x because it uses 
> {{StaticLoggerBinder}}.
> SLF4J is the logging facade that is meant to shield the application from the 
> implementation, whether it be Log4J or Logback or whatever. Historically 
> SLF4J 1.x used a bad approach to configuration: it used a 
> {{StaticLoggerBinder}} (a global static singleton instance) rather than the 
> Java {{ServiceLoader}} mechanism.
> SLF4J 2.x, which has been in development for years, has been released. It 
> finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
> FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
> compatible; an application just needs to use the latest Log4J/Logback 
> implementation which has the service loader.
> *Above all the application must _not_ use the low-level 
> {{StaticLoggerBinder}} method, because it has been removed!*
> Unfortunately 
> [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
>  uses {{StaticLoggerBinder}} and completely breaks any environment using 
> SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API 
> and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark 
> breaks completely just trying to get a Spark session:
> {noformat}
> Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
> at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
> at 
> org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.log(Logging.scala:53)
> at org.apache.spark.internal.Logging.log$(Logging.scala:51)
> at org.apache.spark.SparkContext.log(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.logInfo(Logging.scala:61)
> at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60)
> at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84)
> at org.apache.spark.SparkContext.(SparkContext.scala:195)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
> at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953)
> at scala.Option.getOrElse(Option.scala:201)
> at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
> {noformat}
> This is because Spark is playing low-level tricks to find out if the logging 
> platform is Log4J, and relying on {{StaticLoggerBinder}} to do it.
> {code}
>   private def isLog4j2(): Boolean = {
> // This distinguishes the log4j 1.2 binding, currently
> // org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, 
> currently
> // org.apache.logging.slf4j.Log4jLoggerFactory
> val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr
> 

[jira] [Comment Edited] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.

2022-09-20 Thread Garret Wilson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606852#comment-17606852
 ] 

Garret Wilson edited comment on SPARK-40489 at 9/20/22 1:13 PM:


# Dropping explicit Log4J 1.x support is certainly one of the things that needs 
to be done immediately. Not only is it full of vulnerabilities, it [reached end 
of life|https://logging.apache.org/log4j/1.2/] over five years ago!
# The Log4J implementation dependencies should be removed from Spark as well. 
See my question [Correctly fixing multiple `StaticLoggerBinder` bindings in 
Spark|https://stackoverflow.com/q/73615263] on Stack Overflow (which few people 
seem to have given any thought or care about, given the zero responses I have 
received so far).
# And of course {{StaticLoggerBinder}} references should be abandoned.

All this should have been done years ago. I mention this to give it some sense 
of urgency, in light of what I will say next.

I hesitate to even mention the following, because it might lower the priority 
of the ticket, but for those who might be in a pickle, I just released 
{{io.clogr:slf4j1-shim:0.8.3}} to Maven Central, which is a 
[shim|https://github.com/globalmentor/clogr/tree/master/slf4j1-shim] that will 
keep Spark from breaking in the face of SLF4J 2.x. Just include it as a 
dependency and Spark will stop breaking. *But this is a stop-gap measure! 
Please fix this bug!* :)


was (Author: garretwilson):
# Dropping explicit Log4J 1.x support is certainly one of the things that needs 
to be done immediately. Not only is it full of vulnerabilities, it [reached end 
of life|https://logging.apache.org/log4j/1.2/] over five years ago!
# The Log4J implementation dependencies should be removed from Spark as well. 
See my question [Correctly fixing multiple `StaticLoggerBinder` bindings in 
Spark|https://stackoverflow.com/q/73615263] on Stack Overflow (which few people 
seem to have given any thought or care about, given the zero responses I have 
received so far).
# And of course {{StaticLoggerBinder}} references should be abandoned.

All this should have been done years ago. I mention this to give it some sense 
of urgency, in light of what I will say next.

I hesitate to even mention the following, because it might lower the priority 
of the ticket, but for those who might be in a pickle, I just released 
{{io.clogr:clogr-slf4j1-adapter:0.8.2}} to Maven Central, which is an 
[adapter|https://github.com/globalmentor/clogr/tree/master/clogr-slf4j1-adapter]
 (a shim, really) that will keep Spark from breaking in the face of SLF4J 2.x. 
Just include it as a dependency and Spark will stop breaking. *But this is a 
stop-gap measure! Please fix this bug!* :)

> Spark 3.3.0 breaks with SFL4J 2.
> 
>
> Key: SPARK-40489
> URL: https://issues.apache.org/jira/browse/SPARK-40489
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Garret Wilson
>Priority: Major
>
> Spark breaks fundamentally with SLF4J 2.x because it uses 
> {{StaticLoggerBinder}}.
> SLF4J is the logging facade that is meant to shield the application from the 
> implementation, whether it be Log4J or Logback or whatever. Historically 
> SLF4J 1.x used a bad approach to configuration: it used a 
> {{StaticLoggerBinder}} (a global static singleton instance) rather than the 
> Java {{ServiceLoader}} mechanism.
> SLF4J 2.x, which has been in development for years, has been released. It 
> finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
> FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
> compatible; an application just needs to use the latest Log4J/Logback 
> implementation which has the service loader.
> *Above all the application must _not_ use the low-level 
> {{StaticLoggerBinder}} method, because it has been removed!*
> Unfortunately 
> [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
>  uses {{StaticLoggerBinder}} and completely breaks any environment using 
> SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API 
> and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark 
> breaks completely just trying to get a Spark session:
> {noformat}
> Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
> at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
> at 
> org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
> at 
> 

[jira] [Assigned] (SPARK-40506) Spark Streaming metrics name don't need application name

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40506:


Assignee: (was: Apache Spark)

> Spark Streaming metrics name don't need application name
> 
>
> Key: SPARK-40506
> URL: https://issues.apache.org/jira/browse/SPARK-40506
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.2.2
>Reporter: 王俊博
>Priority: Trivial
>
> Spark  StreamingSource  Metrics sourceName is inappropriate.The label now 
> looks like 
> `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
>  `, instead of 
> `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
>  the Spark app name is not need.
> This makes it hard to use metrics for different Spark applications over time. 
> And this makes the metrics sourceName standard inconsistent
> {code:java}
> //代码占位符
> private[streaming] class StreamingSource(ssc: StreamingContext) extends 
> Source {
>   override val metricRegistry = new MetricRegistry
>   override val sourceName = 
> "%s.StreamingMetrics".format(ssc.sparkContext.appName)
> 
> }{code}
> And for example, other metrics sourceName don't have appName.
> {code:java}
> //代码占位符
> private[spark] class LiveListenerBusMetrics(conf: SparkConf)
>   extends Source with Logging {
>   override val sourceName: String = "LiveListenerBus"
>   override val metricRegistry: MetricRegistry = new MetricRegistry
> ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40506) Spark Streaming metrics name don't need application name

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40506:


Assignee: Apache Spark

> Spark Streaming metrics name don't need application name
> 
>
> Key: SPARK-40506
> URL: https://issues.apache.org/jira/browse/SPARK-40506
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.2.2
>Reporter: 王俊博
>Assignee: Apache Spark
>Priority: Trivial
>
> Spark  StreamingSource  Metrics sourceName is inappropriate.The label now 
> looks like 
> `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
>  `, instead of 
> `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
>  the Spark app name is not need.
> This makes it hard to use metrics for different Spark applications over time. 
> And this makes the metrics sourceName standard inconsistent
> {code:java}
> //代码占位符
> private[streaming] class StreamingSource(ssc: StreamingContext) extends 
> Source {
>   override val metricRegistry = new MetricRegistry
>   override val sourceName = 
> "%s.StreamingMetrics".format(ssc.sparkContext.appName)
> 
> }{code}
> And for example, other metrics sourceName don't have appName.
> {code:java}
> //代码占位符
> private[spark] class LiveListenerBusMetrics(conf: SparkConf)
>   extends Source with Logging {
>   override val sourceName: String = "LiveListenerBus"
>   override val metricRegistry: MetricRegistry = new MetricRegistry
> ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40506) Spark Streaming metrics name don't need application name

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607167#comment-17607167
 ] 

Apache Spark commented on SPARK-40506:
--

User 'Kwafoor' has created a pull request for this issue:
https://github.com/apache/spark/pull/37951

> Spark Streaming metrics name don't need application name
> 
>
> Key: SPARK-40506
> URL: https://issues.apache.org/jira/browse/SPARK-40506
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.2.2
>Reporter: 王俊博
>Priority: Trivial
>
> Spark  StreamingSource  Metrics sourceName is inappropriate.The label now 
> looks like 
> `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
>  `, instead of 
> `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
>  the Spark app name is not need.
> This makes it hard to use metrics for different Spark applications over time. 
> And this makes the metrics sourceName standard inconsistent
> {code:java}
> //代码占位符
> private[streaming] class StreamingSource(ssc: StreamingContext) extends 
> Source {
>   override val metricRegistry = new MetricRegistry
>   override val sourceName = 
> "%s.StreamingMetrics".format(ssc.sparkContext.appName)
> 
> }{code}
> And for example, other metrics sourceName don't have appName.
> {code:java}
> //代码占位符
> private[spark] class LiveListenerBusMetrics(conf: SparkConf)
>   extends Source with Logging {
>   override val sourceName: String = "LiveListenerBus"
>   override val metricRegistry: MetricRegistry = new MetricRegistry
> ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40506) Spark Streaming metrics name don't need application name

2022-09-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

王俊博 updated SPARK-40506:

Summary: Spark Streaming metrics name don't need application name  (was: 
Spark Streaming Metrics SourceName is unsuitable)

> Spark Streaming metrics name don't need application name
> 
>
> Key: SPARK-40506
> URL: https://issues.apache.org/jira/browse/SPARK-40506
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.2.2
>Reporter: 王俊博
>Priority: Trivial
>
> Spark  StreamingSource  Metrics sourceName is inappropriate.The label now 
> looks like 
> `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
>  `, instead of 
> `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
>  the Spark app name is not need.
> This makes it hard to use metrics for different Spark applications over time. 
> And this makes the metrics sourceName standard inconsistent
> {code:java}
> //代码占位符
> private[streaming] class StreamingSource(ssc: StreamingContext) extends 
> Source {
>   override val metricRegistry = new MetricRegistry
>   override val sourceName = 
> "%s.StreamingMetrics".format(ssc.sparkContext.appName)
> 
> }{code}
> And for example, other metrics sourceName don't have appName.
> {code:java}
> //代码占位符
> private[spark] class LiveListenerBusMetrics(conf: SparkConf)
>   extends Source with Logging {
>   override val sourceName: String = "LiveListenerBus"
>   override val metricRegistry: MetricRegistry = new MetricRegistry
> ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40506) Spark Streaming Metrics SourceName is unsuitable

2022-09-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

王俊博 updated SPARK-40506:

Description: 
Spark  StreamingSource  Metrics sourceName is inappropriate.The label now looks 
like 
`application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
 `, instead of 
`application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
 the Spark app name is not need.

This makes it hard to use metrics for different Spark applications over time. 
And this makes the metrics sourceName standard inconsistent
{code:java}
//代码占位符

private[streaming] class StreamingSource(ssc: StreamingContext) extends Source {
  override val metricRegistry = new MetricRegistry
  override val sourceName = 
"%s.StreamingMetrics".format(ssc.sparkContext.appName)

}{code}
And for example, other metrics sourceName don't have appName.
{code:java}
//代码占位符
private[spark] class LiveListenerBusMetrics(conf: SparkConf)
  extends Source with Logging {

  override val sourceName: String = "LiveListenerBus"
  override val metricRegistry: MetricRegistry = new MetricRegistry
...
}

{code}

  was:
Spark  StreamingSource  Metrics sourceName is inappropriate.The label now looks 
like 
`application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
 `, instead of 
`application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
 the Spark app name is not need.
{code:java}
//代码占位符

private[streaming] class StreamingSource(ssc: StreamingContext) extends Source {
  override val metricRegistry = new MetricRegistry
  override val sourceName = 
"%s.StreamingMetrics".format(ssc.sparkContext.appName)

}{code}
And for example, other metrics sourceName don't have appName.
{code:java}
//代码占位符
private[spark] class LiveListenerBusMetrics(conf: SparkConf)
  extends Source with Logging {

  override val sourceName: String = "LiveListenerBus"
  override val metricRegistry: MetricRegistry = new MetricRegistry
...
}

{code}


> Spark Streaming Metrics SourceName is unsuitable
> 
>
> Key: SPARK-40506
> URL: https://issues.apache.org/jira/browse/SPARK-40506
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.2.2
>Reporter: 王俊博
>Priority: Trivial
>
> Spark  StreamingSource  Metrics sourceName is inappropriate.The label now 
> looks like 
> `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
>  `, instead of 
> `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
>  the Spark app name is not need.
> This makes it hard to use metrics for different Spark applications over time. 
> And this makes the metrics sourceName standard inconsistent
> {code:java}
> //代码占位符
> private[streaming] class StreamingSource(ssc: StreamingContext) extends 
> Source {
>   override val metricRegistry = new MetricRegistry
>   override val sourceName = 
> "%s.StreamingMetrics".format(ssc.sparkContext.appName)
> 
> }{code}
> And for example, other metrics sourceName don't have appName.
> {code:java}
> //代码占位符
> private[spark] class LiveListenerBusMetrics(conf: SparkConf)
>   extends Source with Logging {
>   override val sourceName: String = "LiveListenerBus"
>   override val metricRegistry: MetricRegistry = new MetricRegistry
> ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40506) Spark Streaming Metrics SourceName is unsuitable

2022-09-20 Thread Jira
王俊博 created SPARK-40506:
---

 Summary: Spark Streaming Metrics SourceName is unsuitable
 Key: SPARK-40506
 URL: https://issues.apache.org/jira/browse/SPARK-40506
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 3.2.2
Reporter: 王俊博


Spark  StreamingSource  Metrics sourceName is inappropriate.The label now looks 
like 
`application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
 `, instead of 
`application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
 the Spark app name is not need.
{code:java}
//代码占位符

private[streaming] class StreamingSource(ssc: StreamingContext) extends Source {
  override val metricRegistry = new MetricRegistry
  override val sourceName = 
"%s.StreamingMetrics".format(ssc.sparkContext.appName)

}{code}
And for example, other metrics sourceName don't have appName.
{code:java}
//代码占位符
private[spark] class LiveListenerBusMetrics(conf: SparkConf)
  extends Source with Logging {

  override val sourceName: String = "LiveListenerBus"
  override val metricRegistry: MetricRegistry = new MetricRegistry
...
}

{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607151#comment-17607151
 ] 

Apache Spark commented on SPARK-40505:
--

User 'bryanck' has created a pull request for this issue:
https://github.com/apache/spark/pull/37950

> Remove min heap setting in Kubernetes Dockerfile entrypoint
> ---
>
> Key: SPARK-40505
> URL: https://issues.apache.org/jira/browse/SPARK-40505
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Bryan Keller
>Priority: Major
>
> The entrypoint script for the Kubernetes Dockerfile sets the Java min heap 
> setting (-Xms) to be the same as the max setting (-Xmx) for the executor 
> process. This prevents the JVM from shrinking the heap and can lead to 
> excessive memory usage in some scenarios. Removing the min heap setting is 
> consistent with YARN executor startup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40505:


Assignee: Apache Spark

> Remove min heap setting in Kubernetes Dockerfile entrypoint
> ---
>
> Key: SPARK-40505
> URL: https://issues.apache.org/jira/browse/SPARK-40505
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Bryan Keller
>Assignee: Apache Spark
>Priority: Major
>
> The entrypoint script for the Kubernetes Dockerfile sets the Java min heap 
> setting (-Xms) to be the same as the max setting (-Xmx) for the executor 
> process. This prevents the JVM from shrinking the heap and can lead to 
> excessive memory usage in some scenarios. Removing the min heap setting is 
> consistent with YARN executor startup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40505:


Assignee: (was: Apache Spark)

> Remove min heap setting in Kubernetes Dockerfile entrypoint
> ---
>
> Key: SPARK-40505
> URL: https://issues.apache.org/jira/browse/SPARK-40505
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Bryan Keller
>Priority: Major
>
> The entrypoint script for the Kubernetes Dockerfile sets the Java min heap 
> setting (-Xms) to be the same as the max setting (-Xmx) for the executor 
> process. This prevents the JVM from shrinking the heap and can lead to 
> excessive memory usage in some scenarios. Removing the min heap setting is 
> consistent with YARN executor startup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607149#comment-17607149
 ] 

Apache Spark commented on SPARK-40505:
--

User 'bryanck' has created a pull request for this issue:
https://github.com/apache/spark/pull/37950

> Remove min heap setting in Kubernetes Dockerfile entrypoint
> ---
>
> Key: SPARK-40505
> URL: https://issues.apache.org/jira/browse/SPARK-40505
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Bryan Keller
>Priority: Major
>
> The entrypoint script for the Kubernetes Dockerfile sets the Java min heap 
> setting (-Xms) to be the same as the max setting (-Xmx) for the executor 
> process. This prevents the JVM from shrinking the heap and can lead to 
> excessive memory usage in some scenarios. Removing the min heap setting is 
> consistent with YARN executor startup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint

2022-09-20 Thread Bryan Keller (Jira)
Bryan Keller created SPARK-40505:


 Summary: Remove min heap setting in Kubernetes Dockerfile 
entrypoint
 Key: SPARK-40505
 URL: https://issues.apache.org/jira/browse/SPARK-40505
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.2.2, 3.3.0
Reporter: Bryan Keller


The entrypoint script for the Kubernetes Dockerfile sets the Java min heap 
setting (-Xms) to be the same as the max setting (-Xmx) for the executor 
process. This prevents the JVM from shrinking the heap and can lead to 
excessive memory usage in some scenarios. Removing the min heap setting is 
consistent with YARN executor startup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40501) Add PushProjectionThroughLimit for Optimizer

2022-09-20 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40501:

Description: 
h4. It took a long time to fetch out, still running after 20 minutes...

when run as follow code in spark-shell:

spark.sql("select * from xxx where event_day = '20220919' limit 1").show()

[!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png]
[!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png]

  was:
h4. It took a long time to fetch out

when run as follow code in spark-shell:

spark.sql("select * from xxx where event_day = '20220919' limit 1").show()


[!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png]
[!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png]


> Add PushProjectionThroughLimit for Optimizer
> 
>
> Key: SPARK-40501
> URL: https://issues.apache.org/jira/browse/SPARK-40501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> h4. It took a long time to fetch out, still running after 20 minutes...
> when run as follow code in spark-shell:
> spark.sql("select * from xxx where event_day = '20220919' limit 1").show()
> [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png]
> [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40501) Add PushProjectionThroughLimit for Optimizer

2022-09-20 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40501:

Summary: Add PushProjectionThroughLimit for Optimizer  (was: add 
PushProjectionThroughLimit for Optimizer)

> Add PushProjectionThroughLimit for Optimizer
> 
>
> Key: SPARK-40501
> URL: https://issues.apache.org/jira/browse/SPARK-40501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> h4. It took a long time to fetch out
> when run as follow code in spark-shell:
> spark.sql("select * from xxx where event_day = '20220919' limit 1").show()
> [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png]
> [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40504) Make yarn appmaster load config from client

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607092#comment-17607092
 ] 

Apache Spark commented on SPARK-40504:
--

User 'zhengchenyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/37949

> Make yarn appmaster load config from client
> ---
>
> Key: SPARK-40504
> URL: https://issues.apache.org/jira/browse/SPARK-40504
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.1
>Reporter: zhengchenyu
>Priority: Major
>
> In yarn federation mode, config in client side and nm side may be different. 
> AppMaster should override config from client side.
> For example: 
> In client side, yarn.resourcemanager.ha.rm-ids are yarn routers.
> In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40504) Make yarn appmaster load config from client

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40504:


Assignee: Apache Spark

> Make yarn appmaster load config from client
> ---
>
> Key: SPARK-40504
> URL: https://issues.apache.org/jira/browse/SPARK-40504
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.1
>Reporter: zhengchenyu
>Assignee: Apache Spark
>Priority: Major
>
> In yarn federation mode, config in client side and nm side may be different. 
> AppMaster should override config from client side.
> For example: 
> In client side, yarn.resourcemanager.ha.rm-ids are yarn routers.
> In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40504) Make yarn appmaster load config from client

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40504:


Assignee: (was: Apache Spark)

> Make yarn appmaster load config from client
> ---
>
> Key: SPARK-40504
> URL: https://issues.apache.org/jira/browse/SPARK-40504
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.1
>Reporter: zhengchenyu
>Priority: Major
>
> In yarn federation mode, config in client side and nm side may be different. 
> AppMaster should override config from client side.
> For example: 
> In client side, yarn.resourcemanager.ha.rm-ids are yarn routers.
> In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40457) upgrade jackson data mapper to latest

2022-09-20 Thread Bilna (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607082#comment-17607082
 ] 

Bilna commented on SPARK-40457:
---

[~hyukjin.kwon] it is org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13

> upgrade jackson data mapper to latest 
> --
>
> Key: SPARK-40457
> URL: https://issues.apache.org/jira/browse/SPARK-40457
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bilna
>Priority: Major
>
> Upgrade  jackson-mapper-asl to the latest to resolve CVE-2019-10172



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40504) Make yarn appmaster load config from client

2022-09-20 Thread zhengchenyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengchenyu updated SPARK-40504:

Description: 
In yarn federation mode, config in client side and nm side may be different. 
AppMaster should override config from client side.

For example: 

In client side, yarn.resourcemanager.ha.rm-ids are yarn routers.

In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster.

  was:In yarn federation mode, config in client side and nm side may be 
different. AppMaster should override config from client side.


> Make yarn appmaster load config from client
> ---
>
> Key: SPARK-40504
> URL: https://issues.apache.org/jira/browse/SPARK-40504
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.1
>Reporter: zhengchenyu
>Priority: Major
>
> In yarn federation mode, config in client side and nm side may be different. 
> AppMaster should override config from client side.
> For example: 
> In client side, yarn.resourcemanager.ha.rm-ids are yarn routers.
> In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607079#comment-17607079
 ] 

Apache Spark commented on SPARK-40327:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37948

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40504) Make yarn appmaster load config from client

2022-09-20 Thread zhengchenyu (Jira)
zhengchenyu created SPARK-40504:
---

 Summary: Make yarn appmaster load config from client
 Key: SPARK-40504
 URL: https://issues.apache.org/jira/browse/SPARK-40504
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 3.0.1
Reporter: zhengchenyu


In yarn federation mode, config in client side and nm side may be different. 
AppMaster should override config from client side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40327:


Assignee: (was: Apache Spark)

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40327:


Assignee: Apache Spark

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607077#comment-17607077
 ] 

Apache Spark commented on SPARK-40327:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37948

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40503) Add resampling to API references

2022-09-20 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40503:
-

 Summary: Add resampling to API references
 Key: SPARK-40503
 URL: https://issues.apache.org/jira/browse/SPARK-40503
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ps
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40491) Remove too old TODO for JdbcRDD

2022-09-20 Thread CaoYu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607045#comment-17607045
 ] 

CaoYu commented on SPARK-40491:
---

Maybe we can just not remove these. 

 

I have already created https://issues.apache.org/jira/browse/SPARK-40502, 
please take a look.

i want try to implement jdbc data source in pyspark.

 

Also I'm interested in this task for scala.

if possible, Please assign me this task, I want to try to get it done

> Remove too old TODO for JdbcRDD
> ---
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.
> In fact, this is a very old TODO and we need to revisit if this is still 
> necessary. Since Spark SQL is the new core, I'm not sure if anyone is 
> interested in a new API to create jdbc RDD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40502) Support dataframe API use jdbc data source in PySpark

2022-09-20 Thread CaoYu (Jira)
CaoYu created SPARK-40502:
-

 Summary: Support dataframe API use jdbc data source in PySpark
 Key: SPARK-40502
 URL: https://issues.apache.org/jira/browse/SPARK-40502
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 3.3.0
Reporter: CaoYu


When i using pyspark, i wanna get data from mysql database.  so i want use 
JDBCRDD like java\scala.

But that is not be supported in PySpark.

 

For some reasons, i can't using DataFrame API, only can use RDD(datastream) 
API. Even i know the DataFrame can get data from jdbc source fairly well.

 
So i want to implement functionality that can use rdd to get data from jdbc 
source for PySpark.
 
*But i don't know if that are necessary for PySpark.   so we can discuss it.*
 
{*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*}  
*i hope this Jira task can assigned to me, so i can start working to implement 
it.*
 
*if not, please close this Jira task.*
 
 
*thanks a lot.*
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607034#comment-17607034
 ] 

Apache Spark commented on SPARK-40500:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37947

> Use `pd.items` instead of `pd.iteritems`
> 
>
> Key: SPARK-40500
> URL: https://issues.apache.org/jira/browse/SPARK-40500
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40500:


Assignee: Apache Spark

> Use `pd.items` instead of `pd.iteritems`
> 
>
> Key: SPARK-40500
> URL: https://issues.apache.org/jira/browse/SPARK-40500
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40500:


Assignee: (was: Apache Spark)

> Use `pd.items` instead of `pd.iteritems`
> 
>
> Key: SPARK-40500
> URL: https://issues.apache.org/jira/browse/SPARK-40500
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40501) add PushProjectionThroughLimit for Optimizer

2022-09-20 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40501:

Description: 
h4. It took a long time to fetch out

when run as follow code in spark-shell:

spark.sql("select * from xxx where event_day = '20220919' limit 1").show()


[!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png]
[!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png]

> add PushProjectionThroughLimit for Optimizer
> 
>
> Key: SPARK-40501
> URL: https://issues.apache.org/jira/browse/SPARK-40501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> h4. It took a long time to fetch out
> when run as follow code in spark-shell:
> spark.sql("select * from xxx where event_day = '20220919' limit 1").show()
> [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png]
> [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40501) add PushProjectionThroughLimit for Optimizer

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607033#comment-17607033
 ] 

Apache Spark commented on SPARK-40501:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37941

> add PushProjectionThroughLimit for Optimizer
> 
>
> Key: SPARK-40501
> URL: https://issues.apache.org/jira/browse/SPARK-40501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40501) add PushProjectionThroughLimit for Optimizer

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40501:


Assignee: (was: Apache Spark)

> add PushProjectionThroughLimit for Optimizer
> 
>
> Key: SPARK-40501
> URL: https://issues.apache.org/jira/browse/SPARK-40501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40501) add PushProjectionThroughLimit for Optimizer

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40501:


Assignee: Apache Spark

> add PushProjectionThroughLimit for Optimizer
> 
>
> Key: SPARK-40501
> URL: https://issues.apache.org/jira/browse/SPARK-40501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Priority: Blocker  (was: Major)

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Blocker
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40501) add PushProjectionThroughLimit for Optimizer

2022-09-20 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-40501:
---

 Summary: add PushProjectionThroughLimit for Optimizer
 Key: SPARK-40501
 URL: https://issues.apache.org/jira/browse/SPARK-40501
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`

2022-09-20 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40500:
-

 Summary: Use `pd.items` instead of `pd.iteritems`
 Key: SPARK-40500
 URL: https://issues.apache.org/jira/browse/SPARK-40500
 Project: Spark
  Issue Type: Improvement
  Components: ps
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Priority: Blocker  (was: Minor)

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Blocker
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Priority: Major  (was: Blocker)

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Major
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Environment: 
hadoop: 3.0.0 

spark:  2.4.0 / 3.2.1

shuffle:spark 2.4.0

  was:
hadoop 3.0.0 

spark2.4.0 / spark3.2.1

shuffle: spark2.4.0


> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Minor
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Attachment: spark3.2-shuffle-data.png

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop 3.0.0 
> spark2.4.0 / spark3.2.1
> shuffle: spark2.4.0
>Reporter: xuanzhiang
>Priority: Minor
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607022#comment-17607022
 ] 

Apache Spark commented on SPARK-40419:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/37946

> Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
> -
>
> Key: SPARK-40419
> URL: https://issues.apache.org/jira/browse/SPARK-40419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases 
> from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at 
> all.
> We should also leverage this to test pandas aggregate UDFs too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607021#comment-17607021
 ] 

Apache Spark commented on SPARK-40419:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/37946

> Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
> -
>
> Key: SPARK-40419
> URL: https://issues.apache.org/jira/browse/SPARK-40419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases 
> from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at 
> all.
> We should also leverage this to test pandas aggregate UDFs too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Description: 
spark.sql(
      s"""
         |SELECT
         | Info ,
         | PERCENTILE_APPROX(cost,0.5) cost_p50,
         | PERCENTILE_APPROX(cost,0.9) cost_p90,
         | PERCENTILE_APPROX(cost,0.95) cost_p95,
         | PERCENTILE_APPROX(cost,0.99) cost_p99,
         | PERCENTILE_APPROX(cost,0.999) cost_p999
         |FROM
         | textData
         |""".stripMargin)
 * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
shuffle , 140M shuffle data cost 3 hours. 

 * If we upgrade the Shuffle, will we get performance regression?

 *  

  was:
spark.sql(
      s"""
         |SELECT
         | Info ,
         | PERCENTILE_APPROX(cost,0.5) cost_p50,
         | PERCENTILE_APPROX(cost,0.9) cost_p90,
         | PERCENTILE_APPROX(cost,0.95) cost_p95,
         | PERCENTILE_APPROX(cost,0.99) cost_p99,
         | PERCENTILE_APPROX(cost,0.999) cost_p999
         |FROM
         | textData
         |""".stripMargin)
 * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
shuffle , 140M shuffle data cost 3 hours. 

 *  


> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop 3.0.0 
> spark2.4.0 / spark3.2.1
> shuffle: spark2.4.0
>Reporter: xuanzhiang
>Priority: Minor
> Attachments: spark2.4-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >