[jira] [Commented] (SPARK-40502) Support dataframe API use jdbc data source in PySpark

2022-09-20 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607447#comment-17607447
 ] 

Hyukjin Kwon commented on SPARK-40502:
--

{quote}
For some reasons, i can't using DataFrame API, only can use RDD(datastream) API.
{quote}
What's the reason?

> Support dataframe API use jdbc data source in PySpark
> -
>
> Key: SPARK-40502
> URL: https://issues.apache.org/jira/browse/SPARK-40502
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: CaoYu
>Priority: Major
>
> When i using pyspark, i wanna get data from mysql database.  so i want use 
> JDBCRDD like java\scala.
> But that is not be supported in PySpark.
>  
> For some reasons, i can't using DataFrame API, only can use RDD(datastream) 
> API. Even i know the DataFrame can get data from jdbc source fairly well.
>  
> So i want to implement functionality that can use rdd to get data from jdbc 
> source for PySpark.
>  
> *But i don't know if that are necessary for PySpark.   so we can discuss it.*
>  
> {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*}  
> *i hope this Jira task can assigned to me, so i can start working to 
> implement it.*
>  
> *if not, please close this Jira task.*
>  
>  
> *thanks a lot.*
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40499:
-
Priority: Major  (was: Blocker)

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Major
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40509) Construct an example of applyInPandasWithState in examples directory

2022-09-20 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-40509:


 Summary: Construct an example of applyInPandasWithState in 
examples directory
 Key: SPARK-40509
 URL: https://issues.apache.org/jira/browse/SPARK-40509
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Jungtaek Lim


Since we introduce a new API (applyInPandasWithState) in PySpark, it worths to 
have a separate full example of the API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40491) Remove too old TODO for JdbcRDD

2022-09-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40491:


Assignee: jiaan.geng

> Remove too old TODO for JdbcRDD
> ---
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Trivial
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.
> In fact, this is a very old TODO and we need to revisit if this is still 
> necessary. Since Spark SQL is the new core, I'm not sure if anyone is 
> interested in a new API to create jdbc RDD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40332) Implement `GroupBy.quantile`.

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607489#comment-17607489
 ] 

Apache Spark commented on SPARK-40332:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37954

> Implement `GroupBy.quantile`.
> -
>
> Key: SPARK-40332
> URL: https://issues.apache.org/jira/browse/SPARK-40332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> We should implement `GroupBy.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40501) Enhance 'SpecialLimits' to support project(..., limit(...))

2022-09-20 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40501:

Summary: Enhance 'SpecialLimits' to support project(..., limit(...))  (was: 
Add PushProjectionThroughLimit for Optimizer)

> Enhance 'SpecialLimits' to support project(..., limit(...))
> ---
>
> Key: SPARK-40501
> URL: https://issues.apache.org/jira/browse/SPARK-40501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> h4. It took a long time to fetch out, still running after 20 minutes...
> when run as follow code in spark-shell:
> spark.sql("select * from xxx where event_day = '20220919' limit 1").show()
> [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png]
> [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40510) Implement `ddof` in `Series.cov`

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607483#comment-17607483
 ] 

Apache Spark commented on SPARK-40510:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37953

> Implement `ddof` in `Series.cov`
> 
>
> Key: SPARK-40510
> URL: https://issues.apache.org/jira/browse/SPARK-40510
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40510) Implement `ddof` in `Series.cov`

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40510:


Assignee: (was: Apache Spark)

> Implement `ddof` in `Series.cov`
> 
>
> Key: SPARK-40510
> URL: https://issues.apache.org/jira/browse/SPARK-40510
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40510) Implement `ddof` in `Series.cov`

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40510:


Assignee: Apache Spark

> Implement `ddof` in `Series.cov`
> 
>
> Key: SPARK-40510
> URL: https://issues.apache.org/jira/browse/SPARK-40510
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40510) Implement `ddof` in `Series.cov`

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607481#comment-17607481
 ] 

Apache Spark commented on SPARK-40510:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37953

> Implement `ddof` in `Series.cov`
> 
>
> Key: SPARK-40510
> URL: https://issues.apache.org/jira/browse/SPARK-40510
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40510) Implement `ddof` in `Series.cov`

2022-09-20 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40510:
-

 Summary: Implement `ddof` in `Series.cov`
 Key: SPARK-40510
 URL: https://issues.apache.org/jira/browse/SPARK-40510
 Project: Spark
  Issue Type: Sub-task
  Components: ps
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`

2022-09-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40500:


Assignee: Ruifeng Zheng

> Use `pd.items` instead of `pd.iteritems`
> 
>
> Key: SPARK-40500
> URL: https://issues.apache.org/jira/browse/SPARK-40500
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`

2022-09-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40500.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37947
[https://github.com/apache/spark/pull/37947]

> Use `pd.items` instead of `pd.iteritems`
> 
>
> Key: SPARK-40500
> URL: https://issues.apache.org/jira/browse/SPARK-40500
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40491) Remove too old TODO for JdbcRDD

2022-09-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40491.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37937
[https://github.com/apache/spark/pull/37937]

> Remove too old TODO for JdbcRDD
> ---
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Trivial
> Fix For: 3.4.0
>
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.
> In fact, this is a very old TODO and we need to revisit if this is still 
> necessary. Since Spark SQL is the new core, I'm not sure if anyone is 
> interested in a new API to create jdbc RDD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40332) Implement `GroupBy.quantile`.

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607488#comment-17607488
 ] 

Apache Spark commented on SPARK-40332:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37954

> Implement `GroupBy.quantile`.
> -
>
> Key: SPARK-40332
> URL: https://issues.apache.org/jira/browse/SPARK-40332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> We should implement `GroupBy.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40506) Spark Streaming metrics name don't need application name

2022-09-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

王俊博 updated SPARK-40506:

Description: 
Spark  StreamingSource  Metrics sourceName is inappropriate.The label now looks 
like 
`application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
 `, instead of 
`application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
 the Spark app name is not need.

This makes it hard to use metrics for different Spark applications over time. 
And this makes the metrics sourceName standard inconsistent.
{code:java}
//代码占位符

private[streaming] class StreamingSource(ssc: StreamingContext) extends Source {
  override val metricRegistry = new MetricRegistry
  override val sourceName = 
"%s.StreamingMetrics".format(ssc.sparkContext.appName)

}{code}
And for example, other metrics sourceName don't have appName.
{code:java}
//代码占位符
private[spark] class LiveListenerBusMetrics(conf: SparkConf)
  extends Source with Logging {

  override val sourceName: String = "LiveListenerBus"
  override val metricRegistry: MetricRegistry = new MetricRegistry
...
}

{code}

  was:
Spark  StreamingSource  Metrics sourceName is inappropriate.The label now looks 
like 
`application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
 `, instead of 
`application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
 the Spark app name is not need.

This makes it hard to use metrics for different Spark applications over time. 
And this makes the metrics sourceName standard inconsistent
{code:java}
//代码占位符

private[streaming] class StreamingSource(ssc: StreamingContext) extends Source {
  override val metricRegistry = new MetricRegistry
  override val sourceName = 
"%s.StreamingMetrics".format(ssc.sparkContext.appName)

}{code}
And for example, other metrics sourceName don't have appName.
{code:java}
//代码占位符
private[spark] class LiveListenerBusMetrics(conf: SparkConf)
  extends Source with Logging {

  override val sourceName: String = "LiveListenerBus"
  override val metricRegistry: MetricRegistry = new MetricRegistry
...
}

{code}


> Spark Streaming metrics name don't need application name
> 
>
> Key: SPARK-40506
> URL: https://issues.apache.org/jira/browse/SPARK-40506
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.2.2
>Reporter: 王俊博
>Priority: Trivial
>
> Spark  StreamingSource  Metrics sourceName is inappropriate.The label now 
> looks like 
> `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
>  `, instead of 
> `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
>  the Spark app name is not need.
> This makes it hard to use metrics for different Spark applications over time. 
> And this makes the metrics sourceName standard inconsistent.
> {code:java}
> //代码占位符
> private[streaming] class StreamingSource(ssc: StreamingContext) extends 
> Source {
>   override val metricRegistry = new MetricRegistry
>   override val sourceName = 
> "%s.StreamingMetrics".format(ssc.sparkContext.appName)
> 
> }{code}
> And for example, other metrics sourceName don't have appName.
> {code:java}
> //代码占位符
> private[spark] class LiveListenerBusMetrics(conf: SparkConf)
>   extends Source with Logging {
>   override val sourceName: String = "LiveListenerBus"
>   override val metricRegistry: MetricRegistry = new MetricRegistry
> ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40511) Upgrade slf4j to 2.x

2022-09-20 Thread Yang Jie (Jira)
Yang Jie created SPARK-40511:


 Summary: Upgrade slf4j to 2.x
 Key: SPARK-40511
 URL: https://issues.apache.org/jira/browse/SPARK-40511
 Project: Spark
  Issue Type: Improvement
  Components: Build, Spark Core
Affects Versions: 3.4.0
Reporter: Yang Jie


https://www.slf4j.org/news.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped

2022-09-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40496.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37942
[https://github.com/apache/spark/pull/37942]

> Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
> --
>
> Key: SPARK-40496
> URL: https://issues.apache.org/jira/browse/SPARK-40496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped

2022-09-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40496:
---

Assignee: Ivan Sadikov

> Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
> --
>
> Key: SPARK-40496
> URL: https://issues.apache.org/jira/browse/SPARK-40496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40511) Upgrade slf4j to 2.x

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40511:


Assignee: Apache Spark

> Upgrade slf4j to 2.x
> 
>
> Key: SPARK-40511
> URL: https://issues.apache.org/jira/browse/SPARK-40511
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> https://www.slf4j.org/news.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40511) Upgrade slf4j to 2.x

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607522#comment-17607522
 ] 

Apache Spark commented on SPARK-40511:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37844

> Upgrade slf4j to 2.x
> 
>
> Key: SPARK-40511
> URL: https://issues.apache.org/jira/browse/SPARK-40511
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://www.slf4j.org/news.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40511) Upgrade slf4j to 2.x

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40511:


Assignee: (was: Apache Spark)

> Upgrade slf4j to 2.x
> 
>
> Key: SPARK-40511
> URL: https://issues.apache.org/jira/browse/SPARK-40511
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://www.slf4j.org/news.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40502) Support dataframe API use jdbc data source in PySpark

2022-09-20 Thread CaoYu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607523#comment-17607523
 ] 

CaoYu commented on SPARK-40502:
---

I am a teacher
Recently designed Python language basic course, big data direction

PySpark is one of the practical cases, but it is only a simple use of RDD code 
to complete the basic data processing work, and the use of JDBC data source is 
a part of the course

DataFrames(SparkSQL) will be used in future design advanced courses.
So I hope the datastream API to have the capability of jdbc datasource.

 

 

> Support dataframe API use jdbc data source in PySpark
> -
>
> Key: SPARK-40502
> URL: https://issues.apache.org/jira/browse/SPARK-40502
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: CaoYu
>Priority: Major
>
> When i using pyspark, i wanna get data from mysql database.  so i want use 
> JDBCRDD like java\scala.
> But that is not be supported in PySpark.
>  
> For some reasons, i can't using DataFrame API, only can use RDD(datastream) 
> API. Even i know the DataFrame can get data from jdbc source fairly well.
>  
> So i want to implement functionality that can use rdd to get data from jdbc 
> source for PySpark.
>  
> *But i don't know if that are necessary for PySpark.   so we can discuss it.*
>  
> {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*}  
> *i hope this Jira task can assigned to me, so i can start working to 
> implement it.*
>  
> *if not, please close this Jira task.*
>  
>  
> *thanks a lot.*
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40497) Upgrade Scala to 2.13.9

2022-09-20 Thread Yang Jie (Jira)
Yang Jie created SPARK-40497:


 Summary: Upgrade Scala to 2.13.9
 Key: SPARK-40497
 URL: https://issues.apache.org/jira/browse/SPARK-40497
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: Yang Jie


2.13.9 released [https://github.com/scala/scala/releases/tag/v2.13.9]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Attachment: spark2.4-shuffle-data.png

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop 3.0.0 
> spark2.4.0 / spark3.2.1
> shuffle: spark2.4.0
>Reporter: xuanzhiang
>Priority: Minor
> Attachments: spark2.4-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Description: 
spark.sql(
      s"""
         |SELECT
         | Info ,
         | PERCENTILE_APPROX(cost,0.5) cost_p50,
         | PERCENTILE_APPROX(cost,0.9) cost_p90,
         | PERCENTILE_APPROX(cost,0.95) cost_p95,
         | PERCENTILE_APPROX(cost,0.99) cost_p99,
         | PERCENTILE_APPROX(cost,0.999) cost_p999
         |FROM
         | textData
         |""".stripMargin)
 * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
shuffle , 140M shuffle data cost 3 hours. 

 * If we upgrade the Shuffle, will we get performance regression?

 *  

  was:
spark.sql(
      s"""
         |SELECT
         | Info ,
         | PERCENTILE_APPROX(cost,0.5) cost_p50,
         | PERCENTILE_APPROX(cost,0.9) cost_p90,
         | PERCENTILE_APPROX(cost,0.95) cost_p95,
         | PERCENTILE_APPROX(cost,0.99) cost_p99,
         | PERCENTILE_APPROX(cost,0.999) cost_p999
         |FROM
         | textData
         |""".stripMargin)
 * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
shuffle , 140M shuffle data cost 3 hours. 

 *  


> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop 3.0.0 
> spark2.4.0 / spark3.2.1
> shuffle: spark2.4.0
>Reporter: xuanzhiang
>Priority: Minor
> Attachments: spark2.4-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607022#comment-17607022
 ] 

Apache Spark commented on SPARK-40419:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/37946

> Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
> -
>
> Key: SPARK-40419
> URL: https://issues.apache.org/jira/browse/SPARK-40419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases 
> from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at 
> all.
> We should also leverage this to test pandas aggregate UDFs too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606963#comment-17606963
 ] 

Apache Spark commented on SPARK-40496:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/37942

> Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
> --
>
> Key: SPARK-40496
> URL: https://issues.apache.org/jira/browse/SPARK-40496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40497) Upgrade Scala to 2.13.9

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40497:


Assignee: (was: Apache Spark)

> Upgrade Scala to 2.13.9
> ---
>
> Key: SPARK-40497
> URL: https://issues.apache.org/jira/browse/SPARK-40497
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> 2.13.9 released [https://github.com/scala/scala/releases/tag/v2.13.9]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40497) Upgrade Scala to 2.13.9

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40497:


Assignee: Apache Spark

> Upgrade Scala to 2.13.9
> ---
>
> Key: SPARK-40497
> URL: https://issues.apache.org/jira/browse/SPARK-40497
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> 2.13.9 released [https://github.com/scala/scala/releases/tag/v2.13.9]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40497) Upgrade Scala to 2.13.9

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606966#comment-17606966
 ] 

Apache Spark commented on SPARK-40497:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37943

> Upgrade Scala to 2.13.9
> ---
>
> Key: SPARK-40497
> URL: https://issues.apache.org/jira/browse/SPARK-40497
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> 2.13.9 released [https://github.com/scala/scala/releases/tag/v2.13.9]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40497) Upgrade Scala to 2.13.9

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606968#comment-17606968
 ] 

Apache Spark commented on SPARK-40497:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37943

> Upgrade Scala to 2.13.9
> ---
>
> Key: SPARK-40497
> URL: https://issues.apache.org/jira/browse/SPARK-40497
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> 2.13.9 released [https://github.com/scala/scala/releases/tag/v2.13.9]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40498) Implement `kendall` and `min_periods` in `Series.corr`

2022-09-20 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40498:
-

 Summary: Implement `kendall` and `min_periods` in `Series.corr`
 Key: SPARK-40498
 URL: https://issues.apache.org/jira/browse/SPARK-40498
 Project: Spark
  Issue Type: Sub-task
  Components: ps
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607021#comment-17607021
 ] 

Apache Spark commented on SPARK-40419:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/37946

> Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
> -
>
> Key: SPARK-40419
> URL: https://issues.apache.org/jira/browse/SPARK-40419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases 
> from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at 
> all.
> We should also leverage this to test pandas aggregate UDFs too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40495) Add additional tests to StreamingSessionWindowSuite

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40495:


Assignee: Apache Spark

> Add additional tests to StreamingSessionWindowSuite
> ---
>
> Key: SPARK-40495
> URL: https://issues.apache.org/jira/browse/SPARK-40495
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Wei Liu
>Assignee: Apache Spark
>Priority: Minor
>
> Add complex tests to {{{}StreamingSessionWindowSuite{}}}. Concretely, I 
> created two helper functions,
>  * one is called {{{}sessionWindowQueryNestedKey{}}}, which would convert 
> {{sessionId}} from the single word key used in {{sessionWindowQuery}} to a 
> nested column key. For example: {{{}"hello" -> (("hello", "hello"), 
> "hello"){}}}.
>  * The other is called {{{}sessionWindowQueryMultiColKey{}}}. It would 
> convert {{sessionId}} from the single word key used in {{sessionWindowQuery}} 
> to two columns. For example: "hello" -> col1: ("hello", "hello"), col2: 
> "hello"
> With the two new helper functions, I added more tests for the tests for 
> {{complete mode}} and {{cap gap duration}} ({{{}append{}}} and {{async 
> state}} was not included, because the first two is enough for testing the 
> change I added). The logic of the tests are not changed at all, just the key.
> For the aggregation test ({{{}session window - with more aggregation 
> functions{}}}), I added some more functions as well as a UDAF to test, and I 
> tried {{first()}} and {{last()}} function on a nested triple, created using 
> the same method as the above.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40495) Add additional tests to StreamingSessionWindowSuite

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40495:


Assignee: (was: Apache Spark)

> Add additional tests to StreamingSessionWindowSuite
> ---
>
> Key: SPARK-40495
> URL: https://issues.apache.org/jira/browse/SPARK-40495
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Wei Liu
>Priority: Minor
>
> Add complex tests to {{{}StreamingSessionWindowSuite{}}}. Concretely, I 
> created two helper functions,
>  * one is called {{{}sessionWindowQueryNestedKey{}}}, which would convert 
> {{sessionId}} from the single word key used in {{sessionWindowQuery}} to a 
> nested column key. For example: {{{}"hello" -> (("hello", "hello"), 
> "hello"){}}}.
>  * The other is called {{{}sessionWindowQueryMultiColKey{}}}. It would 
> convert {{sessionId}} from the single word key used in {{sessionWindowQuery}} 
> to two columns. For example: "hello" -> col1: ("hello", "hello"), col2: 
> "hello"
> With the two new helper functions, I added more tests for the tests for 
> {{complete mode}} and {{cap gap duration}} ({{{}append{}}} and {{async 
> state}} was not included, because the first two is enough for testing the 
> change I added). The logic of the tests are not changed at all, just the key.
> For the aggregation test ({{{}session window - with more aggregation 
> functions{}}}), I added some more functions as well as a UDAF to test, and I 
> tried {{first()}} and {{last()}} function on a nested triple, created using 
> the same method as the above.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40495) Add additional tests to StreamingSessionWindowSuite

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606940#comment-17606940
 ] 

Apache Spark commented on SPARK-40495:
--

User 'WweiL' has created a pull request for this issue:
https://github.com/apache/spark/pull/37936

> Add additional tests to StreamingSessionWindowSuite
> ---
>
> Key: SPARK-40495
> URL: https://issues.apache.org/jira/browse/SPARK-40495
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Wei Liu
>Priority: Minor
>
> Add complex tests to {{{}StreamingSessionWindowSuite{}}}. Concretely, I 
> created two helper functions,
>  * one is called {{{}sessionWindowQueryNestedKey{}}}, which would convert 
> {{sessionId}} from the single word key used in {{sessionWindowQuery}} to a 
> nested column key. For example: {{{}"hello" -> (("hello", "hello"), 
> "hello"){}}}.
>  * The other is called {{{}sessionWindowQueryMultiColKey{}}}. It would 
> convert {{sessionId}} from the single word key used in {{sessionWindowQuery}} 
> to two columns. For example: "hello" -> col1: ("hello", "hello"), col2: 
> "hello"
> With the two new helper functions, I added more tests for the tests for 
> {{complete mode}} and {{cap gap duration}} ({{{}append{}}} and {{async 
> state}} was not included, because the first two is enough for testing the 
> change I added). The logic of the tests are not changed at all, just the key.
> For the aggregation test ({{{}session window - with more aggregation 
> functions{}}}), I added some more functions as well as a UDAF to test, and I 
> tried {{first()}} and {{last()}} function on a nested triple, created using 
> the same method as the above.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`

2022-09-20 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-40486.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37929
[https://github.com/apache/spark/pull/37929]

> Implement `spearman` and `kendall` in `DataFrame.corrwith`
> --
>
> Key: SPARK-40486
> URL: https://issues.apache.org/jira/browse/SPARK-40486
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`

2022-09-20 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40486:
-

Assignee: Ruifeng Zheng

> Implement `spearman` and `kendall` in `DataFrame.corrwith`
> --
>
> Key: SPARK-40486
> URL: https://issues.apache.org/jira/browse/SPARK-40486
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped

2022-09-20 Thread Ivan Sadikov (Jira)
Ivan Sadikov created SPARK-40496:


 Summary: Configs to control "enableDateTimeParsingFallback" are 
incorrectly swapped
 Key: SPARK-40496
 URL: https://issues.apache.org/jira/browse/SPARK-40496
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Ivan Sadikov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40496:


Assignee: (was: Apache Spark)

> Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
> --
>
> Key: SPARK-40496
> URL: https://issues.apache.org/jira/browse/SPARK-40496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40496:


Assignee: Apache Spark

> Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
> --
>
> Key: SPARK-40496
> URL: https://issues.apache.org/jira/browse/SPARK-40496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606962#comment-17606962
 ] 

Apache Spark commented on SPARK-40496:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/37942

> Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
> --
>
> Key: SPARK-40496
> URL: https://issues.apache.org/jira/browse/SPARK-40496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.

2022-09-20 Thread Piotr Karwasz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606973#comment-17606973
 ] 

Piotr Karwasz commented on SPARK-40489:
---

[~kabhwan],

The code snippet I posted above (using {{LoggerFactory.getLoggerFactory()}}) 
should work on both SLF4J 1.x and SLF4J 2.x.

You can compile Spark against SLF4J 1.x and it should work with both versions 
of the API. Note that if the runtime uses SLF4J 2.x, you need to use 
{{log4j-slf4j2-impl}} as binding.

> Spark 3.3.0 breaks with SFL4J 2.
> 
>
> Key: SPARK-40489
> URL: https://issues.apache.org/jira/browse/SPARK-40489
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Garret Wilson
>Priority: Major
>
> Spark breaks fundamentally with SLF4J 2.x because it uses 
> {{StaticLoggerBinder}}.
> SLF4J is the logging facade that is meant to shield the application from the 
> implementation, whether it be Log4J or Logback or whatever. Historically 
> SLF4J 1.x used a bad approach to configuration: it used a 
> {{StaticLoggerBinder}} (a global static singleton instance) rather than the 
> Java {{ServiceLoader}} mechanism.
> SLF4J 2.x, which has been in development for years, has been released. It 
> finally switches to use the {{ServiceLoader}} mechanism. As [described in the 
> FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be 
> compatible; an application just needs to use the latest Log4J/Logback 
> implementation which has the service loader.
> *Above all the application must _not_ use the low-level 
> {{StaticLoggerBinder}} method, because it has been removed!*
> Unfortunately 
> [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala]
>  uses {{StaticLoggerBinder}} and completely breaks any environment using 
> SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API 
> and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark 
> breaks completely just trying to get a Spark session:
> {noformat}
> Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
> at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232)
> at 
> org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106)
> at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
> at 
> org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.log(Logging.scala:53)
> at org.apache.spark.internal.Logging.log$(Logging.scala:51)
> at org.apache.spark.SparkContext.log(SparkContext.scala:84)
> at org.apache.spark.internal.Logging.logInfo(Logging.scala:61)
> at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60)
> at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84)
> at org.apache.spark.SparkContext.(SparkContext.scala:195)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704)
> at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953)
> at scala.Option.getOrElse(Option.scala:201)
> at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
> {noformat}
> This is because Spark is playing low-level tricks to find out if the logging 
> platform is Log4J, and relying on {{StaticLoggerBinder}} to do it.
> {code}
>   private def isLog4j2(): Boolean = {
> // This distinguishes the log4j 1.2 binding, currently
> // org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, 
> currently
> // org.apache.logging.slf4j.Log4jLoggerFactory
> val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr
> "org.apache.logging.slf4j.Log4jLoggerFactory".equals(binderClass)
>   }
> {code}
> Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark 
> should not be using {{StaticLoggerBinder}} to do that detection. There are 
> many other approaches. (The code itself suggest one approach: 
> {{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to 
> see if the root logger actually is a {{Log4jLogger}}. There may be even 
> better approaches.)
> The other big problem is relying on the Log4J 

[jira] [Updated] (SPARK-40491) Remove too old TODO for JdbcRDD

2022-09-20 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-40491:
---
Summary: Remove too old TODO for JdbcRDD  (was: Remove too old todo 
comments for JdbcRDD)

> Remove too old TODO for JdbcRDD
> ---
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.
> In fact, this is a very old TODO and we need to revisit if this is still 
> necessary. Since Spark SQL is the new core, I'm not sure if anyone is 
> interested in a new API to create jdbc RDD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40498) Implement `kendall` and `min_periods` in `Series.corr`

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607006#comment-17607006
 ] 

Apache Spark commented on SPARK-40498:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37945

> Implement `kendall` and `min_periods` in `Series.corr`
> --
>
> Key: SPARK-40498
> URL: https://issues.apache.org/jira/browse/SPARK-40498
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40498) Implement `kendall` and `min_periods` in `Series.corr`

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40498:


Assignee: Apache Spark

> Implement `kendall` and `min_periods` in `Series.corr`
> --
>
> Key: SPARK-40498
> URL: https://issues.apache.org/jira/browse/SPARK-40498
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40498) Implement `kendall` and `min_periods` in `Series.corr`

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607004#comment-17607004
 ] 

Apache Spark commented on SPARK-40498:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37945

> Implement `kendall` and `min_periods` in `Series.corr`
> --
>
> Key: SPARK-40498
> URL: https://issues.apache.org/jira/browse/SPARK-40498
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40498) Implement `kendall` and `min_periods` in `Series.corr`

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40498:


Assignee: (was: Apache Spark)

> Implement `kendall` and `min_periods` in `Series.corr`
> --
>
> Key: SPARK-40498
> URL: https://issues.apache.org/jira/browse/SPARK-40498
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Attachment: spark3.2-shuffle-data.png

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop 3.0.0 
> spark2.4.0 / spark3.2.1
> shuffle: spark2.4.0
>Reporter: xuanzhiang
>Priority: Minor
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Environment: 
hadoop: 3.0.0 

spark:  2.4.0 / 3.2.1

shuffle:spark 2.4.0

  was:
hadoop 3.0.0 

spark2.4.0 / spark3.2.1

shuffle: spark2.4.0


> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Minor
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40495) Add additional tests to StreamingSessionWindowSuite

2022-09-20 Thread Wei Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606937#comment-17606937
 ] 

Wei Liu commented on SPARK-40495:
-

Hi there, this is my first time to contribute to OSS spark. I'm not 
particularly sure what to put at `Affects Version/s`, so please correct me if 
I'm wrong! Thanks in advance!

> Add additional tests to StreamingSessionWindowSuite
> ---
>
> Key: SPARK-40495
> URL: https://issues.apache.org/jira/browse/SPARK-40495
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Wei Liu
>Priority: Minor
>
> Add complex tests to {{{}StreamingSessionWindowSuite{}}}. Concretely, I 
> created two helper functions,
>  * one is called {{{}sessionWindowQueryNestedKey{}}}, which would convert 
> {{sessionId}} from the single word key used in {{sessionWindowQuery}} to a 
> nested column key. For example: {{{}"hello" -> (("hello", "hello"), 
> "hello"){}}}.
>  * The other is called {{{}sessionWindowQueryMultiColKey{}}}. It would 
> convert {{sessionId}} from the single word key used in {{sessionWindowQuery}} 
> to two columns. For example: "hello" -> col1: ("hello", "hello"), col2: 
> "hello"
> With the two new helper functions, I added more tests for the tests for 
> {{complete mode}} and {{cap gap duration}} ({{{}append{}}} and {{async 
> state}} was not included, because the first two is enough for testing the 
> change I added). The logic of the tests are not changed at all, just the key.
> For the aggregation test ({{{}session window - with more aggregation 
> functions{}}}), I added some more functions as well as a UDAF to test, and I 
> tried {{first()}} and {{last()}} function on a nested triple, created using 
> the same method as the above.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40495) Add additional tests to StreamingSessionWindowSuite

2022-09-20 Thread Wei Liu (Jira)
Wei Liu created SPARK-40495:
---

 Summary: Add additional tests to StreamingSessionWindowSuite
 Key: SPARK-40495
 URL: https://issues.apache.org/jira/browse/SPARK-40495
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.3.0
Reporter: Wei Liu


Add complex tests to {{{}StreamingSessionWindowSuite{}}}. Concretely, I created 
two helper functions,
 * one is called {{{}sessionWindowQueryNestedKey{}}}, which would convert 
{{sessionId}} from the single word key used in {{sessionWindowQuery}} to a 
nested column key. For example: {{{}"hello" -> (("hello", "hello"), 
"hello"){}}}.
 * The other is called {{{}sessionWindowQueryMultiColKey{}}}. It would convert 
{{sessionId}} from the single word key used in {{sessionWindowQuery}} to two 
columns. For example: "hello" -> col1: ("hello", "hello"), col2: "hello"

With the two new helper functions, I added more tests for the tests for 
{{complete mode}} and {{cap gap duration}} ({{{}append{}}} and {{async state}} 
was not included, because the first two is enough for testing the change I 
added). The logic of the tests are not changed at all, just the key.

For the aggregation test ({{{}session window - with more aggregation 
functions{}}}), I added some more functions as well as a UDAF to test, and I 
tried {{first()}} and {{last()}} function on a nested triple, created using the 
same method as the above.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40491) Remove too old todo comments for JdbcRDD

2022-09-20 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-40491:
---
Summary: Remove too old todo comments for JdbcRDD  (was: Expose a jdbcRDD 
function in SparkContext)

> Remove too old todo comments for JdbcRDD
> 
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40491) Remove too old todo comments for JdbcRDD

2022-09-20 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-40491:
---
Description: 
According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
function in SparkContext.

In fact, this is a very old TODO and we need to revisit if this is still 
necessary. Since Spark SQL is the new core, I'm not sure if anyone is 
interested in a new API to create jdbc RDD.

  was:According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
function in SparkContext.


> Remove too old todo comments for JdbcRDD
> 
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.
> In fact, this is a very old TODO and we need to revisit if this is still 
> necessary. Since Spark SQL is the new core, I'm not sure if anyone is 
> interested in a new API to create jdbc RDD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Environment: 
hadoop 3.0.0 

spark2.4.0 / spark3.2.1

shuffle: spark2.4.0

  was:!image-2022-09-20-16-57-01-881.png!


> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop 3.0.0 
> spark2.4.0 / spark3.2.1
> shuffle: spark2.4.0
>Reporter: xuanzhiang
>Priority: Minor
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)
xuanzhiang created SPARK-40499:
--

 Summary: Spark 3.2.1 percentlie_approx query much slower than 
Spark 2.4.0
 Key: SPARK-40499
 URL: https://issues.apache.org/jira/browse/SPARK-40499
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 3.2.1
 Environment: !image-2022-09-20-16-57-01-881.png!
Reporter: xuanzhiang


spark.sql(
      s"""
         |SELECT
         | Info ,
         | PERCENTILE_APPROX(cost,0.5) cost_p50,
         | PERCENTILE_APPROX(cost,0.9) cost_p90,
         | PERCENTILE_APPROX(cost,0.95) cost_p95,
         | PERCENTILE_APPROX(cost,0.99) cost_p99,
         | PERCENTILE_APPROX(cost,0.999) cost_p999
         |FROM
         | textData
         |""".stripMargin)
 * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
shuffle , 140M shuffle data cost 3 hours. 

 *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Priority: Blocker  (was: Minor)

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Blocker
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Priority: Major  (was: Blocker)

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Major
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40501) Add PushProjectionThroughLimit for Optimizer

2022-09-20 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40501:

Summary: Add PushProjectionThroughLimit for Optimizer  (was: add 
PushProjectionThroughLimit for Optimizer)

> Add PushProjectionThroughLimit for Optimizer
> 
>
> Key: SPARK-40501
> URL: https://issues.apache.org/jira/browse/SPARK-40501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> h4. It took a long time to fetch out
> when run as follow code in spark-shell:
> spark.sql("select * from xxx where event_day = '20220919' limit 1").show()
> [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png]
> [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607149#comment-17607149
 ] 

Apache Spark commented on SPARK-40505:
--

User 'bryanck' has created a pull request for this issue:
https://github.com/apache/spark/pull/37950

> Remove min heap setting in Kubernetes Dockerfile entrypoint
> ---
>
> Key: SPARK-40505
> URL: https://issues.apache.org/jira/browse/SPARK-40505
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Bryan Keller
>Priority: Major
>
> The entrypoint script for the Kubernetes Dockerfile sets the Java min heap 
> setting (-Xms) to be the same as the max setting (-Xmx) for the executor 
> process. This prevents the JVM from shrinking the heap and can lead to 
> excessive memory usage in some scenarios. Removing the min heap setting is 
> consistent with YARN executor startup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40505:


Assignee: Apache Spark

> Remove min heap setting in Kubernetes Dockerfile entrypoint
> ---
>
> Key: SPARK-40505
> URL: https://issues.apache.org/jira/browse/SPARK-40505
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Bryan Keller
>Assignee: Apache Spark
>Priority: Major
>
> The entrypoint script for the Kubernetes Dockerfile sets the Java min heap 
> setting (-Xms) to be the same as the max setting (-Xmx) for the executor 
> process. This prevents the JVM from shrinking the heap and can lead to 
> excessive memory usage in some scenarios. Removing the min heap setting is 
> consistent with YARN executor startup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40505:


Assignee: (was: Apache Spark)

> Remove min heap setting in Kubernetes Dockerfile entrypoint
> ---
>
> Key: SPARK-40505
> URL: https://issues.apache.org/jira/browse/SPARK-40505
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Bryan Keller
>Priority: Major
>
> The entrypoint script for the Kubernetes Dockerfile sets the Java min heap 
> setting (-Xms) to be the same as the max setting (-Xmx) for the executor 
> process. This prevents the JVM from shrinking the heap and can lead to 
> excessive memory usage in some scenarios. Removing the min heap setting is 
> consistent with YARN executor startup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607151#comment-17607151
 ] 

Apache Spark commented on SPARK-40505:
--

User 'bryanck' has created a pull request for this issue:
https://github.com/apache/spark/pull/37950

> Remove min heap setting in Kubernetes Dockerfile entrypoint
> ---
>
> Key: SPARK-40505
> URL: https://issues.apache.org/jira/browse/SPARK-40505
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Bryan Keller
>Priority: Major
>
> The entrypoint script for the Kubernetes Dockerfile sets the Java min heap 
> setting (-Xms) to be the same as the max setting (-Xmx) for the executor 
> process. This prevents the JVM from shrinking the heap and can lead to 
> excessive memory usage in some scenarios. Removing the min heap setting is 
> consistent with YARN executor startup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40506) Spark Streaming Metrics SourceName is unsuitable

2022-09-20 Thread Jira
王俊博 created SPARK-40506:
---

 Summary: Spark Streaming Metrics SourceName is unsuitable
 Key: SPARK-40506
 URL: https://issues.apache.org/jira/browse/SPARK-40506
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 3.2.2
Reporter: 王俊博


Spark  StreamingSource  Metrics sourceName is inappropriate.The label now looks 
like 
`application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
 `, instead of 
`application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
 the Spark app name is not need.
{code:java}
//代码占位符

private[streaming] class StreamingSource(ssc: StreamingContext) extends Source {
  override val metricRegistry = new MetricRegistry
  override val sourceName = 
"%s.StreamingMetrics".format(ssc.sparkContext.appName)

}{code}
And for example, other metrics sourceName don't have appName.
{code:java}
//代码占位符
private[spark] class LiveListenerBusMetrics(conf: SparkConf)
  extends Source with Logging {

  override val sourceName: String = "LiveListenerBus"
  override val metricRegistry: MetricRegistry = new MetricRegistry
...
}

{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40506) Spark Streaming metrics name don't need application name

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40506:


Assignee: (was: Apache Spark)

> Spark Streaming metrics name don't need application name
> 
>
> Key: SPARK-40506
> URL: https://issues.apache.org/jira/browse/SPARK-40506
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.2.2
>Reporter: 王俊博
>Priority: Trivial
>
> Spark  StreamingSource  Metrics sourceName is inappropriate.The label now 
> looks like 
> `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
>  `, instead of 
> `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
>  the Spark app name is not need.
> This makes it hard to use metrics for different Spark applications over time. 
> And this makes the metrics sourceName standard inconsistent
> {code:java}
> //代码占位符
> private[streaming] class StreamingSource(ssc: StreamingContext) extends 
> Source {
>   override val metricRegistry = new MetricRegistry
>   override val sourceName = 
> "%s.StreamingMetrics".format(ssc.sparkContext.appName)
> 
> }{code}
> And for example, other metrics sourceName don't have appName.
> {code:java}
> //代码占位符
> private[spark] class LiveListenerBusMetrics(conf: SparkConf)
>   extends Source with Logging {
>   override val sourceName: String = "LiveListenerBus"
>   override val metricRegistry: MetricRegistry = new MetricRegistry
> ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40506) Spark Streaming metrics name don't need application name

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607167#comment-17607167
 ] 

Apache Spark commented on SPARK-40506:
--

User 'Kwafoor' has created a pull request for this issue:
https://github.com/apache/spark/pull/37951

> Spark Streaming metrics name don't need application name
> 
>
> Key: SPARK-40506
> URL: https://issues.apache.org/jira/browse/SPARK-40506
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.2.2
>Reporter: 王俊博
>Priority: Trivial
>
> Spark  StreamingSource  Metrics sourceName is inappropriate.The label now 
> looks like 
> `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
>  `, instead of 
> `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
>  the Spark app name is not need.
> This makes it hard to use metrics for different Spark applications over time. 
> And this makes the metrics sourceName standard inconsistent
> {code:java}
> //代码占位符
> private[streaming] class StreamingSource(ssc: StreamingContext) extends 
> Source {
>   override val metricRegistry = new MetricRegistry
>   override val sourceName = 
> "%s.StreamingMetrics".format(ssc.sparkContext.appName)
> 
> }{code}
> And for example, other metrics sourceName don't have appName.
> {code:java}
> //代码占位符
> private[spark] class LiveListenerBusMetrics(conf: SparkConf)
>   extends Source with Logging {
>   override val sourceName: String = "LiveListenerBus"
>   override val metricRegistry: MetricRegistry = new MetricRegistry
> ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40506) Spark Streaming metrics name don't need application name

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40506:


Assignee: Apache Spark

> Spark Streaming metrics name don't need application name
> 
>
> Key: SPARK-40506
> URL: https://issues.apache.org/jira/browse/SPARK-40506
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.2.2
>Reporter: 王俊博
>Assignee: Apache Spark
>Priority: Trivial
>
> Spark  StreamingSource  Metrics sourceName is inappropriate.The label now 
> looks like 
> `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
>  `, instead of 
> `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
>  the Spark app name is not need.
> This makes it hard to use metrics for different Spark applications over time. 
> And this makes the metrics sourceName standard inconsistent
> {code:java}
> //代码占位符
> private[streaming] class StreamingSource(ssc: StreamingContext) extends 
> Source {
>   override val metricRegistry = new MetricRegistry
>   override val sourceName = 
> "%s.StreamingMetrics".format(ssc.sparkContext.appName)
> 
> }{code}
> And for example, other metrics sourceName don't have appName.
> {code:java}
> //代码占位符
> private[spark] class LiveListenerBusMetrics(conf: SparkConf)
>   extends Source with Logging {
>   override val sourceName: String = "LiveListenerBus"
>   override val metricRegistry: MetricRegistry = new MetricRegistry
> ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40501) Add PushProjectionThroughLimit for Optimizer

2022-09-20 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40501:

Description: 
h4. It took a long time to fetch out, still running after 20 minutes...

when run as follow code in spark-shell:

spark.sql("select * from xxx where event_day = '20220919' limit 1").show()

[!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png]
[!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png]

  was:
h4. It took a long time to fetch out

when run as follow code in spark-shell:

spark.sql("select * from xxx where event_day = '20220919' limit 1").show()


[!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png]
[!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png]


> Add PushProjectionThroughLimit for Optimizer
> 
>
> Key: SPARK-40501
> URL: https://issues.apache.org/jira/browse/SPARK-40501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> h4. It took a long time to fetch out, still running after 20 minutes...
> when run as follow code in spark-shell:
> spark.sql("select * from xxx where event_day = '20220919' limit 1").show()
> [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png]
> [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint

2022-09-20 Thread Bryan Keller (Jira)
Bryan Keller created SPARK-40505:


 Summary: Remove min heap setting in Kubernetes Dockerfile 
entrypoint
 Key: SPARK-40505
 URL: https://issues.apache.org/jira/browse/SPARK-40505
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.2.2, 3.3.0
Reporter: Bryan Keller


The entrypoint script for the Kubernetes Dockerfile sets the Java min heap 
setting (-Xms) to be the same as the max setting (-Xmx) for the executor 
process. This prevents the JVM from shrinking the heap and can lead to 
excessive memory usage in some scenarios. Removing the min heap setting is 
consistent with YARN executor startup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40506) Spark Streaming Metrics SourceName is unsuitable

2022-09-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

王俊博 updated SPARK-40506:

Description: 
Spark  StreamingSource  Metrics sourceName is inappropriate.The label now looks 
like 
`application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
 `, instead of 
`application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
 the Spark app name is not need.

This makes it hard to use metrics for different Spark applications over time. 
And this makes the metrics sourceName standard inconsistent
{code:java}
//代码占位符

private[streaming] class StreamingSource(ssc: StreamingContext) extends Source {
  override val metricRegistry = new MetricRegistry
  override val sourceName = 
"%s.StreamingMetrics".format(ssc.sparkContext.appName)

}{code}
And for example, other metrics sourceName don't have appName.
{code:java}
//代码占位符
private[spark] class LiveListenerBusMetrics(conf: SparkConf)
  extends Source with Logging {

  override val sourceName: String = "LiveListenerBus"
  override val metricRegistry: MetricRegistry = new MetricRegistry
...
}

{code}

  was:
Spark  StreamingSource  Metrics sourceName is inappropriate.The label now looks 
like 
`application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
 `, instead of 
`application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
 the Spark app name is not need.
{code:java}
//代码占位符

private[streaming] class StreamingSource(ssc: StreamingContext) extends Source {
  override val metricRegistry = new MetricRegistry
  override val sourceName = 
"%s.StreamingMetrics".format(ssc.sparkContext.appName)

}{code}
And for example, other metrics sourceName don't have appName.
{code:java}
//代码占位符
private[spark] class LiveListenerBusMetrics(conf: SparkConf)
  extends Source with Logging {

  override val sourceName: String = "LiveListenerBus"
  override val metricRegistry: MetricRegistry = new MetricRegistry
...
}

{code}


> Spark Streaming Metrics SourceName is unsuitable
> 
>
> Key: SPARK-40506
> URL: https://issues.apache.org/jira/browse/SPARK-40506
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.2.2
>Reporter: 王俊博
>Priority: Trivial
>
> Spark  StreamingSource  Metrics sourceName is inappropriate.The label now 
> looks like 
> `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
>  `, instead of 
> `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
>  the Spark app name is not need.
> This makes it hard to use metrics for different Spark applications over time. 
> And this makes the metrics sourceName standard inconsistent
> {code:java}
> //代码占位符
> private[streaming] class StreamingSource(ssc: StreamingContext) extends 
> Source {
>   override val metricRegistry = new MetricRegistry
>   override val sourceName = 
> "%s.StreamingMetrics".format(ssc.sparkContext.appName)
> 
> }{code}
> And for example, other metrics sourceName don't have appName.
> {code:java}
> //代码占位符
> private[spark] class LiveListenerBusMetrics(conf: SparkConf)
>   extends Source with Logging {
>   override val sourceName: String = "LiveListenerBus"
>   override val metricRegistry: MetricRegistry = new MetricRegistry
> ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40506) Spark Streaming metrics name don't need application name

2022-09-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

王俊博 updated SPARK-40506:

Summary: Spark Streaming metrics name don't need application name  (was: 
Spark Streaming Metrics SourceName is unsuitable)

> Spark Streaming metrics name don't need application name
> 
>
> Key: SPARK-40506
> URL: https://issues.apache.org/jira/browse/SPARK-40506
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.2.2
>Reporter: 王俊博
>Priority: Trivial
>
> Spark  StreamingSource  Metrics sourceName is inappropriate.The label now 
> looks like 
> `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime
>  `, instead of 
> `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`,
>  the Spark app name is not need.
> This makes it hard to use metrics for different Spark applications over time. 
> And this makes the metrics sourceName standard inconsistent
> {code:java}
> //代码占位符
> private[streaming] class StreamingSource(ssc: StreamingContext) extends 
> Source {
>   override val metricRegistry = new MetricRegistry
>   override val sourceName = 
> "%s.StreamingMetrics".format(ssc.sparkContext.appName)
> 
> }{code}
> And for example, other metrics sourceName don't have appName.
> {code:java}
> //代码占位符
> private[spark] class LiveListenerBusMetrics(conf: SparkConf)
>   extends Source with Logging {
>   override val sourceName: String = "LiveListenerBus"
>   override val metricRegistry: MetricRegistry = new MetricRegistry
> ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607079#comment-17607079
 ] 

Apache Spark commented on SPARK-40327:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37948

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40504) Make yarn appmaster load config from client

2022-09-20 Thread zhengchenyu (Jira)
zhengchenyu created SPARK-40504:
---

 Summary: Make yarn appmaster load config from client
 Key: SPARK-40504
 URL: https://issues.apache.org/jira/browse/SPARK-40504
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 3.0.1
Reporter: zhengchenyu


In yarn federation mode, config in client side and nm side may be different. 
AppMaster should override config from client side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`

2022-09-20 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40500:
-

 Summary: Use `pd.items` instead of `pd.iteritems`
 Key: SPARK-40500
 URL: https://issues.apache.org/jira/browse/SPARK-40500
 Project: Spark
  Issue Type: Improvement
  Components: ps
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40501) add PushProjectionThroughLimit for Optimizer

2022-09-20 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-40501:
---

 Summary: add PushProjectionThroughLimit for Optimizer
 Key: SPARK-40501
 URL: https://issues.apache.org/jira/browse/SPARK-40501
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0

2022-09-20 Thread xuanzhiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-40499:
---
Priority: Blocker  (was: Major)

> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> 
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0 
> spark:  2.4.0 / 3.2.1
> shuffle:spark 2.4.0
>Reporter: xuanzhiang
>Priority: Blocker
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
>       s"""
>          |SELECT
>          | Info ,
>          | PERCENTILE_APPROX(cost,0.5) cost_p50,
>          | PERCENTILE_APPROX(cost,0.9) cost_p90,
>          | PERCENTILE_APPROX(cost,0.95) cost_p95,
>          | PERCENTILE_APPROX(cost,0.99) cost_p99,
>          | PERCENTILE_APPROX(cost,0.999) cost_p999
>          |FROM
>          | textData
>          |""".stripMargin)
>  * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old 
> shuffle , 140M shuffle data cost 3 hours. 
>  * If we upgrade the Shuffle, will we get performance regression?
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40457) upgrade jackson data mapper to latest

2022-09-20 Thread Bilna (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607082#comment-17607082
 ] 

Bilna commented on SPARK-40457:
---

[~hyukjin.kwon] it is org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13

> upgrade jackson data mapper to latest 
> --
>
> Key: SPARK-40457
> URL: https://issues.apache.org/jira/browse/SPARK-40457
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bilna
>Priority: Major
>
> Upgrade  jackson-mapper-asl to the latest to resolve CVE-2019-10172



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40504) Make yarn appmaster load config from client

2022-09-20 Thread zhengchenyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengchenyu updated SPARK-40504:

Description: 
In yarn federation mode, config in client side and nm side may be different. 
AppMaster should override config from client side.

For example: 

In client side, yarn.resourcemanager.ha.rm-ids are yarn routers.

In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster.

  was:In yarn federation mode, config in client side and nm side may be 
different. AppMaster should override config from client side.


> Make yarn appmaster load config from client
> ---
>
> Key: SPARK-40504
> URL: https://issues.apache.org/jira/browse/SPARK-40504
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.1
>Reporter: zhengchenyu
>Priority: Major
>
> In yarn federation mode, config in client side and nm side may be different. 
> AppMaster should override config from client side.
> For example: 
> In client side, yarn.resourcemanager.ha.rm-ids are yarn routers.
> In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40501) add PushProjectionThroughLimit for Optimizer

2022-09-20 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-40501:

Description: 
h4. It took a long time to fetch out

when run as follow code in spark-shell:

spark.sql("select * from xxx where event_day = '20220919' limit 1").show()


[!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png]
[!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png]

> add PushProjectionThroughLimit for Optimizer
> 
>
> Key: SPARK-40501
> URL: https://issues.apache.org/jira/browse/SPARK-40501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> h4. It took a long time to fetch out
> when run as follow code in spark-shell:
> spark.sql("select * from xxx where event_day = '20220919' limit 1").show()
> [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png]
> [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40500:


Assignee: (was: Apache Spark)

> Use `pd.items` instead of `pd.iteritems`
> 
>
> Key: SPARK-40500
> URL: https://issues.apache.org/jira/browse/SPARK-40500
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40500:


Assignee: Apache Spark

> Use `pd.items` instead of `pd.iteritems`
> 
>
> Key: SPARK-40500
> URL: https://issues.apache.org/jira/browse/SPARK-40500
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40501) add PushProjectionThroughLimit for Optimizer

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40501:


Assignee: (was: Apache Spark)

> add PushProjectionThroughLimit for Optimizer
> 
>
> Key: SPARK-40501
> URL: https://issues.apache.org/jira/browse/SPARK-40501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40501) add PushProjectionThroughLimit for Optimizer

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607033#comment-17607033
 ] 

Apache Spark commented on SPARK-40501:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37941

> add PushProjectionThroughLimit for Optimizer
> 
>
> Key: SPARK-40501
> URL: https://issues.apache.org/jira/browse/SPARK-40501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607034#comment-17607034
 ] 

Apache Spark commented on SPARK-40500:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37947

> Use `pd.items` instead of `pd.iteritems`
> 
>
> Key: SPARK-40500
> URL: https://issues.apache.org/jira/browse/SPARK-40500
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40501) add PushProjectionThroughLimit for Optimizer

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40501:


Assignee: Apache Spark

> add PushProjectionThroughLimit for Optimizer
> 
>
> Key: SPARK-40501
> URL: https://issues.apache.org/jira/browse/SPARK-40501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40502) Support dataframe API use jdbc data source in PySpark

2022-09-20 Thread CaoYu (Jira)
CaoYu created SPARK-40502:
-

 Summary: Support dataframe API use jdbc data source in PySpark
 Key: SPARK-40502
 URL: https://issues.apache.org/jira/browse/SPARK-40502
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 3.3.0
Reporter: CaoYu


When i using pyspark, i wanna get data from mysql database.  so i want use 
JDBCRDD like java\scala.

But that is not be supported in PySpark.

 

For some reasons, i can't using DataFrame API, only can use RDD(datastream) 
API. Even i know the DataFrame can get data from jdbc source fairly well.

 
So i want to implement functionality that can use rdd to get data from jdbc 
source for PySpark.
 
*But i don't know if that are necessary for PySpark.   so we can discuss it.*
 
{*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*}  
*i hope this Jira task can assigned to me, so i can start working to implement 
it.*
 
*if not, please close this Jira task.*
 
 
*thanks a lot.*
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40503) Add resampling to API references

2022-09-20 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40503:
-

 Summary: Add resampling to API references
 Key: SPARK-40503
 URL: https://issues.apache.org/jira/browse/SPARK-40503
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ps
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40504) Make yarn appmaster load config from client

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40504:


Assignee: Apache Spark

> Make yarn appmaster load config from client
> ---
>
> Key: SPARK-40504
> URL: https://issues.apache.org/jira/browse/SPARK-40504
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.1
>Reporter: zhengchenyu
>Assignee: Apache Spark
>Priority: Major
>
> In yarn federation mode, config in client side and nm side may be different. 
> AppMaster should override config from client side.
> For example: 
> In client side, yarn.resourcemanager.ha.rm-ids are yarn routers.
> In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40504) Make yarn appmaster load config from client

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607092#comment-17607092
 ] 

Apache Spark commented on SPARK-40504:
--

User 'zhengchenyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/37949

> Make yarn appmaster load config from client
> ---
>
> Key: SPARK-40504
> URL: https://issues.apache.org/jira/browse/SPARK-40504
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.1
>Reporter: zhengchenyu
>Priority: Major
>
> In yarn federation mode, config in client side and nm side may be different. 
> AppMaster should override config from client side.
> For example: 
> In client side, yarn.resourcemanager.ha.rm-ids are yarn routers.
> In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40491) Remove too old TODO for JdbcRDD

2022-09-20 Thread CaoYu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607045#comment-17607045
 ] 

CaoYu commented on SPARK-40491:
---

Maybe we can just not remove these. 

 

I have already created https://issues.apache.org/jira/browse/SPARK-40502, 
please take a look.

i want try to implement jdbc data source in pyspark.

 

Also I'm interested in this task for scala.

if possible, Please assign me this task, I want to try to get it done

> Remove too old TODO for JdbcRDD
> ---
>
> Key: SPARK-40491
> URL: https://issues.apache.org/jira/browse/SPARK-40491
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> According to the legacy document of JdbcRDD, we need to expose a jdbcRDD 
> function in SparkContext.
> In fact, this is a very old TODO and we need to revisit if this is still 
> necessary. Since Spark SQL is the new core, I'm not sure if anyone is 
> interested in a new API to create jdbc RDD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607077#comment-17607077
 ] 

Apache Spark commented on SPARK-40327:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37948

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40327:


Assignee: (was: Apache Spark)

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40327:


Assignee: Apache Spark

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40504) Make yarn appmaster load config from client

2022-09-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40504:


Assignee: (was: Apache Spark)

> Make yarn appmaster load config from client
> ---
>
> Key: SPARK-40504
> URL: https://issues.apache.org/jira/browse/SPARK-40504
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.1
>Reporter: zhengchenyu
>Priority: Major
>
> In yarn federation mode, config in client side and nm side may be different. 
> AppMaster should override config from client side.
> For example: 
> In client side, yarn.resourcemanager.ha.rm-ids are yarn routers.
> In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:17 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. I believe it could get 
non-trivial for users to discover that spark.sql.storeAssignmentPolicy}} would 
work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY }}works. I believe it could get 
non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} 
would work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:18 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}I believe it could get 
non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} 
would work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. I believe it could get 
non-trivial for users to discover that spark.sql.storeAssignmentPolicy}} would 
work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Commented] (SPARK-31404) file source backward compatibility after calendar switch

2022-09-20 Thread Sachit (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607326#comment-17607326
 ] 

Sachit commented on SPARK-31404:


Hi [~cloud_fan] ,

Could you please confirm if we need to use below properties to ensure it can 
read data written by spark2.4x


spark.conf.set("spark.sql.parquet.int96RebaseModeInRead","CORRECTED")
spark.conf.set("spark.sql.parquet.int96RebaseModeInWrite","CORRECTED")


Regards
Sachit

> file source backward compatibility after calendar switch
> 
>
> Key: SPARK-31404
> URL: https://issues.apache.org/jira/browse/SPARK-31404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: Switch to Java 8 time API in Spark 3.0.pdf
>
>
> In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java 
> 8 datetime APIs. This makes Spark follow the ISO and SQL standard, but 
> introduces some backward compatibility problems:
> 1. may read wrong data from the data files written by Spark 2.4
> 2. may have perf regression



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:20 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to LEGACY works. 

I believe it could get non-trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy would work. For instance, after inspecting 
the code, I thought that nullOnOverflow}} is controlled by 
{{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}

I believe it could get non trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting 
the code, I thought that {{nullOnOverflow}} is controlled by 
{{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>       

[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-20 Thread xsys (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314
 ] 

xsys edited comment on SPARK-40439 at 9/20/22 5:20 PM:
---

[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}

I believe it could get non trivial for users to discover that 
{{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting 
the code, I thought that {{nullOnOverflow}} is controlled by 
{{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by 
altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 


was (Author: JIRAUSER288838):
[~hyukjin.kwon]: Thank you for your response! Setting 
{{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}I believe it could get 
non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} 
would work. For instance, after inspecting the code, I thought that 
{{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to 
achieve the desired behaviour by altering it (but to no avail).

Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to 
the error message? 

> DECIMAL value with more precision than what is defined in the schema raises 
> exception in SparkSQL but evaluates to NULL for DataFrame
> -
>
> Key: SPARK-40439
> URL: https://issues.apache.org/jira/browse/SPARK-40439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to store a DECIMAL value {{333.22}} with more 
> precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
> leads to a {{NULL}} value being stored if the table is created using 
> DataFrames via {{{}spark-shell{}}}. However, it leads to the following 
> exception if the table is created via {{{}spark-sql{}}}:
> {code:java}
> Failed in [insert into decimal_extra_precision select 333.22]
> java.lang.ArithmeticException: 
> Decimal(expanded,333.22,21,10}) cannot be represented as 
> Decimal(20, 10){code}
> h3. Step to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> Execute the following:
> {code:java}
> create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
> insert into decimal_extra_precision select 333.22;{code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 
> Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
> aforementioned decimal value evaluates to a {{{}NULL{}}}:
> {code:java}
> scala> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.{Row, SparkSession}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val rdd = 
> sc.parallelize(Seq(Row(BigDecimal("333.22"
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[0] at parallelize at :27
> scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 
> 10), true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,DecimalType(20,10),true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> scala> df.show()
> ++
> |  c1|
> ++
> |null|
> ++
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
> 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> scala> spark.sql("select * from decimal_extra_precision;")
> res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
> {code}
> h3. Root Cause
> The exception is being raised from 
> [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
>  ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
> [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
> {code:java}
>   private[sql] def toPrecision(
>       precision: Int,
>   

  1   2   >