[jira] [Commented] (SPARK-40502) Support dataframe API use jdbc data source in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607447#comment-17607447 ] Hyukjin Kwon commented on SPARK-40502: -- {quote} For some reasons, i can't using DataFrame API, only can use RDD(datastream) API. {quote} What's the reason? > Support dataframe API use jdbc data source in PySpark > - > > Key: SPARK-40502 > URL: https://issues.apache.org/jira/browse/SPARK-40502 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.3.0 >Reporter: CaoYu >Priority: Major > > When i using pyspark, i wanna get data from mysql database. so i want use > JDBCRDD like java\scala. > But that is not be supported in PySpark. > > For some reasons, i can't using DataFrame API, only can use RDD(datastream) > API. Even i know the DataFrame can get data from jdbc source fairly well. > > So i want to implement functionality that can use rdd to get data from jdbc > source for PySpark. > > *But i don't know if that are necessary for PySpark. so we can discuss it.* > > {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*} > *i hope this Jira task can assigned to me, so i can start working to > implement it.* > > *if not, please close this Jira task.* > > > *thanks a lot.* > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40499: - Priority: Major (was: Blocker) > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop: 3.0.0 > spark: 2.4.0 / 3.2.1 > shuffle:spark 2.4.0 >Reporter: xuanzhiang >Priority: Major > Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40509) Construct an example of applyInPandasWithState in examples directory
Jungtaek Lim created SPARK-40509: Summary: Construct an example of applyInPandasWithState in examples directory Key: SPARK-40509 URL: https://issues.apache.org/jira/browse/SPARK-40509 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Jungtaek Lim Since we introduce a new API (applyInPandasWithState) in PySpark, it worths to have a separate full example of the API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40491) Remove too old TODO for JdbcRDD
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40491: Assignee: jiaan.geng > Remove too old TODO for JdbcRDD > --- > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Trivial > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. > In fact, this is a very old TODO and we need to revisit if this is still > necessary. Since Spark SQL is the new core, I'm not sure if anyone is > interested in a new API to create jdbc RDD. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40332) Implement `GroupBy.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607489#comment-17607489 ] Apache Spark commented on SPARK-40332: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37954 > Implement `GroupBy.quantile`. > - > > Key: SPARK-40332 > URL: https://issues.apache.org/jira/browse/SPARK-40332 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > We should implement `GroupBy.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40501) Enhance 'SpecialLimits' to support project(..., limit(...))
[ https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-40501: Summary: Enhance 'SpecialLimits' to support project(..., limit(...)) (was: Add PushProjectionThroughLimit for Optimizer) > Enhance 'SpecialLimits' to support project(..., limit(...)) > --- > > Key: SPARK-40501 > URL: https://issues.apache.org/jira/browse/SPARK-40501 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > h4. It took a long time to fetch out, still running after 20 minutes... > when run as follow code in spark-shell: > spark.sql("select * from xxx where event_day = '20220919' limit 1").show() > [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png] > [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40510) Implement `ddof` in `Series.cov`
[ https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607483#comment-17607483 ] Apache Spark commented on SPARK-40510: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37953 > Implement `ddof` in `Series.cov` > > > Key: SPARK-40510 > URL: https://issues.apache.org/jira/browse/SPARK-40510 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40510) Implement `ddof` in `Series.cov`
[ https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40510: Assignee: (was: Apache Spark) > Implement `ddof` in `Series.cov` > > > Key: SPARK-40510 > URL: https://issues.apache.org/jira/browse/SPARK-40510 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40510) Implement `ddof` in `Series.cov`
[ https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40510: Assignee: Apache Spark > Implement `ddof` in `Series.cov` > > > Key: SPARK-40510 > URL: https://issues.apache.org/jira/browse/SPARK-40510 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40510) Implement `ddof` in `Series.cov`
[ https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607481#comment-17607481 ] Apache Spark commented on SPARK-40510: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37953 > Implement `ddof` in `Series.cov` > > > Key: SPARK-40510 > URL: https://issues.apache.org/jira/browse/SPARK-40510 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40510) Implement `ddof` in `Series.cov`
Ruifeng Zheng created SPARK-40510: - Summary: Implement `ddof` in `Series.cov` Key: SPARK-40510 URL: https://issues.apache.org/jira/browse/SPARK-40510 Project: Spark Issue Type: Sub-task Components: ps Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`
[ https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40500: Assignee: Ruifeng Zheng > Use `pd.items` instead of `pd.iteritems` > > > Key: SPARK-40500 > URL: https://issues.apache.org/jira/browse/SPARK-40500 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`
[ https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40500. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37947 [https://github.com/apache/spark/pull/37947] > Use `pd.items` instead of `pd.iteritems` > > > Key: SPARK-40500 > URL: https://issues.apache.org/jira/browse/SPARK-40500 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40491) Remove too old TODO for JdbcRDD
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40491. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37937 [https://github.com/apache/spark/pull/37937] > Remove too old TODO for JdbcRDD > --- > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Trivial > Fix For: 3.4.0 > > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. > In fact, this is a very old TODO and we need to revisit if this is still > necessary. Since Spark SQL is the new core, I'm not sure if anyone is > interested in a new API to create jdbc RDD. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40332) Implement `GroupBy.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607488#comment-17607488 ] Apache Spark commented on SPARK-40332: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37954 > Implement `GroupBy.quantile`. > - > > Key: SPARK-40332 > URL: https://issues.apache.org/jira/browse/SPARK-40332 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > We should implement `GroupBy.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40506) Spark Streaming metrics name don't need application name
[ https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 王俊博 updated SPARK-40506: Description: Spark StreamingSource Metrics sourceName is inappropriate.The label now looks like `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime `, instead of `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, the Spark app name is not need. This makes it hard to use metrics for different Spark applications over time. And this makes the metrics sourceName standard inconsistent. {code:java} //代码占位符 private[streaming] class StreamingSource(ssc: StreamingContext) extends Source { override val metricRegistry = new MetricRegistry override val sourceName = "%s.StreamingMetrics".format(ssc.sparkContext.appName) }{code} And for example, other metrics sourceName don't have appName. {code:java} //代码占位符 private[spark] class LiveListenerBusMetrics(conf: SparkConf) extends Source with Logging { override val sourceName: String = "LiveListenerBus" override val metricRegistry: MetricRegistry = new MetricRegistry ... } {code} was: Spark StreamingSource Metrics sourceName is inappropriate.The label now looks like `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime `, instead of `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, the Spark app name is not need. This makes it hard to use metrics for different Spark applications over time. And this makes the metrics sourceName standard inconsistent {code:java} //代码占位符 private[streaming] class StreamingSource(ssc: StreamingContext) extends Source { override val metricRegistry = new MetricRegistry override val sourceName = "%s.StreamingMetrics".format(ssc.sparkContext.appName) }{code} And for example, other metrics sourceName don't have appName. {code:java} //代码占位符 private[spark] class LiveListenerBusMetrics(conf: SparkConf) extends Source with Logging { override val sourceName: String = "LiveListenerBus" override val metricRegistry: MetricRegistry = new MetricRegistry ... } {code} > Spark Streaming metrics name don't need application name > > > Key: SPARK-40506 > URL: https://issues.apache.org/jira/browse/SPARK-40506 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.2.2 >Reporter: 王俊博 >Priority: Trivial > > Spark StreamingSource Metrics sourceName is inappropriate.The label now > looks like > `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime > `, instead of > `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, > the Spark app name is not need. > This makes it hard to use metrics for different Spark applications over time. > And this makes the metrics sourceName standard inconsistent. > {code:java} > //代码占位符 > private[streaming] class StreamingSource(ssc: StreamingContext) extends > Source { > override val metricRegistry = new MetricRegistry > override val sourceName = > "%s.StreamingMetrics".format(ssc.sparkContext.appName) > > }{code} > And for example, other metrics sourceName don't have appName. > {code:java} > //代码占位符 > private[spark] class LiveListenerBusMetrics(conf: SparkConf) > extends Source with Logging { > override val sourceName: String = "LiveListenerBus" > override val metricRegistry: MetricRegistry = new MetricRegistry > ... > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40511) Upgrade slf4j to 2.x
Yang Jie created SPARK-40511: Summary: Upgrade slf4j to 2.x Key: SPARK-40511 URL: https://issues.apache.org/jira/browse/SPARK-40511 Project: Spark Issue Type: Improvement Components: Build, Spark Core Affects Versions: 3.4.0 Reporter: Yang Jie https://www.slf4j.org/news.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
[ https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40496. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37942 [https://github.com/apache/spark/pull/37942] > Configs to control "enableDateTimeParsingFallback" are incorrectly swapped > -- > > Key: SPARK-40496 > URL: https://issues.apache.org/jira/browse/SPARK-40496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
[ https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40496: --- Assignee: Ivan Sadikov > Configs to control "enableDateTimeParsingFallback" are incorrectly swapped > -- > > Key: SPARK-40496 > URL: https://issues.apache.org/jira/browse/SPARK-40496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40511) Upgrade slf4j to 2.x
[ https://issues.apache.org/jira/browse/SPARK-40511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40511: Assignee: Apache Spark > Upgrade slf4j to 2.x > > > Key: SPARK-40511 > URL: https://issues.apache.org/jira/browse/SPARK-40511 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > https://www.slf4j.org/news.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40511) Upgrade slf4j to 2.x
[ https://issues.apache.org/jira/browse/SPARK-40511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607522#comment-17607522 ] Apache Spark commented on SPARK-40511: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37844 > Upgrade slf4j to 2.x > > > Key: SPARK-40511 > URL: https://issues.apache.org/jira/browse/SPARK-40511 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > https://www.slf4j.org/news.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40511) Upgrade slf4j to 2.x
[ https://issues.apache.org/jira/browse/SPARK-40511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40511: Assignee: (was: Apache Spark) > Upgrade slf4j to 2.x > > > Key: SPARK-40511 > URL: https://issues.apache.org/jira/browse/SPARK-40511 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > https://www.slf4j.org/news.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40502) Support dataframe API use jdbc data source in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607523#comment-17607523 ] CaoYu commented on SPARK-40502: --- I am a teacher Recently designed Python language basic course, big data direction PySpark is one of the practical cases, but it is only a simple use of RDD code to complete the basic data processing work, and the use of JDBC data source is a part of the course DataFrames(SparkSQL) will be used in future design advanced courses. So I hope the datastream API to have the capability of jdbc datasource. > Support dataframe API use jdbc data source in PySpark > - > > Key: SPARK-40502 > URL: https://issues.apache.org/jira/browse/SPARK-40502 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.3.0 >Reporter: CaoYu >Priority: Major > > When i using pyspark, i wanna get data from mysql database. so i want use > JDBCRDD like java\scala. > But that is not be supported in PySpark. > > For some reasons, i can't using DataFrame API, only can use RDD(datastream) > API. Even i know the DataFrame can get data from jdbc source fairly well. > > So i want to implement functionality that can use rdd to get data from jdbc > source for PySpark. > > *But i don't know if that are necessary for PySpark. so we can discuss it.* > > {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*} > *i hope this Jira task can assigned to me, so i can start working to > implement it.* > > *if not, please close this Jira task.* > > > *thanks a lot.* > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40497) Upgrade Scala to 2.13.9
Yang Jie created SPARK-40497: Summary: Upgrade Scala to 2.13.9 Key: SPARK-40497 URL: https://issues.apache.org/jira/browse/SPARK-40497 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.4.0 Reporter: Yang Jie 2.13.9 released [https://github.com/scala/scala/releases/tag/v2.13.9] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Attachment: spark2.4-shuffle-data.png > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop 3.0.0 > spark2.4.0 / spark3.2.1 > shuffle: spark2.4.0 >Reporter: xuanzhiang >Priority: Minor > Attachments: spark2.4-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Description: spark.sql( s""" |SELECT | Info , | PERCENTILE_APPROX(cost,0.5) cost_p50, | PERCENTILE_APPROX(cost,0.9) cost_p90, | PERCENTILE_APPROX(cost,0.95) cost_p95, | PERCENTILE_APPROX(cost,0.99) cost_p99, | PERCENTILE_APPROX(cost,0.999) cost_p999 |FROM | textData |""".stripMargin) * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 pull shuffle data very quick . but , when we use spark 3.2.1 and use old shuffle , 140M shuffle data cost 3 hours. * If we upgrade the Shuffle, will we get performance regression? * was: spark.sql( s""" |SELECT | Info , | PERCENTILE_APPROX(cost,0.5) cost_p50, | PERCENTILE_APPROX(cost,0.9) cost_p90, | PERCENTILE_APPROX(cost,0.95) cost_p95, | PERCENTILE_APPROX(cost,0.99) cost_p99, | PERCENTILE_APPROX(cost,0.999) cost_p999 |FROM | textData |""".stripMargin) * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 pull shuffle data very quick . but , when we use spark 3.2.1 and use old shuffle , 140M shuffle data cost 3 hours. * > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop 3.0.0 > spark2.4.0 / spark3.2.1 > shuffle: spark2.4.0 >Reporter: xuanzhiang >Priority: Minor > Attachments: spark2.4-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
[ https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607022#comment-17607022 ] Apache Spark commented on SPARK-40419: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/37946 > Integrate Grouped Aggregate Pandas UDFs into *.sql test cases > - > > Key: SPARK-40419 > URL: https://issues.apache.org/jira/browse/SPARK-40419 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases > from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at > all. > We should also leverage this to test pandas aggregate UDFs too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
[ https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606963#comment-17606963 ] Apache Spark commented on SPARK-40496: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/37942 > Configs to control "enableDateTimeParsingFallback" are incorrectly swapped > -- > > Key: SPARK-40496 > URL: https://issues.apache.org/jira/browse/SPARK-40496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40497) Upgrade Scala to 2.13.9
[ https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40497: Assignee: (was: Apache Spark) > Upgrade Scala to 2.13.9 > --- > > Key: SPARK-40497 > URL: https://issues.apache.org/jira/browse/SPARK-40497 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > 2.13.9 released [https://github.com/scala/scala/releases/tag/v2.13.9] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40497) Upgrade Scala to 2.13.9
[ https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40497: Assignee: Apache Spark > Upgrade Scala to 2.13.9 > --- > > Key: SPARK-40497 > URL: https://issues.apache.org/jira/browse/SPARK-40497 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > 2.13.9 released [https://github.com/scala/scala/releases/tag/v2.13.9] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40497) Upgrade Scala to 2.13.9
[ https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606966#comment-17606966 ] Apache Spark commented on SPARK-40497: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37943 > Upgrade Scala to 2.13.9 > --- > > Key: SPARK-40497 > URL: https://issues.apache.org/jira/browse/SPARK-40497 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > 2.13.9 released [https://github.com/scala/scala/releases/tag/v2.13.9] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40497) Upgrade Scala to 2.13.9
[ https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606968#comment-17606968 ] Apache Spark commented on SPARK-40497: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37943 > Upgrade Scala to 2.13.9 > --- > > Key: SPARK-40497 > URL: https://issues.apache.org/jira/browse/SPARK-40497 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > 2.13.9 released [https://github.com/scala/scala/releases/tag/v2.13.9] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40498) Implement `kendall` and `min_periods` in `Series.corr`
Ruifeng Zheng created SPARK-40498: - Summary: Implement `kendall` and `min_periods` in `Series.corr` Key: SPARK-40498 URL: https://issues.apache.org/jira/browse/SPARK-40498 Project: Spark Issue Type: Sub-task Components: ps Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
[ https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607021#comment-17607021 ] Apache Spark commented on SPARK-40419: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/37946 > Integrate Grouped Aggregate Pandas UDFs into *.sql test cases > - > > Key: SPARK-40419 > URL: https://issues.apache.org/jira/browse/SPARK-40419 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases > from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at > all. > We should also leverage this to test pandas aggregate UDFs too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40495) Add additional tests to StreamingSessionWindowSuite
[ https://issues.apache.org/jira/browse/SPARK-40495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40495: Assignee: Apache Spark > Add additional tests to StreamingSessionWindowSuite > --- > > Key: SPARK-40495 > URL: https://issues.apache.org/jira/browse/SPARK-40495 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.3.0 >Reporter: Wei Liu >Assignee: Apache Spark >Priority: Minor > > Add complex tests to {{{}StreamingSessionWindowSuite{}}}. Concretely, I > created two helper functions, > * one is called {{{}sessionWindowQueryNestedKey{}}}, which would convert > {{sessionId}} from the single word key used in {{sessionWindowQuery}} to a > nested column key. For example: {{{}"hello" -> (("hello", "hello"), > "hello"){}}}. > * The other is called {{{}sessionWindowQueryMultiColKey{}}}. It would > convert {{sessionId}} from the single word key used in {{sessionWindowQuery}} > to two columns. For example: "hello" -> col1: ("hello", "hello"), col2: > "hello" > With the two new helper functions, I added more tests for the tests for > {{complete mode}} and {{cap gap duration}} ({{{}append{}}} and {{async > state}} was not included, because the first two is enough for testing the > change I added). The logic of the tests are not changed at all, just the key. > For the aggregation test ({{{}session window - with more aggregation > functions{}}}), I added some more functions as well as a UDAF to test, and I > tried {{first()}} and {{last()}} function on a nested triple, created using > the same method as the above. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40495) Add additional tests to StreamingSessionWindowSuite
[ https://issues.apache.org/jira/browse/SPARK-40495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40495: Assignee: (was: Apache Spark) > Add additional tests to StreamingSessionWindowSuite > --- > > Key: SPARK-40495 > URL: https://issues.apache.org/jira/browse/SPARK-40495 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.3.0 >Reporter: Wei Liu >Priority: Minor > > Add complex tests to {{{}StreamingSessionWindowSuite{}}}. Concretely, I > created two helper functions, > * one is called {{{}sessionWindowQueryNestedKey{}}}, which would convert > {{sessionId}} from the single word key used in {{sessionWindowQuery}} to a > nested column key. For example: {{{}"hello" -> (("hello", "hello"), > "hello"){}}}. > * The other is called {{{}sessionWindowQueryMultiColKey{}}}. It would > convert {{sessionId}} from the single word key used in {{sessionWindowQuery}} > to two columns. For example: "hello" -> col1: ("hello", "hello"), col2: > "hello" > With the two new helper functions, I added more tests for the tests for > {{complete mode}} and {{cap gap duration}} ({{{}append{}}} and {{async > state}} was not included, because the first two is enough for testing the > change I added). The logic of the tests are not changed at all, just the key. > For the aggregation test ({{{}session window - with more aggregation > functions{}}}), I added some more functions as well as a UDAF to test, and I > tried {{first()}} and {{last()}} function on a nested triple, created using > the same method as the above. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40495) Add additional tests to StreamingSessionWindowSuite
[ https://issues.apache.org/jira/browse/SPARK-40495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606940#comment-17606940 ] Apache Spark commented on SPARK-40495: -- User 'WweiL' has created a pull request for this issue: https://github.com/apache/spark/pull/37936 > Add additional tests to StreamingSessionWindowSuite > --- > > Key: SPARK-40495 > URL: https://issues.apache.org/jira/browse/SPARK-40495 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.3.0 >Reporter: Wei Liu >Priority: Minor > > Add complex tests to {{{}StreamingSessionWindowSuite{}}}. Concretely, I > created two helper functions, > * one is called {{{}sessionWindowQueryNestedKey{}}}, which would convert > {{sessionId}} from the single word key used in {{sessionWindowQuery}} to a > nested column key. For example: {{{}"hello" -> (("hello", "hello"), > "hello"){}}}. > * The other is called {{{}sessionWindowQueryMultiColKey{}}}. It would > convert {{sessionId}} from the single word key used in {{sessionWindowQuery}} > to two columns. For example: "hello" -> col1: ("hello", "hello"), col2: > "hello" > With the two new helper functions, I added more tests for the tests for > {{complete mode}} and {{cap gap duration}} ({{{}append{}}} and {{async > state}} was not included, because the first two is enough for testing the > change I added). The logic of the tests are not changed at all, just the key. > For the aggregation test ({{{}session window - with more aggregation > functions{}}}), I added some more functions as well as a UDAF to test, and I > tried {{first()}} and {{last()}} function on a nested triple, created using > the same method as the above. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`
[ https://issues.apache.org/jira/browse/SPARK-40486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-40486. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37929 [https://github.com/apache/spark/pull/37929] > Implement `spearman` and `kendall` in `DataFrame.corrwith` > -- > > Key: SPARK-40486 > URL: https://issues.apache.org/jira/browse/SPARK-40486 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`
[ https://issues.apache.org/jira/browse/SPARK-40486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-40486: - Assignee: Ruifeng Zheng > Implement `spearman` and `kendall` in `DataFrame.corrwith` > -- > > Key: SPARK-40486 > URL: https://issues.apache.org/jira/browse/SPARK-40486 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
Ivan Sadikov created SPARK-40496: Summary: Configs to control "enableDateTimeParsingFallback" are incorrectly swapped Key: SPARK-40496 URL: https://issues.apache.org/jira/browse/SPARK-40496 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Ivan Sadikov -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
[ https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40496: Assignee: (was: Apache Spark) > Configs to control "enableDateTimeParsingFallback" are incorrectly swapped > -- > > Key: SPARK-40496 > URL: https://issues.apache.org/jira/browse/SPARK-40496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
[ https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40496: Assignee: Apache Spark > Configs to control "enableDateTimeParsingFallback" are incorrectly swapped > -- > > Key: SPARK-40496 > URL: https://issues.apache.org/jira/browse/SPARK-40496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
[ https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606962#comment-17606962 ] Apache Spark commented on SPARK-40496: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/37942 > Configs to control "enableDateTimeParsingFallback" are incorrectly swapped > -- > > Key: SPARK-40496 > URL: https://issues.apache.org/jira/browse/SPARK-40496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606973#comment-17606973 ] Piotr Karwasz commented on SPARK-40489: --- [~kabhwan], The code snippet I posted above (using {{LoggerFactory.getLoggerFactory()}}) should work on both SLF4J 1.x and SLF4J 2.x. You can compile Spark against SLF4J 1.x and it should work with both versions of the API. Note that if the runtime uses SLF4J 2.x, you need to use {{log4j-slf4j2-impl}} as binding. > Spark 3.3.0 breaks with SFL4J 2. > > > Key: SPARK-40489 > URL: https://issues.apache.org/jira/browse/SPARK-40489 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Garret Wilson >Priority: Major > > Spark breaks fundamentally with SLF4J 2.x because it uses > {{StaticLoggerBinder}}. > SLF4J is the logging facade that is meant to shield the application from the > implementation, whether it be Log4J or Logback or whatever. Historically > SLF4J 1.x used a bad approach to configuration: it used a > {{StaticLoggerBinder}} (a global static singleton instance) rather than the > Java {{ServiceLoader}} mechanism. > SLF4J 2.x, which has been in development for years, has been released. It > finally switches to use the {{ServiceLoader}} mechanism. As [described in the > FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be > compatible; an application just needs to use the latest Log4J/Logback > implementation which has the service loader. > *Above all the application must _not_ use the low-level > {{StaticLoggerBinder}} method, because it has been removed!* > Unfortunately > [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] > uses {{StaticLoggerBinder}} and completely breaks any environment using > SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API > and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark > breaks completely just trying to get a Spark session: > {noformat} > Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder > at > org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) > at > org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at org.apache.spark.internal.Logging.log(Logging.scala:53) > at org.apache.spark.internal.Logging.log$(Logging.scala:51) > at org.apache.spark.SparkContext.log(SparkContext.scala:84) > at org.apache.spark.internal.Logging.logInfo(Logging.scala:61) > at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60) > at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84) > at org.apache.spark.SparkContext.(SparkContext.scala:195) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) > at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) > at scala.Option.getOrElse(Option.scala:201) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) > {noformat} > This is because Spark is playing low-level tricks to find out if the logging > platform is Log4J, and relying on {{StaticLoggerBinder}} to do it. > {code} > private def isLog4j2(): Boolean = { > // This distinguishes the log4j 1.2 binding, currently > // org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, > currently > // org.apache.logging.slf4j.Log4jLoggerFactory > val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr > "org.apache.logging.slf4j.Log4jLoggerFactory".equals(binderClass) > } > {code} > Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark > should not be using {{StaticLoggerBinder}} to do that detection. There are > many other approaches. (The code itself suggest one approach: > {{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to > see if the root logger actually is a {{Log4jLogger}}. There may be even > better approaches.) > The other big problem is relying on the Log4J
[jira] [Updated] (SPARK-40491) Remove too old TODO for JdbcRDD
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-40491: --- Summary: Remove too old TODO for JdbcRDD (was: Remove too old todo comments for JdbcRDD) > Remove too old TODO for JdbcRDD > --- > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. > In fact, this is a very old TODO and we need to revisit if this is still > necessary. Since Spark SQL is the new core, I'm not sure if anyone is > interested in a new API to create jdbc RDD. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40498) Implement `kendall` and `min_periods` in `Series.corr`
[ https://issues.apache.org/jira/browse/SPARK-40498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607006#comment-17607006 ] Apache Spark commented on SPARK-40498: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37945 > Implement `kendall` and `min_periods` in `Series.corr` > -- > > Key: SPARK-40498 > URL: https://issues.apache.org/jira/browse/SPARK-40498 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40498) Implement `kendall` and `min_periods` in `Series.corr`
[ https://issues.apache.org/jira/browse/SPARK-40498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40498: Assignee: Apache Spark > Implement `kendall` and `min_periods` in `Series.corr` > -- > > Key: SPARK-40498 > URL: https://issues.apache.org/jira/browse/SPARK-40498 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40498) Implement `kendall` and `min_periods` in `Series.corr`
[ https://issues.apache.org/jira/browse/SPARK-40498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607004#comment-17607004 ] Apache Spark commented on SPARK-40498: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37945 > Implement `kendall` and `min_periods` in `Series.corr` > -- > > Key: SPARK-40498 > URL: https://issues.apache.org/jira/browse/SPARK-40498 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40498) Implement `kendall` and `min_periods` in `Series.corr`
[ https://issues.apache.org/jira/browse/SPARK-40498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40498: Assignee: (was: Apache Spark) > Implement `kendall` and `min_periods` in `Series.corr` > -- > > Key: SPARK-40498 > URL: https://issues.apache.org/jira/browse/SPARK-40498 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Attachment: spark3.2-shuffle-data.png > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop 3.0.0 > spark2.4.0 / spark3.2.1 > shuffle: spark2.4.0 >Reporter: xuanzhiang >Priority: Minor > Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Environment: hadoop: 3.0.0 spark: 2.4.0 / 3.2.1 shuffle:spark 2.4.0 was: hadoop 3.0.0 spark2.4.0 / spark3.2.1 shuffle: spark2.4.0 > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop: 3.0.0 > spark: 2.4.0 / 3.2.1 > shuffle:spark 2.4.0 >Reporter: xuanzhiang >Priority: Minor > Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40495) Add additional tests to StreamingSessionWindowSuite
[ https://issues.apache.org/jira/browse/SPARK-40495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606937#comment-17606937 ] Wei Liu commented on SPARK-40495: - Hi there, this is my first time to contribute to OSS spark. I'm not particularly sure what to put at `Affects Version/s`, so please correct me if I'm wrong! Thanks in advance! > Add additional tests to StreamingSessionWindowSuite > --- > > Key: SPARK-40495 > URL: https://issues.apache.org/jira/browse/SPARK-40495 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.3.0 >Reporter: Wei Liu >Priority: Minor > > Add complex tests to {{{}StreamingSessionWindowSuite{}}}. Concretely, I > created two helper functions, > * one is called {{{}sessionWindowQueryNestedKey{}}}, which would convert > {{sessionId}} from the single word key used in {{sessionWindowQuery}} to a > nested column key. For example: {{{}"hello" -> (("hello", "hello"), > "hello"){}}}. > * The other is called {{{}sessionWindowQueryMultiColKey{}}}. It would > convert {{sessionId}} from the single word key used in {{sessionWindowQuery}} > to two columns. For example: "hello" -> col1: ("hello", "hello"), col2: > "hello" > With the two new helper functions, I added more tests for the tests for > {{complete mode}} and {{cap gap duration}} ({{{}append{}}} and {{async > state}} was not included, because the first two is enough for testing the > change I added). The logic of the tests are not changed at all, just the key. > For the aggregation test ({{{}session window - with more aggregation > functions{}}}), I added some more functions as well as a UDAF to test, and I > tried {{first()}} and {{last()}} function on a nested triple, created using > the same method as the above. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40495) Add additional tests to StreamingSessionWindowSuite
Wei Liu created SPARK-40495: --- Summary: Add additional tests to StreamingSessionWindowSuite Key: SPARK-40495 URL: https://issues.apache.org/jira/browse/SPARK-40495 Project: Spark Issue Type: Test Components: Tests Affects Versions: 3.3.0 Reporter: Wei Liu Add complex tests to {{{}StreamingSessionWindowSuite{}}}. Concretely, I created two helper functions, * one is called {{{}sessionWindowQueryNestedKey{}}}, which would convert {{sessionId}} from the single word key used in {{sessionWindowQuery}} to a nested column key. For example: {{{}"hello" -> (("hello", "hello"), "hello"){}}}. * The other is called {{{}sessionWindowQueryMultiColKey{}}}. It would convert {{sessionId}} from the single word key used in {{sessionWindowQuery}} to two columns. For example: "hello" -> col1: ("hello", "hello"), col2: "hello" With the two new helper functions, I added more tests for the tests for {{complete mode}} and {{cap gap duration}} ({{{}append{}}} and {{async state}} was not included, because the first two is enough for testing the change I added). The logic of the tests are not changed at all, just the key. For the aggregation test ({{{}session window - with more aggregation functions{}}}), I added some more functions as well as a UDAF to test, and I tried {{first()}} and {{last()}} function on a nested triple, created using the same method as the above. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40491) Remove too old todo comments for JdbcRDD
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-40491: --- Summary: Remove too old todo comments for JdbcRDD (was: Expose a jdbcRDD function in SparkContext) > Remove too old todo comments for JdbcRDD > > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40491) Remove too old todo comments for JdbcRDD
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-40491: --- Description: According to the legacy document of JdbcRDD, we need to expose a jdbcRDD function in SparkContext. In fact, this is a very old TODO and we need to revisit if this is still necessary. Since Spark SQL is the new core, I'm not sure if anyone is interested in a new API to create jdbc RDD. was:According to the legacy document of JdbcRDD, we need to expose a jdbcRDD function in SparkContext. > Remove too old todo comments for JdbcRDD > > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. > In fact, this is a very old TODO and we need to revisit if this is still > necessary. Since Spark SQL is the new core, I'm not sure if anyone is > interested in a new API to create jdbc RDD. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Environment: hadoop 3.0.0 spark2.4.0 / spark3.2.1 shuffle: spark2.4.0 was:!image-2022-09-20-16-57-01-881.png! > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop 3.0.0 > spark2.4.0 / spark3.2.1 > shuffle: spark2.4.0 >Reporter: xuanzhiang >Priority: Minor > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
xuanzhiang created SPARK-40499: -- Summary: Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 Key: SPARK-40499 URL: https://issues.apache.org/jira/browse/SPARK-40499 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 3.2.1 Environment: !image-2022-09-20-16-57-01-881.png! Reporter: xuanzhiang spark.sql( s""" |SELECT | Info , | PERCENTILE_APPROX(cost,0.5) cost_p50, | PERCENTILE_APPROX(cost,0.9) cost_p90, | PERCENTILE_APPROX(cost,0.95) cost_p95, | PERCENTILE_APPROX(cost,0.99) cost_p99, | PERCENTILE_APPROX(cost,0.999) cost_p999 |FROM | textData |""".stripMargin) * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 pull shuffle data very quick . but , when we use spark 3.2.1 and use old shuffle , 140M shuffle data cost 3 hours. * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Priority: Blocker (was: Minor) > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop: 3.0.0 > spark: 2.4.0 / 3.2.1 > shuffle:spark 2.4.0 >Reporter: xuanzhiang >Priority: Blocker > Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Priority: Major (was: Blocker) > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop: 3.0.0 > spark: 2.4.0 / 3.2.1 > shuffle:spark 2.4.0 >Reporter: xuanzhiang >Priority: Major > Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40501) Add PushProjectionThroughLimit for Optimizer
[ https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-40501: Summary: Add PushProjectionThroughLimit for Optimizer (was: add PushProjectionThroughLimit for Optimizer) > Add PushProjectionThroughLimit for Optimizer > > > Key: SPARK-40501 > URL: https://issues.apache.org/jira/browse/SPARK-40501 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > h4. It took a long time to fetch out > when run as follow code in spark-shell: > spark.sql("select * from xxx where event_day = '20220919' limit 1").show() > [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png] > [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint
[ https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607149#comment-17607149 ] Apache Spark commented on SPARK-40505: -- User 'bryanck' has created a pull request for this issue: https://github.com/apache/spark/pull/37950 > Remove min heap setting in Kubernetes Dockerfile entrypoint > --- > > Key: SPARK-40505 > URL: https://issues.apache.org/jira/browse/SPARK-40505 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0, 3.2.2 >Reporter: Bryan Keller >Priority: Major > > The entrypoint script for the Kubernetes Dockerfile sets the Java min heap > setting (-Xms) to be the same as the max setting (-Xmx) for the executor > process. This prevents the JVM from shrinking the heap and can lead to > excessive memory usage in some scenarios. Removing the min heap setting is > consistent with YARN executor startup. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint
[ https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40505: Assignee: Apache Spark > Remove min heap setting in Kubernetes Dockerfile entrypoint > --- > > Key: SPARK-40505 > URL: https://issues.apache.org/jira/browse/SPARK-40505 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0, 3.2.2 >Reporter: Bryan Keller >Assignee: Apache Spark >Priority: Major > > The entrypoint script for the Kubernetes Dockerfile sets the Java min heap > setting (-Xms) to be the same as the max setting (-Xmx) for the executor > process. This prevents the JVM from shrinking the heap and can lead to > excessive memory usage in some scenarios. Removing the min heap setting is > consistent with YARN executor startup. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint
[ https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40505: Assignee: (was: Apache Spark) > Remove min heap setting in Kubernetes Dockerfile entrypoint > --- > > Key: SPARK-40505 > URL: https://issues.apache.org/jira/browse/SPARK-40505 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0, 3.2.2 >Reporter: Bryan Keller >Priority: Major > > The entrypoint script for the Kubernetes Dockerfile sets the Java min heap > setting (-Xms) to be the same as the max setting (-Xmx) for the executor > process. This prevents the JVM from shrinking the heap and can lead to > excessive memory usage in some scenarios. Removing the min heap setting is > consistent with YARN executor startup. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint
[ https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607151#comment-17607151 ] Apache Spark commented on SPARK-40505: -- User 'bryanck' has created a pull request for this issue: https://github.com/apache/spark/pull/37950 > Remove min heap setting in Kubernetes Dockerfile entrypoint > --- > > Key: SPARK-40505 > URL: https://issues.apache.org/jira/browse/SPARK-40505 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0, 3.2.2 >Reporter: Bryan Keller >Priority: Major > > The entrypoint script for the Kubernetes Dockerfile sets the Java min heap > setting (-Xms) to be the same as the max setting (-Xmx) for the executor > process. This prevents the JVM from shrinking the heap and can lead to > excessive memory usage in some scenarios. Removing the min heap setting is > consistent with YARN executor startup. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40506) Spark Streaming Metrics SourceName is unsuitable
王俊博 created SPARK-40506: --- Summary: Spark Streaming Metrics SourceName is unsuitable Key: SPARK-40506 URL: https://issues.apache.org/jira/browse/SPARK-40506 Project: Spark Issue Type: Improvement Components: DStreams Affects Versions: 3.2.2 Reporter: 王俊博 Spark StreamingSource Metrics sourceName is inappropriate.The label now looks like `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime `, instead of `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, the Spark app name is not need. {code:java} //代码占位符 private[streaming] class StreamingSource(ssc: StreamingContext) extends Source { override val metricRegistry = new MetricRegistry override val sourceName = "%s.StreamingMetrics".format(ssc.sparkContext.appName) }{code} And for example, other metrics sourceName don't have appName. {code:java} //代码占位符 private[spark] class LiveListenerBusMetrics(conf: SparkConf) extends Source with Logging { override val sourceName: String = "LiveListenerBus" override val metricRegistry: MetricRegistry = new MetricRegistry ... } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40506) Spark Streaming metrics name don't need application name
[ https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40506: Assignee: (was: Apache Spark) > Spark Streaming metrics name don't need application name > > > Key: SPARK-40506 > URL: https://issues.apache.org/jira/browse/SPARK-40506 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.2.2 >Reporter: 王俊博 >Priority: Trivial > > Spark StreamingSource Metrics sourceName is inappropriate.The label now > looks like > `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime > `, instead of > `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, > the Spark app name is not need. > This makes it hard to use metrics for different Spark applications over time. > And this makes the metrics sourceName standard inconsistent > {code:java} > //代码占位符 > private[streaming] class StreamingSource(ssc: StreamingContext) extends > Source { > override val metricRegistry = new MetricRegistry > override val sourceName = > "%s.StreamingMetrics".format(ssc.sparkContext.appName) > > }{code} > And for example, other metrics sourceName don't have appName. > {code:java} > //代码占位符 > private[spark] class LiveListenerBusMetrics(conf: SparkConf) > extends Source with Logging { > override val sourceName: String = "LiveListenerBus" > override val metricRegistry: MetricRegistry = new MetricRegistry > ... > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40506) Spark Streaming metrics name don't need application name
[ https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607167#comment-17607167 ] Apache Spark commented on SPARK-40506: -- User 'Kwafoor' has created a pull request for this issue: https://github.com/apache/spark/pull/37951 > Spark Streaming metrics name don't need application name > > > Key: SPARK-40506 > URL: https://issues.apache.org/jira/browse/SPARK-40506 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.2.2 >Reporter: 王俊博 >Priority: Trivial > > Spark StreamingSource Metrics sourceName is inappropriate.The label now > looks like > `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime > `, instead of > `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, > the Spark app name is not need. > This makes it hard to use metrics for different Spark applications over time. > And this makes the metrics sourceName standard inconsistent > {code:java} > //代码占位符 > private[streaming] class StreamingSource(ssc: StreamingContext) extends > Source { > override val metricRegistry = new MetricRegistry > override val sourceName = > "%s.StreamingMetrics".format(ssc.sparkContext.appName) > > }{code} > And for example, other metrics sourceName don't have appName. > {code:java} > //代码占位符 > private[spark] class LiveListenerBusMetrics(conf: SparkConf) > extends Source with Logging { > override val sourceName: String = "LiveListenerBus" > override val metricRegistry: MetricRegistry = new MetricRegistry > ... > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40506) Spark Streaming metrics name don't need application name
[ https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40506: Assignee: Apache Spark > Spark Streaming metrics name don't need application name > > > Key: SPARK-40506 > URL: https://issues.apache.org/jira/browse/SPARK-40506 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.2.2 >Reporter: 王俊博 >Assignee: Apache Spark >Priority: Trivial > > Spark StreamingSource Metrics sourceName is inappropriate.The label now > looks like > `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime > `, instead of > `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, > the Spark app name is not need. > This makes it hard to use metrics for different Spark applications over time. > And this makes the metrics sourceName standard inconsistent > {code:java} > //代码占位符 > private[streaming] class StreamingSource(ssc: StreamingContext) extends > Source { > override val metricRegistry = new MetricRegistry > override val sourceName = > "%s.StreamingMetrics".format(ssc.sparkContext.appName) > > }{code} > And for example, other metrics sourceName don't have appName. > {code:java} > //代码占位符 > private[spark] class LiveListenerBusMetrics(conf: SparkConf) > extends Source with Logging { > override val sourceName: String = "LiveListenerBus" > override val metricRegistry: MetricRegistry = new MetricRegistry > ... > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40501) Add PushProjectionThroughLimit for Optimizer
[ https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-40501: Description: h4. It took a long time to fetch out, still running after 20 minutes... when run as follow code in spark-shell: spark.sql("select * from xxx where event_day = '20220919' limit 1").show() [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png] [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png] was: h4. It took a long time to fetch out when run as follow code in spark-shell: spark.sql("select * from xxx where event_day = '20220919' limit 1").show() [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png] [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png] > Add PushProjectionThroughLimit for Optimizer > > > Key: SPARK-40501 > URL: https://issues.apache.org/jira/browse/SPARK-40501 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > h4. It took a long time to fetch out, still running after 20 minutes... > when run as follow code in spark-shell: > spark.sql("select * from xxx where event_day = '20220919' limit 1").show() > [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png] > [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint
Bryan Keller created SPARK-40505: Summary: Remove min heap setting in Kubernetes Dockerfile entrypoint Key: SPARK-40505 URL: https://issues.apache.org/jira/browse/SPARK-40505 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.2.2, 3.3.0 Reporter: Bryan Keller The entrypoint script for the Kubernetes Dockerfile sets the Java min heap setting (-Xms) to be the same as the max setting (-Xmx) for the executor process. This prevents the JVM from shrinking the heap and can lead to excessive memory usage in some scenarios. Removing the min heap setting is consistent with YARN executor startup. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40506) Spark Streaming Metrics SourceName is unsuitable
[ https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 王俊博 updated SPARK-40506: Description: Spark StreamingSource Metrics sourceName is inappropriate.The label now looks like `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime `, instead of `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, the Spark app name is not need. This makes it hard to use metrics for different Spark applications over time. And this makes the metrics sourceName standard inconsistent {code:java} //代码占位符 private[streaming] class StreamingSource(ssc: StreamingContext) extends Source { override val metricRegistry = new MetricRegistry override val sourceName = "%s.StreamingMetrics".format(ssc.sparkContext.appName) }{code} And for example, other metrics sourceName don't have appName. {code:java} //代码占位符 private[spark] class LiveListenerBusMetrics(conf: SparkConf) extends Source with Logging { override val sourceName: String = "LiveListenerBus" override val metricRegistry: MetricRegistry = new MetricRegistry ... } {code} was: Spark StreamingSource Metrics sourceName is inappropriate.The label now looks like `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime `, instead of `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, the Spark app name is not need. {code:java} //代码占位符 private[streaming] class StreamingSource(ssc: StreamingContext) extends Source { override val metricRegistry = new MetricRegistry override val sourceName = "%s.StreamingMetrics".format(ssc.sparkContext.appName) }{code} And for example, other metrics sourceName don't have appName. {code:java} //代码占位符 private[spark] class LiveListenerBusMetrics(conf: SparkConf) extends Source with Logging { override val sourceName: String = "LiveListenerBus" override val metricRegistry: MetricRegistry = new MetricRegistry ... } {code} > Spark Streaming Metrics SourceName is unsuitable > > > Key: SPARK-40506 > URL: https://issues.apache.org/jira/browse/SPARK-40506 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.2.2 >Reporter: 王俊博 >Priority: Trivial > > Spark StreamingSource Metrics sourceName is inappropriate.The label now > looks like > `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime > `, instead of > `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, > the Spark app name is not need. > This makes it hard to use metrics for different Spark applications over time. > And this makes the metrics sourceName standard inconsistent > {code:java} > //代码占位符 > private[streaming] class StreamingSource(ssc: StreamingContext) extends > Source { > override val metricRegistry = new MetricRegistry > override val sourceName = > "%s.StreamingMetrics".format(ssc.sparkContext.appName) > > }{code} > And for example, other metrics sourceName don't have appName. > {code:java} > //代码占位符 > private[spark] class LiveListenerBusMetrics(conf: SparkConf) > extends Source with Logging { > override val sourceName: String = "LiveListenerBus" > override val metricRegistry: MetricRegistry = new MetricRegistry > ... > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40506) Spark Streaming metrics name don't need application name
[ https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 王俊博 updated SPARK-40506: Summary: Spark Streaming metrics name don't need application name (was: Spark Streaming Metrics SourceName is unsuitable) > Spark Streaming metrics name don't need application name > > > Key: SPARK-40506 > URL: https://issues.apache.org/jira/browse/SPARK-40506 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.2.2 >Reporter: 王俊博 >Priority: Trivial > > Spark StreamingSource Metrics sourceName is inappropriate.The label now > looks like > `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime > `, instead of > `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, > the Spark app name is not need. > This makes it hard to use metrics for different Spark applications over time. > And this makes the metrics sourceName standard inconsistent > {code:java} > //代码占位符 > private[streaming] class StreamingSource(ssc: StreamingContext) extends > Source { > override val metricRegistry = new MetricRegistry > override val sourceName = > "%s.StreamingMetrics".format(ssc.sparkContext.appName) > > }{code} > And for example, other metrics sourceName don't have appName. > {code:java} > //代码占位符 > private[spark] class LiveListenerBusMetrics(conf: SparkConf) > extends Source with Logging { > override val sourceName: String = "LiveListenerBus" > override val metricRegistry: MetricRegistry = new MetricRegistry > ... > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607079#comment-17607079 ] Apache Spark commented on SPARK-40327: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37948 > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40504) Make yarn appmaster load config from client
zhengchenyu created SPARK-40504: --- Summary: Make yarn appmaster load config from client Key: SPARK-40504 URL: https://issues.apache.org/jira/browse/SPARK-40504 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 3.0.1 Reporter: zhengchenyu In yarn federation mode, config in client side and nm side may be different. AppMaster should override config from client side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`
Ruifeng Zheng created SPARK-40500: - Summary: Use `pd.items` instead of `pd.iteritems` Key: SPARK-40500 URL: https://issues.apache.org/jira/browse/SPARK-40500 Project: Spark Issue Type: Improvement Components: ps Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40501) add PushProjectionThroughLimit for Optimizer
BingKun Pan created SPARK-40501: --- Summary: add PushProjectionThroughLimit for Optimizer Key: SPARK-40501 URL: https://issues.apache.org/jira/browse/SPARK-40501 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Priority: Blocker (was: Major) > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop: 3.0.0 > spark: 2.4.0 / 3.2.1 > shuffle:spark 2.4.0 >Reporter: xuanzhiang >Priority: Blocker > Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40457) upgrade jackson data mapper to latest
[ https://issues.apache.org/jira/browse/SPARK-40457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607082#comment-17607082 ] Bilna commented on SPARK-40457: --- [~hyukjin.kwon] it is org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13 > upgrade jackson data mapper to latest > -- > > Key: SPARK-40457 > URL: https://issues.apache.org/jira/browse/SPARK-40457 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Bilna >Priority: Major > > Upgrade jackson-mapper-asl to the latest to resolve CVE-2019-10172 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40504) Make yarn appmaster load config from client
[ https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengchenyu updated SPARK-40504: Description: In yarn federation mode, config in client side and nm side may be different. AppMaster should override config from client side. For example: In client side, yarn.resourcemanager.ha.rm-ids are yarn routers. In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster. was:In yarn federation mode, config in client side and nm side may be different. AppMaster should override config from client side. > Make yarn appmaster load config from client > --- > > Key: SPARK-40504 > URL: https://issues.apache.org/jira/browse/SPARK-40504 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.1 >Reporter: zhengchenyu >Priority: Major > > In yarn federation mode, config in client side and nm side may be different. > AppMaster should override config from client side. > For example: > In client side, yarn.resourcemanager.ha.rm-ids are yarn routers. > In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40501) add PushProjectionThroughLimit for Optimizer
[ https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-40501: Description: h4. It took a long time to fetch out when run as follow code in spark-shell: spark.sql("select * from xxx where event_day = '20220919' limit 1").show() [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png] [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png] > add PushProjectionThroughLimit for Optimizer > > > Key: SPARK-40501 > URL: https://issues.apache.org/jira/browse/SPARK-40501 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > h4. It took a long time to fetch out > when run as follow code in spark-shell: > spark.sql("select * from xxx where event_day = '20220919' limit 1").show() > [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png] > [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`
[ https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40500: Assignee: (was: Apache Spark) > Use `pd.items` instead of `pd.iteritems` > > > Key: SPARK-40500 > URL: https://issues.apache.org/jira/browse/SPARK-40500 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`
[ https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40500: Assignee: Apache Spark > Use `pd.items` instead of `pd.iteritems` > > > Key: SPARK-40500 > URL: https://issues.apache.org/jira/browse/SPARK-40500 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40501) add PushProjectionThroughLimit for Optimizer
[ https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40501: Assignee: (was: Apache Spark) > add PushProjectionThroughLimit for Optimizer > > > Key: SPARK-40501 > URL: https://issues.apache.org/jira/browse/SPARK-40501 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40501) add PushProjectionThroughLimit for Optimizer
[ https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607033#comment-17607033 ] Apache Spark commented on SPARK-40501: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/37941 > add PushProjectionThroughLimit for Optimizer > > > Key: SPARK-40501 > URL: https://issues.apache.org/jira/browse/SPARK-40501 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`
[ https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607034#comment-17607034 ] Apache Spark commented on SPARK-40500: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37947 > Use `pd.items` instead of `pd.iteritems` > > > Key: SPARK-40500 > URL: https://issues.apache.org/jira/browse/SPARK-40500 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40501) add PushProjectionThroughLimit for Optimizer
[ https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40501: Assignee: Apache Spark > add PushProjectionThroughLimit for Optimizer > > > Key: SPARK-40501 > URL: https://issues.apache.org/jira/browse/SPARK-40501 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40502) Support dataframe API use jdbc data source in PySpark
CaoYu created SPARK-40502: - Summary: Support dataframe API use jdbc data source in PySpark Key: SPARK-40502 URL: https://issues.apache.org/jira/browse/SPARK-40502 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.3.0 Reporter: CaoYu When i using pyspark, i wanna get data from mysql database. so i want use JDBCRDD like java\scala. But that is not be supported in PySpark. For some reasons, i can't using DataFrame API, only can use RDD(datastream) API. Even i know the DataFrame can get data from jdbc source fairly well. So i want to implement functionality that can use rdd to get data from jdbc source for PySpark. *But i don't know if that are necessary for PySpark. so we can discuss it.* {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*} *i hope this Jira task can assigned to me, so i can start working to implement it.* *if not, please close this Jira task.* *thanks a lot.* -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40503) Add resampling to API references
Ruifeng Zheng created SPARK-40503: - Summary: Add resampling to API references Key: SPARK-40503 URL: https://issues.apache.org/jira/browse/SPARK-40503 Project: Spark Issue Type: Sub-task Components: Documentation, ps Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40504) Make yarn appmaster load config from client
[ https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40504: Assignee: Apache Spark > Make yarn appmaster load config from client > --- > > Key: SPARK-40504 > URL: https://issues.apache.org/jira/browse/SPARK-40504 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.1 >Reporter: zhengchenyu >Assignee: Apache Spark >Priority: Major > > In yarn federation mode, config in client side and nm side may be different. > AppMaster should override config from client side. > For example: > In client side, yarn.resourcemanager.ha.rm-ids are yarn routers. > In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40504) Make yarn appmaster load config from client
[ https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607092#comment-17607092 ] Apache Spark commented on SPARK-40504: -- User 'zhengchenyu' has created a pull request for this issue: https://github.com/apache/spark/pull/37949 > Make yarn appmaster load config from client > --- > > Key: SPARK-40504 > URL: https://issues.apache.org/jira/browse/SPARK-40504 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.1 >Reporter: zhengchenyu >Priority: Major > > In yarn federation mode, config in client side and nm side may be different. > AppMaster should override config from client side. > For example: > In client side, yarn.resourcemanager.ha.rm-ids are yarn routers. > In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40491) Remove too old TODO for JdbcRDD
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607045#comment-17607045 ] CaoYu commented on SPARK-40491: --- Maybe we can just not remove these. I have already created https://issues.apache.org/jira/browse/SPARK-40502, please take a look. i want try to implement jdbc data source in pyspark. Also I'm interested in this task for scala. if possible, Please assign me this task, I want to try to get it done > Remove too old TODO for JdbcRDD > --- > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. > In fact, this is a very old TODO and we need to revisit if this is still > necessary. Since Spark SQL is the new core, I'm not sure if anyone is > interested in a new API to create jdbc RDD. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607077#comment-17607077 ] Apache Spark commented on SPARK-40327: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37948 > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40327: Assignee: (was: Apache Spark) > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40327: Assignee: Apache Spark > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40504) Make yarn appmaster load config from client
[ https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40504: Assignee: (was: Apache Spark) > Make yarn appmaster load config from client > --- > > Key: SPARK-40504 > URL: https://issues.apache.org/jira/browse/SPARK-40504 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.1 >Reporter: zhengchenyu >Priority: Major > > In yarn federation mode, config in client side and nm side may be different. > AppMaster should override config from client side. > For example: > In client side, yarn.resourcemanager.ha.rm-ids are yarn routers. > In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:17 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. I believe it could get non-trivial for users to discover that spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY }}works. I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, >
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:18 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. I believe it could get non-trivial for users to discover that spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, >
[jira] [Commented] (SPARK-31404) file source backward compatibility after calendar switch
[ https://issues.apache.org/jira/browse/SPARK-31404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607326#comment-17607326 ] Sachit commented on SPARK-31404: Hi [~cloud_fan] , Could you please confirm if we need to use below properties to ensure it can read data written by spark2.4x spark.conf.set("spark.sql.parquet.int96RebaseModeInRead","CORRECTED") spark.conf.set("spark.sql.parquet.int96RebaseModeInWrite","CORRECTED") Regards Sachit > file source backward compatibility after calendar switch > > > Key: SPARK-31404 > URL: https://issues.apache.org/jira/browse/SPARK-31404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > Fix For: 3.0.0 > > Attachments: Switch to Java 8 time API in Spark 3.0.pdf > > > In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java > 8 datetime APIs. This makes Spark follow the ISO and SQL standard, but > introduces some backward compatibility problems: > 1. may read wrong data from the data files written by Spark 2.4 > 2. may have perf regression -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:20 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy would work. For instance, after inspecting the code, I thought that nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }} I believe it could get non trivial for users to discover that {{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, >
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:20 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }} I believe it could get non trivial for users to discover that {{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, >