[jira] [Created] (SPARK-42672) Document error class list
Haejoon Lee created SPARK-42672: --- Summary: Document error class list Key: SPARK-42672 URL: https://issues.apache.org/jira/browse/SPARK-42672 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41497) Accumulator undercounting in the case of retry task with rdd cache
[ https://issues.apache.org/jira/browse/SPARK-41497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696526#comment-17696526 ] Apache Spark commented on SPARK-41497: -- User 'ivoson' has created a pull request for this issue: https://github.com/apache/spark/pull/40281 > Accumulator undercounting in the case of retry task with rdd cache > -- > > Key: SPARK-41497 > URL: https://issues.apache.org/jira/browse/SPARK-41497 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.2, 3.3.1 >Reporter: wuyi >Assignee: Tengfei Huang >Priority: Major > Fix For: 3.5.0 > > > Accumulator could be undercounted when the retried task has rdd cache. See > the example below and you could also find the completed and reproducible > example at > [https://github.com/apache/spark/compare/master...Ngone51:spark:fix-acc] > > {code:scala} > test("SPARK-XXX") { > // Set up a cluster with 2 executors > val conf = new SparkConf() > .setMaster("local-cluster[2, 1, > 1024]").setAppName("TaskSchedulerImplSuite") > sc = new SparkContext(conf) > // Set up a custom task scheduler. The scheduler will fail the first task > attempt of the job > // submitted below. In particular, the failed first attempt task would > success on computation > // (accumulator accounting, result caching) but only fail to report its > success status due > // to the concurrent executor lost. The second task attempt would success. > taskScheduler = setupSchedulerWithCustomStatusUpdate(sc) > val myAcc = sc.longAccumulator("myAcc") > // Initiate a rdd with only one partition so there's only one task and > specify the storage level > // with MEMORY_ONLY_2 so that the rdd result will be cached on both two > executors. > val rdd = sc.parallelize(0 until 10, 1).mapPartitions { iter => > myAcc.add(100) > iter.map(x => x + 1) > }.persist(StorageLevel.MEMORY_ONLY_2) > // This will pass since the second task attempt will succeed > assert(rdd.count() === 10) > // This will fail due to `myAcc.add(100)` won't be executed during the > second task attempt's > // execution. Because the second task attempt will load the rdd cache > directly instead of > // executing the task function so `myAcc.add(100)` is skipped. > assert(myAcc.value === 100) > } {code} > > We could also hit this issue with decommission even if the rdd only has one > copy. For example, decommission could migrate the rdd cache block to another > executor (the result is actually the same with 2 copies) and the > decommissioned executor lost before the task reports its success status to > the driver. > > And the issue is a bit more complicated than expected to fix. I have tried to > give some fixes but all of them are not ideal: > Option 1: Clean up any rdd cache related to the failed task: in practice, > this option can already fix the issue in most cases. However, theoretically, > rdd cache could be reported to the driver right after the driver cleans up > the failed task's caches due to asynchronous communication. So this option > can’t resolve the issue thoroughly; > Option 2: Disallow rdd cache reuse across the task attempts for the same > task: this option can 100% fix the issue. The problem is this way can also > affect the case where rdd cache can be reused across the attempts (e.g., when > there is no accumulator operation in the task), which can have perf > regression; > Option 3: Introduce accumulator cache: first, this requires a new framework > for supporting accumulator cache; second, the driver should improve its logic > to distinguish whether the accumulator cache value should be reported to the > user to avoid overcounting. For example, in the case of task retry, the value > should be reported. However, in the case of rdd cache reuse, the value > shouldn’t be reported (should it?); > Option 4: Do task success validation when a task trying to load the rdd > cache: this way defines a rdd cache is only valid/accessible if the task has > succeeded. This way could be either overkill or a bit complex (because > currently Spark would clean up the task state once it’s finished. So we need > to maintain a structure to know if task once succeeded or not. ) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42671) Fix bug for createDataFrame from complex type schema
[ https://issues.apache.org/jira/browse/SPARK-42671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696504#comment-17696504 ] Apache Spark commented on SPARK-42671: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40280 > Fix bug for createDataFrame from complex type schema > > > Key: SPARK-42671 > URL: https://issues.apache.org/jira/browse/SPARK-42671 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42671) Fix bug for createDataFrame from complex type schema
[ https://issues.apache.org/jira/browse/SPARK-42671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42671: Assignee: (was: Apache Spark) > Fix bug for createDataFrame from complex type schema > > > Key: SPARK-42671 > URL: https://issues.apache.org/jira/browse/SPARK-42671 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42671) Fix bug for createDataFrame from complex type schema
[ https://issues.apache.org/jira/browse/SPARK-42671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696503#comment-17696503 ] Apache Spark commented on SPARK-42671: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40280 > Fix bug for createDataFrame from complex type schema > > > Key: SPARK-42671 > URL: https://issues.apache.org/jira/browse/SPARK-42671 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42671) Fix bug for createDataFrame from complex type schema
[ https://issues.apache.org/jira/browse/SPARK-42671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42671: Assignee: Apache Spark > Fix bug for createDataFrame from complex type schema > > > Key: SPARK-42671 > URL: https://issues.apache.org/jira/browse/SPARK-42671 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42671) Fix bug for createDataFrame from complex type schema
BingKun Pan created SPARK-42671: --- Summary: Fix bug for createDataFrame from complex type schema Key: SPARK-42671 URL: https://issues.apache.org/jira/browse/SPARK-42671 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.4.1 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42623) parameter markers not blocked in DDL
[ https://issues.apache.org/jira/browse/SPARK-42623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696501#comment-17696501 ] ming95 commented on SPARK-42623: i can try to fix this issue > parameter markers not blocked in DDL > > > Key: SPARK-42623 > URL: https://issues.apache.org/jira/browse/SPARK-42623 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > Fix For: 3.4.0 > > > The parameterized query code does not block DDL statements from referencing > parameter markers. > E.g. a > > {code:java} > scala> spark.sql(sqlText = "CREATE VIEW v1 AS SELECT current_timestamp() + > :later as stamp, :x * :x AS square", args = Map("later" -> "INTERVAL'3' > HOUR", "x" -> "15.0")).show() > ++ > || > ++ > ++ > {code} > It appears we have some protection that fails us when the view is invoked: > > {code:java} > scala> spark.sql(sqlText = "SELECT * FROM v1", args = Map("later" -> > "INTERVAL'3' HOUR", "x" -> "15.0")).show() > org.apache.spark.sql.AnalysisException: [UNBOUND_SQL_PARAMETER] Found the > unbound parameter: `later`. Please, fix `args` and provide a mapping of the > parameter to a SQL literal.; line 1 pos 29 > {code} > Right now I think affected are: > * DEFAULT definition > * VIEW definition > but any other future standard expression popping up is at risk, such as SQL > Functions, or GENERATED COLUMN. > CREATE TABLE AS is debatable, since it it executes the query at definition > only. > For simplicity I propose to block the feature from ANY DDL statement (CREATE, > ALTER). > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-42635: - Fix Version/s: 3.3.3 > Several counter-intuitive behaviours in the TimestampAdd expression > --- > > Key: SPARK-42635 > URL: https://issues.apache.org/jira/browse/SPARK-42635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Chenhao Li >Assignee: Chenhao Li >Priority: Major > Fix For: 3.3.3, 3.4.1 > > > # When the time is close to daylight saving time transition, the result may > be discontinuous and not monotonic. > We currently have: > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, > timestamp'2011-03-12 03:00:00')").show > ++ > |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| > ++ > | 2011-03-13 03:59:59| > ++ > scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 > 03:00:00')").show > +--+ > |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| > +--+ > | 2011-03-13 03:00:00| > +--+ {code} > > In the second query, adding one more second will set the time back one hour > instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 > 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due > to the daylight saving time transition. > The root cause of the problem is the Spark code at > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790] > wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day > and time-in-day split before looking at the timezone. > 2. Adding month, quarter, and year silently ignores Int overflow during unit > conversion. > The root cause is > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246]. > {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking > overflow. Note that we do have overflow checking in adding the amount to the > timestamp, so the behavior is inconsistent. > This can cause counter-intuitive results like this: > {code:scala} > scala> spark.sql("select timestampadd(quarter, 1431655764, > timestamp'1970-01-01')").show > +--+ > |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| > +--+ > | 1969-09-01 00:00:00| > +--+{code} > 3. Adding sub-month units (week, day, hour, minute, second, millisecond, > microsecond)silently ignores Long overflow during unit conversion. > This is similar to the previous problem: > {code:scala} > scala> spark.sql("select timestampadd(day, 106751992, > timestamp'1970-01-01')").show(false) > +-+ > |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| > +-+ > |-290308-12-22 15:58:10.448384| > +-+{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42670) Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings
[ https://issues.apache.org/jira/browse/SPARK-42670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42670: Assignee: (was: Apache Spark) > Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings > > > Key: SPARK-42670 > URL: https://issues.apache.org/jira/browse/SPARK-42670 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42670) Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings
[ https://issues.apache.org/jira/browse/SPARK-42670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42670: Assignee: Apache Spark > Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings > > > Key: SPARK-42670 > URL: https://issues.apache.org/jira/browse/SPARK-42670 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42670) Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings
[ https://issues.apache.org/jira/browse/SPARK-42670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696450#comment-17696450 ] Apache Spark commented on SPARK-42670: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40278 > Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings > > > Key: SPARK-42670 > URL: https://issues.apache.org/jira/browse/SPARK-42670 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42670) Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings
BingKun Pan created SPARK-42670: --- Summary: Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings Key: SPARK-42670 URL: https://issues.apache.org/jira/browse/SPARK-42670 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42562) UnresolvedLambdaVariables in python do not need unique names
[ https://issues.apache.org/jira/browse/SPARK-42562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696429#comment-17696429 ] jiaan.geng commented on SPARK-42562: I got it ! > UnresolvedLambdaVariables in python do not need unique names > > > Key: SPARK-42562 > URL: https://issues.apache.org/jira/browse/SPARK-42562 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > UnresolvedLambdaVariables do not need unique names in python. We already did > this for the scala client, and it is good to have parity between the two > implementations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org