[jira] [Created] (SPARK-42672) Document error class list

2023-03-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-42672:
---

 Summary: Document error class list
 Key: SPARK-42672
 URL: https://issues.apache.org/jira/browse/SPARK-42672
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 3.4.0
Reporter: Haejoon Lee






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41497) Accumulator undercounting in the case of retry task with rdd cache

2023-03-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696526#comment-17696526
 ] 

Apache Spark commented on SPARK-41497:
--

User 'ivoson' has created a pull request for this issue:
https://github.com/apache/spark/pull/40281

> Accumulator undercounting in the case of retry task with rdd cache
> --
>
> Key: SPARK-41497
> URL: https://issues.apache.org/jira/browse/SPARK-41497
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.2, 3.3.1
>Reporter: wuyi
>Assignee: Tengfei Huang
>Priority: Major
> Fix For: 3.5.0
>
>
> Accumulator could be undercounted when the retried task has rdd cache.  See 
> the example below and you could also find the completed and reproducible 
> example at 
> [https://github.com/apache/spark/compare/master...Ngone51:spark:fix-acc]
>   
> {code:scala}
> test("SPARK-XXX") {
>   // Set up a cluster with 2 executors
>   val conf = new SparkConf()
> .setMaster("local-cluster[2, 1, 
> 1024]").setAppName("TaskSchedulerImplSuite")
>   sc = new SparkContext(conf)
>   // Set up a custom task scheduler. The scheduler will fail the first task 
> attempt of the job
>   // submitted below. In particular, the failed first attempt task would 
> success on computation
>   // (accumulator accounting, result caching) but only fail to report its 
> success status due
>   // to the concurrent executor lost. The second task attempt would success.
>   taskScheduler = setupSchedulerWithCustomStatusUpdate(sc)
>   val myAcc = sc.longAccumulator("myAcc")
>   // Initiate a rdd with only one partition so there's only one task and 
> specify the storage level
>   // with MEMORY_ONLY_2 so that the rdd result will be cached on both two 
> executors.
>   val rdd = sc.parallelize(0 until 10, 1).mapPartitions { iter =>
> myAcc.add(100)
> iter.map(x => x + 1)
>   }.persist(StorageLevel.MEMORY_ONLY_2)
>   // This will pass since the second task attempt will succeed
>   assert(rdd.count() === 10)
>   // This will fail due to `myAcc.add(100)` won't be executed during the 
> second task attempt's
>   // execution. Because the second task attempt will load the rdd cache 
> directly instead of
>   // executing the task function so `myAcc.add(100)` is skipped.
>   assert(myAcc.value === 100)
> } {code}
>  
> We could also hit this issue with decommission even if the rdd only has one 
> copy. For example, decommission could migrate the rdd cache block to another 
> executor (the result is actually the same with 2 copies) and the 
> decommissioned executor lost before the task reports its success status to 
> the driver. 
>  
> And the issue is a bit more complicated than expected to fix. I have tried to 
> give some fixes but all of them are not ideal:
> Option 1: Clean up any rdd cache related to the failed task: in practice, 
> this option can already fix the issue in most cases. However, theoretically, 
> rdd cache could be reported to the driver right after the driver cleans up 
> the failed task's caches due to asynchronous communication. So this option 
> can’t resolve the issue thoroughly;
> Option 2: Disallow rdd cache reuse across the task attempts for the same 
> task: this option can 100% fix the issue. The problem is this way can also 
> affect the case where rdd cache can be reused across the attempts (e.g., when 
> there is no accumulator operation in the task), which can have perf 
> regression;
> Option 3: Introduce accumulator cache: first, this requires a new framework 
> for supporting accumulator cache; second, the driver should improve its logic 
> to distinguish whether the accumulator cache value should be reported to the 
> user to avoid overcounting. For example, in the case of task retry, the value 
> should be reported. However, in the case of rdd cache reuse, the value 
> shouldn’t be reported (should it?);
> Option 4: Do task success validation when a task trying to load the rdd 
> cache: this way defines a rdd cache is only valid/accessible if the task has 
> succeeded. This way could be either overkill or a bit complex (because 
> currently Spark would clean up the task state once it’s finished. So we need 
> to maintain a structure to know if task once succeeded or not. )



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42671) Fix bug for createDataFrame from complex type schema

2023-03-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696504#comment-17696504
 ] 

Apache Spark commented on SPARK-42671:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40280

> Fix bug for createDataFrame from complex type schema
> 
>
> Key: SPARK-42671
> URL: https://issues.apache.org/jira/browse/SPARK-42671
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42671) Fix bug for createDataFrame from complex type schema

2023-03-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42671:


Assignee: (was: Apache Spark)

> Fix bug for createDataFrame from complex type schema
> 
>
> Key: SPARK-42671
> URL: https://issues.apache.org/jira/browse/SPARK-42671
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42671) Fix bug for createDataFrame from complex type schema

2023-03-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696503#comment-17696503
 ] 

Apache Spark commented on SPARK-42671:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40280

> Fix bug for createDataFrame from complex type schema
> 
>
> Key: SPARK-42671
> URL: https://issues.apache.org/jira/browse/SPARK-42671
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42671) Fix bug for createDataFrame from complex type schema

2023-03-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42671:


Assignee: Apache Spark

> Fix bug for createDataFrame from complex type schema
> 
>
> Key: SPARK-42671
> URL: https://issues.apache.org/jira/browse/SPARK-42671
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42671) Fix bug for createDataFrame from complex type schema

2023-03-04 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-42671:
---

 Summary: Fix bug for createDataFrame from complex type schema
 Key: SPARK-42671
 URL: https://issues.apache.org/jira/browse/SPARK-42671
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.4.1
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42623) parameter markers not blocked in DDL

2023-03-04 Thread zzzzming95 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696501#comment-17696501
 ] 

ming95 commented on SPARK-42623:


i can try to fix this issue

> parameter markers not blocked in DDL
> 
>
> Key: SPARK-42623
> URL: https://issues.apache.org/jira/browse/SPARK-42623
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
> Fix For: 3.4.0
>
>
> The parameterized query code does not block DDL statements from referencing 
> parameter markers.
> E.g. a 
>  
> {code:java}
> scala> spark.sql(sqlText = "CREATE VIEW v1 AS SELECT current_timestamp() + 
> :later as stamp, :x * :x AS square", args = Map("later" -> "INTERVAL'3' 
> HOUR", "x" -> "15.0")).show()
> ++
> ||
> ++
> ++
> {code}
> It appears we have some protection that fails us when the view is invoked:
>  
> {code:java}
> scala> spark.sql(sqlText = "SELECT * FROM v1", args = Map("later" -> 
> "INTERVAL'3' HOUR", "x" -> "15.0")).show()
> org.apache.spark.sql.AnalysisException: [UNBOUND_SQL_PARAMETER] Found the 
> unbound parameter: `later`. Please, fix `args` and provide a mapping of the 
> parameter to a SQL literal.; line 1 pos 29
> {code}
> Right now I think affected are:
> * DEFAULT definition
> * VIEW definition
> but any other future standard expression popping up is at risk, such as SQL 
> Functions, or GENERATED COLUMN.
> CREATE TABLE AS is debatable, since it it executes the query at definition 
> only.
> For simplicity I propose to block the feature from ANY DDL statement (CREATE, 
> ALTER).
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-42635:
-
Fix Version/s: 3.3.3

> Several counter-intuitive behaviours in the TimestampAdd expression
> ---
>
> Key: SPARK-42635
> URL: https://issues.apache.org/jira/browse/SPARK-42635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Chenhao Li
>Assignee: Chenhao Li
>Priority: Major
> Fix For: 3.3.3, 3.4.1
>
>
> # When the time is close to daylight saving time transition, the result may 
> be discontinuous and not monotonic.
> We currently have:
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
> timestamp'2011-03-12 03:00:00')").show
> ++
> |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
> ++
> | 2011-03-13 03:59:59|
> ++
> scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
> 03:00:00')").show
> +--+
> |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
> +--+
> |   2011-03-13 03:00:00|
> +--+ {code}
>  
> In the second query, adding one more second will set the time back one hour 
> instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 
> 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due 
> to the daylight saving time transition.
> The root cause of the problem is the Spark code at 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790]
>  wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day 
> and time-in-day split before looking at the timezone.
> 2. Adding month, quarter, and year silently ignores Int overflow during unit 
> conversion.
> The root cause is 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246].
>  {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking 
> overflow. Note that we do have overflow checking in adding the amount to the 
> timestamp, so the behavior is inconsistent.
> This can cause counter-intuitive results like this:
> {code:scala}
> scala> spark.sql("select timestampadd(quarter, 1431655764, 
> timestamp'1970-01-01')").show
> +--+
> |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
> +--+
> |   1969-09-01 00:00:00|
> +--+{code}
> 3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
> microsecond)silently ignores Long overflow during unit conversion.
> This is similar to the previous problem:
> {code:scala}
>  scala> spark.sql("select timestampadd(day, 106751992, 
> timestamp'1970-01-01')").show(false)
> +-+
> |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
> +-+
> |-290308-12-22 15:58:10.448384|
> +-+{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42670) Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings

2023-03-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42670:


Assignee: (was: Apache Spark)

> Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings
> 
>
> Key: SPARK-42670
> URL: https://issues.apache.org/jira/browse/SPARK-42670
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42670) Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings

2023-03-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42670:


Assignee: Apache Spark

> Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings
> 
>
> Key: SPARK-42670
> URL: https://issues.apache.org/jira/browse/SPARK-42670
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42670) Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings

2023-03-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696450#comment-17696450
 ] 

Apache Spark commented on SPARK-42670:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40278

> Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings
> 
>
> Key: SPARK-42670
> URL: https://issues.apache.org/jira/browse/SPARK-42670
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42670) Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings

2023-03-04 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-42670:
---

 Summary: Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate 
build warnings
 Key: SPARK-42670
 URL: https://issues.apache.org/jira/browse/SPARK-42670
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42562) UnresolvedLambdaVariables in python do not need unique names

2023-03-04 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696429#comment-17696429
 ] 

jiaan.geng commented on SPARK-42562:


I got it !

> UnresolvedLambdaVariables in python do not need unique names
> 
>
> Key: SPARK-42562
> URL: https://issues.apache.org/jira/browse/SPARK-42562
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>
> UnresolvedLambdaVariables do not need unique names in python. We already did 
> this for the scala client, and it is good to have parity between the two 
> implementations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org