[jira] [Created] (SPARK-39175) Provide runtime error query context for Cast when WSCG is off
Gengliang Wang created SPARK-39175: -- Summary: Provide runtime error query context for Cast when WSCG is off Key: SPARK-39175 URL: https://issues.apache.org/jira/browse/SPARK-39175 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Gengliang Wang Assignee: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39174) Catalogs loading swallows missing classname for ClassNotFoundException
Kent Yao created SPARK-39174: Summary: Catalogs loading swallows missing classname for ClassNotFoundException Key: SPARK-39174 URL: https://issues.apache.org/jira/browse/SPARK-39174 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1, 3.1.2, 3.3.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39166) Provide runtime error query context for Binary Arithmetic when WSCG is off
[ https://issues.apache.org/jira/browse/SPARK-39166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-39166. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 36525 [https://github.com/apache/spark/pull/36525] > Provide runtime error query context for Binary Arithmetic when WSCG is off > -- > > Key: SPARK-39166 > URL: https://issues.apache.org/jira/browse/SPARK-39166 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > > Currently, for most of the cases, the project > https://issues.apache.org/jira/browse/SPARK-38615 is able to show where the > runtime errors happen within the original query. > However, after trying on production, I found that the following queries won't > show where the divide by 0 error happens > {code:java} > create table aggTest(i int, j int, k int, d date) using parquet > insert into aggTest values(1, 2, 0, date'2022-01-01') > select sum(j)/sum(k),percentile(i, 0.9) from aggTest group by d{code} > With `percentile` function in the query, the plan can't execute with whole > stage codegen. Thus the child plan of `Project` is serialized to executors > for execution, from ProjectExec: > {code:java} > protected override def doExecute(): RDD[InternalRow] = { > child.execute().mapPartitionsWithIndexInternal { (index, iter) => > val project = UnsafeProjection.create(projectList, child.output) > project.initialize(index) > iter.map(project) > } > }{code} > Note that the `TreeNode.origin` is not serialized to executors since > `TreeNode` doesn't extend the trait `Serializable`, which results in an empty > query context on errors. For more details, please read > https://issues.apache.org/jira/browse/SPARK-39140 > A dummy fix is to make `TreeNode` extend the trait `Serializable`. However, > it can be performance regression if the query text is long (every `TreeNode` > carries it for serialization). > A better fix is to introduce a new trait `SupportQueryContext` and > materialize the truncated query context for special expressions. This jira > targets on binary arithmetic expressions only. I will create follow-ups for > the remaining expressions which support runtime error query context. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39172) Remove outer join if all output come from streamed side and buffered side keys exist unique key
[ https://issues.apache.org/jira/browse/SPARK-39172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39172: Assignee: (was: Apache Spark) > Remove outer join if all output come from streamed side and buffered side > keys exist unique key > --- > > Key: SPARK-39172 > URL: https://issues.apache.org/jira/browse/SPARK-39172 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > Improve the optimzation case using the distinct keys framework. > For example: > {code:java} > SELECT t1.* FROM t1 LEFT JOIN (SELECT distinct c1 as c1 FROM t)t2 ON t1.c1 = > t2.c1 > ==> > SELECT t1.* FROM t1 {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39172) Remove outer join if all output come from streamed side and buffered side keys exist unique key
[ https://issues.apache.org/jira/browse/SPARK-39172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536428#comment-17536428 ] Apache Spark commented on SPARK-39172: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/36530 > Remove outer join if all output come from streamed side and buffered side > keys exist unique key > --- > > Key: SPARK-39172 > URL: https://issues.apache.org/jira/browse/SPARK-39172 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > Improve the optimzation case using the distinct keys framework. > For example: > {code:java} > SELECT t1.* FROM t1 LEFT JOIN (SELECT distinct c1 as c1 FROM t)t2 ON t1.c1 = > t2.c1 > ==> > SELECT t1.* FROM t1 {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39172) Remove outer join if all output come from streamed side and buffered side keys exist unique key
[ https://issues.apache.org/jira/browse/SPARK-39172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39172: Assignee: Apache Spark > Remove outer join if all output come from streamed side and buffered side > keys exist unique key > --- > > Key: SPARK-39172 > URL: https://issues.apache.org/jira/browse/SPARK-39172 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > Improve the optimzation case using the distinct keys framework. > For example: > {code:java} > SELECT t1.* FROM t1 LEFT JOIN (SELECT distinct c1 as c1 FROM t)t2 ON t1.c1 = > t2.c1 > ==> > SELECT t1.* FROM t1 {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39173) The error message is different if disable broadcast join
[ https://issues.apache.org/jira/browse/SPARK-39173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-39173: Description: How to reproduce this issue: {code:scala} Seq(-1, 10L).foreach { broadcastThreshold => withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> broadcastThreshold.toString, SQLConf.ANSI_ENABLED.key -> "true") { val df = sql( """ |SELECT | item.i_brand_id brand_id, | avg(ss_ext_sales_price) avg_agg |FROM store_sales, item |WHERE store_sales.ss_item_sk = item.i_item_sk |GROUP BY item.i_brand_id """.stripMargin) val error = intercept[SparkException] { df.collect() } println("Error message: " + error.getMessage) } } {code} {noformat} Error message: org.apache.spark.SparkArithmeticException: [CANNOT_CHANGE_DECIMAL_PRECISION] Decimal(expanded,9.28175,38,5}) cannot be represented as Decimal(38, 6). If necessary set "spark.sql.ansi.enabled" to false to bypass this error. Error message: org.apache.spark.SparkArithmeticException: [ARITHMETIC_OVERFLOW] Overflow in sum of decimals. If necessary set spark.sql.ansi.enabled to false (except for ANSI interval type) to bypass this error. {noformat} was: How to reproduce this issue: {code:scala} Seq(-1, 10L).foreach { broadcastThreshold => withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> broadcastThreshold.toString, SQLConf.ANSI_ENABLED.key -> "true") { val df = sql( """ |SELECT | item.i_brand_id brand_id, | avg(ss_ext_sales_price) avg_agg |FROM store_sales, item |WHERE store_sales.ss_item_sk = item.i_item_sk |GROUP BY item.i_brand_id """.stripMargin) df.collect() } } {code} {noformat} Error message: org.apache.spark.SparkArithmeticException: [CANNOT_CHANGE_DECIMAL_PRECISION] Decimal(expanded,9.28175,38,5}) cannot be represented as Decimal(38, 6). If necessary set "spark.sql.ansi.enabled" to false to bypass this error. Error message: org.apache.spark.SparkArithmeticException: [ARITHMETIC_OVERFLOW] Overflow in sum of decimals. If necessary set spark.sql.ansi.enabled to false (except for ANSI interval type) to bypass this error. {noformat} > The error message is different if disable broadcast join > > > Key: SPARK-39173 > URL: https://issues.apache.org/jira/browse/SPARK-39173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:scala} > Seq(-1, 10L).foreach { broadcastThreshold => > withSQLConf( > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> broadcastThreshold.toString, > SQLConf.ANSI_ENABLED.key -> "true") { > val df = sql( > """ > |SELECT > | item.i_brand_id brand_id, > | avg(ss_ext_sales_price) avg_agg > |FROM store_sales, item > |WHERE store_sales.ss_item_sk = item.i_item_sk > |GROUP BY item.i_brand_id > """.stripMargin) > val error = intercept[SparkException] { > df.collect() > } > println("Error message: " + error.getMessage) > } > } > {code} > {noformat} > Error message: org.apache.spark.SparkArithmeticException: > [CANNOT_CHANGE_DECIMAL_PRECISION] > Decimal(expanded,9.28175,38,5}) cannot be > represented as Decimal(38, 6). If necessary set "spark.sql.ansi.enabled" to > false to bypass this error. > Error message: org.apache.spark.SparkArithmeticException: > [ARITHMETIC_OVERFLOW] Overflow in sum of decimals. If necessary set > spark.sql.ansi.enabled to false (except for ANSI interval type) to bypass > this error. > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39173) The error message is different if disable broadcast join
[ https://issues.apache.org/jira/browse/SPARK-39173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-39173: Description: How to reproduce this issue: {code:scala} Seq(-1, 10L).foreach { broadcastThreshold => withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> broadcastThreshold.toString, SQLConf.ANSI_ENABLED.key -> "true") { val df = sql( """ |SELECT | item.i_brand_id brand_id, | avg(ss_ext_sales_price) avg_agg |FROM store_sales, item |WHERE store_sales.ss_item_sk = item.i_item_sk |GROUP BY item.i_brand_id """.stripMargin) df.collect() } } {code} {noformat} Error message: org.apache.spark.SparkArithmeticException: [CANNOT_CHANGE_DECIMAL_PRECISION] Decimal(expanded,9.28175,38,5}) cannot be represented as Decimal(38, 6). If necessary set "spark.sql.ansi.enabled" to false to bypass this error. Error message: org.apache.spark.SparkArithmeticException: [ARITHMETIC_OVERFLOW] Overflow in sum of decimals. If necessary set spark.sql.ansi.enabled to false (except for ANSI interval type) to bypass this error. {noformat} was: How to reproduce this issue: {code:scala} Seq(-1, 10L).foreach { broadcastThreshold => withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> broadcastThreshold.toString, SQLConf.ANSI_ENABLED.key -> "true") { val df = sql( """ |SELECT | item.i_brand_id brand_id, | avg(ss_ext_sales_price) avg_agg |FROM store_sales, item |WHERE store_sales.ss_item_sk = item.i_item_sk |GROUP BY item.i_brand_id """.stripMargin) df.collect() } } {code} {noformat} Error message: Job aborted due to stage failure: Task 0 in stage 10.0 failed 1 times, most recent failure: Lost task 0.0 in stage 10.0 (TID 9) (localhost executor driver): org.apache.spark.SparkArithmeticException: [CANNOT_CHANGE_DECIMAL_PRECISION] Decimal(expanded,9.28175,38,5}) cannot be represented as Decimal(38, 6). If necessary set "spark.sql.ansi.enabled" to false to bypass this error. Error message: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 14) (localhost executor driver): org.apache.spark.SparkArithmeticException: [ARITHMETIC_OVERFLOW] Overflow in sum of decimals. If necessary set spark.sql.ansi.enabled to false (except for ANSI interval type) to bypass this error. {noformat} > The error message is different if disable broadcast join > > > Key: SPARK-39173 > URL: https://issues.apache.org/jira/browse/SPARK-39173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:scala} > Seq(-1, 10L).foreach { broadcastThreshold => > withSQLConf( > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> broadcastThreshold.toString, > SQLConf.ANSI_ENABLED.key -> "true") { > val df = sql( > """ > |SELECT > | item.i_brand_id brand_id, > | avg(ss_ext_sales_price) avg_agg > |FROM store_sales, item > |WHERE store_sales.ss_item_sk = item.i_item_sk > |GROUP BY item.i_brand_id > """.stripMargin) > df.collect() > } > } > {code} > {noformat} > Error message: org.apache.spark.SparkArithmeticException: > [CANNOT_CHANGE_DECIMAL_PRECISION] > Decimal(expanded,9.28175,38,5}) cannot be > represented as Decimal(38, 6). If necessary set "spark.sql.ansi.enabled" to > false to bypass this error. > Error message: org.apache.spark.SparkArithmeticException: > [ARITHMETIC_OVERFLOW] Overflow in sum of decimals. If necessary set > spark.sql.ansi.enabled to false (except for ANSI interval type) to bypass > this error. > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39173) The error message is different if disable broadcast join
Yuming Wang created SPARK-39173: --- Summary: The error message is different if disable broadcast join Key: SPARK-39173 URL: https://issues.apache.org/jira/browse/SPARK-39173 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Yuming Wang How to reproduce this issue: {code:scala} Seq(-1, 10L).foreach { broadcastThreshold => withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> broadcastThreshold.toString, SQLConf.ANSI_ENABLED.key -> "true") { val df = sql( """ |SELECT | item.i_brand_id brand_id, | avg(ss_ext_sales_price) avg_agg |FROM store_sales, item |WHERE store_sales.ss_item_sk = item.i_item_sk |GROUP BY item.i_brand_id """.stripMargin) df.collect() } } {code} {noformat} Error message: Job aborted due to stage failure: Task 0 in stage 10.0 failed 1 times, most recent failure: Lost task 0.0 in stage 10.0 (TID 9) (localhost executor driver): org.apache.spark.SparkArithmeticException: [CANNOT_CHANGE_DECIMAL_PRECISION] Decimal(expanded,9.28175,38,5}) cannot be represented as Decimal(38, 6). If necessary set "spark.sql.ansi.enabled" to false to bypass this error. Error message: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 14) (localhost executor driver): org.apache.spark.SparkArithmeticException: [ARITHMETIC_OVERFLOW] Overflow in sum of decimals. If necessary set spark.sql.ansi.enabled to false (except for ANSI interval type) to bypass this error. {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28516) Data Type Formatting Functions: `to_char`
[ https://issues.apache.org/jira/browse/SPARK-28516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-28516: --- Assignee: Daniel > Data Type Formatting Functions: `to_char` > - > > Key: SPARK-28516 > URL: https://issues.apache.org/jira/browse/SPARK-28516 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Assignee: Daniel >Priority: Major > Fix For: 3.4.0 > > > Currently, Spark does not have support for `to_char`. PgSQL, however, > [does|[https://www.postgresql.org/docs/12/functions-formatting.html]]: > Query example: > {code:sql} > SELECT to_char(SUM(n) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND 1 > FOLLOWING),'9D9') > {code} > ||Function||Return Type||Description||Example|| > |{{to_char(}}{{timestamp}}{{, }}{{text}}{{)}}|{{text}}|convert time stamp to > string|{{to_char(current_timestamp, 'HH12:MI:SS')}}| > |{{to_char(}}{{interval}}{{, }}{{text}}{{)}}|{{text}}|convert interval to > string|{{to_char(interval '15h 2m 12s', 'HH24:MI:SS')}}| > |{{to_char(}}{{int}}{{, }}{{text}}{{)}}|{{text}}|convert integer to > string|{{to_char(125, '999')}}| > |{{to_char}}{{(}}{{double precision}}{{, }}{{text}}{{)}}|{{text}}|convert > real/double precision to string|{{to_char(125.8::real, '999D9')}}| > |{{to_char(}}{{numeric}}{{, }}{{text}}{{)}}|{{text}}|convert numeric to > string|{{to_char(-125.8, '999D99S')}}| -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28516) Data Type Formatting Functions: `to_char`
[ https://issues.apache.org/jira/browse/SPARK-28516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28516. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36365 [https://github.com/apache/spark/pull/36365] > Data Type Formatting Functions: `to_char` > - > > Key: SPARK-28516 > URL: https://issues.apache.org/jira/browse/SPARK-28516 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > Fix For: 3.4.0 > > > Currently, Spark does not have support for `to_char`. PgSQL, however, > [does|[https://www.postgresql.org/docs/12/functions-formatting.html]]: > Query example: > {code:sql} > SELECT to_char(SUM(n) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND 1 > FOLLOWING),'9D9') > {code} > ||Function||Return Type||Description||Example|| > |{{to_char(}}{{timestamp}}{{, }}{{text}}{{)}}|{{text}}|convert time stamp to > string|{{to_char(current_timestamp, 'HH12:MI:SS')}}| > |{{to_char(}}{{interval}}{{, }}{{text}}{{)}}|{{text}}|convert interval to > string|{{to_char(interval '15h 2m 12s', 'HH24:MI:SS')}}| > |{{to_char(}}{{int}}{{, }}{{text}}{{)}}|{{text}}|convert integer to > string|{{to_char(125, '999')}}| > |{{to_char}}{{(}}{{double precision}}{{, }}{{text}}{{)}}|{{text}}|convert > real/double precision to string|{{to_char(125.8::real, '999D9')}}| > |{{to_char(}}{{numeric}}{{, }}{{text}}{{)}}|{{text}}|convert numeric to > string|{{to_char(-125.8, '999D99S')}}| -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39172) Remove outer join if all output come from streamed side and buffered side keys exist unique key
[ https://issues.apache.org/jira/browse/SPARK-39172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-39172: -- Summary: Remove outer join if all output come from streamed side and buffered side keys exist unique key (was: Remove outer join if all output come from streamed side and buffered side keys exist unique) > Remove outer join if all output come from streamed side and buffered side > keys exist unique key > --- > > Key: SPARK-39172 > URL: https://issues.apache.org/jira/browse/SPARK-39172 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > Improve the optimzation case using the distinct keys framework. > For example: > {code:java} > SELECT t1.* FROM t1 LEFT JOIN (SELECT distinct c1 as c1 FROM t)t2 ON t1.c1 = > t2.c1 > ==> > SELECT t1.* FROM t1 {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39102) Replace the usage of guava's Files.createTempDir() with java.nio.file.Files.createTempDirectory()
[ https://issues.apache.org/jira/browse/SPARK-39102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39102: Assignee: Apache Spark > Replace the usage of guava's Files.createTempDir() with > java.nio.file.Files.createTempDirectory() > -- > > Key: SPARK-39102 > URL: https://issues.apache.org/jira/browse/SPARK-39102 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.2.1, 3.4.0 >Reporter: pralabhkumar >Assignee: Apache Spark >Priority: Minor > > Hi > There are several classes where Spark is using guava's Files.createTempDir() > which have vulnerabilities. I think its better to move to > java.nio.file.Files.createTempDirectory() for those classes. > Classes > Java8RDDAPISuite > JavaAPISuite.java > RPackageUtilsSuite > StreamTestHelper > TestShuffleDataContext > ExternalBlockHandlerSuite > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39102) Replace the usage of guava's Files.createTempDir() with java.nio.file.Files.createTempDirectory()
[ https://issues.apache.org/jira/browse/SPARK-39102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39102: Assignee: (was: Apache Spark) > Replace the usage of guava's Files.createTempDir() with > java.nio.file.Files.createTempDirectory() > -- > > Key: SPARK-39102 > URL: https://issues.apache.org/jira/browse/SPARK-39102 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.2.1, 3.4.0 >Reporter: pralabhkumar >Priority: Minor > > Hi > There are several classes where Spark is using guava's Files.createTempDir() > which have vulnerabilities. I think its better to move to > java.nio.file.Files.createTempDirectory() for those classes. > Classes > Java8RDDAPISuite > JavaAPISuite.java > RPackageUtilsSuite > StreamTestHelper > TestShuffleDataContext > ExternalBlockHandlerSuite > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39172) Remove outer join if all output come from streamed side and buffered side keys exist unique
XiDuo You created SPARK-39172: - Summary: Remove outer join if all output come from streamed side and buffered side keys exist unique Key: SPARK-39172 URL: https://issues.apache.org/jira/browse/SPARK-39172 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You Improve the optimzation case using the distinct keys framework. For example: {code:java} SELECT t1.* FROM t1 LEFT JOIN (SELECT distinct c1 as c1 FROM t)t2 ON t1.c1 = t2.c1 ==> SELECT t1.* FROM t1 {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39102) Replace the usage of guava's Files.createTempDir() with java.nio.file.Files.createTempDirectory()
[ https://issues.apache.org/jira/browse/SPARK-39102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536413#comment-17536413 ] Yang Jie commented on SPARK-39102: -- Give a pr https://github.com/apache/spark/pull/36529 > Replace the usage of guava's Files.createTempDir() with > java.nio.file.Files.createTempDirectory() > -- > > Key: SPARK-39102 > URL: https://issues.apache.org/jira/browse/SPARK-39102 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.2.1, 3.4.0 >Reporter: pralabhkumar >Priority: Minor > > Hi > There are several classes where Spark is using guava's Files.createTempDir() > which have vulnerabilities. I think its better to move to > java.nio.file.Files.createTempDirectory() for those classes. > Classes > Java8RDDAPISuite > JavaAPISuite.java > RPackageUtilsSuite > StreamTestHelper > TestShuffleDataContext > ExternalBlockHandlerSuite > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39171) Unify the Cast expression
[ https://issues.apache.org/jira/browse/SPARK-39171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-39171: --- Issue Type: Improvement (was: New Feature) > Unify the Cast expression > - > > Key: SPARK-39171 > URL: https://issues.apache.org/jira/browse/SPARK-39171 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39171) Unify the Cast expression
jiaan.geng created SPARK-39171: -- Summary: Unify the Cast expression Key: SPARK-39171 URL: https://issues.apache.org/jira/browse/SPARK-39171 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.4.0 Reporter: jiaan.geng -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39041) Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly
[ https://issues.apache.org/jira/browse/SPARK-39041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-39041. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36373 [https://github.com/apache/spark/pull/36373] > Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly > - > > Key: SPARK-39041 > URL: https://issues.apache.org/jira/browse/SPARK-39041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39041) Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly
[ https://issues.apache.org/jira/browse/SPARK-39041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-39041: Assignee: Kent Yao > Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly > - > > Key: SPARK-39041 > URL: https://issues.apache.org/jira/browse/SPARK-39041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38850) Upgrade Kafka to 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-38850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38850. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36526 [https://github.com/apache/spark/pull/36526] > Upgrade Kafka to 3.2.0 > -- > > Key: SPARK-38850 > URL: https://issues.apache.org/jira/browse/SPARK-38850 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38850) Upgrade Kafka to 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-38850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38850: - Assignee: Dongjoon Hyun > Upgrade Kafka to 3.2.0 > -- > > Key: SPARK-38850 > URL: https://issues.apache.org/jira/browse/SPARK-38850 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39170) ImportError when creating pyspark.pandas document "Supported APIs" if pandas version is low.
Hyunwoo Park created SPARK-39170: Summary: ImportError when creating pyspark.pandas document "Supported APIs" if pandas version is low. Key: SPARK-39170 URL: https://issues.apache.org/jira/browse/SPARK-39170 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Hyunwoo Park The pyspark.pandas documentation "Supported APIs" will be auto-generated. ([SPARK-38961|https://issues.apache.org/jira/browse/SPARK-38961]) At this point, we need to verify the version of pandas. It can be applied after the docker image used in github action is upgraded and republished at https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage. Related: https://github.com/apache/spark/pull/36509 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39144) Nested subquery expressions deduplicate relations should be done bottom up
[ https://issues.apache.org/jira/browse/SPARK-39144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-39144: - Description: When we have nested subquery expressions, there is a chance that deduplicate relations could replace an attributes with a wrong one. This is because the attributes replacement is done by top down than bottom up. This could happen if the subplan gets deduplicate relations first (thus two same relation with different attributes id), then a more complex plan built on top of the subplan (e.g. a UNION of queries with nested subquery expressions) can trigger this wrong attribute replacement error. > Nested subquery expressions deduplicate relations should be done bottom up > -- > > Key: SPARK-39144 > URL: https://issues.apache.org/jira/browse/SPARK-39144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Priority: Major > > When we have nested subquery expressions, there is a chance that deduplicate > relations could replace an attributes with a wrong one. This is because the > attributes replacement is done by top down than bottom up. This could happen > if the subplan gets deduplicate relations first (thus two same relation with > different attributes id), then a more complex plan built on top of the > subplan (e.g. a UNION of queries with nested subquery expressions) can > trigger this wrong attribute replacement error. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39144) Nested subquery expressions deduplicate relations should be done bottom up
[ https://issues.apache.org/jira/browse/SPARK-39144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-39144: - Summary: Nested subquery expressions deduplicate relations should be done bottom up (was: Spark SQL replace wrong attributes for nested subquery expression in which all tables are the same relation) > Nested subquery expressions deduplicate relations should be done bottom up > -- > > Key: SPARK-39144 > URL: https://issues.apache.org/jira/browse/SPARK-39144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39169) Optimize FIRST when used as a single aggregate function
[ https://issues.apache.org/jira/browse/SPARK-39169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39169: Assignee: Apache Spark > Optimize FIRST when used as a single aggregate function > --- > > Key: SPARK-39169 > URL: https://issues.apache.org/jira/browse/SPARK-39169 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Vitalii Li >Assignee: Apache Spark >Priority: Major > > When `FIRST` is a single aggregate function in `Aggregate` we could either > rewrite whole query or optimize execution logic. > * Plan => `SELECT FIRST() FROM ` => `SELECT FROM > LIMIT 1`. Note that setting `ignoreNulls` to `true` should block such rewrite > since returns could differ in case all values of are `NULL` > * Execution => `SELECT FIRST() FROM GROUP BY ` => > short circuit iteration per key once a value for `FIRST` is set. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39169) Optimize FIRST when used as a single aggregate function
[ https://issues.apache.org/jira/browse/SPARK-39169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536390#comment-17536390 ] Apache Spark commented on SPARK-39169: -- User 'vli-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/36527 > Optimize FIRST when used as a single aggregate function > --- > > Key: SPARK-39169 > URL: https://issues.apache.org/jira/browse/SPARK-39169 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Vitalii Li >Priority: Major > > When `FIRST` is a single aggregate function in `Aggregate` we could either > rewrite whole query or optimize execution logic. > * Plan => `SELECT FIRST() FROM ` => `SELECT FROM > LIMIT 1`. Note that setting `ignoreNulls` to `true` should block such rewrite > since returns could differ in case all values of are `NULL` > * Execution => `SELECT FIRST() FROM GROUP BY ` => > short circuit iteration per key once a value for `FIRST` is set. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39169) Optimize FIRST when used as a single aggregate function
[ https://issues.apache.org/jira/browse/SPARK-39169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39169: Assignee: (was: Apache Spark) > Optimize FIRST when used as a single aggregate function > --- > > Key: SPARK-39169 > URL: https://issues.apache.org/jira/browse/SPARK-39169 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Vitalii Li >Priority: Major > > When `FIRST` is a single aggregate function in `Aggregate` we could either > rewrite whole query or optimize execution logic. > * Plan => `SELECT FIRST() FROM ` => `SELECT FROM > LIMIT 1`. Note that setting `ignoreNulls` to `true` should block such rewrite > since returns could differ in case all values of are `NULL` > * Execution => `SELECT FIRST() FROM GROUP BY ` => > short circuit iteration per key once a value for `FIRST` is set. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39142) Type overloads in `pandas_udf`
[ https://issues.apache.org/jira/browse/SPARK-39142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536388#comment-17536388 ] Hyukjin Kwon commented on SPARK-39142: -- [~tigerhawkvok], Are you interested in submitting a PR? cc [~zero323] FYI > Type overloads in `pandas_udf` > --- > > Key: SPARK-39142 > URL: https://issues.apache.org/jira/browse/SPARK-39142 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Philip Kahn >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > It seems that the `returnType` in the type overloads for `pandas_udf` never > specify a generic for PySpark SQL types or explicitly list those types: > > [https://github.com/apache/spark/blob/f84018a4810867afa84658fec76494aaae6d57fc/python/pyspark/sql/pandas/functions.pyi] > > This results in static type checkers flagging the type of the decorated > functions (and their parameters) as incorrect, see > [https://github.com/microsoft/pylance-release/issues/2789] as an example. > > For someone familiar with the code base, this should be a very fast patch. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38397) Support Kueue: K8s-native Job Queueing
[ https://issues.apache.org/jira/browse/SPARK-38397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38397: -- Description: There are several ways to run Spark on K8s including vanilla `spark-submit` with built-in `KubernetesClusterManager`, `spark-submit` with custom `ExternalClusterManager`, CRD-based operators (like spark-on-k8s-operator), custom K8s `schedulers`, custom `standalone pod definitions`, and so on. This issue is tracking K8s-native Job Queueing related work. * [https://github.com/kubernetes-sigs/kueue] {code} metadata: generateName: sample-job- annotations: kueue.k8s.io/queue-name: main {code} The best case is Apache Spark users use it in the future via pod templates or existing configuration. In other words, we don't need to do anything and close this JIRA without any patches. *Documentation* - https://github.com/kubernetes-sigs/kueue/tree/main/docs *Release History* - https://github.com/kubernetes-sigs/kueue/releases/tag/v0.1.0 was: There are several ways to run Spark on K8s including vanilla `spark-submit` with built-in `KubernetesClusterManager`, `spark-submit` with custom `ExternalClusterManager`, CRD-based operators (like spark-on-k8s-operator), custom K8s `schedulers`, custom `standalone pod definitions`, and so on. This issue is tracking K8s-native Job Queueing related work. * [https://github.com/kubernetes-sigs/kueue] {code} metadata: generateName: sample-job- annotations: kueue.k8s.io/queue-name: main {code} The best case is Apache Spark users use it in the future via pod templates or existing configuration. In other words, we don't need to do anything and close this JIRA without any patches. > Support Kueue: K8s-native Job Queueing > -- > > Key: SPARK-38397 > URL: https://issues.apache.org/jira/browse/SPARK-38397 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > There are several ways to run Spark on K8s including vanilla `spark-submit` > with built-in `KubernetesClusterManager`, `spark-submit` with custom > `ExternalClusterManager`, CRD-based operators (like spark-on-k8s-operator), > custom K8s `schedulers`, custom `standalone pod definitions`, and so on. > This issue is tracking K8s-native Job Queueing related work. > * [https://github.com/kubernetes-sigs/kueue] > {code} > metadata: > generateName: sample-job- > annotations: > kueue.k8s.io/queue-name: main > {code} > The best case is Apache Spark users use it in the future via pod templates or > existing configuration. In other words, we don't need to do anything and > close this JIRA without any patches. > *Documentation* > - https://github.com/kubernetes-sigs/kueue/tree/main/docs > *Release History* > - https://github.com/kubernetes-sigs/kueue/releases/tag/v0.1.0 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38850) Upgrade Kafka to 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-38850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38850: -- Affects Version/s: 3.4.0 (was: 3.3.0) > Upgrade Kafka to 3.2.0 > -- > > Key: SPARK-38850 > URL: https://issues.apache.org/jira/browse/SPARK-38850 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39168) Consider all values in a python list when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-39168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536385#comment-17536385 ] Hyukjin Kwon commented on SPARK-39168: -- Sounds making sense to me. Are you interested in submitting a PR? cc [~itholic] who wrote {{spark.sql.pyspark.inferNestedDictAsStruct.enabled}} option. > Consider all values in a python list when inferring schema > -- > > Key: SPARK-39168 > URL: https://issues.apache.org/jira/browse/SPARK-39168 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Brian Schaefer >Priority: Major > > Schema inference fails on the following case: > {code:python} > >>> data = [{"a": [1, None], "b": [None, 2]}] > >>> spark.createDataFrame(data) > ValueError: Some of types cannot be determined after inferring > {code} > This is because only the first value in the array is used to infer the > element type for the array: > [https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/types.py#L1260]. > The element type of the "b" array is inferred as {{NullType}} but I think it > makes sense to infer the element type as {{{}LongType{}}}. > One approach to address the above would be to infer the type from the first > non-null value in the array. However, consider a case with structs: > {code:python} > >>> spark.conf.set("spark.sql.pyspark.inferNestedDictAsStruct.enabled", True) > >>> data = [{"a": [{"b": 1}, {"c": 2}]}] > >>> spark.createDataFrame(data).schema > StructType([StructField('a', ArrayType(StructType([StructField('b', > LongType(), True)]), True), True)]) > {code} > The element type of the "a" array is inferred as a struct with one field, > "b". However, it would be convenient to infer the element type as a struct > with both fields "b" and "c". Omitted fields from each dictionary would > become null values in each struct: > {code:java} > +--+ > | a| > +--+ > |[{1, null}, {null, 2}]| > +--+ > {code} > To support both of these cases, the type of each array element could be > inferred, and those types could be merged, similar to the approach > [here|https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/session.py#L574-L576]. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39169) Optimize FIRST when used as a single aggregate function
[ https://issues.apache.org/jira/browse/SPARK-39169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalii Li updated SPARK-39169: --- Summary: Optimize FIRST when used as a single aggregate function (was: Optimize FIRST when used as non-aggregate) > Optimize FIRST when used as a single aggregate function > --- > > Key: SPARK-39169 > URL: https://issues.apache.org/jira/browse/SPARK-39169 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Vitalii Li >Priority: Major > > When `FIRST` is a single aggregate function in `Aggregate` we could either > rewrite whole query or optimize execution logic. > * Plan => `SELECT FIRST() FROM ` => `SELECT FROM > LIMIT 1`. Note that setting `ignoreNulls` to `true` should block such rewrite > since returns could differ in case all values of are `NULL` > * Execution => `SELECT FIRST() FROM GROUP BY ` => > short circuit iteration per key once a value for `FIRST` is set. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39169) Optimize FIRST when used as non-aggregate
[ https://issues.apache.org/jira/browse/SPARK-39169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalii Li updated SPARK-39169: --- Description: When `FIRST` is a single aggregate function in `Aggregate` we could either rewrite whole query or optimize execution logic. * Plan => `SELECT FIRST() FROM ` => `SELECT FROM LIMIT 1`. Note that setting `ignoreNulls` to `true` should block such rewrite since returns could differ in case all values of are `NULL` * Execution => `SELECT FIRST() FROM GROUP BY ` => short circuit iteration per key once a value for `FIRST` is set. was: When `FIRST` is a single aggregate function in `Aggregate` we could either rewrite whole query or optimize execution logic. * Plan => `SELECT FIRST() FROM [GROUP BY ]` => `SELECT FROM LIMIT 1`. Note that setting `ignoreNulls` to `true` should block such rewrite since returns could differ in case all values of are `NULL` * Execution => `SELECT FIRST() FROM GROUP BY ` => short circuit iteration per key once a value for `FIRST` is set. > Optimize FIRST when used as non-aggregate > - > > Key: SPARK-39169 > URL: https://issues.apache.org/jira/browse/SPARK-39169 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Vitalii Li >Priority: Major > > When `FIRST` is a single aggregate function in `Aggregate` we could either > rewrite whole query or optimize execution logic. > * Plan => `SELECT FIRST() FROM ` => `SELECT FROM > LIMIT 1`. Note that setting `ignoreNulls` to `true` should block such rewrite > since returns could differ in case all values of are `NULL` > * Execution => `SELECT FIRST() FROM GROUP BY ` => > short circuit iteration per key once a value for `FIRST` is set. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38850) Upgrade Kafka to 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-38850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536381#comment-17536381 ] Apache Spark commented on SPARK-38850: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/36526 > Upgrade Kafka to 3.2.0 > -- > > Key: SPARK-38850 > URL: https://issues.apache.org/jira/browse/SPARK-38850 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39169) Optimize FIRST when used as non-aggregate
Vitalii Li created SPARK-39169: -- Summary: Optimize FIRST when used as non-aggregate Key: SPARK-39169 URL: https://issues.apache.org/jira/browse/SPARK-39169 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Vitalii Li When `FIRST` is a single aggregate function in `Aggregate` we could either rewrite whole query or optimize execution logic. * Plan => `SELECT FIRST() FROM [GROUP BY ]` => `SELECT FROM LIMIT 1`. Note that setting `ignoreNulls` to `true` should block such rewrite since returns could differ in case all values of are `NULL` * Execution => `SELECT FIRST() FROM GROUP BY ` => short circuit iteration per key once a value for `FIRST` is set. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34930) Install PyArrow and pandas on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-34930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34930. -- Resolution: Invalid We dropped Jenkins. > Install PyArrow and pandas on Jenkins > - > > Key: SPARK-34930 > URL: https://issues.apache.org/jira/browse/SPARK-34930 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Shane Knapp >Priority: Critical > > Looks like Jenkins mahcines don't have pandas and PyArrow (ever since it got > upgraded?) which result in skipping related tests in PySpark, see also > https://github.com/apache/spark/pull/31470#issuecomment-811618571 > It would be great if we can install both in Python 3.6 on Jenkins. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38850) Upgrade Kafka to 3.1.1
[ https://issues.apache.org/jira/browse/SPARK-38850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38850: -- Issue Type: Improvement (was: Bug) > Upgrade Kafka to 3.1.1 > -- > > Key: SPARK-38850 > URL: https://issues.apache.org/jira/browse/SPARK-38850 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38850) Upgrade Kafka to 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-38850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38850: -- Summary: Upgrade Kafka to 3.2.0 (was: Upgrade Kafka to 3.1.1) > Upgrade Kafka to 3.2.0 > -- > > Key: SPARK-38850 > URL: https://issues.apache.org/jira/browse/SPARK-38850 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39160) Remove workaround for ARROW-1948
[ https://issues.apache.org/jira/browse/SPARK-39160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-39160. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36518 [https://github.com/apache/spark/pull/36518] > Remove workaround for ARROW-1948 > > > Key: SPARK-39160 > URL: https://issues.apache.org/jira/browse/SPARK-39160 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39160) Remove workaround for ARROW-1948
[ https://issues.apache.org/jira/browse/SPARK-39160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-39160: Assignee: Cheng Pan > Remove workaround for ARROW-1948 > > > Key: SPARK-39160 > URL: https://issues.apache.org/jira/browse/SPARK-39160 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39164) Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in actions
[ https://issues.apache.org/jira/browse/SPARK-39164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-39164. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36500 [https://github.com/apache/spark/pull/36500] > Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in > actions > > > Key: SPARK-39164 > URL: https://issues.apache.org/jira/browse/SPARK-39164 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Catch exceptions from asserts and IllegalStateException raised from actions, > and replace them by SparkException w/ the INTERNAL_ERROR error class. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39168) Consider all values in a python list when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-39168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Schaefer updated SPARK-39168: --- Description: Schema inference fails on the following case: {code:python} >>> data = [{"a": [1, None], "b": [None, 2]}] >>> spark.createDataFrame(data) ValueError: Some of types cannot be determined after inferring {code} This is because only the first value in the array is used to infer the element type for the array: [https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/types.py#L1260]. The element type of the "b" array is inferred as {{NullType}} but I think it makes sense to infer the element type as {{{}LongType{}}}. One approach to address the above would be to infer the type from the first non-null value in the array. However, consider a case with structs: {code:python} >>> spark.conf.set("spark.sql.pyspark.inferNestedDictAsStruct.enabled", True) >>> data = [{"a": [{"b": 1}, {"c": 2}]}] >>> spark.createDataFrame(data).schema StructType([StructField('a', ArrayType(StructType([StructField('b', LongType(), True)]), True), True)]) {code} The element type of the "a" array is inferred as a struct with one field, "b". However, it would be convenient to infer the element type as a struct with both fields "b" and "c". Omitted fields from each dictionary would become null values in each struct: {code:java} +--+ | a| +--+ |[{1, null}, {null, 2}]| +--+ {code} To support both of these cases, the type of each array element could be inferred, and those types could be merged, similar to the approach [here|https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/session.py#L574-L576]. was: Schema inference fails on the following case: {code:python} >>> data = [{"a": [1, None], "b": [None, 2]}] >>> spark.createDataFrame(data) ValueError: Some of types cannot be determined after inferring {code} This is because only the first value in the array is used to infer the element type for the array: [https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/types.py#L1260]. The element type of the "b" array is inferred as {{NullType}} but I think it makes sense to infer the element type as {{{}LongType{}}}. One approach to address the above would be to infer the type from the first non-null value in the array. However, consider a case with structs: {code:python} >>> spark.conf.set("spark.sql.pyspark.inferNestedDictAsStruct.enabled", True) >>> data = [{"a": [{"b": 1}, {"c": 2}]}] >>> spark.createDataFrame(data).schema StructType([StructField('a', ArrayType(StructType([StructField('b', LongType(), True)]), True), True)]) {code} The element type of the "a" array is inferred as a struct with one field, "b". However, it would be convenient to infer the element type as a struct with both fields "b" and "c". Omitted fields from each dictionary would become null values in each struct: {code:java} +--+ | a| +--+ |[{1, null}, {null, 1}]| +--+ {code} To support both of these cases, the type of each array element could be inferred, and those types could be merged, similar to the approach [here|https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/session.py#L574-L576]. > Consider all values in a python list when inferring schema > -- > > Key: SPARK-39168 > URL: https://issues.apache.org/jira/browse/SPARK-39168 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Brian Schaefer >Priority: Major > > Schema inference fails on the following case: > {code:python} > >>> data = [{"a": [1, None], "b": [None, 2]}] > >>> spark.createDataFrame(data) > ValueError: Some of types cannot be determined after inferring > {code} > This is because only the first value in the array is used to infer the > element type for the array: > [https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/types.py#L1260]. > The element type of the "b" array is inferred as {{NullType}} but I think it > makes sense to infer the element type as {{{}LongType{}}}. > One approach to address the above would be to infer the type from the first > non-null value in the array. However, consider a case with structs: > {code:python} > >>> spark.conf.set("spark.sql.pyspark.inferNestedDictAsStruct.enabled", True) > >>> data = [{"a": [{"b": 1}, {"c": 2}]}] > >>> spark.createDataFrame(data).schema > StructType([StructField('a', ArrayType(StructType([StructField('b', > LongType(), True)]), True), Tru
[jira] [Resolved] (SPARK-39145) CLONE - SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-39145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen resolved SPARK-39145. - Resolution: Duplicate Closing this as a duplicate.. > CLONE - SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-39145 > URL: https://issues.apache.org/jira/browse/SPARK-39145 > Project: Spark > Issue Type: Epic > Components: SQL >Affects Versions: 3.0.0 >Reporter: Abhi Shah >Assignee: Robert Joseph Evans >Priority: Major > > *strong text**SPIP: Columnar Processing Without Arrow Formatting Guarantees.* > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > # Add to the current sql extensions mechanism so advanced users can have > access to the physical SparkPlan and manipulate it to provide columnar > processing for existing operators, including shuffle. This will allow them > to implement their own cost based optimizers to decide when processing should > be columnar and when it should not. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the users so operations that are not columnar see the > data as rows, and operations that are columnar see the data as columns. > > Not Requirements, but things that would be nice to have. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > > *Q2.* What problem is this proposal NOT designed to solve? > The goal of this is not for ML/AI but to provide APIs for accelerated > computing in Spark primarily targeting SQL/ETL like workloads. ML/AI already > have several mechanisms to get data into/out of them. These can be improved > but will be covered in a separate SPIP. > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation. > This does not cover exposing the underlying format of the data. The only way > to get at the data in a ColumnVector is through the public APIs. Exposing > the underlying format to improve efficiency will be covered in a separate > SPIP. > This is not trying to implement new ways of transferring data to external > ML/AI applications. That is covered by separate SPIPs already. > This is not trying to add in generic code generation for columnar processing. > Currently code generation for columnar processing is only supported when > translating columns to rows. We will continue to support this, but will not > extend it as a general solution. That will be covered in a separate SPIP if > we find it is helpful. For now columnar processing will be interpreted. > This is not trying to expose a way to get columnar data into Spark through > DataSource V2 or any other similar API. That would be covered by a separate > SPIP if we find it is needed. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Internal implementations of FileFormats, optionally can return a > ColumnarBatch instead of rows. The code generation phase knows how to take > that columnar data and iterate through it as rows for stages that wants rows, > which currently is almost everything. The limitations here are mostly > implementation specific. The current standard is to abuse Scala’s type > erasure to return ColumnarBatches as the elements of an RDD[InternalRow]. The > code generation can handle this because it is generating java code, so it > bypasses scala’s type checking and just casts the InternalRow to the desired > ColumnarBatch. This makes it difficult for others to implement the same > functionality for different processing because they can only do it through > code generation. There really is no clean separate path in the code > generation for columnar vs row based. Additionally, because it is only > supported through code generation if for any reason code generation would > fail there is no backup. This is typically fine for input formats but can be > problematic when we get into more extensive processing. > # When caching data it can optionally be cached in a columnar format if the > input is also columnar. This is similar to the first area and has t
[jira] [Resolved] (SPARK-39161) Upgrade rocksdbjni to 7.2.2
[ https://issues.apache.org/jira/browse/SPARK-39161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-39161. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36522 [https://github.com/apache/spark/pull/36522] > Upgrade rocksdbjni to 7.2.2 > --- > > Key: SPARK-39161 > URL: https://issues.apache.org/jira/browse/SPARK-39161 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39161) Upgrade rocksdbjni to 7.2.2
[ https://issues.apache.org/jira/browse/SPARK-39161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-39161: - Assignee: Yang Jie > Upgrade rocksdbjni to 7.2.2 > --- > > Key: SPARK-39161 > URL: https://issues.apache.org/jira/browse/SPARK-39161 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36837) Upgrade Kafka to 3.1.0
[ https://issues.apache.org/jira/browse/SPARK-36837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-36837: -- Parent: (was: SPARK-33772) Issue Type: Improvement (was: Sub-task) > Upgrade Kafka to 3.1.0 > -- > > Key: SPARK-36837 > URL: https://issues.apache.org/jira/browse/SPARK-36837 > Project: Spark > Issue Type: Improvement > Components: Build, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: releasenotes > Fix For: 3.4.0 > > > Kafka 3.1.0 has the official Java 17 support. We had better align with it. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36837) Upgrade Kafka to 3.1.0
[ https://issues.apache.org/jira/browse/SPARK-36837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-36837: -- Fix Version/s: 3.4.0 (was: 3.3.0) > Upgrade Kafka to 3.1.0 > -- > > Key: SPARK-36837 > URL: https://issues.apache.org/jira/browse/SPARK-36837 > Project: Spark > Issue Type: Sub-task > Components: Build, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: releasenotes > Fix For: 3.4.0 > > > Kafka 3.1.0 has the official Java 17 support. We had better align with it. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36837) Upgrade Kafka to 3.1.0
[ https://issues.apache.org/jira/browse/SPARK-36837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-36837: -- Affects Version/s: 3.4.0 (was: 3.3.0) > Upgrade Kafka to 3.1.0 > -- > > Key: SPARK-36837 > URL: https://issues.apache.org/jira/browse/SPARK-36837 > Project: Spark > Issue Type: Sub-task > Components: Build, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: releasenotes > Fix For: 3.3.0 > > > Kafka 3.1.0 has the official Java 17 support. We had better align with it. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39167) Throw an exception w/ an error class for multiple rows from a subquery used as an expression
Max Gekk created SPARK-39167: Summary: Throw an exception w/ an error class for multiple rows from a subquery used as an expression Key: SPARK-39167 URL: https://issues.apache.org/jira/browse/SPARK-39167 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Users can trigger an illegal state exception by the SQL statement: {code:sql} > select (select a from (select 1 as a union all select 2 as a) t) as b {code} {code:java} Caused by: java.lang.IllegalStateException: more than one row returned by a subquery used as an expression: Subquery subquery#242, [id=#100] +- AdaptiveSparkPlan isFinalPlan=true +- == Final Plan == Union :- *(1) Project [1 AS a#240] : +- *(1) Scan OneRowRelation[] +- *(2) Project [2 AS a#241] +- *(2) Scan OneRowRelation[] +- == Initial Plan == Union :- Project [1 AS a#240] : +- Scan OneRowRelation[] +- Project [2 AS a#241] +- Scan OneRowRelation[] at org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:83) {code} but such kind of exceptions are not supposed to be visible to users. Need to introduce an error class (or re-use an existing one), and replace the IllegalStateException. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39168) Consider all values in a python list when inferring schema
Brian Schaefer created SPARK-39168: -- Summary: Consider all values in a python list when inferring schema Key: SPARK-39168 URL: https://issues.apache.org/jira/browse/SPARK-39168 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.2.1 Reporter: Brian Schaefer Schema inference fails on the following case: {code:python} >>> data = [{"a": [1, None], "b": [None, 2]}] >>> spark.createDataFrame(data) ValueError: Some of types cannot be determined after inferring {code} This is because only the first value in the array is used to infer the element type for the array: [https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/types.py#L1260]. The element type of the "b" array is inferred as {{NullType}} but I think it makes sense to infer the element type as {{{}LongType{}}}. One approach to address the above would be to infer the type from the first non-null value in the array. However, consider a case with structs: {code:python} >>> spark.conf.set("spark.sql.pyspark.inferNestedDictAsStruct.enabled", True) >>> data = [{"a": [{"b": 1}, {"c": 2}]}] >>> spark.createDataFrame(data).schema StructType([StructField('a', ArrayType(StructType([StructField('b', LongType(), True)]), True), True)]) {code} The element type of the "a" array is inferred as a struct with one field, "b". However, it would be convenient to infer the element type as a struct with both fields "b" and "c". Omitted fields from each dictionary would become null values in each struct: {code:java} +--+ | a| +--+ |[{1, null}, {null, 1}]| +--+ {code} To support both of these cases, the type of each array element could be inferred, and those types could be merged, similar to the approach [here|https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/session.py#L574-L576]. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39165) Replace sys.error by IllegalStateException in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-39165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-39165. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36524 [https://github.com/apache/spark/pull/36524] > Replace sys.error by IllegalStateException in Spark SQL > --- > > Key: SPARK-39165 > URL: https://issues.apache.org/jira/browse/SPARK-39165 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Replace all sys.error by IllegalStateException. sys.error throws > RuntimeException which is hard to distinguish from Spark exception. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39165) Replace sys.error by IllegalStateException in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-39165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-39165: Assignee: Max Gekk > Replace sys.error by IllegalStateException in Spark SQL > --- > > Key: SPARK-39165 > URL: https://issues.apache.org/jira/browse/SPARK-39165 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Replace all sys.error by IllegalStateException. sys.error throws > RuntimeException which is hard to distinguish from Spark exception. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39166) Provide runtime error query context for Binary Arithmetic when WSCG is off
[ https://issues.apache.org/jira/browse/SPARK-39166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-39166: --- Description: Currently, for most of the cases, the project https://issues.apache.org/jira/browse/SPARK-38615 is able to show where the runtime errors happen within the original query. However, after trying on production, I found that the following queries won't show where the divide by 0 error happens {code:java} create table aggTest(i int, j int, k int, d date) using parquet insert into aggTest values(1, 2, 0, date'2022-01-01') select sum(j)/sum(k),percentile(i, 0.9) from aggTest group by d{code} With `percentile` function in the query, the plan can't execute with whole stage codegen. Thus the child plan of `Project` is serialized to executors for execution, from ProjectExec: {code:java} protected override def doExecute(): RDD[InternalRow] = { child.execute().mapPartitionsWithIndexInternal { (index, iter) => val project = UnsafeProjection.create(projectList, child.output) project.initialize(index) iter.map(project) } }{code} Note that the `TreeNode.origin` is not serialized to executors since `TreeNode` doesn't extend the trait `Serializable`, which results in an empty query context on errors. For more details, please read https://issues.apache.org/jira/browse/SPARK-39140 A dummy fix is to make `TreeNode` extend the trait `Serializable`. However, it can be performance regression if the query text is long (every `TreeNode` carries it for serialization). A better fix is to introduce a new trait `SupportQueryContext` and materialize the truncated query context for special expressions. This jira targets on binary arithmetic expressions only. I will create follow-ups for the remaining expressions which support runtime error query context. > Provide runtime error query context for Binary Arithmetic when WSCG is off > -- > > Key: SPARK-39166 > URL: https://issues.apache.org/jira/browse/SPARK-39166 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, for most of the cases, the project > https://issues.apache.org/jira/browse/SPARK-38615 is able to show where the > runtime errors happen within the original query. > However, after trying on production, I found that the following queries won't > show where the divide by 0 error happens > {code:java} > create table aggTest(i int, j int, k int, d date) using parquet > insert into aggTest values(1, 2, 0, date'2022-01-01') > select sum(j)/sum(k),percentile(i, 0.9) from aggTest group by d{code} > With `percentile` function in the query, the plan can't execute with whole > stage codegen. Thus the child plan of `Project` is serialized to executors > for execution, from ProjectExec: > {code:java} > protected override def doExecute(): RDD[InternalRow] = { > child.execute().mapPartitionsWithIndexInternal { (index, iter) => > val project = UnsafeProjection.create(projectList, child.output) > project.initialize(index) > iter.map(project) > } > }{code} > Note that the `TreeNode.origin` is not serialized to executors since > `TreeNode` doesn't extend the trait `Serializable`, which results in an empty > query context on errors. For more details, please read > https://issues.apache.org/jira/browse/SPARK-39140 > A dummy fix is to make `TreeNode` extend the trait `Serializable`. However, > it can be performance regression if the query text is long (every `TreeNode` > carries it for serialization). > A better fix is to introduce a new trait `SupportQueryContext` and > materialize the truncated query context for special expressions. This jira > targets on binary arithmetic expressions only. I will create follow-ups for > the remaining expressions which support runtime error query context. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39166) Provide runtime error query context for Binary Arithmetic when WSCG is off
[ https://issues.apache.org/jira/browse/SPARK-39166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536175#comment-17536175 ] Apache Spark commented on SPARK-39166: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/36525 > Provide runtime error query context for Binary Arithmetic when WSCG is off > -- > > Key: SPARK-39166 > URL: https://issues.apache.org/jira/browse/SPARK-39166 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39166) Provide runtime error query context for Binary Arithmetic when WSCG is off
[ https://issues.apache.org/jira/browse/SPARK-39166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536174#comment-17536174 ] Apache Spark commented on SPARK-39166: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/36525 > Provide runtime error query context for Binary Arithmetic when WSCG is off > -- > > Key: SPARK-39166 > URL: https://issues.apache.org/jira/browse/SPARK-39166 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39166) Provide runtime error query context for Binary Arithmetic when WSCG is off
[ https://issues.apache.org/jira/browse/SPARK-39166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39166: Assignee: Apache Spark (was: Gengliang Wang) > Provide runtime error query context for Binary Arithmetic when WSCG is off > -- > > Key: SPARK-39166 > URL: https://issues.apache.org/jira/browse/SPARK-39166 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39166) Provide runtime error query context for Binary Arithmetic when WSCG is off
[ https://issues.apache.org/jira/browse/SPARK-39166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39166: Assignee: Gengliang Wang (was: Apache Spark) > Provide runtime error query context for Binary Arithmetic when WSCG is off > -- > > Key: SPARK-39166 > URL: https://issues.apache.org/jira/browse/SPARK-39166 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39166) Provide runtime error query context for Binary Arithmetic when WSCG is off
Gengliang Wang created SPARK-39166: -- Summary: Provide runtime error query context for Binary Arithmetic when WSCG is off Key: SPARK-39166 URL: https://issues.apache.org/jira/browse/SPARK-39166 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Gengliang Wang Assignee: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39165) Replace sys.error by IllegalStateException in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-39165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39165: Assignee: (was: Apache Spark) > Replace sys.error by IllegalStateException in Spark SQL > --- > > Key: SPARK-39165 > URL: https://issues.apache.org/jira/browse/SPARK-39165 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace all sys.error by IllegalStateException. sys.error throws > RuntimeException which is hard to distinguish from Spark exception. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39165) Replace sys.error by IllegalStateException in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-39165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39165: Assignee: Apache Spark > Replace sys.error by IllegalStateException in Spark SQL > --- > > Key: SPARK-39165 > URL: https://issues.apache.org/jira/browse/SPARK-39165 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Replace all sys.error by IllegalStateException. sys.error throws > RuntimeException which is hard to distinguish from Spark exception. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39165) Replace sys.error by IllegalStateException in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-39165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536101#comment-17536101 ] Apache Spark commented on SPARK-39165: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/36524 > Replace sys.error by IllegalStateException in Spark SQL > --- > > Key: SPARK-39165 > URL: https://issues.apache.org/jira/browse/SPARK-39165 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace all sys.error by IllegalStateException. sys.error throws > RuntimeException which is hard to distinguish from Spark exception. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39165) Replace sys.error by IllegalStateException in Spark SQL
Max Gekk created SPARK-39165: Summary: Replace sys.error by IllegalStateException in Spark SQL Key: SPARK-39165 URL: https://issues.apache.org/jira/browse/SPARK-39165 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Replace all sys.error by IllegalStateException. sys.error throws RuntimeException which is hard to distinguish from Spark exception. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39164) Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in actions
[ https://issues.apache.org/jira/browse/SPARK-39164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536047#comment-17536047 ] Apache Spark commented on SPARK-39164: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/36500 > Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in > actions > > > Key: SPARK-39164 > URL: https://issues.apache.org/jira/browse/SPARK-39164 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Catch exceptions from asserts and IllegalStateException raised from actions, > and replace them by SparkException w/ the INTERNAL_ERROR error class. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39164) Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in actions
[ https://issues.apache.org/jira/browse/SPARK-39164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39164: Assignee: Apache Spark (was: Max Gekk) > Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in > actions > > > Key: SPARK-39164 > URL: https://issues.apache.org/jira/browse/SPARK-39164 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Catch exceptions from asserts and IllegalStateException raised from actions, > and replace them by SparkException w/ the INTERNAL_ERROR error class. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39164) Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in actions
[ https://issues.apache.org/jira/browse/SPARK-39164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39164: Assignee: Max Gekk (was: Apache Spark) > Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in > actions > > > Key: SPARK-39164 > URL: https://issues.apache.org/jira/browse/SPARK-39164 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Catch exceptions from asserts and IllegalStateException raised from actions, > and replace them by SparkException w/ the INTERNAL_ERROR error class. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37956) Add Java and Python examples to the Parquet encryption feature documentation
[ https://issues.apache.org/jira/browse/SPARK-37956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536046#comment-17536046 ] Apache Spark commented on SPARK-37956: -- User 'andersonm-ibm' has created a pull request for this issue: https://github.com/apache/spark/pull/36523 > Add Java and Python examples to the Parquet encryption feature documentation > - > > Key: SPARK-37956 > URL: https://issues.apache.org/jira/browse/SPARK-37956 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Maya Anderson >Priority: Minor > > Add Java and Python examples to the Parquet encryption feature documentation, > based on the Scala example in [SPARK-35658]. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37956) Add Java and Python examples to the Parquet encryption feature documentation
[ https://issues.apache.org/jira/browse/SPARK-37956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536045#comment-17536045 ] Apache Spark commented on SPARK-37956: -- User 'andersonm-ibm' has created a pull request for this issue: https://github.com/apache/spark/pull/36523 > Add Java and Python examples to the Parquet encryption feature documentation > - > > Key: SPARK-37956 > URL: https://issues.apache.org/jira/browse/SPARK-37956 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Maya Anderson >Priority: Minor > > Add Java and Python examples to the Parquet encryption feature documentation, > based on the Scala example in [SPARK-35658]. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39164) Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in actions
[ https://issues.apache.org/jira/browse/SPARK-39164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39164: - Summary: Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in actions (was: Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception) > Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in > actions > > > Key: SPARK-39164 > URL: https://issues.apache.org/jira/browse/SPARK-39164 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Catch exceptions from asserts and IllegalStateException raised from actions, > and replace them by SparkException w/ the INTERNAL_ERROR error class. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39164) Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception
Max Gekk created SPARK-39164: Summary: Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception Key: SPARK-39164 URL: https://issues.apache.org/jira/browse/SPARK-39164 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk Catch exceptions from asserts and IllegalStateException raised from actions, and replace them by SparkException w/ the INTERNAL_ERROR error class. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39163) Throw an exception w/ error class for an invalid bucket file
[ https://issues.apache.org/jira/browse/SPARK-39163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39163: - Description: Replace IllegalStateException by Spark's exception w/ an error class there [https://github.com/apache/spark/blob/ee6ea3c68694e35c36ad006a7762297800d1e463/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L621.|https://github.com/apache/spark/blob/ee6ea3c68694e35c36ad006a7762297800d1e463/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L621] Move related tests to an Query.*ErrorsSuite. was:Replace IllegalStateException by Spark's exception w/ an error class there https://github.com/apache/spark/blob/ee6ea3c68694e35c36ad006a7762297800d1e463/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L621 > Throw an exception w/ error class for an invalid bucket file > > > Key: SPARK-39163 > URL: https://issues.apache.org/jira/browse/SPARK-39163 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace IllegalStateException by Spark's exception w/ an error class there > [https://github.com/apache/spark/blob/ee6ea3c68694e35c36ad006a7762297800d1e463/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L621.|https://github.com/apache/spark/blob/ee6ea3c68694e35c36ad006a7762297800d1e463/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L621] > Move related tests to an Query.*ErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39163) Throw an exception w/ error class for an invalid bucket file
Max Gekk created SPARK-39163: Summary: Throw an exception w/ error class for an invalid bucket file Key: SPARK-39163 URL: https://issues.apache.org/jira/browse/SPARK-39163 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Replace IllegalStateException by Spark's exception w/ an error class there https://github.com/apache/spark/blob/ee6ea3c68694e35c36ad006a7762297800d1e463/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L621 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39161) Upgrade rocksdbjni to 7.2.2
[ https://issues.apache.org/jira/browse/SPARK-39161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39161: Assignee: (was: Apache Spark) > Upgrade rocksdbjni to 7.2.2 > --- > > Key: SPARK-39161 > URL: https://issues.apache.org/jira/browse/SPARK-39161 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39161) Upgrade rocksdbjni to 7.2.2
[ https://issues.apache.org/jira/browse/SPARK-39161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536020#comment-17536020 ] Apache Spark commented on SPARK-39161: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/36522 > Upgrade rocksdbjni to 7.2.2 > --- > > Key: SPARK-39161 > URL: https://issues.apache.org/jira/browse/SPARK-39161 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39161) Upgrade rocksdbjni to 7.2.2
[ https://issues.apache.org/jira/browse/SPARK-39161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39161: Assignee: Apache Spark > Upgrade rocksdbjni to 7.2.2 > --- > > Key: SPARK-39161 > URL: https://issues.apache.org/jira/browse/SPARK-39161 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39162) Jdbc dialect should decide which function could be pushed down.
[ https://issues.apache.org/jira/browse/SPARK-39162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39162: Assignee: Apache Spark > Jdbc dialect should decide which function could be pushed down. > --- > > Key: SPARK-39162 > URL: https://issues.apache.org/jira/browse/SPARK-39162 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Regardless of whether the functions are ANSI or not, most databases are > actually unsure of their support. > So we should add a new API into JdbcDialect so that Jdbc dialect could decide > which function could be pushed down. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39162) Jdbc dialect should decide which function could be pushed down.
[ https://issues.apache.org/jira/browse/SPARK-39162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39162: Assignee: (was: Apache Spark) > Jdbc dialect should decide which function could be pushed down. > --- > > Key: SPARK-39162 > URL: https://issues.apache.org/jira/browse/SPARK-39162 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Regardless of whether the functions are ANSI or not, most databases are > actually unsure of their support. > So we should add a new API into JdbcDialect so that Jdbc dialect could decide > which function could be pushed down. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39162) Jdbc dialect should decide which function could be pushed down.
[ https://issues.apache.org/jira/browse/SPARK-39162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536016#comment-17536016 ] Apache Spark commented on SPARK-39162: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/36521 > Jdbc dialect should decide which function could be pushed down. > --- > > Key: SPARK-39162 > URL: https://issues.apache.org/jira/browse/SPARK-39162 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Regardless of whether the functions are ANSI or not, most databases are > actually unsure of their support. > So we should add a new API into JdbcDialect so that Jdbc dialect could decide > which function could be pushed down. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39162) Jdbc dialect should decide which function could be pushed down.
[ https://issues.apache.org/jira/browse/SPARK-39162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-39162: --- Description: Regardless of whether the functions are ANSI or not, most databases are actually unsure of their support. So we should add a new API into JdbcDialect so that Jdbc dialect could decide which function could be pushed down. was: Regardless of whether the functions are ANSI or not, most databases are actually unsure of their support. So we should add a new API into JdbcDialect so that Jdbc dialect should decide which function could be pushed down. > Jdbc dialect should decide which function could be pushed down. > --- > > Key: SPARK-39162 > URL: https://issues.apache.org/jira/browse/SPARK-39162 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Regardless of whether the functions are ANSI or not, most databases are > actually unsure of their support. > So we should add a new API into JdbcDialect so that Jdbc dialect could decide > which function could be pushed down. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39162) Jdbc dialect should decide which function could be pushed down.
jiaan.geng created SPARK-39162: -- Summary: Jdbc dialect should decide which function could be pushed down. Key: SPARK-39162 URL: https://issues.apache.org/jira/browse/SPARK-39162 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: jiaan.geng Regardless of whether the functions are ANSI or not, most databases are actually unsure of their support. So we should add a new API into JdbcDialect so that Jdbc dialect should decide which function could be pushed down. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39161) Upgrade rocksdbjni to 7.2.2
Yang Jie created SPARK-39161: Summary: Upgrade rocksdbjni to 7.2.2 Key: SPARK-39161 URL: https://issues.apache.org/jira/browse/SPARK-39161 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.4.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38633) Support push down Cast to JDBC data source V2
[ https://issues.apache.org/jira/browse/SPARK-38633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17535941#comment-17535941 ] Apache Spark commented on SPARK-38633: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/36520 > Support push down Cast to JDBC data source V2 > - > > Key: SPARK-38633 > URL: https://issues.apache.org/jira/browse/SPARK-38633 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.3.0 > > > Cast is very useful and Spark always use Cast to convert data type > automatically. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39159) Add new Dataset API for Offset
[ https://issues.apache.org/jira/browse/SPARK-39159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17535909#comment-17535909 ] Apache Spark commented on SPARK-39159: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/36519 > Add new Dataset API for Offset > -- > > Key: SPARK-39159 > URL: https://issues.apache.org/jira/browse/SPARK-39159 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org