[jira] [Created] (SPARK-45089) Remove obsolete repo of DB2 JDBC driver
Cheng Pan created SPARK-45089: - Summary: Remove obsolete repo of DB2 JDBC driver Key: SPARK-45089 URL: https://issues.apache.org/jira/browse/SPARK-45089 Project: Spark Issue Type: Test Components: Build, Tests Affects Versions: 4.0.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44833) Spark Connect reattach when initial ExecutePlan didn't reach server doing too eager Reattach
[ https://issues.apache.org/jira/browse/SPARK-44833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762282#comment-17762282 ] Aparna Garg commented on SPARK-44833: - User 'juliuszsompolski' has created a pull request for this issue: https://github.com/apache/spark/pull/42806 > Spark Connect reattach when initial ExecutePlan didn't reach server doing too > eager Reattach > > > Key: SPARK-44833 > URL: https://issues.apache.org/jira/browse/SPARK-44833 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > Fix For: 4.0.0, 3.5.1 > > > In > {code:java} > case ex: StatusRuntimeException > if Option(StatusProto.fromThrowable(ex)) > .exists(_.getMessage.contains("INVALID_HANDLE.OPERATION_NOT_FOUND")) => > if (lastReturnedResponseId.isDefined) { > throw new IllegalStateException( > "OPERATION_NOT_FOUND on the server but responses were already received > from it.", > ex) > } > // Try a new ExecutePlan, and throw upstream for retry. > -> iter = rawBlockingStub.executePlan(initialRequest) > -> throw new GrpcRetryHandler.RetryException {code} > we call executePlan, and throw RetryException to have an exception handled > upstream. > Then it goes to > {code:java} > retry { > if (firstTry) { > // on first try, we use the existing iter. > firstTry = false > } else { > // on retry, the iter is borked, so we need a new one > ->iter = rawBlockingStub.reattachExecute(createReattachExecuteRequest()) > } {code} > and because it's not firstTry, immediately does reattach. > This causes no failure - the reattach will work and attach to the query, the > original executePlan will get detached. But it could be improved. > Same issue is also present in python reattach.py. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45070) Describe the binary and datetime formats of `to_char`/`to_varchar`
[ https://issues.apache.org/jira/browse/SPARK-45070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-45070. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42801 [https://github.com/apache/spark/pull/42801] > Describe the binary and datetime formats of `to_char`/`to_varchar` > -- > > Key: SPARK-45070 > URL: https://issues.apache.org/jira/browse/SPARK-45070 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 4.0.0 > > > In the PR, I propose to document the recent changes related to the `format` > of the `to_char`/`to_varchar` functions: > 1. binary formats added by https://github.com/apache/spark/pull/42632 > 2. datetime formats introduced by https://github.com/apache/spark/pull/42534 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44833) Spark Connect reattach when initial ExecutePlan didn't reach server doing too eager Reattach
[ https://issues.apache.org/jira/browse/SPARK-44833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44833: Assignee: Juliusz Sompolski > Spark Connect reattach when initial ExecutePlan didn't reach server doing too > eager Reattach > > > Key: SPARK-44833 > URL: https://issues.apache.org/jira/browse/SPARK-44833 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > > In > {code:java} > case ex: StatusRuntimeException > if Option(StatusProto.fromThrowable(ex)) > .exists(_.getMessage.contains("INVALID_HANDLE.OPERATION_NOT_FOUND")) => > if (lastReturnedResponseId.isDefined) { > throw new IllegalStateException( > "OPERATION_NOT_FOUND on the server but responses were already received > from it.", > ex) > } > // Try a new ExecutePlan, and throw upstream for retry. > -> iter = rawBlockingStub.executePlan(initialRequest) > -> throw new GrpcRetryHandler.RetryException {code} > we call executePlan, and throw RetryException to have an exception handled > upstream. > Then it goes to > {code:java} > retry { > if (firstTry) { > // on first try, we use the existing iter. > firstTry = false > } else { > // on retry, the iter is borked, so we need a new one > ->iter = rawBlockingStub.reattachExecute(createReattachExecuteRequest()) > } {code} > and because it's not firstTry, immediately does reattach. > This causes no failure - the reattach will work and attach to the query, the > original executePlan will get detached. But it could be improved. > Same issue is also present in python reattach.py. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44833) Spark Connect reattach when initial ExecutePlan didn't reach server doing too eager Reattach
[ https://issues.apache.org/jira/browse/SPARK-44833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44833. -- Fix Version/s: 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 42806 [https://github.com/apache/spark/pull/42806] > Spark Connect reattach when initial ExecutePlan didn't reach server doing too > eager Reattach > > > Key: SPARK-44833 > URL: https://issues.apache.org/jira/browse/SPARK-44833 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > Fix For: 3.5.1, 4.0.0 > > > In > {code:java} > case ex: StatusRuntimeException > if Option(StatusProto.fromThrowable(ex)) > .exists(_.getMessage.contains("INVALID_HANDLE.OPERATION_NOT_FOUND")) => > if (lastReturnedResponseId.isDefined) { > throw new IllegalStateException( > "OPERATION_NOT_FOUND on the server but responses were already received > from it.", > ex) > } > // Try a new ExecutePlan, and throw upstream for retry. > -> iter = rawBlockingStub.executePlan(initialRequest) > -> throw new GrpcRetryHandler.RetryException {code} > we call executePlan, and throw RetryException to have an exception handled > upstream. > Then it goes to > {code:java} > retry { > if (firstTry) { > // on first try, we use the existing iter. > firstTry = false > } else { > // on retry, the iter is borked, so we need a new one > ->iter = rawBlockingStub.reattachExecute(createReattachExecuteRequest()) > } {code} > and because it's not firstTry, immediately does reattach. > This causes no failure - the reattach will work and attach to the query, the > original executePlan will get detached. But it could be improved. > Same issue is also present in python reattach.py. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45088) Make `getitem` work with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-45088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762278#comment-17762278 ] Aparna Garg commented on SPARK-45088: - User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/42828 > Make `getitem` work with duplicated columns > --- > > Key: SPARK-45088 > URL: https://issues.apache.org/jira/browse/SPARK-45088 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45088) Make `getitem` work with duplicated columns
Ruifeng Zheng created SPARK-45088: - Summary: Make `getitem` work with duplicated columns Key: SPARK-45088 URL: https://issues.apache.org/jira/browse/SPARK-45088 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45087) Improve Python DataFrame API test coverage
Ruifeng Zheng created SPARK-45087: - Summary: Improve Python DataFrame API test coverage Key: SPARK-45087 URL: https://issues.apache.org/jira/browse/SPARK-45087 Project: Spark Issue Type: Umbrella Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44801) SQL Page does not capture failed queries in analyzer
[ https://issues.apache.org/jira/browse/SPARK-44801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762262#comment-17762262 ] Snoot.io commented on SPARK-44801: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/42825 > SQL Page does not capture failed queries in analyzer > - > > Key: SPARK-44801 > URL: https://issues.apache.org/jira/browse/SPARK-44801 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 3.2.4, 3.3.2, 3.4.1, 3.5.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45086) Display hexadecimal for thread lock hash code
[ https://issues.apache.org/jira/browse/SPARK-45086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762261#comment-17762261 ] Snoot.io commented on SPARK-45086: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/42826 > Display hexadecimal for thread lock hash code > - > > Key: SPARK-45086 > URL: https://issues.apache.org/jira/browse/SPARK-45086 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.4.1, 3.5.0, 4.0.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45086) Display hexadecimal for thread lock hash code
Kent Yao created SPARK-45086: Summary: Display hexadecimal for thread lock hash code Key: SPARK-45086 URL: https://issues.apache.org/jira/browse/SPARK-45086 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.4.1, 3.5.0, 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
[ https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45071: Fix Version/s: 3.5.1 (was: 3.5.0) > Optimize the processing speed of `BinaryArithmetic#dataType` when processing > multi-column data > -- > > Key: SPARK-45071 > URL: https://issues.apache.org/jira/browse/SPARK-45071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: ming95 >Assignee: ming95 >Priority: Major > Fix For: 3.4.2, 4.0.0, 3.5.1 > > > Since `BinaryArithmetic#dataType` will recursively process the datatype of > each node, the driver will be very slow when multiple columns are processed. > For example, the following code: > {code:java} > ``` > import spark.implicits._ > import scala.util.Random > import org.apache.spark.sql.functions.sum > import org.apache.spark.sql.types.{StructType, StructField, IntegerType} > val N = 30 > val M = 100 > val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) > val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) > val schema = StructType(columns.map(StructField(_, IntegerType))) > val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) > val df = spark.createDataFrame(rdd, schema) > val colExprs = columns.map(sum(_)) > // gen a new column , and add the other 30 column > df.withColumn("new_col_sum", expr(columns.mkString(" + "))) > ``` > {code} > > This code will take a few minutes for the driver to execute in the spark3.4 > version, but only takes a few seconds to execute in the spark3.2 version. > Related issue: SPARK-39316 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45046) Set shadeTestJar of core module to false
[ https://issues.apache.org/jira/browse/SPARK-45046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762260#comment-17762260 ] Snoot.io commented on SPARK-45046: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/42766 > Set shadeTestJar of core module to false > > > Key: SPARK-45046 > URL: https://issues.apache.org/jira/browse/SPARK-45046 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45046) Set shadeTestJar of core module to false
[ https://issues.apache.org/jira/browse/SPARK-45046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-45046. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42766 [https://github.com/apache/spark/pull/42766] > Set shadeTestJar of core module to false > > > Key: SPARK-45046 > URL: https://issues.apache.org/jira/browse/SPARK-45046 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
[ https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762257#comment-17762257 ] Snoot.io commented on SPARK-45071: -- User 'ming95' has created a pull request for this issue: https://github.com/apache/spark/pull/42804 > Optimize the processing speed of `BinaryArithmetic#dataType` when processing > multi-column data > -- > > Key: SPARK-45071 > URL: https://issues.apache.org/jira/browse/SPARK-45071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: ming95 >Assignee: ming95 >Priority: Major > Fix For: 3.4.2, 3.5.0, 4.0.0 > > > Since `BinaryArithmetic#dataType` will recursively process the datatype of > each node, the driver will be very slow when multiple columns are processed. > For example, the following code: > {code:java} > ``` > import spark.implicits._ > import scala.util.Random > import org.apache.spark.sql.functions.sum > import org.apache.spark.sql.types.{StructType, StructField, IntegerType} > val N = 30 > val M = 100 > val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) > val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) > val schema = StructType(columns.map(StructField(_, IntegerType))) > val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) > val df = spark.createDataFrame(rdd, schema) > val colExprs = columns.map(sum(_)) > // gen a new column , and add the other 30 column > df.withColumn("new_col_sum", expr(columns.mkString(" + "))) > ``` > {code} > > This code will take a few minutes for the driver to execute in the spark3.4 > version, but only takes a few seconds to execute in the spark3.2 version. > Related issue: SPARK-39316 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45080) Kafka DSv2 streaming source implementation calls planInputPartitions 4 times per microbatch
[ https://issues.apache.org/jira/browse/SPARK-45080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762258#comment-17762258 ] Snoot.io commented on SPARK-45080: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/42823 > Kafka DSv2 streaming source implementation calls planInputPartitions 4 times > per microbatch > --- > > Key: SPARK-45080 > URL: https://issues.apache.org/jira/browse/SPARK-45080 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Jungtaek Lim >Priority: Major > > I was tracking through method calls for DSv2 streaming source, and figured > out planInputPartitions is called 4 times per microbatch. > It turned out that multiple calls of planInputPartitions is due to > `DataSourceV2ScanExecBase.supportsColumnar`, though it is called through > `MicroBatchScanExec.inputPartitions` which is defined as lazy, hence > shouldn't happen. > The behavior seems to be coupled with catalyst and very hard to figure out > why, but with SPARK-44505, we can at least fix this per each data source. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45046) Set shadeTestJar of core module to false
[ https://issues.apache.org/jira/browse/SPARK-45046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-45046: Assignee: Yang Jie > Set shadeTestJar of core module to false > > > Key: SPARK-45046 > URL: https://issues.apache.org/jira/browse/SPARK-45046 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
[ https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-45071. - Fix Version/s: 3.5.0 4.0.0 3.4.2 Resolution: Fixed Issue resolved by pull request 42804 [https://github.com/apache/spark/pull/42804] > Optimize the processing speed of `BinaryArithmetic#dataType` when processing > multi-column data > -- > > Key: SPARK-45071 > URL: https://issues.apache.org/jira/browse/SPARK-45071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: ming95 >Assignee: ming95 >Priority: Major > Fix For: 3.5.0, 4.0.0, 3.4.2 > > > Since `BinaryArithmetic#dataType` will recursively process the datatype of > each node, the driver will be very slow when multiple columns are processed. > For example, the following code: > {code:java} > ``` > import spark.implicits._ > import scala.util.Random > import org.apache.spark.sql.functions.sum > import org.apache.spark.sql.types.{StructType, StructField, IntegerType} > val N = 30 > val M = 100 > val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) > val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) > val schema = StructType(columns.map(StructField(_, IntegerType))) > val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) > val df = spark.createDataFrame(rdd, schema) > val colExprs = columns.map(sum(_)) > // gen a new column , and add the other 30 column > df.withColumn("new_col_sum", expr(columns.mkString(" + "))) > ``` > {code} > > This code will take a few minutes for the driver to execute in the spark3.4 > version, but only takes a few seconds to execute in the spark3.2 version. > Related issue: SPARK-39316 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
[ https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-45071: --- Assignee: ming95 > Optimize the processing speed of `BinaryArithmetic#dataType` when processing > multi-column data > -- > > Key: SPARK-45071 > URL: https://issues.apache.org/jira/browse/SPARK-45071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: ming95 >Assignee: ming95 >Priority: Major > > Since `BinaryArithmetic#dataType` will recursively process the datatype of > each node, the driver will be very slow when multiple columns are processed. > For example, the following code: > {code:java} > ``` > import spark.implicits._ > import scala.util.Random > import org.apache.spark.sql.functions.sum > import org.apache.spark.sql.types.{StructType, StructField, IntegerType} > val N = 30 > val M = 100 > val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) > val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) > val schema = StructType(columns.map(StructField(_, IntegerType))) > val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) > val df = spark.createDataFrame(rdd, schema) > val colExprs = columns.map(sum(_)) > // gen a new column , and add the other 30 column > df.withColumn("new_col_sum", expr(columns.mkString(" + "))) > ``` > {code} > > This code will take a few minutes for the driver to execute in the spark3.4 > version, but only takes a few seconds to execute in the spark3.2 version. > Related issue: SPARK-39316 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45077) Upgrade dagre-d3.js from 0.4.3 to 0.6.4
[ https://issues.apache.org/jira/browse/SPARK-45077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-45077: - Labels: (was: pat) > Upgrade dagre-d3.js from 0.4.3 to 0.6.4 > --- > > Key: SPARK-45077 > URL: https://issues.apache.org/jira/browse/SPARK-45077 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45083) Refine docstring of `min`
[ https://issues.apache.org/jira/browse/SPARK-45083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-45083: - Assignee: Allison Wang > Refine docstring of `min` > - > > Key: SPARK-45083 > URL: https://issues.apache.org/jira/browse/SPARK-45083 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > > Refine the docstring of the function `min`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45083) Refine docstring of `min`
[ https://issues.apache.org/jira/browse/SPARK-45083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-45083. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42821 [https://github.com/apache/spark/pull/42821] > Refine docstring of `min` > - > > Key: SPARK-45083 > URL: https://issues.apache.org/jira/browse/SPARK-45083 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 4.0.0 > > > Refine the docstring of the function `min`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45085) Merge UNSUPPORTED_TEMP_VIEW_OPERATION into UNSUPPORTED_VIEW_OPERATION and refactor some logic
[ https://issues.apache.org/jira/browse/SPARK-45085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-45085: Summary: Merge UNSUPPORTED_TEMP_VIEW_OPERATION into UNSUPPORTED_VIEW_OPERATION and refactor some logic (was: Merge UNSUPPORTED_TEMP_VIEW_OPERATION to UNSUPPORTED_VIEW_OPERATION and refactor some logic) > Merge UNSUPPORTED_TEMP_VIEW_OPERATION into UNSUPPORTED_VIEW_OPERATION and > refactor some logic > - > > Key: SPARK-45085 > URL: https://issues.apache.org/jira/browse/SPARK-45085 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45085) Merge UNSUPPORTED_TEMP_VIEW_OPERATION to UNSUPPORTED_VIEW_OPERATION and refactor some logic
BingKun Pan created SPARK-45085: --- Summary: Merge UNSUPPORTED_TEMP_VIEW_OPERATION to UNSUPPORTED_VIEW_OPERATION and refactor some logic Key: SPARK-45085 URL: https://issues.apache.org/jira/browse/SPARK-45085 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true
[ https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762234#comment-17762234 ] Bruce Robbins commented on SPARK-44805: --- I looked at this yesterday and I think I have a handle on what's going on. I will make a PR in the coming days. > Data lost after union using > spark.sql.parquet.enableNestedColumnVectorizedReader=true > - > > Key: SPARK-44805 > URL: https://issues.apache.org/jira/browse/SPARK-44805 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.1 > Environment: pySpark, linux, hadoop, parquet. >Reporter: Jakub Wozniak >Priority: Major > Labels: correctness > > When union-ing two DataFrames read from parquet containing nested structures > (2 fields of array types where one is double and second is integer) data from > the second field seems to be lost (zeros are set instead). > This seems to be the case only if nested vectorised reader is used > (spark.sql.parquet.enableNestedColumnVectorizedReader=true). > The following Python code reproduces the problem: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.types import * > # PREPARING DATA > data1 = [] > data2 = [] > for i in range(2): > data1.append( (([1,2,3],[1,1,2]),i)) > data2.append( (([1.0,2.0,3.0],[1,1]),i+10)) > schema1 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(IntegerType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > schema2 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(DoubleType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > spark = SparkSession.builder.getOrCreate() > data_dir = "/user//" > df1 = spark.createDataFrame(data1, schema1) > df1.write.mode('overwrite').parquet(data_dir + "data1") > df2 = spark.createDataFrame(data2, schema2) > df2.write.mode('overwrite').parquet(data_dir + "data2") > # READING DATA > parquet1 = spark.read.parquet(data_dir + "data1") > parquet2 = spark.read.parquet(data_dir + "data2") > # UNION > out = parquet1.union(parquet2) > parquet1.select("value.f2").distinct().show() > out.select("value.f2").distinct().show() > print(parquet1.collect()) > print(out.collect()) {code} > Output: > {code:java} > +-+ > | f2| > +-+ > |[1, 1, 2]| > +-+ > +-+ > | f2| > +-+ > |[0, 0, 0]| > | [1, 1]| > +-+ > [ > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1) > ] > [ > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11) > ] {code} > Please notice that values for the field f2 are lost after the union is done. > This only happens when this data is read from parquet files. > Could you please look into this? > Best regards, > Jakub -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45084) ProgressReport should include an accurate effective shuffle partition number
Siying Dong created SPARK-45084: --- Summary: ProgressReport should include an accurate effective shuffle partition number Key: SPARK-45084 URL: https://issues.apache.org/jira/browse/SPARK-45084 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.4.2 Reporter: Siying Dong Currently, there is a numShufflePartitions "metric" reported in StateOperatorProgress part of the progress report. However, the number is reported by aggregating executors so in the case of task retry or speculative executor, the metric is higher than number of shuffle partitions for the query plan. Number of shuffle partitions can be useful for reporting purpose so having a metric is helpful. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45083) Refine docstring of `min`
Allison Wang created SPARK-45083: Summary: Refine docstring of `min` Key: SPARK-45083 URL: https://issues.apache.org/jira/browse/SPARK-45083 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Refine the docstring of the function `min`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45082) Review and fix issues in API docs
[ https://issues.apache.org/jira/browse/SPARK-45082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li updated SPARK-45082: Fix Version/s: (was: 3.1.1) > Review and fix issues in API docs > - > > Key: SPARK-45082 > URL: https://issues.apache.org/jira/browse/SPARK-45082 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > > Compare the 3.4 API doc with the 3.5 RC3 cut. Fix the following issues: > * Remove the leaking class/object in API doc -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45057) Deadlock caused by rdd replication level of 2
[ https://issues.apache.org/jira/browse/SPARK-45057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-45057: - Description: When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Task only release lock after writing into local machine and replicate to remote executor. ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task Thread T3)||Exe 2 (Shuffle Server Thread T4)|| |T0|write lock of rdd| | | | |T1| | |write lock of rdd| | |T2|replicate -> UploadBlockSync (blocked by T4)| | | | |T3| | | |Received UploadBlock request from T1 (blocked by T4)| |T4| | |replicate -> UploadBlockSync (blocked by T2)| | |T5| |Received UploadBlock request from T3 (blocked by T1)| | | |T6|Deadlock|Deadlock|Deadlock|Deadlock| was: When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task Thread T3)||Exe 2 (Shuffle Server Thread T4)|| |T0|write lock of rdd| | | | |T1| | |write lock of rdd| | |T2|replicate -> UploadBlockSync (blocked by T4)| | | | |T3| | | |Received UploadBlock request from T1 (blocked by T4)| |T4| | |replicate -> UploadBlockSync (blocked by T2)| | |T5| |Received UploadBlock request from T3 (blocked by T1)| | | |T6|Deadlock|Deadlock|Deadlock|Deadlock| > Deadlock caused by rdd replication level of 2 > - > > Key: SPARK-45057 > URL: https://issues.apache.org/jira/browse/SPARK-45057 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Zhongwei Zhu >Priority: Major > > > When 2 tasks try to compute same rdd with replication level of 2 and running > on only 2 executors. Deadlock will happen. > Task only release lock after writing into local machine and replicate to > remote executor. > > ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task > Thread T3)||Exe 2 (Shuffle Server Thread T4)|| > |T0|write lock of rdd| | | | > |T1| | |write lock of rdd| | > |T2|replicate -> UploadBlockSync (blocked by T4)| | | | > |T3| | | |Received UploadBlock request from T1 (blocked by T4)| > |T4| | |replicate -> UploadBlockSync (blocked by T2)| | > |T5| |Received UploadBlock request from T3 (blocked by T1)| | | > |T6|Deadlock|Deadlock|Deadlock|Deadlock| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45051) Connect: Use UUIDv7 for operation IDs to make operations chronologically sortable
[ https://issues.apache.org/jira/browse/SPARK-45051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Dillitz resolved SPARK-45051. Resolution: Abandoned We agreed that the benefits of adding this are not big enough because we can not rely on the operation ID being UUIDv7 and need to sort by startDate anyway. Closing this PR. > Connect: Use UUIDv7 for operation IDs to make operations chronologically > sortable > - > > Key: SPARK-45051 > URL: https://issues.apache.org/jira/browse/SPARK-45051 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.1 >Reporter: Robert Dillitz >Priority: Major > Labels: Connect > > Spark Connect currently uses UUIDv4 for operation IDs. Using UUIDv7 instead > allows us to sort operations by ID to receive a chronological order while > keeping the collision-free properties we require from this ID. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45082) Review and fix issues in API docs
[ https://issues.apache.org/jira/browse/SPARK-45082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li updated SPARK-45082: Description: Compare the 3.4 API doc with the 3.5 RC3 cut. Fix the following issues: * Remove the leaking class/object in API doc was: Compare the 3.1.1 API doc with the latest release version 3.0.1. Fix the following issues: * Add missing `Since` annotation for new APIs * Remove the leaking class/object in API doc > Review and fix issues in API docs > - > > Key: SPARK-45082 > URL: https://issues.apache.org/jira/browse/SPARK-45082 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.1.1 > > > Compare the 3.4 API doc with the 3.5 RC3 cut. Fix the following issues: > * Remove the leaking class/object in API doc -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45082) Review and fix issues in API docs
Yuanjian Li created SPARK-45082: --- Summary: Review and fix issues in API docs Key: SPARK-45082 URL: https://issues.apache.org/jira/browse/SPARK-45082 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.1.1 Reporter: Yuanjian Li Assignee: Yuanjian Li Fix For: 3.1.1 Compare the 3.1.1 API doc with the latest release version 3.0.1. Fix the following issues: * Add missing `Since` annotation for new APIs * Remove the leaking class/object in API doc -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45082) Review and fix issues in API docs
[ https://issues.apache.org/jira/browse/SPARK-45082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li updated SPARK-45082: Affects Version/s: 3.5.0 (was: 3.1.1) > Review and fix issues in API docs > - > > Key: SPARK-45082 > URL: https://issues.apache.org/jira/browse/SPARK-45082 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.1.1 > > > Compare the 3.1.1 API doc with the latest release version 3.0.1. Fix the > following issues: > * Add missing `Since` annotation for new APIs > * Remove the leaking class/object in API doc -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45072) Fix Outerscopes for same cell evaluation
[ https://issues.apache.org/jira/browse/SPARK-45072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45072. --- Fix Version/s: 3.5.1 Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/42807 > Fix Outerscopes for same cell evaluation > > > Key: SPARK-45072 > URL: https://issues.apache.org/jira/browse/SPARK-45072 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45072) Fix Outerscopes for same cell evaluation
[ https://issues.apache.org/jira/browse/SPARK-45072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45072: -- Issue Type: Bug (was: New Feature) > Fix Outerscopes for same cell evaluation > > > Key: SPARK-45072 > URL: https://issues.apache.org/jira/browse/SPARK-45072 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44284) Introduce simpe conf system for sql/api
[ https://issues.apache.org/jira/browse/SPARK-44284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762123#comment-17762123 ] Herman van Hövell commented on SPARK-44284: --- I added a description. IMO the change itself not too spectacular. > Introduce simpe conf system for sql/api > --- > > Key: SPARK-44284 > URL: https://issues.apache.org/jira/browse/SPARK-44284 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0 > > > Create a simple conf system for classes in sql/api. This is needed for a > number of classes that are moved from sql/catalyst to sql/api that require > configuration access (e.g. timeZone, parsing behavior, ...). > The change will add a small common interface that allows you to read the > needed configurations, this interface is implemented by SQLConf and SQLConf > will be used when we are executing on the driver, and there will be an > implementation using the default values for when we are in Connect mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45081) Encoders.bean does no longer work with read-only properties
Giambattista Bloisi created SPARK-45081: --- Summary: Encoders.bean does no longer work with read-only properties Key: SPARK-45081 URL: https://issues.apache.org/jira/browse/SPARK-45081 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1 Reporter: Giambattista Bloisi Since Spark 3.4.x an exception is thrown when Encoders.bean is called providing a bean having read-only properties, such as: {code:java} public static class ReadOnlyPropertyBean implements Serializable { public boolean isEmpty() { return true; } } {code} Encoders.bean(ReadOnlyPropertyBean.class) will throw: {code:java} java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$deserializerFor$8(ScalaReflection.scala:359) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor(ScalaReflection.scala:348) at org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor(ScalaReflection.scala:183) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:56) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:62) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:179) at org.apache.spark.sql.Encoders.bean(Encoders.scala) {code} This problem is described also in [link Encoders.bean doesn't work anymore on a Java POJO, with Spark 3.4.0|https://stackoverflow.com/questions/76036349/encoders-bean-doesnt-work-anymore-on-a-java-pojo-with-spark-3-4-0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44284) Introduce simpe conf system for sql/api
[ https://issues.apache.org/jira/browse/SPARK-44284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell updated SPARK-44284: -- Description: Create a simple conf system for classes in sql/api. This is needed for a number of classes that are moved from sql/catalyst to sql/api that require configuration access (e.g. timeZone, parsing behavior, ...). The change will add a small common interface that allows you to read the needed configurations, this interface is implemented by SQLConf and SQLConf will be used when we are executing on the driver, and there will be an implementation using the default values for when we are in Connect mode. was:Create a simple conf system for classes in sql/api > Introduce simpe conf system for sql/api > --- > > Key: SPARK-44284 > URL: https://issues.apache.org/jira/browse/SPARK-44284 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0 > > > Create a simple conf system for classes in sql/api. This is needed for a > number of classes that are moved from sql/catalyst to sql/api that require > configuration access (e.g. timeZone, parsing behavior, ...). > The change will add a small common interface that allows you to read the > needed configurations, this interface is implemented by SQLConf and SQLConf > will be used when we are executing on the driver, and there will be an > implementation using the default values for when we are in Connect mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45075) Alter table with invalid default value will not report error
[ https://issues.apache.org/jira/browse/SPARK-45075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762110#comment-17762110 ] Ignite TC Bot commented on SPARK-45075: --- User 'Hisoka-X' has created a pull request for this issue: https://github.com/apache/spark/pull/42810 > Alter table with invalid default value will not report error > > > Key: SPARK-45075 > URL: https://issues.apache.org/jira/browse/SPARK-45075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Jia Fan >Priority: Major > > create table t(i boolean, s bigint); > alter table t alter column s set default badvalue; > > The code wouldn't report error on DataSource V2, not align with V1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44284) Introduce simpe conf system for sql/api
[ https://issues.apache.org/jira/browse/SPARK-44284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762102#comment-17762102 ] Thomas Graves commented on SPARK-44284: --- Can we get a description on this? This seems like a fairly significant change for a one line without description here or in the pr. > Introduce simpe conf system for sql/api > --- > > Key: SPARK-44284 > URL: https://issues.apache.org/jira/browse/SPARK-44284 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0 > > > Create a simple conf system for classes in sql/api -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45080) Kafka DSv2 streaming source implementation calls planInputPartitions 4 times per microbatch
[ https://issues.apache.org/jira/browse/SPARK-45080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762089#comment-17762089 ] Jungtaek Lim commented on SPARK-45080: -- Working on this. Will submit a PR sooner. > Kafka DSv2 streaming source implementation calls planInputPartitions 4 times > per microbatch > --- > > Key: SPARK-45080 > URL: https://issues.apache.org/jira/browse/SPARK-45080 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Jungtaek Lim >Priority: Major > > I was tracking through method calls for DSv2 streaming source, and figured > out planInputPartitions is called 4 times per microbatch. > It turned out that multiple calls of planInputPartitions is due to > `DataSourceV2ScanExecBase.supportsColumnar`, though it is called through > `MicroBatchScanExec.inputPartitions` which is defined as lazy, hence > shouldn't happen. > The behavior seems to be coupled with catalyst and very hard to figure out > why, but with SPARK-44505, we can at least fix this per each data source. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45080) Kafka DSv2 streaming source implementation calls planInputPartitions 4 times per microbatch
Jungtaek Lim created SPARK-45080: Summary: Kafka DSv2 streaming source implementation calls planInputPartitions 4 times per microbatch Key: SPARK-45080 URL: https://issues.apache.org/jira/browse/SPARK-45080 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Jungtaek Lim I was tracking through method calls for DSv2 streaming source, and figured out planInputPartitions is called 4 times per microbatch. It turned out that multiple calls of planInputPartitions is due to `DataSourceV2ScanExecBase.supportsColumnar`, though it is called through `MicroBatchScanExec.inputPartitions` which is defined as lazy, hence shouldn't happen. The behavior seems to be coupled with catalyst and very hard to figure out why, but with SPARK-44505, we can at least fix this per each data source. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45079) percentile_approx() fails with an internal error on NULL accuracy
[ https://issues.apache.org/jira/browse/SPARK-45079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762061#comment-17762061 ] Aparna Garg commented on SPARK-45079: - User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/42817 > percentile_approx() fails with an internal error on NULL accuracy > - > > Key: SPARK-45079 > URL: https://issues.apache.org/jira/browse/SPARK-45079 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The example below demonstrates the issue: > {code:sql} > spark-sql (default)> SELECT percentile_approx(col, array(0.5, 0.4, 0.1), > NULL) FROM VALUES (0), (1), (2), (10) AS tab(col); > [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. > You hit a bug in Spark or the Spark plugins you use. Please, report this bug > to the corresponding communities or vendors, and provide the full stack trace. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45079) percentile_approx() fails with an internal error on NULL accuracy
[ https://issues.apache.org/jira/browse/SPARK-45079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-45079: - Affects Version/s: 3.3.2 > percentile_approx() fails with an internal error on NULL accuracy > - > > Key: SPARK-45079 > URL: https://issues.apache.org/jira/browse/SPARK-45079 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The example below demonstrates the issue: > {code:sql} > spark-sql (default)> SELECT percentile_approx(col, array(0.5, 0.4, 0.1), > NULL) FROM VALUES (0), (1), (2), (10) AS tab(col); > [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. > You hit a bug in Spark or the Spark plugins you use. Please, report this bug > to the corresponding communities or vendors, and provide the full stack trace. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45079) percentile_approx() fails with an internal error on NULL accuracy
[ https://issues.apache.org/jira/browse/SPARK-45079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-45079: - Affects Version/s: 3.5.0 > percentile_approx() fails with an internal error on NULL accuracy > - > > Key: SPARK-45079 > URL: https://issues.apache.org/jira/browse/SPARK-45079 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0, 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The example below demonstrates the issue: > {code:sql} > spark-sql (default)> SELECT percentile_approx(col, array(0.5, 0.4, 0.1), > NULL) FROM VALUES (0), (1), (2), (10) AS tab(col); > [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. > You hit a bug in Spark or the Spark plugins you use. Please, report this bug > to the corresponding communities or vendors, and provide the full stack trace. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45079) percentile_approx() fails with an internal error on NULL accuracy
[ https://issues.apache.org/jira/browse/SPARK-45079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-45079: - Affects Version/s: 3.4.1 > percentile_approx() fails with an internal error on NULL accuracy > - > > Key: SPARK-45079 > URL: https://issues.apache.org/jira/browse/SPARK-45079 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The example below demonstrates the issue: > {code:sql} > spark-sql (default)> SELECT percentile_approx(col, array(0.5, 0.4, 0.1), > NULL) FROM VALUES (0), (1), (2), (10) AS tab(col); > [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. > You hit a bug in Spark or the Spark plugins you use. Please, report this bug > to the corresponding communities or vendors, and provide the full stack trace. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44404) Assign names to the error class _LEGACY_ERROR_TEMP_[1009,1010,1013,1015,1016,1278]
[ https://issues.apache.org/jira/browse/SPARK-44404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762055#comment-17762055 ] Aparna Garg commented on SPARK-44404: - User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/42109 > Assign names to the error class > _LEGACY_ERROR_TEMP_[1009,1010,1013,1015,1016,1278] > -- > > Key: SPARK-44404 > URL: https://issues.apache.org/jira/browse/SPARK-44404 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45079) percentile_approx() fails with an internal error on NULL accuracy
[ https://issues.apache.org/jira/browse/SPARK-45079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-45079: - Description: The example below demonstrates the issue: {code:sql} spark-sql (default)> SELECT percentile_approx(col, array(0.5, 0.4, 0.1), NULL) FROM VALUES (0), (1), (2), (10) AS tab(col); [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace. {code} was: The example below demonstrates the issue: {code:sql} spark-sql (default)> SELECT to_char(x'537061726b2053514c', CAST(NULL AS STRING)); [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace. {code} > percentile_approx() fails with an internal error on NULL accuracy > - > > Key: SPARK-45079 > URL: https://issues.apache.org/jira/browse/SPARK-45079 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The example below demonstrates the issue: > {code:sql} > spark-sql (default)> SELECT percentile_approx(col, array(0.5, 0.4, 0.1), > NULL) FROM VALUES (0), (1), (2), (10) AS tab(col); > [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. > You hit a bug in Spark or the Spark plugins you use. Please, report this bug > to the corresponding communities or vendors, and provide the full stack trace. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45079) percentile_approx() fails with an internal error on NULL accuracy
[ https://issues.apache.org/jira/browse/SPARK-45079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-45079: - Fix Version/s: (was: 4.0.0) > percentile_approx() fails with an internal error on NULL accuracy > - > > Key: SPARK-45079 > URL: https://issues.apache.org/jira/browse/SPARK-45079 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The example below demonstrates the issue: > {code:sql} > spark-sql (default)> SELECT to_char(x'537061726b2053514c', CAST(NULL AS > STRING)); > [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. > You hit a bug in Spark or the Spark plugins you use. Please, report this bug > to the corresponding communities or vendors, and provide the full stack trace. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45079) percentile_approx() fails with an internal error on NULL accuracy
Max Gekk created SPARK-45079: Summary: percentile_approx() fails with an internal error on NULL accuracy Key: SPARK-45079 URL: https://issues.apache.org/jira/browse/SPARK-45079 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Max Gekk Assignee: Max Gekk Fix For: 4.0.0 The example below demonstrates the issue: {code:sql} spark-sql (default)> SELECT to_char(x'537061726b2053514c', CAST(NULL AS STRING)); [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45078) The ArrayInsert function should make explicit casting when element type not equals derived component type
[ https://issues.apache.org/jira/browse/SPARK-45078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Tao updated SPARK-45078: Description: Generally speaking, array_insert has same insert semantic with array_prepend/array_append. however, if we run sql use element cast like below, array_prepend/array_append can get right result. but array_insert fails. {code:java} spark-sql (default)> select array_prepend(array(1), cast(2 as tinyint)); [2,1] Time taken: 0.123 seconds, Fetched 1 row(s) {code} {code:java} spark-sql (default)> select array_append(array(1), cast(2 as tinyint)); [1,2] Time taken: 0.206 seconds, Fetched 1 row(s) {code} {code:java} spark-sql (default)> select array_insert(array(1), 2, cast(2 as tinyint)); [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPES] Cannot resolve "array_insert(array(1), 2, CAST(2 AS TINYINT))" due to data type mismatch: Input to `array_insert` should have been "ARRAY" followed by a value with same element type, but it's ["ARRAY", "TINYINT"].; line 1 pos 7; 'Project [unresolvedalias(array_insert(array(1), 2, cast(2 as tinyint)), None)] +- OneRowRelation {code} The reported error is clear, however, we may should do explicit casting here. because multiset type such as array or map allow the operands of same type family to coexist. was: Generally speaking, array_insert has same insert semantic with array_prepend/array_append. however, if we run sql use element cast like below, array_prepend/array_append can get right result. but array_insert fails. {code:java} spark-sql (default)> select array_prepend(array(1), cast(2 as tinyint)); [2,1] Time taken: 0.123 seconds, Fetched 1 row(s) {code} {code:java} spark-sql (default)> select array_append(array(1), cast(2 as tinyint)); [1,2] Time taken: 0.206 seconds, Fetched 1 row(s) {code} {code:java} spark-sql (default)> select array_insert(array(1), 2, cast(2 as tinyint)); [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPES] Cannot resolve "array_insert(array(1), 2, CAST(2 AS TINYINT))" due to data type mismatch: Input to `array_insert` should have been "ARRAY" followed by a value with same element type, but it's ["ARRAY", "TINYINT"].; line 1 pos 7; 'Project [unresolvedalias(array_insert(array(1), 2, cast(2 as tinyint)), None)] +- OneRowRelation {code} The reported error is clear, however, we may should do explicit casting here. because multiset type such as array or map allow the operands of same type family to coexist. > The ArrayInsert function should make explicit casting when element type not > equals derived component type > - > > Key: SPARK-45078 > URL: https://issues.apache.org/jira/browse/SPARK-45078 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Ran Tao >Priority: Major > > Generally speaking, array_insert has same insert semantic with > array_prepend/array_append. however, if we run sql use element cast like > below, array_prepend/array_append can get right result. but array_insert > fails. > {code:java} > spark-sql (default)> select array_prepend(array(1), cast(2 as tinyint)); > [2,1] > Time taken: 0.123 seconds, Fetched 1 row(s) {code} > {code:java} > spark-sql (default)> select array_append(array(1), cast(2 as tinyint)); [1,2] > Time taken: 0.206 seconds, Fetched 1 row(s) > {code} > {code:java} > spark-sql (default)> select array_insert(array(1), 2, cast(2 as tinyint)); > [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPES] Cannot resolve > "array_insert(array(1), 2, CAST(2 AS TINYINT))" due to data type mismatch: > Input to `array_insert` should have been "ARRAY" followed by a value with > same element type, but it's ["ARRAY", "TINYINT"].; line 1 pos 7; > 'Project [unresolvedalias(array_insert(array(1), 2, cast(2 as tinyint)), > None)] > +- OneRowRelation {code} > The reported error is clear, however, we may should do explicit casting here. > because multiset type such as array or map allow the operands of same type > family to coexist. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45078) The ArrayInsert function should make explicit casting when element type not equals derived component type
[ https://issues.apache.org/jira/browse/SPARK-45078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Tao updated SPARK-45078: Description: Generally speaking, array_insert has same insert semantic with array_prepend/array_append. however, if we run sql use element cast like below, array_prepend/array_append can get right result. but array_insert failed. {code:java} spark-sql (default)> select array_prepend(array(1), cast(2 as tinyint)); [2,1] Time taken: 0.123 seconds, Fetched 1 row(s) {code} {code:java} spark-sql (default)> select array_append(array(1), cast(2 as tinyint)); [1,2] Time taken: 0.206 seconds, Fetched 1 row(s) {code} {code:java} spark-sql (default)> select array_insert(array(1), 2, cast(2 as tinyint)); [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPES] Cannot resolve "array_insert(array(1), 2, CAST(2 AS TINYINT))" due to data type mismatch: Input to `array_insert` should have been "ARRAY" followed by a value with same element type, but it's ["ARRAY", "TINYINT"].; line 1 pos 7; 'Project [unresolvedalias(array_insert(array(1), 2, cast(2 as tinyint)), None)] +- OneRowRelation {code} The reported error is clear, however, we may should do explicit casting here. because multiset type such as array or map allow the operands of same type family to coexist. was: Generally speaking, array_insert has same insert semantic with array_prepend/array_append. however, if we run sql use element cast like below, array_prepend/array_append can get right result. but array_insert fails. {code:java} spark-sql (default)> select array_prepend(array(1), cast(2 as tinyint)); [2,1] Time taken: 0.123 seconds, Fetched 1 row(s) {code} {code:java} spark-sql (default)> select array_append(array(1), cast(2 as tinyint)); [1,2] Time taken: 0.206 seconds, Fetched 1 row(s) {code} {code:java} spark-sql (default)> select array_insert(array(1), 2, cast(2 as tinyint)); [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPES] Cannot resolve "array_insert(array(1), 2, CAST(2 AS TINYINT))" due to data type mismatch: Input to `array_insert` should have been "ARRAY" followed by a value with same element type, but it's ["ARRAY", "TINYINT"].; line 1 pos 7; 'Project [unresolvedalias(array_insert(array(1), 2, cast(2 as tinyint)), None)] +- OneRowRelation {code} The reported error is clear, however, we may should do explicit casting here. because multiset type such as array or map allow the operands of same type family to coexist. > The ArrayInsert function should make explicit casting when element type not > equals derived component type > - > > Key: SPARK-45078 > URL: https://issues.apache.org/jira/browse/SPARK-45078 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Ran Tao >Priority: Major > > Generally speaking, array_insert has same insert semantic with > array_prepend/array_append. however, if we run sql use element cast like > below, array_prepend/array_append can get right result. but array_insert > failed. > {code:java} > spark-sql (default)> select array_prepend(array(1), cast(2 as tinyint)); > [2,1] > Time taken: 0.123 seconds, Fetched 1 row(s) {code} > {code:java} > spark-sql (default)> select array_append(array(1), cast(2 as tinyint)); [1,2] > Time taken: 0.206 seconds, Fetched 1 row(s) > {code} > {code:java} > spark-sql (default)> select array_insert(array(1), 2, cast(2 as tinyint)); > [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPES] Cannot resolve > "array_insert(array(1), 2, CAST(2 AS TINYINT))" due to data type mismatch: > Input to `array_insert` should have been "ARRAY" followed by a value with > same element type, but it's ["ARRAY", "TINYINT"].; line 1 pos 7; > 'Project [unresolvedalias(array_insert(array(1), 2, cast(2 as tinyint)), > None)] > +- OneRowRelation {code} > The reported error is clear, however, we may should do explicit casting here. > because multiset type such as array or map allow the operands of same type > family to coexist. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45078) The ArrayInsert function should make explicit casting when element type not equals derived component type
[ https://issues.apache.org/jira/browse/SPARK-45078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Tao updated SPARK-45078: Description: Generally speaking, array_insert has same insert semantic with array_prepend/array_append. however, if we run sql use element cast like below, array_prepend/array_append can get right result. but array_insert failed. {code:java} spark-sql (default)> select array_prepend(array(1), cast(2 as tinyint)); [2,1] Time taken: 0.123 seconds, Fetched 1 row(s) {code} {code:java} spark-sql (default)> select array_append(array(1), cast(2 as tinyint)); [1,2] Time taken: 0.206 seconds, Fetched 1 row(s) {code} {code:java} spark-sql (default)> select array_insert(array(1), 2, cast(2 as tinyint)); [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPES] Cannot resolve "array_insert(array(1), 2, CAST(2 AS TINYINT))" due to data type mismatch: Input to `array_insert` should have been "ARRAY" followed by a value with same element type, but it's ["ARRAY", "TINYINT"].; line 1 pos 7; 'Project [unresolvedalias(array_insert(array(1), 2, cast(2 as tinyint)), None)] +- OneRowRelation {code} The reported error is clear, however, we may should do explicit casting here. because multiset type such as array or map allow the operands of same type family to coexist. was: Generally speaking, array_insert has same insert semantic with array_prepend/array_append. however, if we run sql use element cast like below, array_prepend/array_append can get right result. but array_insert failed. {code:java} spark-sql (default)> select array_prepend(array(1), cast(2 as tinyint)); [2,1] Time taken: 0.123 seconds, Fetched 1 row(s) {code} {code:java} spark-sql (default)> select array_append(array(1), cast(2 as tinyint)); [1,2] Time taken: 0.206 seconds, Fetched 1 row(s) {code} {code:java} spark-sql (default)> select array_insert(array(1), 2, cast(2 as tinyint)); [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPES] Cannot resolve "array_insert(array(1), 2, CAST(2 AS TINYINT))" due to data type mismatch: Input to `array_insert` should have been "ARRAY" followed by a value with same element type, but it's ["ARRAY", "TINYINT"].; line 1 pos 7; 'Project [unresolvedalias(array_insert(array(1), 2, cast(2 as tinyint)), None)] +- OneRowRelation {code} The reported error is clear, however, we may should do explicit casting here. because multiset type such as array or map allow the operands of same type family to coexist. > The ArrayInsert function should make explicit casting when element type not > equals derived component type > - > > Key: SPARK-45078 > URL: https://issues.apache.org/jira/browse/SPARK-45078 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Ran Tao >Priority: Major > > Generally speaking, array_insert has same insert semantic with > array_prepend/array_append. however, if we run sql use element cast like > below, array_prepend/array_append can get right result. but array_insert > failed. > {code:java} > spark-sql (default)> select array_prepend(array(1), cast(2 as tinyint)); > [2,1] > Time taken: 0.123 seconds, Fetched 1 row(s) {code} > {code:java} > spark-sql (default)> select array_append(array(1), cast(2 as tinyint)); > [1,2] > Time taken: 0.206 seconds, Fetched 1 row(s) > {code} > {code:java} > spark-sql (default)> select array_insert(array(1), 2, cast(2 as tinyint)); > [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPES] Cannot resolve > "array_insert(array(1), 2, CAST(2 AS TINYINT))" due to data type mismatch: > Input to `array_insert` should have been "ARRAY" followed by a value with > same element type, but it's ["ARRAY", "TINYINT"].; line 1 pos 7; > 'Project [unresolvedalias(array_insert(array(1), 2, cast(2 as tinyint)), > None)] > +- OneRowRelation {code} > The reported error is clear, however, we may should do explicit casting here. > because multiset type such as array or map allow the operands of same type > family to coexist. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45068) Make function output column name consistent in case
[ https://issues.apache.org/jira/browse/SPARK-45068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762050#comment-17762050 ] Aparna Garg commented on SPARK-45068: - User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/42797 > Make function output column name consistent in case > --- > > Key: SPARK-45068 > URL: https://issues.apache.org/jira/browse/SPARK-45068 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45068) Make function output column name consistent in case
[ https://issues.apache.org/jira/browse/SPARK-45068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762049#comment-17762049 ] Aparna Garg commented on SPARK-45068: - User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/42797 > Make function output column name consistent in case > --- > > Key: SPARK-45068 > URL: https://issues.apache.org/jira/browse/SPARK-45068 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45078) The ArrayInsert function should make explicit casting when element type not equals derived component type
Ran Tao created SPARK-45078: --- Summary: The ArrayInsert function should make explicit casting when element type not equals derived component type Key: SPARK-45078 URL: https://issues.apache.org/jira/browse/SPARK-45078 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1 Reporter: Ran Tao Generally speaking, array_insert has same insert semantic with array_prepend/array_append. however, if we run sql use element cast like below, array_prepend/array_append can get right result. but array_insert fails. {code:java} spark-sql (default)> select array_prepend(array(1), cast(2 as tinyint)); [2,1] Time taken: 0.123 seconds, Fetched 1 row(s) {code} {code:java} spark-sql (default)> select array_append(array(1), cast(2 as tinyint)); [1,2] Time taken: 0.206 seconds, Fetched 1 row(s) {code} {code:java} spark-sql (default)> select array_insert(array(1), 2, cast(2 as tinyint)); [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPES] Cannot resolve "array_insert(array(1), 2, CAST(2 AS TINYINT))" due to data type mismatch: Input to `array_insert` should have been "ARRAY" followed by a value with same element type, but it's ["ARRAY", "TINYINT"].; line 1 pos 7; 'Project [unresolvedalias(array_insert(array(1), 2, cast(2 as tinyint)), None)] +- OneRowRelation {code} The reported error is clear, however, we may should do explicit casting here. because multiset type such as array or map allow the operands of same type family to coexist. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45070) Describe the binary and datetime formats of `to_char`/`to_varchar`
[ https://issues.apache.org/jira/browse/SPARK-45070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762046#comment-17762046 ] Aparna Garg commented on SPARK-45070: - User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/42801 > Describe the binary and datetime formats of `to_char`/`to_varchar` > -- > > Key: SPARK-45070 > URL: https://issues.apache.org/jira/browse/SPARK-45070 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > In the PR, I propose to document the recent changes related to the `format` > of the `to_char`/`to_varchar` functions: > 1. binary formats added by https://github.com/apache/spark/pull/42632 > 2. datetime formats introduced by https://github.com/apache/spark/pull/42534 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45022) Provide context for dataset API errors
[ https://issues.apache.org/jira/browse/SPARK-45022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762043#comment-17762043 ] Aparna Garg commented on SPARK-45022: - User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/42740 > Provide context for dataset API errors > -- > > Key: SPARK-45022 > URL: https://issues.apache.org/jira/browse/SPARK-45022 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Priority: Major > > SQL failures already provide nice error context when there is a failure: > {noformat} > org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. > Use `try_divide` to tolerate divisor being 0 and return NULL instead. If > necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. > == SQL(line 1, position 1) == > a / b > ^ > at > org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201) > at > org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala) > ... > {noformat} > We could add a similar user friendly error context to Dataset APIs. > E.g. consider the following Spark app SimpleApp.scala: > {noformat} >1 import org.apache.spark.sql.SparkSession >2 import org.apache.spark.sql.functions._ >3 >4 object SimpleApp { >5def main(args: Array[String]) { >6 val spark = SparkSession.builder.appName("Simple > Application").config("spark.sql.ansi.enabled", true).getOrCreate() >7 import spark.implicits._ >8 >9 val c = col("a") / col("b") > 10 > 11 Seq((1, 0)).toDF("a", "b").select(c).show() > 12 > 13 spark.stop() > 14} > 15 } > {noformat} > then the error context could be: > {noformat} > Exception in thread "main" org.apache.spark.SparkArithmeticException: > [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being > 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to > "false" to bypass this error. > == Dataset == > "div" was called from SimpleApp$.main(SimpleApp.scala:9) > at > org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201) > at > org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672 > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45022) Provide context for dataset API errors
[ https://issues.apache.org/jira/browse/SPARK-45022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762041#comment-17762041 ] Aparna Garg commented on SPARK-45022: - User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/42816 > Provide context for dataset API errors > -- > > Key: SPARK-45022 > URL: https://issues.apache.org/jira/browse/SPARK-45022 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Priority: Major > > SQL failures already provide nice error context when there is a failure: > {noformat} > org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. > Use `try_divide` to tolerate divisor being 0 and return NULL instead. If > necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. > == SQL(line 1, position 1) == > a / b > ^ > at > org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201) > at > org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala) > ... > {noformat} > We could add a similar user friendly error context to Dataset APIs. > E.g. consider the following Spark app SimpleApp.scala: > {noformat} >1 import org.apache.spark.sql.SparkSession >2 import org.apache.spark.sql.functions._ >3 >4 object SimpleApp { >5def main(args: Array[String]) { >6 val spark = SparkSession.builder.appName("Simple > Application").config("spark.sql.ansi.enabled", true).getOrCreate() >7 import spark.implicits._ >8 >9 val c = col("a") / col("b") > 10 > 11 Seq((1, 0)).toDF("a", "b").select(c).show() > 12 > 13 spark.stop() > 14} > 15 } > {noformat} > then the error context could be: > {noformat} > Exception in thread "main" org.apache.spark.SparkArithmeticException: > [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being > 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to > "false" to bypass this error. > == Dataset == > "div" was called from SimpleApp$.main(SimpleApp.scala:9) > at > org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201) > at > org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672 > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45077) Upgrade dagre-d3.js from 0.4.3 to 0.6.4
[ https://issues.apache.org/jira/browse/SPARK-45077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-45077: - Labels: pat (was: ) > Upgrade dagre-d3.js from 0.4.3 to 0.6.4 > --- > > Key: SPARK-45077 > URL: https://issues.apache.org/jira/browse/SPARK-45077 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pat > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45077) Upgrade dagre-d3.js from 0.4.3 to 0.6.4
Kent Yao created SPARK-45077: Summary: Upgrade dagre-d3.js from 0.4.3 to 0.6.4 Key: SPARK-45077 URL: https://issues.apache.org/jira/browse/SPARK-45077 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45074) DataFrame.{sort, sortWithinPartitions} support column ordinals
[ https://issues.apache.org/jira/browse/SPARK-45074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45074. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42809 [https://github.com/apache/spark/pull/42809] > DataFrame.{sort, sortWithinPartitions} support column ordinals > -- > > Key: SPARK-45074 > URL: https://issues.apache.org/jira/browse/SPARK-45074 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45074) DataFrame.{sort, sortWithinPartitions} support column ordinals
[ https://issues.apache.org/jira/browse/SPARK-45074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45074: - Assignee: Ruifeng Zheng > DataFrame.{sort, sortWithinPartitions} support column ordinals > -- > > Key: SPARK-45074 > URL: https://issues.apache.org/jira/browse/SPARK-45074 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45076) Switch to built-in repeat function
[ https://issues.apache.org/jira/browse/SPARK-45076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-45076. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42812 [https://github.com/apache/spark/pull/42812] > Switch to built-in repeat function > -- > > Key: SPARK-45076 > URL: https://issues.apache.org/jira/browse/SPARK-45076 > Project: Spark > Issue Type: Improvement > Components: Connect, Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45076) Switch to built-in repeat function
[ https://issues.apache.org/jira/browse/SPARK-45076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-45076: - Assignee: Ruifeng Zheng > Switch to built-in repeat function > -- > > Key: SPARK-45076 > URL: https://issues.apache.org/jira/browse/SPARK-45076 > Project: Spark > Issue Type: Improvement > Components: Connect, Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org