[jira] [Assigned] (SPARK-40285) Simplify the roundTo[Numeric] for Decimal
[ https://issues.apache.org/jira/browse/SPARK-40285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40285: Assignee: Apache Spark > Simplify the roundTo[Numeric] for Decimal > - > > Key: SPARK-40285 > URL: https://issues.apache.org/jira/browse/SPARK-40285 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Spark Decimal have a lot of methods named roundTo*. > Except roundToLong, everything else is redundant -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40285) Simplify the roundTo[Numeric] for Decimal
[ https://issues.apache.org/jira/browse/SPARK-40285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40285: Assignee: (was: Apache Spark) > Simplify the roundTo[Numeric] for Decimal > - > > Key: SPARK-40285 > URL: https://issues.apache.org/jira/browse/SPARK-40285 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Spark Decimal have a lot of methods named roundTo*. > Except roundToLong, everything else is redundant -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40285) Simplify the roundTo[Numeric] for Decimal
[ https://issues.apache.org/jira/browse/SPARK-40285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598204#comment-17598204 ] Apache Spark commented on SPARK-40285: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/37736 > Simplify the roundTo[Numeric] for Decimal > - > > Key: SPARK-40285 > URL: https://issues.apache.org/jira/browse/SPARK-40285 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Spark Decimal have a lot of methods named roundTo*. > Except roundToLong, everything else is redundant -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40284) spark concurrent overwrite mode writes data to files in HDFS format, all request data write success
[ https://issues.apache.org/jira/browse/SPARK-40284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu updated SPARK-40284: Description: We use Spark as a service. The same Spark service needs to handle multiple requests, but I have a problem with this When multiple requests are overwritten to a directory at the same time, the results of two overwrite requests may be written successfully. I think this does not meet the definition of overwrite write First I ran Write SQL1, then I ran Write SQL2, and I found that both data were written in the end, which I thought was unreasonable {code:java} sparkSession.udf.register("sleep", (time: Long) => Thread.sleep(time)) -- write sql1 sparkSession.sql("select 1 as id, sleep(4) as time").write.mode(SaveMode.Overwrite).parquet("path") -- write sql2 sparkSession.sql("select 2 as id, 1 as time").write.mode(SaveMode.Overwrite).parquet("path") {code} When the spark source, and I saw that all these logic in InsertIntoHadoopFsRelationCommand this class. When the target directory already exists, Spark directly deletes the target directory and writes to the _temporary directory that it requests. However, when multiple requests are written, the data will all append in; For example, in Write SQL above, this procedure occurs 1. excute write sql1, spark create the _temporary directory for SQL1, and continue 2. excute write sql2 , spark will delete the entire target directory and create its own _temporary 3. sql2 writes its data 4. sql1 complete the calculation, The corresponding _temporary /0/attemp_id directory does not exist and so the request fail. However, the task is retried, but the _temporary directory is not deleted when the task is retried. Therefore, the execution result of sql1 result is append to the target directory Based on the above process, the write process, can spark do a directory check before the write task or some other way to avoid this kind of problem? was: We use Spark as a service. The same Spark service needs to handle multiple requests, but I have a problem with this When multiple requests are overwritten to a directory at the same time, the results of two overwrite requests may be written successfully. I think this does not meet the definition of overwrite write First I ran Write SQL1, then I ran Write SQL2, and I found that both data were written in the end, which I thought was unreasonable {code:java} sparkSession.udf.register("sleep", (time: Long) => Thread.sleep(time)) -- write sql1 sparkSession.sql("select 1 as id, sleep(4) as time").write.mode(SaveMode.Overwrite).parquet("path") -- write sql2 sparkSession.sql("select 2 as id, 1 as time").write.mode(SaveMode.Overwrite).parquet("path") {code} When the spark source, and I saw that all these logic in InsertIntoHadoopFsRelationCommand this class. When the target directory already exists, Spark directly deletes the target directory and writes to the _temporary directory that it requests. However, when multiple requests are written, the data will all append in; For example, in Write SQL above, this procedure occurs 1. Run write SQL 1, SQL 1 to create the _TEMPORARY directory, and continue 2. Run write SQL 2 to delete the entire target directory and create its own _TEMPORARY 3. Sql2 writes data 4. SQL 1 completion. The corresponding _temporary /0/attemp_id directory does not exist and fails. However, the task is retried, but the _temporary directory is not deleted when the task is retried. Therefore, the execution result of SQL1 is sent to the target directory by append Based on the above process, the write process, can you do a directory check before the write task or some other way to avoid this kind of problem > spark concurrent overwrite mode writes data to files in HDFS format, all > request data write success > > > Key: SPARK-40284 > URL: https://issues.apache.org/jira/browse/SPARK-40284 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.1 >Reporter: Liu >Priority: Major > > We use Spark as a service. The same Spark service needs to handle multiple > requests, but I have a problem with this > When multiple requests are overwritten to a directory at the same time, the > results of two overwrite requests may be written successfully. I think this > does not meet the definition of overwrite write > First I ran Write SQL1, then I ran Write SQL2, and I found that both data > were written in the end, which I thought was unreasonable > {code:java} > sparkSession.udf.register("sleep", (time: Long) => Thread.sleep(time)) > -- write sql1 > sparkSession.sql("select 1 as id, sleep(4) as > time").wr
[jira] [Updated] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew updated SPARK-40286: - Description: Hello, I'm using spark to [load data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into a hive table through Pyspark, and when I load data from a path in Amazon S3, the original file is getting wiped from the Directory. The file is found, and is populating the table with data. I also tried to add the `Local` clause but that throws an error when looking for the file. When looking through the documentation it doesn't explicitly state that this is the intended behavior. Thanks in advance! {code:java} spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE src"){code} was: Hello, I'm using spark to load data into a hive table through Pyspark, and when I load data from a path in Amazon S3, the original file is getting wiped from the Directory. The file is found, and is populating the table with data. I also tried to add the `Local` clause but that throws an error when looking for the file. When looking through the documentation it doesn't explicitly state that this is the intended behavior. Thanks in advance! {code:java} spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE src"){code} > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew updated SPARK-40286: - Priority: Major (was: Trivial) > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to load data into a hive table through Pyspark, and when I > load data from a path in Amazon S3, the original file is getting wiped from > the Directory. The file is found, and is populating the table with data. I > also tried to add the `Local` clause but that throws an error when looking > for the file. When looking through the documentation it doesn't explicitly > state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40286) Load Data from S3 deletes data source file
Drew created SPARK-40286: Summary: Load Data from S3 deletes data source file Key: SPARK-40286 URL: https://issues.apache.org/jira/browse/SPARK-40286 Project: Spark Issue Type: Question Components: Documentation Affects Versions: 3.2.1 Reporter: Drew Hello, I'm using spark to load data into a hive table through Pyspark, and when I load data from a path in Amazon S3, the original file is getting wiped from the Directory. The file is found, and is populating the table with data. I also tried to add the `Local` clause but that throws an error when looking for the file. When looking through the documentation it doesn't explicitly state that this is the intended behavior. Thanks in advance! {code:java} spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40271) Support list type for pyspark.sql.functions.lit
[ https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598183#comment-17598183 ] Apache Spark commented on SPARK-40271: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/37735 > Support list type for pyspark.sql.functions.lit > --- > > Key: SPARK-40271 > URL: https://issues.apache.org/jira/browse/SPARK-40271 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > Currently, `pyspark.sql.functions.lit` doesn't support for Python list type > as below: > {code:python} > >>> df = spark.range(3).withColumn("c", lit([1,2,3])) > Traceback (most recent call last): > ... > : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] > The feature is not supported: Literal for '[1, 2, 3]' of class > java.util.ArrayList. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302) > at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100) > at org.apache.spark.sql.functions$.lit(functions.scala:125) > at org.apache.spark.sql.functions.lit(functions.scala) > at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:577) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.base/java.lang.Thread.run(Thread.java:833) > {code} > We should make it supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40271) Support list type for pyspark.sql.functions.lit
[ https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598182#comment-17598182 ] Apache Spark commented on SPARK-40271: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/37735 > Support list type for pyspark.sql.functions.lit > --- > > Key: SPARK-40271 > URL: https://issues.apache.org/jira/browse/SPARK-40271 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > Currently, `pyspark.sql.functions.lit` doesn't support for Python list type > as below: > {code:python} > >>> df = spark.range(3).withColumn("c", lit([1,2,3])) > Traceback (most recent call last): > ... > : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] > The feature is not supported: Literal for '[1, 2, 3]' of class > java.util.ArrayList. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302) > at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100) > at org.apache.spark.sql.functions$.lit(functions.scala:125) > at org.apache.spark.sql.functions.lit(functions.scala) > at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:577) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.base/java.lang.Thread.run(Thread.java:833) > {code} > We should make it supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39896) The structural integrity of the plan is broken after UnwrapCastInBinaryComparison
[ https://issues.apache.org/jira/browse/SPARK-39896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-39896: --- Assignee: Fu Chen > The structural integrity of the plan is broken after > UnwrapCastInBinaryComparison > - > > Key: SPARK-39896 > URL: https://issues.apache.org/jira/browse/SPARK-39896 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Fu Chen >Priority: Major > Fix For: 3.4.0, 3.3.1 > > > {code:scala} > sql("create table t1(a decimal(3, 0)) using parquet") > sql("insert into t1 values(100), (10), (1)") > sql("select * from t1 where a in(10, 10, 0, 1.00)").show > {code} > {noformat} > After applying rule > org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch > Operator Optimization before Inferring Filters, the structural integrity of > the plan is broken. > java.lang.RuntimeException: After applying rule > org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch > Operator Optimization before Inferring Filters, the structural integrity of > the plan is broken. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39896) The structural integrity of the plan is broken after UnwrapCastInBinaryComparison
[ https://issues.apache.org/jira/browse/SPARK-39896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-39896. - Fix Version/s: 3.3.1 3.4.0 Resolution: Fixed Issue resolved by pull request 37439 [https://github.com/apache/spark/pull/37439] > The structural integrity of the plan is broken after > UnwrapCastInBinaryComparison > - > > Key: SPARK-39896 > URL: https://issues.apache.org/jira/browse/SPARK-39896 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.3.1, 3.4.0 > > > {code:scala} > sql("create table t1(a decimal(3, 0)) using parquet") > sql("insert into t1 values(100), (10), (1)") > sql("select * from t1 where a in(10, 10, 0, 1.00)").show > {code} > {noformat} > After applying rule > org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch > Operator Optimization before Inferring Filters, the structural integrity of > the plan is broken. > java.lang.RuntimeException: After applying rule > org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch > Operator Optimization before Inferring Filters, the structural integrity of > the plan is broken. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40285) Simplify the roundTo[Numeric] for Decimal
jiaan.geng created SPARK-40285: -- Summary: Simplify the roundTo[Numeric] for Decimal Key: SPARK-40285 URL: https://issues.apache.org/jira/browse/SPARK-40285 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: jiaan.geng Spark Decimal have a lot of methods named roundTo*. Except roundToLong, everything else is redundant -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40284) spark concurrent overwrite mode writes data to files in HDFS format, all request data write success
[ https://issues.apache.org/jira/browse/SPARK-40284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu updated SPARK-40284: Description: We use Spark as a service. The same Spark service needs to handle multiple requests, but I have a problem with this When multiple requests are overwritten to a directory at the same time, the results of two overwrite requests may be written successfully. I think this does not meet the definition of overwrite write First I ran Write SQL1, then I ran Write SQL2, and I found that both data were written in the end, which I thought was unreasonable {code:java} sparkSession.udf.register("sleep", (time: Long) => Thread.sleep(time)) -- write sql1 sparkSession.sql("select 1 as id, sleep(4) as time").write.mode(SaveMode.Overwrite).parquet("path") -- write sql2 sparkSession.sql("select 2 as id, 1 as time").write.mode(SaveMode.Overwrite).parquet("path") {code} When the spark source, and I saw that all these logic in InsertIntoHadoopFsRelationCommand this class. When the target directory already exists, Spark directly deletes the target directory and writes to the _temporary directory that it requests. However, when multiple requests are written, the data will all append in; For example, in Write SQL above, this procedure occurs 1. Run write SQL 1, SQL 1 to create the _TEMPORARY directory, and continue 2. Run write SQL 2 to delete the entire target directory and create its own _TEMPORARY 3. Sql2 writes data 4. SQL 1 completion. The corresponding _temporary /0/attemp_id directory does not exist and fails. However, the task is retried, but the _temporary directory is not deleted when the task is retried. Therefore, the execution result of SQL1 is sent to the target directory by append Based on the above process, the write process, can you do a directory check before the write task or some other way to avoid this kind of problem was: We use Spark as a service. The same Spark service needs to handle multiple requests, but I have a problem with this When multiple requests are overwritten to a directory at the same time, the results of two overwrite requests may be written successfully. I think this does not meet the definition of overwrite write First I ran Write SQL1, then I ran Write SQL2, and I found that both data were written in the end, which I thought was unreasonable {code:java} sparkSession.udf.register("sleep", (time: Long) => Thread.sleep(time)) -- write sql1 sparkSession.sql("select 1 as id, sleep(4) as time").write.mode(SaveMode.Overwrite).parquet("path") -- write sql2 sparkSession.sql("select 2 as id, 1 as time").write.mode(SaveMode.Overwrite).parquet("path") {code} When the spark source, and I saw that all these logic in InsertIntoHadoopFsRelationCommand this class. When the target directory already exists, Spark directly deletes the target directory and writes to the _temporary directory that it requests. However, when multiple requests are written, the data will all append in; For example, in Write SQL above, this procedure occurs 1. Run write SQL 1, SQL 1 to create the _TEMPORARY directory, and continue 2. Run write SQL 2 to delete the entire target directory and create its own _TEMPORARY 3. Sql2 writes data 4. SQL 1 completion. The corresponding _Temporary /0/attemp_id directory does not exist and fails. However, the task is retried, but the directory is not deleted when the task is retried. Therefore, the execution result of SQL1 is sent to the target directory by append Based on the above process, the write process, can you do a directory check before the write task or some other way to avoid this kind of problem > spark concurrent overwrite mode writes data to files in HDFS format, all > request data write success > > > Key: SPARK-40284 > URL: https://issues.apache.org/jira/browse/SPARK-40284 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.1 >Reporter: Liu >Priority: Major > > We use Spark as a service. The same Spark service needs to handle multiple > requests, but I have a problem with this > When multiple requests are overwritten to a directory at the same time, the > results of two overwrite requests may be written successfully. I think this > does not meet the definition of overwrite write > First I ran Write SQL1, then I ran Write SQL2, and I found that both data > were written in the end, which I thought was unreasonable > {code:java} > sparkSession.udf.register("sleep", (time: Long) => Thread.sleep(time)) > -- write sql1 > sparkSession.sql("select 1 as id, sleep(4) as > time").write.mode(SaveMode.Overwrite).parquet("path") > -- write sql2 > sparkSess
[jira] [Created] (SPARK-40284) spark concurrent overwrite mode writes data to files in HDFS format, all request data write success
Liu created SPARK-40284: --- Summary: spark concurrent overwrite mode writes data to files in HDFS format, all request data write success Key: SPARK-40284 URL: https://issues.apache.org/jira/browse/SPARK-40284 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 3.0.1 Reporter: Liu We use Spark as a service. The same Spark service needs to handle multiple requests, but I have a problem with this When multiple requests are overwritten to a directory at the same time, the results of two overwrite requests may be written successfully. I think this does not meet the definition of overwrite write First I ran Write SQL1, then I ran Write SQL2, and I found that both data were written in the end, which I thought was unreasonable {code:java} sparkSession.udf.register("sleep", (time: Long) => Thread.sleep(time)) -- write sql1 sparkSession.sql("select 1 as id, sleep(4) as time").write.mode(SaveMode.Overwrite).parquet("path") -- write sql2 sparkSession.sql("select 2 as id, 1 as time").write.mode(SaveMode.Overwrite).parquet("path") {code} When the spark source, and I saw that all these logic in InsertIntoHadoopFsRelationCommand this class. When the target directory already exists, Spark directly deletes the target directory and writes to the _temporary directory that it requests. However, when multiple requests are written, the data will all append in; For example, in Write SQL above, this procedure occurs 1. Run write SQL 1, SQL 1 to create the _TEMPORARY directory, and continue 2. Run write SQL 2 to delete the entire target directory and create its own _TEMPORARY 3. Sql2 writes data 4. SQL 1 completion. The corresponding _Temporary /0/attemp_id directory does not exist and fails. However, the task is retried, but the directory is not deleted when the task is retried. Therefore, the execution result of SQL1 is sent to the target directory by append Based on the above process, the write process, can you do a directory check before the write task or some other way to avoid this kind of problem -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33598) Support Java Class with circular references
[ https://issues.apache.org/jira/browse/SPARK-33598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598165#comment-17598165 ] Santokh Singh edited comment on SPARK-33598 at 8/31/22 4:38 AM: *Facing same exception, Spark Version 3.2.2* *Using avro mvn plugin to generate java class from below avro schema,* {color:#ff}*Exception in thread "main" java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema*{color}{*}{{*}} *AVRO SCHEMA* [ { "type": "record", "namespace":"kafka.avro.schema.nested", "name": "Address", "fields": [ { "name": "streetaddress", "type": "string"} , {"name": "city", "type": "string" } ] }, { "type": "record", "name": "person", "namespace":"kafka.avro.schema.nested", "fields": [ { "name": "firstname", "type": "string"} , { "name": "lastname", "type": "string" } , { "name": "address", "type": ["null","Address"] } ] } ] *---CODE * import kafka.avro.schema.nested.person; import org.apache.spark.api.java.function.MapFunction; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Encoders; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.streaming.OutputMode; import org.apache.spark.sql.streaming.StreamingQuery; import org.apache.spark.sql.streaming.StreamingQueryException; import org.apache.spark.sql.streaming.Trigger; import za.co.absa.abris.avro.functions.*; import za.co.absa.abris.config.AbrisConfig; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Paths; import java.util.concurrent.TimeoutException; public class KafkaAvroStreamingAbris { public static void main(String[] args) throws IOException, StreamingQueryException, TimeoutException { SparkSession spark = SparkSession.builder() .appName("AvroApp") .master("local") .getOrCreate(); Dataset df = spark.readStream() .format("kafka") .option("kafka.bootstrap.servers", "127.0.0.1:9092") .option("subscribe", "person") .option("startingOffsets", "earliest") .load(); Dataset df2 = df .select(za.co.absa.abris.avro.functions.from_avro( org.apache.spark.sql.functions.col("value"), za.co.absa.abris.config.AbrisConfig .fromConfluentAvro().downloadReaderSchemaByLatestVersion() .andTopicNameStrategy("person",false) .usingSchemaRegistry("http://localhost:8089";)).as("data")); Dataset df3 = df2.map((MapFunction) row-> { String rr = row.toString(); return null; } , Encoders.bean(person.class)); StreamingQuery streamingQuery = df2 .writeStream() .queryName("Kafka-Write") .format("console") .outputMode(OutputMode.Append()) .trigger(Trigger.ProcessingTime(Long.parseLong("2000"))) .start(); streamingQuery.awaitTermination(); } } was (Author: santokhsdg): *Facing same exception, Spark Version 3.2.2* *Using avro mvn plugin to generate java class from below avro schema,* {color:#ff}*Exception in thread "main" java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema*{color}{*}{{*}} *AVRO SCHEMA* [ { "type": "record", "namespace":"kafka.avro.schema.nested", "name": "Address", "fields": [ { "name": "streetaddress", "type": "string"} , {"name": "city", "type": "string" } ] }, { "type": "record", "name": "person", "namespace":"kafka.avro.schema.nested", "fields": [ { "name": "firstname", "type": "string"} , { "name": "lastname", "type": "string" } , { "name": "address", "type": ["null","Address"] } ] } ] *---CODE * import kafka.avro.schema.nested.person; import org.apache.spark.api.java.function.MapFunction; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Encoders; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.streaming.OutputMode; import org.apache.spark.sql.streaming.StreamingQuery; import org.apache.spark.sql.streaming.StreamingQueryException; import org.apache.spark.sql.streaming.Trigger; import za.co.absa.abris.avro.functions.*; import za.co.absa.abris.config.AbrisConfig; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Paths; import java.util.concurrent.TimeoutException; public class KafkaAvroStreamingAbris { public static void main(String[] args) throws IOException, StreamingQueryException, TimeoutException { SparkSession spark = SparkSession.builder() .appName("AvroApp") .master("local") .getOrCreate(); Dataset df = spark.readStream() .format("kafka") .option("kafka.bootstrap.servers", "127.0.0.1:9092") .option("subscribe", "person") .option("startingOffsets", "earliest") .load(); Dataset df2 = df .select(za.co.absa.abris.avro.functions.from_avro( org.apache.spark.sql.functions.col("value"), za.co.absa.abris.config.AbrisConfig .fromConfluentAvro().downloadReaderS
[jira] [Commented] (SPARK-33598) Support Java Class with circular references
[ https://issues.apache.org/jira/browse/SPARK-33598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598165#comment-17598165 ] Santokh Singh commented on SPARK-33598: --- *Facing same exception, Spark Version 3.2.2* *Using avro mvn plugin to generate java class from below avro schema,* {color:#FF}*Exception in thread "main" java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema*{color}{*}{*} *- AVRO SCHEMA -* [ { "type": "record", "namespace":"kafka.avro.schema.nested", "name": "Address", "fields": [{ "name": "streetaddress", "type": "string"}, {"name": "city", "type": "string" }] }, { "type": "record", "name": "person", "namespace":"kafka.avro.schema.nested", "fields": [{ "name": "firstname", "type": "string"}, { "name": "lastname", "type": "string" },{ "name": "address", "type": ["null","Address"] }] } ] *---CODE * import kafka.avro.schema.nested.person; import org.apache.spark.api.java.function.MapFunction; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Encoders; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.streaming.OutputMode; import org.apache.spark.sql.streaming.StreamingQuery; import org.apache.spark.sql.streaming.StreamingQueryException; import org.apache.spark.sql.streaming.Trigger; import za.co.absa.abris.avro.functions.*; import za.co.absa.abris.config.AbrisConfig; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Paths; import java.util.concurrent.TimeoutException; public class KafkaAvroStreamingAbris { public static void main(String[] args) throws IOException, StreamingQueryException, TimeoutException { SparkSession spark = SparkSession.builder() .appName("AvroApp") .master("local") .getOrCreate(); Dataset df = spark.readStream() .format("kafka") .option("kafka.bootstrap.servers", "127.0.0.1:9092") .option("subscribe", "person") .option("startingOffsets", "earliest") .load(); Dataset df2 = df .select(za.co.absa.abris.avro.functions.from_avro( org.apache.spark.sql.functions.col("value"), za.co.absa.abris.config.AbrisConfig .fromConfluentAvro().downloadReaderSchemaByLatestVersion() .andTopicNameStrategy("person",false) .usingSchemaRegistry("http://localhost:8089";)).as("data")); Dataset df3 = df2.map((MapFunction) row->{ String rr = row.toString(); return null; }, Encoders.bean(PersonBean.class)); StreamingQuery streamingQuery = df2 .writeStream() .queryName("Kafka-Write") .format("console") .outputMode(OutputMode.Append()) .trigger(Trigger.ProcessingTime(Long.parseLong("2000"))) .start(); streamingQuery.awaitTermination(); } } > Support Java Class with circular references > --- > > Key: SPARK-33598 > URL: https://issues.apache.org/jira/browse/SPARK-33598 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 3.1.2 >Reporter: jacklzg >Priority: Minor > > If the target Java data class has a circular reference, Spark will fail fast > from creating the Dataset or running Encoders. > > For example, with protobuf class, there is a reference with Descriptor, there > is no way to build a dataset from the protobuf class. > From this line > {color:#7a869a}Encoders.bean(ProtoBuffOuterClass.ProtoBuff.class);{color} > > It will throw out immediately > > {quote}Exception in thread "main" java.lang.UnsupportedOperationException: > Cannot have circular references in bean class, but got the circular reference > of class class com.google.protobuf.Descriptors$Descriptor > {quote} > > Can we add a parameter, for example, > > {code:java} > Encoders.bean(Class clas, List fieldsToIgnore);{code} > > or > > {code:java} > Encoders.bean(Class clas, boolean skipCircularRefField);{code} > > which subsequently, instead of throwing an exception @ > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L556], > it instead skip the field. > > {code:java} > if (seenTypeSet.contains(t)) { > if(skipCircularRefField) > println("field skipped") //just skip this field > else throw new UnsupportedOperationException( s"cannot have circular > references in class, but got the circular reference of class $t") > } > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33598) Support Java Class with circular references
[ https://issues.apache.org/jira/browse/SPARK-33598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598165#comment-17598165 ] Santokh Singh edited comment on SPARK-33598 at 8/31/22 4:27 AM: *Facing same exception, Spark Version 3.2.2* *Using avro mvn plugin to generate java class from below avro schema,* {color:#ff}*Exception in thread "main" java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema*{color}{*}{{*}} *AVRO SCHEMA* [ { "type": "record", "namespace":"kafka.avro.schema.nested", "name": "Address", "fields": [ { "name": "streetaddress", "type": "string"} , {"name": "city", "type": "string" } ] }, { "type": "record", "name": "person", "namespace":"kafka.avro.schema.nested", "fields": [ { "name": "firstname", "type": "string"} , { "name": "lastname", "type": "string" } , { "name": "address", "type": ["null","Address"] } ] } ] *---CODE * import kafka.avro.schema.nested.person; import org.apache.spark.api.java.function.MapFunction; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Encoders; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.streaming.OutputMode; import org.apache.spark.sql.streaming.StreamingQuery; import org.apache.spark.sql.streaming.StreamingQueryException; import org.apache.spark.sql.streaming.Trigger; import za.co.absa.abris.avro.functions.*; import za.co.absa.abris.config.AbrisConfig; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Paths; import java.util.concurrent.TimeoutException; public class KafkaAvroStreamingAbris { public static void main(String[] args) throws IOException, StreamingQueryException, TimeoutException { SparkSession spark = SparkSession.builder() .appName("AvroApp") .master("local") .getOrCreate(); Dataset df = spark.readStream() .format("kafka") .option("kafka.bootstrap.servers", "127.0.0.1:9092") .option("subscribe", "person") .option("startingOffsets", "earliest") .load(); Dataset df2 = df .select(za.co.absa.abris.avro.functions.from_avro( org.apache.spark.sql.functions.col("value"), za.co.absa.abris.config.AbrisConfig .fromConfluentAvro().downloadReaderSchemaByLatestVersion() .andTopicNameStrategy("person",false) .usingSchemaRegistry("http://localhost:8089";)).as("data")); Dataset df3 = df2.map((MapFunction) row-> { String rr = row.toString(); return null; } , Encoders.bean(PersonBean.class)); StreamingQuery streamingQuery = df2 .writeStream() .queryName("Kafka-Write") .format("console") .outputMode(OutputMode.Append()) .trigger(Trigger.ProcessingTime(Long.parseLong("2000"))) .start(); streamingQuery.awaitTermination(); } } was (Author: santokhsdg): *Facing same exception, Spark Version 3.2.2* *Using avro mvn plugin to generate java class from below avro schema,* {color:#FF}*Exception in thread "main" java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema*{color}{*}{*} *- AVRO SCHEMA -* [ { "type": "record", "namespace":"kafka.avro.schema.nested", "name": "Address", "fields": [{ "name": "streetaddress", "type": "string"}, {"name": "city", "type": "string" }] }, { "type": "record", "name": "person", "namespace":"kafka.avro.schema.nested", "fields": [{ "name": "firstname", "type": "string"}, { "name": "lastname", "type": "string" },{ "name": "address", "type": ["null","Address"] }] } ] *---CODE * import kafka.avro.schema.nested.person; import org.apache.spark.api.java.function.MapFunction; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Encoders; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.streaming.OutputMode; import org.apache.spark.sql.streaming.StreamingQuery; import org.apache.spark.sql.streaming.StreamingQueryException; import org.apache.spark.sql.streaming.Trigger; import za.co.absa.abris.avro.functions.*; import za.co.absa.abris.config.AbrisConfig; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Paths; import java.util.concurrent.TimeoutException; public class KafkaAvroStreamingAbris { public static void main(String[] args) throws IOException, StreamingQueryException, TimeoutException { SparkSession spark = SparkSession.builder() .appName("AvroApp") .master("local") .getOrCreate(); Dataset df = spark.readStream() .format("kafka") .option("kafka.bootstrap.servers", "127.0.0.1:9092") .option("subscribe", "person") .option("startingOffsets", "earliest") .load(); Dataset df2 = df .select(za.co.absa.abris.avro.functions.from_avro( org.apache.spark.sql.functions.col("value"), za.co.absa.abris.config.AbrisConfig .fromConfluentAvro().downloadRea
[jira] [Updated] (SPARK-39971) ANALYZE TABLE makes some queries run forever
[ https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felipe updated SPARK-39971: --- Attachment: explainMode-cost.zip > ANALYZE TABLE makes some queries run forever > > > Key: SPARK-39971 > URL: https://issues.apache.org/jira/browse/SPARK-39971 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.2.2 >Reporter: Felipe >Priority: Major > Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, > 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT > ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT > ForAllColumns-joinreorder-enabled.txt, > 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, > 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, > explainMode-cost.zip > > > I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without > the FOR ALL COLUMNS) some queries became really slow. For example query24 - > [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes > between 10~15min before running the ANALYZE TABLE. > After running ANALYZE TABLE I waited 24h before cancelling the execution. > If I disable spark.sql.cbo.joinReorder.enabled or > spark.sql.cbo.enabled it becomes fast again. > It seems something in join reordering is not working well when we have table > stats, but not column stats. > Rows Count: > store_sales - 2879966589 > store_returns - 288009578 > store - 1002 > item - 30 > customer - 1200 > customer_address - 600 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39971) ANALYZE TABLE makes some queries run forever
[ https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felipe updated SPARK-39971: --- Attachment: (was: explainMode-cost.zip) > ANALYZE TABLE makes some queries run forever > > > Key: SPARK-39971 > URL: https://issues.apache.org/jira/browse/SPARK-39971 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.2.2 >Reporter: Felipe >Priority: Major > Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, > 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT > ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT > ForAllColumns-joinreorder-enabled.txt, > 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, > 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, > explainMode-cost.zip > > > I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without > the FOR ALL COLUMNS) some queries became really slow. For example query24 - > [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes > between 10~15min before running the ANALYZE TABLE. > After running ANALYZE TABLE I waited 24h before cancelling the execution. > If I disable spark.sql.cbo.joinReorder.enabled or > spark.sql.cbo.enabled it becomes fast again. > It seems something in join reordering is not working well when we have table > stats, but not column stats. > Rows Count: > store_sales - 2879966589 > store_returns - 288009578 > store - 1002 > item - 30 > customer - 1200 > customer_address - 600 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-39971) ANALYZE TABLE makes some queries run forever
[ https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felipe reopened SPARK-39971: Sorry. I found my change was causing the issue in some of TPC-DS, but not all. For the query 24 specifically the behavior is exact the same without my changes. The query plan is also the same > ANALYZE TABLE makes some queries run forever > > > Key: SPARK-39971 > URL: https://issues.apache.org/jira/browse/SPARK-39971 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.2.2 >Reporter: Felipe >Priority: Major > Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, > 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT > ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT > ForAllColumns-joinreorder-enabled.txt, > 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, > 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, > explainMode-cost.zip > > > I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without > the FOR ALL COLUMNS) some queries became really slow. For example query24 - > [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes > between 10~15min before running the ANALYZE TABLE. > After running ANALYZE TABLE I waited 24h before cancelling the execution. > If I disable spark.sql.cbo.joinReorder.enabled or > spark.sql.cbo.enabled it becomes fast again. > It seems something in join reordering is not working well when we have table > stats, but not column stats. > Rows Count: > store_sales - 2879966589 > store_returns - 288009578 > store - 1002 > item - 30 > customer - 1200 > customer_address - 600 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40271) Support list type for pyspark.sql.functions.lit
[ https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-40271. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37722 [https://github.com/apache/spark/pull/37722] > Support list type for pyspark.sql.functions.lit > --- > > Key: SPARK-40271 > URL: https://issues.apache.org/jira/browse/SPARK-40271 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > Currently, `pyspark.sql.functions.lit` doesn't support for Python list type > as below: > {code:python} > >>> df = spark.range(3).withColumn("c", lit([1,2,3])) > Traceback (most recent call last): > ... > : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] > The feature is not supported: Literal for '[1, 2, 3]' of class > java.util.ArrayList. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302) > at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100) > at org.apache.spark.sql.functions$.lit(functions.scala:125) > at org.apache.spark.sql.functions.lit(functions.scala) > at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:577) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.base/java.lang.Thread.run(Thread.java:833) > {code} > We should make it supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40271) Support list type for pyspark.sql.functions.lit
[ https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-40271: - Assignee: Haejoon Lee > Support list type for pyspark.sql.functions.lit > --- > > Key: SPARK-40271 > URL: https://issues.apache.org/jira/browse/SPARK-40271 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > Currently, `pyspark.sql.functions.lit` doesn't support for Python list type > as below: > {code:python} > >>> df = spark.range(3).withColumn("c", lit([1,2,3])) > Traceback (most recent call last): > ... > : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] > The feature is not supported: Literal for '[1, 2, 3]' of class > java.util.ArrayList. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302) > at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100) > at org.apache.spark.sql.functions$.lit(functions.scala:125) > at org.apache.spark.sql.functions.lit(functions.scala) > at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:577) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.base/java.lang.Thread.run(Thread.java:833) > {code} > We should make it supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
[ https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598138#comment-17598138 ] Hyukjin Kwon commented on SPARK-40274: -- The error is likely from other libraries assuming from the error message: {code} t com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532) at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315) at com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102) at com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.s {code} > ArrayIndexOutOfBoundsException in BytecodeReadingParanamer > -- > > Key: SPARK-40274 > URL: https://issues.apache.org/jira/browse/SPARK-40274 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.1.2 > Environment: spark 3.1.2 scala 2.12.10 jdk 11 linux >Reporter: 张刘强 >Priority: Major > Attachments: code.scala, error.txt, pom.txt > > > spark 3.1.2 scala 2.12.10 jdk 1.8 linux > > when use dataframe.count will throw this exception: > > stacktrace like this: > > java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for > length 206 > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102) > at > com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292) > at > scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:285) > at scala.collection.TraversableLike.map$(TraversableLike.scala:278) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:175) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:174) > at scala.collection.immutable.List.flatMap(List.scala:366) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:174) > at > com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:20) > at > com.fa
[jira] [Resolved] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
[ https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40274. -- Resolution: Invalid > ArrayIndexOutOfBoundsException in BytecodeReadingParanamer > -- > > Key: SPARK-40274 > URL: https://issues.apache.org/jira/browse/SPARK-40274 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.1.2 > Environment: spark 3.1.2 scala 2.12.10 jdk 11 linux >Reporter: 张刘强 >Priority: Major > Attachments: code.scala, error.txt, pom.txt > > > spark 3.1.2 scala 2.12.10 jdk 1.8 linux > > when use dataframe.count will throw this exception: > > stacktrace like this: > > java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for > length 206 > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102) > at > com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292) > at > scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:285) > at scala.collection.TraversableLike.map$(TraversableLike.scala:278) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:175) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:174) > at scala.collection.immutable.List.flatMap(List.scala:366) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:174) > at > com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:20) > at > com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:28) > at > com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:80) > at > com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:490) > at > com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:380) > at > com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:308) > at > com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.getJsonValueAccessor(POJOPropertiesCollector.java:196) > at > com.fasterxml.jackson.databind.introspect.BasicBeanDescription.findJsonValueAccessor(BasicBeanD
[jira] [Commented] (SPARK-40282) DataType argument in StructType.add is incorrectly throwing scala.MatchError
[ https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598132#comment-17598132 ] Hyukjin Kwon commented on SPARK-40282: -- We don't have this problem in the languages supported by the official Apache Spark . Is it problem from Kotlin? > DataType argument in StructType.add is incorrectly throwing scala.MatchError > > > Key: SPARK-40282 > URL: https://issues.apache.org/jira/browse/SPARK-40282 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: M. Manna >Priority: Major > Attachments: SparkApplication.kt, retailstore.csv > > > *Problem Description* > as part of contract mentioned here, Spark should be able to support > {{IntegerType}} as an argument in StructType.add method. However, it > complaints with {{scala.MatchError}} today. > > If we call the override version which access String value as Type e.g. > "Integer" - it works. > *How to Reproduce* > # Create a Kotlin Project - I have used Kotlin but Java will also work > (needs minor adjustment) > # Place the attached CSV file in {{src/main/resources}} > # Compile the project with Java 11 > # Run - it will give you error. > {code:java} > Exception in thread "main" scala.MatchError: > org.apache.spark.sql.types.IntegerType@363fe35a (of class > org.apache.spark.sql.types.IntegerType) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:240) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:236) > at > org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.(objects.scala:1890) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:197) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293) > at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290) > at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89) > at > org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:444) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) > {code} > # Now change line (commented as HERE) - to have a String value i.e. "Integer" > # It works > *Ask* > # Why does it not accept IntegerType, StringType as DataType as part of the > parameters supplied through {{add}} function in {{StructType}} ? > # If this is a bug, do we know when the fix can come? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40282) DataType argument in StructType.add is incorrectly throwing scala.MatchError
[ https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40282: - Component/s: SQL (was: Spark Core) > DataType argument in StructType.add is incorrectly throwing scala.MatchError > > > Key: SPARK-40282 > URL: https://issues.apache.org/jira/browse/SPARK-40282 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: M. Manna >Priority: Major > Attachments: SparkApplication.kt, retailstore.csv > > > *Problem Description* > as part of contract mentioned here, Spark should be able to support > {{IntegerType}} as an argument in StructType.add method. However, it > complaints with {{scala.MatchError}} today. > > If we call the override version which access String value as Type e.g. > "Integer" - it works. > *How to Reproduce* > # Create a Kotlin Project - I have used Kotlin but Java will also work > (needs minor adjustment) > # Place the attached CSV file in {{src/main/resources}} > # Compile the project with Java 11 > # Run - it will give you error. > {code:java} > Exception in thread "main" scala.MatchError: > org.apache.spark.sql.types.IntegerType@363fe35a (of class > org.apache.spark.sql.types.IntegerType) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:240) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:236) > at > org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.(objects.scala:1890) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:197) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293) > at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290) > at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89) > at > org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:444) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) > {code} > # Now change line (commented as HERE) - to have a String value i.e. "Integer" > # It works > *Ask* > # Why does it not accept IntegerType, StringType as DataType as part of the > parameters supplied through {{add}} function in {{StructType}} ? > # If this is a bug, do we know when the fix can come? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40282) DataType argument in StructType.add is incorrectly throwing scala.MatchError
[ https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40282: - Priority: Major (was: Blocker) > DataType argument in StructType.add is incorrectly throwing scala.MatchError > > > Key: SPARK-40282 > URL: https://issues.apache.org/jira/browse/SPARK-40282 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: M. Manna >Priority: Major > Attachments: SparkApplication.kt, retailstore.csv > > > *Problem Description* > as part of contract mentioned here, Spark should be able to support > {{IntegerType}} as an argument in StructType.add method. However, it > complaints with {{scala.MatchError}} today. > > If we call the override version which access String value as Type e.g. > "Integer" - it works. > *How to Reproduce* > # Create a Kotlin Project - I have used Kotlin but Java will also work > (needs minor adjustment) > # Place the attached CSV file in {{src/main/resources}} > # Compile the project with Java 11 > # Run - it will give you error. > {code:java} > Exception in thread "main" scala.MatchError: > org.apache.spark.sql.types.IntegerType@363fe35a (of class > org.apache.spark.sql.types.IntegerType) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:240) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:236) > at > org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.(objects.scala:1890) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:197) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293) > at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290) > at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89) > at > org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:444) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) > {code} > # Now change line (commented as HERE) - to have a String value i.e. "Integer" > # It works > *Ask* > # Why does it not accept IntegerType, StringType as DataType as part of the > parameters supplied through {{add}} function in {{StructType}} ? > # If this is a bug, do we know when the fix can come? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40283) Update mima's previousSparkVersion to 3.3.0
Yuming Wang created SPARK-40283: --- Summary: Update mima's previousSparkVersion to 3.3.0 Key: SPARK-40283 URL: https://issues.apache.org/jira/browse/SPARK-40283 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.4.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39971) ANALYZE TABLE makes some queries run forever
[ https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-39971. - Resolution: Not A Problem > ANALYZE TABLE makes some queries run forever > > > Key: SPARK-39971 > URL: https://issues.apache.org/jira/browse/SPARK-39971 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.2.2 >Reporter: Felipe >Priority: Major > Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, > 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT > ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT > ForAllColumns-joinreorder-enabled.txt, > 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, > 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, > explainMode-cost.zip > > > I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without > the FOR ALL COLUMNS) some queries became really slow. For example query24 - > [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes > between 10~15min before running the ANALYZE TABLE. > After running ANALYZE TABLE I waited 24h before cancelling the execution. > If I disable spark.sql.cbo.joinReorder.enabled or > spark.sql.cbo.enabled it becomes fast again. > It seems something in join reordering is not working well when we have table > stats, but not column stats. > Rows Count: > store_sales - 2879966589 > store_returns - 288009578 > store - 1002 > item - 30 > customer - 1200 > customer_address - 600 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39971) ANALYZE TABLE makes some queries run forever
[ https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598121#comment-17598121 ] Felipe commented on SPARK-39971: I found the issue was caused by a customization in my code. We are setting rowCount to distinctCount stats when we have only the rowCount > ANALYZE TABLE makes some queries run forever > > > Key: SPARK-39971 > URL: https://issues.apache.org/jira/browse/SPARK-39971 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.2.2 >Reporter: Felipe >Priority: Major > Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, > 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT > ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT > ForAllColumns-joinreorder-enabled.txt, > 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, > 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, > explainMode-cost.zip > > > I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without > the FOR ALL COLUMNS) some queries became really slow. For example query24 - > [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes > between 10~15min before running the ANALYZE TABLE. > After running ANALYZE TABLE I waited 24h before cancelling the execution. > If I disable spark.sql.cbo.joinReorder.enabled or > spark.sql.cbo.enabled it becomes fast again. > It seems something in join reordering is not working well when we have table > stats, but not column stats. > Rows Count: > store_sales - 2879966589 > store_returns - 288009578 > store - 1002 > item - 30 > customer - 1200 > customer_address - 600 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()
[ https://issues.apache.org/jira/browse/SPARK-31001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598115#comment-17598115 ] Nicholas Chammas commented on SPARK-31001: -- What's {{{}__partition_columns{}}}? Is that something specific to Delta, or are you saying it's a hidden feature of Spark? > Add ability to create a partitioned table via catalog.createTable() > --- > > Key: SPARK-31001 > URL: https://issues.apache.org/jira/browse/SPARK-31001 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > There doesn't appear to be a way to create a partitioned table using the > Catalog interface. > In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40282) DataType argument in StructType.add is incorrectly throwing scala.MatchError
[ https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Manna updated SPARK-40282: - Summary: DataType argument in StructType.add is incorrectly throwing scala.MatchError (was: IntegerType is missed in "ExternalDataTypeForInput" function) > DataType argument in StructType.add is incorrectly throwing scala.MatchError > > > Key: SPARK-40282 > URL: https://issues.apache.org/jira/browse/SPARK-40282 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: M. Manna >Priority: Blocker > Attachments: SparkApplication.kt, retailstore.csv > > > *Problem Description* > as part of contract mentioned here, Spark should be able to support > {{IntegerType}} as an argument in StructType.add method. However, it > complaints with {{scala.MatchError}} today. > > If we call the override version which access String value as Type e.g. > "Integer" - it works. > *How to Reproduce* > # Create a Kotlin Project - I have used Kotlin but Java will also work > (needs minor adjustment) > # Place the attached CSV file in {{src/main/resources}} > # Compile the project with Java 11 > # Run - it will give you error. > # Now change line (commented as HERE) - to have a String value i.e. "Integer" > # It works > *Ask* > # Why does it not accept IntegerType, StringType as DataType as part of the > parameters supplied through {{add}} function in {{StructType}} ? > # If this is a bug, do we know when the fix can come? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40282) DataType argument in StructType.add is incorrectly throwing scala.MatchError
[ https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Manna updated SPARK-40282: - Description: *Problem Description* as part of contract mentioned here, Spark should be able to support {{IntegerType}} as an argument in StructType.add method. However, it complaints with {{scala.MatchError}} today. If we call the override version which access String value as Type e.g. "Integer" - it works. *How to Reproduce* # Create a Kotlin Project - I have used Kotlin but Java will also work (needs minor adjustment) # Place the attached CSV file in {{src/main/resources}} # Compile the project with Java 11 # Run - it will give you error. {code:java} Exception in thread "main" scala.MatchError: org.apache.spark.sql.types.IntegerType@363fe35a (of class org.apache.spark.sql.types.IntegerType) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:240) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:236) at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.(objects.scala:1890) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:197) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290) at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89) at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:444) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) {code} # Now change line (commented as HERE) - to have a String value i.e. "Integer" # It works *Ask* # Why does it not accept IntegerType, StringType as DataType as part of the parameters supplied through {{add}} function in {{StructType}} ? # If this is a bug, do we know when the fix can come? was: *Problem Description* as part of contract mentioned here, Spark should be able to support {{IntegerType}} as an argument in StructType.add method. However, it complaints with {{scala.MatchError}} today. If we call the override version which access String value as Type e.g. "Integer" - it works. *How to Reproduce* # Create a Kotlin Project - I have used Kotlin but Java will also work (needs minor adjustment) # Place the attached CSV file in {{src/main/resources}} # Compile the project with Java 11 # Run - it will give you error. # Now change line (commented as HERE) - to have a String value i.e. "Integer" # It works *Ask* # Why does it not accept IntegerType, StringType as DataType as part of the parameters supplied through {{add}} function in {{StructType}} ? # If this is a bug, do we know when the fix can come? > DataType argument in StructType.add is incorrectly throwing scala.MatchError > > > Key: SPARK-40282 > URL: https://issues.apache.org/jira/browse/SPARK-40282 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: M. Manna >Priority: Blocker > Attachments: SparkApplication.kt, retailstore.csv > > > *Problem Description* > as part of contract mentioned here, Spark should be able to support > {{IntegerType}} as an argument in StructType.add method. However, it > complaints with {{scala.MatchError}} today. > > If we call the override version which access String value as Type e.g. > "Integer" - it works. > *How to Reproduce* > # Create a Kotlin Project - I have used Kotlin but Java will also work > (needs minor adjustment) > # Place the attached CSV file in {{src/main/r
[jira] [Updated] (SPARK-40282) IntegerType is missed in "ExternalDataTypeForInput" function
[ https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Manna updated SPARK-40282: - Attachment: SparkApplication.kt > IntegerType is missed in "ExternalDataTypeForInput" function > > > Key: SPARK-40282 > URL: https://issues.apache.org/jira/browse/SPARK-40282 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: M. Manna >Priority: Blocker > Attachments: SparkApplication.kt, retailstore.csv > > > *Problem Description* > as part of contract mentioned here, Spark should be able to support > {{IntegerType}} as an argument in StructType.add method. However, it > complaints with {{scala.MatchError}} today. > > If we call the override version which access String value as Type e.g. > "Integer" - it works. > *How to Reproduce* > # Create a Kotlin Project - I have used Kotlin but Java will also work > (needs minor adjustment) > # Place the attached CSV file in {{src/main/resources}} > # Compile the project with Java 11 > # Run - it will give you error. > # Now change line (commented as HERE) - to have a String value i.e. "Integer" > # It works > *Ask* > # Why does it not accept IntegerType, StringType as DataType as part of the > parameters supplied through {{add}} function in {{StructType}} ? > # If this is a bug, do we know when the fix can come? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40282) IntegerType is missed in "ExternalDataTypeForInput" function
M. Manna created SPARK-40282: Summary: IntegerType is missed in "ExternalDataTypeForInput" function Key: SPARK-40282 URL: https://issues.apache.org/jira/browse/SPARK-40282 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: M. Manna Attachments: SparkApplication.kt, retailstore.csv *Problem Description* as part of contract mentioned here, Spark should be able to support {{IntegerType}} as an argument in StructType.add method. However, it complaints with {{scala.MatchError}} today. If we call the override version which access String value as Type e.g. "Integer" - it works. *How to Reproduce* # Create a Kotlin Project - I have used Kotlin but Java will also work (needs minor adjustment) # Place the attached CSV file in {{src/main/resources}} # Compile the project with Java 11 # Run - it will give you error. # Now change line (commented as HERE) - to have a String value i.e. "Integer" # It works *Ask* # Why does it not accept IntegerType, StringType as DataType as part of the parameters supplied through {{add}} function in {{StructType}} ? # If this is a bug, do we know when the fix can come? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40282) IntegerType is missed in "ExternalDataTypeForInput" function
[ https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Manna updated SPARK-40282: - Attachment: retailstore.csv > IntegerType is missed in "ExternalDataTypeForInput" function > > > Key: SPARK-40282 > URL: https://issues.apache.org/jira/browse/SPARK-40282 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: M. Manna >Priority: Blocker > Attachments: SparkApplication.kt, retailstore.csv > > > *Problem Description* > as part of contract mentioned here, Spark should be able to support > {{IntegerType}} as an argument in StructType.add method. However, it > complaints with {{scala.MatchError}} today. > > If we call the override version which access String value as Type e.g. > "Integer" - it works. > *How to Reproduce* > # Create a Kotlin Project - I have used Kotlin but Java will also work > (needs minor adjustment) > # Place the attached CSV file in {{src/main/resources}} > # Compile the project with Java 11 > # Run - it will give you error. > # Now change line (commented as HERE) - to have a String value i.e. "Integer" > # It works > *Ask* > # Why does it not accept IntegerType, StringType as DataType as part of the > parameters supplied through {{add}} function in {{StructType}} ? > # If this is a bug, do we know when the fix can come? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40281) Memory Profiler on Executors
[ https://issues.apache.org/jira/browse/SPARK-40281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng reassigned SPARK-40281: Assignee: (was: Xinrong Meng) > Memory Profiler on Executors > > > Key: SPARK-40281 > URL: https://issues.apache.org/jira/browse/SPARK-40281 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Profiling is critical to performance engineering. Memory consumption is a key > indicator of how efficient a PySpark program is. There is an existing effort > on memory profiling of Python progrms, Memory Profiler > ([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/] > PySpark applications run as independent sets of processes on a cluster, > coordinated by the SparkContext object in the driver program. On the driver > side, PySpark is a regular Python process, thus, we can profile it as a > normal Python program using Memory Profiler. > However, on the executors side, we are missing such memory profiler. Since > executors are distributed on different nodes in the cluster, we need to need > to aggregate profiles. Furthermore, Python worker processes are spawned per > executor for the Python/Pandas UDF execution, which makes the memory > profiling more intricate. > The umbrella proposes to implement a Memory Profiler on Executors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40281) Memory Profiler on Executors
Xinrong Meng created SPARK-40281: Summary: Memory Profiler on Executors Key: SPARK-40281 URL: https://issues.apache.org/jira/browse/SPARK-40281 Project: Spark Issue Type: Umbrella Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Profiling is critical to performance engineering. Memory consumption is a key indicator of how efficient a PySpark program is. There is an existing effort on memory profiling of Python progrms, Memory Profiler ([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/] PySpark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in the driver program. On the driver side, PySpark is a regular Python process, thus, we can profile it as a normal Python program using Memory Profiler. However, on the executors side, we are missing such memory profiler. Since executors are distributed on different nodes in the cluster, we need to need to aggregate profiles. Furthermore, Python worker processes are spawned per executor for the Python/Pandas UDF execution, which makes the memory profiling more intricate. The umbrella proposes to implement a Memory Profiler on Executors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40281) Memory Profiler on Executors
[ https://issues.apache.org/jira/browse/SPARK-40281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng reassigned SPARK-40281: Assignee: Xinrong Meng > Memory Profiler on Executors > > > Key: SPARK-40281 > URL: https://issues.apache.org/jira/browse/SPARK-40281 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > Profiling is critical to performance engineering. Memory consumption is a key > indicator of how efficient a PySpark program is. There is an existing effort > on memory profiling of Python progrms, Memory Profiler > ([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/] > PySpark applications run as independent sets of processes on a cluster, > coordinated by the SparkContext object in the driver program. On the driver > side, PySpark is a regular Python process, thus, we can profile it as a > normal Python program using Memory Profiler. > However, on the executors side, we are missing such memory profiler. Since > executors are distributed on different nodes in the cluster, we need to need > to aggregate profiles. Furthermore, Python worker processes are spawned per > executor for the Python/Pandas UDF execution, which makes the memory > profiling more intricate. > The umbrella proposes to implement a Memory Profiler on Executors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long
[ https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Singh closed SPARK-40266. -- > Corrected console output in quick-start - Datatype Integer instead of Long > > > Key: SPARK-40266 > URL: https://issues.apache.org/jira/browse/SPARK-40266 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2, 3.3.0 > Environment: spark 3.3.0 > Windows 10 (OS Build 19044.1889) >Reporter: Prashant Singh >Assignee: Prashant Singh >Priority: Minor > Fix For: 3.4.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > h3. What changes were proposed in this pull request? > Corrected datatype output of command from Long to Int > h3. Why are the changes needed? > It shows incorrect datatype > h3. Does this PR introduce _any_ user-facing change? > Yes. It proposes changes in documentation for console output. > [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png] > h3. How was this patch tested? > Manually checked the changes by previewing markdown output. I tested output > by installing spark 3.3.0 locally and running commands present in quick start > docs > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40256) Switch base image from openjdk to eclipse-temurin
[ https://issues.apache.org/jira/browse/SPARK-40256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-40256. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37705 [https://github.com/apache/spark/pull/37705] > Switch base image from openjdk to eclipse-temurin > - > > Key: SPARK-40256 > URL: https://issues.apache.org/jira/browse/SPARK-40256 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > According to https://github.com/docker-library/openjdk/issues/505 and > https://github.com/docker-library/docs/pull/2162, openjdk:8/11 is EOL and > Eclipse Temurin replaces this. > The `openjdk` image is [not update > anymore](https://adoptopenjdk.net/upstream.html) (the last releases were > 8u342 and 11.0.16), will `remove the 11 and 8 tags (in October 2022, > perhaps)` (we are using it in spark), so we have to switch this before it > happens. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40256) Switch base image from openjdk to eclipse-temurin
[ https://issues.apache.org/jira/browse/SPARK-40256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-40256: -- Assignee: Yikun Jiang > Switch base image from openjdk to eclipse-temurin > - > > Key: SPARK-40256 > URL: https://issues.apache.org/jira/browse/SPARK-40256 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > > According to https://github.com/docker-library/openjdk/issues/505 and > https://github.com/docker-library/docs/pull/2162, openjdk:8/11 is EOL and > Eclipse Temurin replaces this. > The `openjdk` image is [not update > anymore](https://adoptopenjdk.net/upstream.html) (the last releases were > 8u342 and 11.0.16), will `remove the 11 and 8 tags (in October 2022, > perhaps)` (we are using it in spark), so we have to switch this before it > happens. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40264) Add helper function for DL model inference in pyspark.ml.functions
[ https://issues.apache.org/jira/browse/SPARK-40264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598067#comment-17598067 ] Apache Spark commented on SPARK-40264: -- User 'leewyang' has created a pull request for this issue: https://github.com/apache/spark/pull/37734 > Add helper function for DL model inference in pyspark.ml.functions > -- > > Key: SPARK-40264 > URL: https://issues.apache.org/jira/browse/SPARK-40264 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.2 >Reporter: Lee Yang >Priority: Minor > > Add a helper function to create a pandas_udf for inference on a given DL > model, where the user provides a predict function that is responsible for > loading the model and inferring on a batch of numpy inputs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40264) Add helper function for DL model inference in pyspark.ml.functions
[ https://issues.apache.org/jira/browse/SPARK-40264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40264: Assignee: Apache Spark > Add helper function for DL model inference in pyspark.ml.functions > -- > > Key: SPARK-40264 > URL: https://issues.apache.org/jira/browse/SPARK-40264 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.2 >Reporter: Lee Yang >Assignee: Apache Spark >Priority: Minor > > Add a helper function to create a pandas_udf for inference on a given DL > model, where the user provides a predict function that is responsible for > loading the model and inferring on a batch of numpy inputs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40264) Add helper function for DL model inference in pyspark.ml.functions
[ https://issues.apache.org/jira/browse/SPARK-40264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40264: Assignee: (was: Apache Spark) > Add helper function for DL model inference in pyspark.ml.functions > -- > > Key: SPARK-40264 > URL: https://issues.apache.org/jira/browse/SPARK-40264 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.2 >Reporter: Lee Yang >Priority: Minor > > Add a helper function to create a pandas_udf for inference on a given DL > model, where the user provides a predict function that is responsible for > loading the model and inferring on a batch of numpy inputs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40280) Failure to create parquet predicate push down for ints and longs on some valid files
Robert Joseph Evans created SPARK-40280: --- Summary: Failure to create parquet predicate push down for ints and longs on some valid files Key: SPARK-40280 URL: https://issues.apache.org/jira/browse/SPARK-40280 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0, 3.2.0, 3.1.0, 3.4.0 Reporter: Robert Joseph Evans The [parquet format|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#signed-integers] specification states that... bq. {{{}INT(8, true){}}}, {{{}INT(16, true){}}}, and {{INT(32, true)}} must annotate an {{int32}} primitive type and {{INT(64, true)}} must annotate an {{int64}} primitive type. {{INT(32, true)}} and {{INT(64, true)}} are implied by the {{int32}} and {{int64}} primitive types if no other annotation is present and should be considered optional. But the code inside of [ParquetFilters.scala|https://github.com/apache/spark/blob/296fe49ec855ac8c15c080e7bab6d519fe504bd3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L125-L126] requires that for {{int32}} and {{int64}} that there be no annotation. If there is an annotation for those columns and they are a part of a predicate push down, the hard coded types will not match and the corresponding filter ends up being {{None}}. This can be a huge performance penalty for a valid parquet file. I am happy to provide files that show the issue if needed for testing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40233) Unable to load large pandas dataframe to pyspark
[ https://issues.apache.org/jira/browse/SPARK-40233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598021#comment-17598021 ] Niranda Perera commented on SPARK-40233: I believe the issue is related to executors not being able to load data from the python driver (possibly not having enough memory). I believe first one executor would have to load the entire data dump from the python driver and then repartition it. My suggestion is to add a `num_partitions` option for createDataFrame method so that partitioning can be handled at the driver and sent to the executors as a list of RDDs. Is this an acceptable way from the POV of spark internals? > Unable to load large pandas dataframe to pyspark > > > Key: SPARK-40233 > URL: https://issues.apache.org/jira/browse/SPARK-40233 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Niranda Perera >Priority: Major > > I've been trying to join two large pandas dataframes using pyspark using the > following code. I'm trying to vary executor cores allocated for the > application and measure scalability of pyspark (strong scaling). > {code:java} > r = 10 # 1Bn rows > it = 10 > w = 256 > unique = 0.9 > TOTAL_MEM = 240 > TOTAL_NODES = 14 > max_val = r * unique > rng = default_rng() > frame_data = rng.integers(0, max_val, size=(r, 2)) > frame_data1 = rng.integers(0, max_val, size=(r, 2)) > print(f"data generated", flush=True) > df_l = pd.DataFrame(frame_data).add_prefix("col") > df_r = pd.DataFrame(frame_data1).add_prefix("col") > print(f"data loaded", flush=True) > procs = int(math.ceil(w / TOTAL_NODES)) > mem = int(TOTAL_MEM*0.9) > print(f"world sz {w} procs per worker {procs} mem {mem} iter {it}", > flush=True) > spark = SparkSession\ > .builder\ > .appName(f'join {r} {w}')\ > .master('spark://node:7077')\ > .config('spark.executor.memory', f'{int(mem*0.6)}g')\ > .config('spark.executor.pyspark.memory', f'{int(mem*0.4)}g')\ > .config('spark.cores.max', w)\ > .config('spark.driver.memory', '100g')\ > .config('sspark.sql.execution.arrow.pyspark.enabled', 'true')\ > .getOrCreate() > sdf0 = spark.createDataFrame(df_l).repartition(w).cache() > sdf1 = spark.createDataFrame(df_r).repartition(w).cache() > print(f"data loaded to spark", flush=True) > try: > for i in range(it): > t1 = time.time() > out = sdf0.join(sdf1, on='col0', how='inner') > count = out.count() > t2 = time.time() > print(f"timings {r} 1 {i} {(t2 - t1) * 1000:.0f} ms, {count}", > flush=True) > > del out > del count > gc.collect() > finally: > spark.stop() {code} > {*}Cluster{*}: I am using standalone spark cluster in a 15 node cluster with > 48 cores and 240GB RAM each. I've spawned master and the driver code in > node1, while other 14 nodes have spawned workers allocating maximum memory. > In the spark context, I am reserving 90% of total memory to executor, > splitting 60% to jvm and 40% to pyspark. > {*}Issue{*}: When I run the above program, I can see that the executors are > being assigned to the app. But it doesn't move forward, even after 60 mins. > For smaller row count (10M), this was working without a problem. Driver output > {code:java} > world sz 256 procs per worker 19 mem 216 iter 8 > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 22/08/26 14:52:22 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > /N/u/d/dnperera/.conda/envs/env/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:425: > UserWarning: createDataFrame attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed > by the reason below: > Negative initial size: -589934400 > Attempting non-optimization as > 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. > warn(msg) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40267) Add description for ExecutorAllocationManager metrics
[ https://issues.apache.org/jira/browse/SPARK-40267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40267: Assignee: Apache Spark > Add description for ExecutorAllocationManager metrics > - > > Key: SPARK-40267 > URL: https://issues.apache.org/jira/browse/SPARK-40267 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Assignee: Apache Spark >Priority: Minor > > Some ExecutorAllocationManager metrics are hard to know what stands for just > from metric name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40267) Add description for ExecutorAllocationManager metrics
[ https://issues.apache.org/jira/browse/SPARK-40267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40267: Assignee: (was: Apache Spark) > Add description for ExecutorAllocationManager metrics > - > > Key: SPARK-40267 > URL: https://issues.apache.org/jira/browse/SPARK-40267 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Priority: Minor > > Some ExecutorAllocationManager metrics are hard to know what stands for just > from metric name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40267) Add description for ExecutorAllocationManager metrics
[ https://issues.apache.org/jira/browse/SPARK-40267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597996#comment-17597996 ] Apache Spark commented on SPARK-40267: -- User 'warrenzhu25' has created a pull request for this issue: https://github.com/apache/spark/pull/37733 > Add description for ExecutorAllocationManager metrics > - > > Key: SPARK-40267 > URL: https://issues.apache.org/jira/browse/SPARK-40267 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Priority: Minor > > Some ExecutorAllocationManager metrics are hard to know what stands for just > from metric name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position
[ https://issues.apache.org/jira/browse/SPARK-40260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40260. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37712 [https://github.com/apache/spark/pull/37712] > Use error classes in the compilation errors of GROUP BY a position > -- > > Key: SPARK-40260 > URL: https://issues.apache.org/jira/browse/SPARK-40260 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Migrate the following errors in QueryCompilationErrors: > * groupByPositionRefersToAggregateFunctionError > * groupByPositionRangeError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40253) Data read exception in orc format
[ https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597977#comment-17597977 ] Apache Spark commented on SPARK-40253: -- User 'SelfImpr001' has created a pull request for this issue: https://github.com/apache/spark/pull/37732 > Data read exception in orc format > -- > > Key: SPARK-40253 > URL: https://issues.apache.org/jira/browse/SPARK-40253 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: os centos7 > spark 2.4.3 > hive 1.2.1 > hadoop 2.7.2 >Reporter: yihangqiao >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > Caused by: java.io.EOFException: Read past end of RLE integer from compressed > stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 > offset: 0 limit: 0 > When running batches using spark-sql and using the create table xxx as select > syntax, the select query part uses a static value as the default value (0.00 > as column_name) and does not specify the data type of the default value. In > this usage scenario, because the data type is not explicitly specified, the > metadata information of the field in the written ORC file is missing (the > writing is successful), but when reading, as long as the query column > contains this field, it will not be able to Parsing the ORC file, the > following error occurs: > > {code:java} > create table testgg as select 0.00 as gg;select * from testgg;Caused by: > java.io.IOException: Error reading file: > viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000 > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown > Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at > org.apache.spark.scheduler.Task.run(Task.scala:121) at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)Caused by: > java.io.EOFException: Read past end of RLE integer from compressed stream > Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 > limit: 0 at > org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398) > at > org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(Tree
[jira] [Commented] (SPARK-40253) Data read exception in orc format
[ https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597975#comment-17597975 ] Apache Spark commented on SPARK-40253: -- User 'SelfImpr001' has created a pull request for this issue: https://github.com/apache/spark/pull/37732 > Data read exception in orc format > -- > > Key: SPARK-40253 > URL: https://issues.apache.org/jira/browse/SPARK-40253 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: os centos7 > spark 2.4.3 > hive 1.2.1 > hadoop 2.7.2 >Reporter: yihangqiao >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > Caused by: java.io.EOFException: Read past end of RLE integer from compressed > stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 > offset: 0 limit: 0 > When running batches using spark-sql and using the create table xxx as select > syntax, the select query part uses a static value as the default value (0.00 > as column_name) and does not specify the data type of the default value. In > this usage scenario, because the data type is not explicitly specified, the > metadata information of the field in the written ORC file is missing (the > writing is successful), but when reading, as long as the query column > contains this field, it will not be able to Parsing the ORC file, the > following error occurs: > > {code:java} > create table testgg as select 0.00 as gg;select * from testgg;Caused by: > java.io.IOException: Error reading file: > viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000 > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown > Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at > org.apache.spark.scheduler.Task.run(Task.scala:121) at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)Caused by: > java.io.EOFException: Read past end of RLE integer from compressed stream > Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 > limit: 0 at > org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398) > at > org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(Tree
[jira] [Resolved] (SPARK-38603) Qualified star selection produces duplicated common columns after join then alias
[ https://issues.apache.org/jira/browse/SPARK-38603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-38603. - Resolution: Duplicate > Qualified star selection produces duplicated common columns after join then > alias > - > > Key: SPARK-38603 > URL: https://issues.apache.org/jira/browse/SPARK-38603 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: OS: Ubuntu 18.04.5 LTS > Scala version: 2.12.15 >Reporter: Yves Li >Priority: Minor > > When joining two DataFrames and then aliasing the result, selecting columns > from the resulting Dataset by a qualified star produces duplicates of the > joined columns. > {code:scala} > scala> val df1 = Seq((1, 10), (2, 20)).toDF("a", "x") > df1: org.apache.spark.sql.DataFrame = [a: int, x: int] > scala> val df2 = Seq((2, 200), (3, 300)).toDF("a", "y") > df2: org.apache.spark.sql.DataFrame = [a: int, y: int] > scala> val joined = df1.join(df2, "a").alias("joined") > joined: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, x: > int ... 1 more field] > scala> joined.select("*").show() > +---+---+---+ > | a| x| y| > +---+---+---+ > | 2| 20|200| > +---+---+---+ > scala> joined.select("joined.*").show() > +---+---+---+---+ > | a| a| x| y| > +---+---+---+---+ > | 2| 2| 20|200| > +---+---+---+---+ > scala> joined.select("*").select("joined.*").show() > +---+---+---+ > | a| x| y| > +---+---+---+ > | 2| 20|200| > +---+---+---+ {code} > This appears to be introduced by SPARK-34527, leading to some surprising > behaviour. Using an earlier version, such as Spark 3.0.2, produces the same > output for all three {{{}show(){}}}s. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()
[ https://issues.apache.org/jira/browse/SPARK-31001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597941#comment-17597941 ] Kevin Appel commented on SPARK-31001: - [~nchammas] I ran into this recently trying to create the external table that is partitioned and I found a bunch of items but nothing was working easily without having to manually extract the ddl and remove the partition column. I found in here [https://github.com/delta-io/delta/issues/31] there is vprus there that posted an example and they did that option and include the partition columns. I used this and tried this and combined with the second item this is working for me, to create the external table in spark using the existing parquet data. From here since this is registered then I am also able to view this in Trino spark.catalog.createTable("kevin.ktest1", "/user/kevin/ktest1", **\{"__partition_columns":"['id']"}) spark.sql("alter table kevin.ktest1 recover partitions") Can you see if this is working for you as well? If this is working for you as well then possibly we can get the Spark documentation updated to include this on using this > Add ability to create a partitioned table via catalog.createTable() > --- > > Key: SPARK-31001 > URL: https://issues.apache.org/jira/browse/SPARK-31001 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > There doesn't appear to be a way to create a partitioned table using the > Catalog interface. > In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40113) Reactor ParquetScanBuilder DataSourceV2 interface implementation
[ https://issues.apache.org/jira/browse/SPARK-40113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao resolved SPARK-40113. Fix Version/s: 3.4.0 Assignee: miracle Resolution: Fixed > Reactor ParquetScanBuilder DataSourceV2 interface implementation > > > Key: SPARK-40113 > URL: https://issues.apache.org/jira/browse/SPARK-40113 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.3.0 >Reporter: Mars >Assignee: miracle >Priority: Minor > Fix For: 3.4.0 > > > Now `FileScanBuilder` interface is not fully implemented in > `ParquetScanBuilder` like > `OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder` > In order to unify the logic of the code and make it clearer, this part of the > implementation is unified. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
[ https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40056. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37727 [https://github.com/apache/spark/pull/37727] > Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9 > - > > Key: SPARK-40056 > URL: https://issues.apache.org/jira/browse/SPARK-40056 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Trivial > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
[ https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40056: Assignee: BingKun Pan > Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9 > - > > Key: SPARK-40056 > URL: https://issues.apache.org/jira/browse/SPARK-40056 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40253) Data read exception in orc format
[ https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597905#comment-17597905 ] yihangqiao commented on SPARK-40253: solution: In Literal, 1 significant digit reserved digit is added by default. For example, 0.00 will convert it to decimal(3,2) by default. Compatible with 38-bit data calculations at the same time > Data read exception in orc format > -- > > Key: SPARK-40253 > URL: https://issues.apache.org/jira/browse/SPARK-40253 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: os centos7 > spark 2.4.3 > hive 1.2.1 > hadoop 2.7.2 >Reporter: yihangqiao >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > Caused by: java.io.EOFException: Read past end of RLE integer from compressed > stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 > offset: 0 limit: 0 > When running batches using spark-sql and using the create table xxx as select > syntax, the select query part uses a static value as the default value (0.00 > as column_name) and does not specify the data type of the default value. In > this usage scenario, because the data type is not explicitly specified, the > metadata information of the field in the written ORC file is missing (the > writing is successful), but when reading, as long as the query column > contains this field, it will not be able to Parsing the ORC file, the > following error occurs: > > {code:java} > create table testgg as select 0.00 as gg;select * from testgg;Caused by: > java.io.IOException: Error reading file: > viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000 > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown > Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at > org.apache.spark.scheduler.Task.run(Task.scala:121) at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)Caused by: > java.io.EOFException: Read past end of RLE integer from compressed stream > Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 > limit: 0 at > org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:
[jira] [Assigned] (SPARK-40279) Document spark.yarn.report.interval
[ https://issues.apache.org/jira/browse/SPARK-40279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40279: Assignee: Apache Spark > Document spark.yarn.report.interval > --- > > Key: SPARK-40279 > URL: https://issues.apache.org/jira/browse/SPARK-40279 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.0 >Reporter: Luca Canali >Assignee: Apache Spark >Priority: Minor > > This proposes to document the configuration paramter > spark.yarn.report.interval -> Interval between reports of the current Spark > job status in cluster mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40279) Document spark.yarn.report.interval
[ https://issues.apache.org/jira/browse/SPARK-40279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40279: Assignee: (was: Apache Spark) > Document spark.yarn.report.interval > --- > > Key: SPARK-40279 > URL: https://issues.apache.org/jira/browse/SPARK-40279 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.0 >Reporter: Luca Canali >Priority: Minor > > This proposes to document the configuration paramter > spark.yarn.report.interval -> Interval between reports of the current Spark > job status in cluster mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40279) Document spark.yarn.report.interval
[ https://issues.apache.org/jira/browse/SPARK-40279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597896#comment-17597896 ] Apache Spark commented on SPARK-40279: -- User 'LucaCanali' has created a pull request for this issue: https://github.com/apache/spark/pull/37731 > Document spark.yarn.report.interval > --- > > Key: SPARK-40279 > URL: https://issues.apache.org/jira/browse/SPARK-40279 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.0 >Reporter: Luca Canali >Priority: Minor > > This proposes to document the configuration paramter > spark.yarn.report.interval -> Interval between reports of the current Spark > job status in cluster mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40279) Document spark.yarn.report.interval
Luca Canali created SPARK-40279: --- Summary: Document spark.yarn.report.interval Key: SPARK-40279 URL: https://issues.apache.org/jira/browse/SPARK-40279 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 3.3.0 Reporter: Luca Canali This proposes to document the configuration paramter spark.yarn.report.interval -> Interval between reports of the current Spark job status in cluster mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597894#comment-17597894 ] Apache Spark commented on SPARK-39915: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/37730 > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597893#comment-17597893 ] Apache Spark commented on SPARK-39915: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/37730 > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40278) Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed
[ https://issues.apache.org/jira/browse/SPARK-40278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-40278: - Description: I used databricks spark-sql-pref + Spark 3.3 to run 3TB TPCDS q24a or q24b, the test code as follows: {code:java} val rootDir = "hdfs://${clusterName}/tpcds-data/POCGenData3T" val databaseName = "tpcds_database" val scaleFactor = "3072" val format = "parquet" import com.databricks.spark.sql.perf.tpcds.TPCDSTables val tables = new TPCDSTables( spark.sqlContext,dsdgenDir = "./tpcds-kit/tools", scaleFactor = scaleFactor, useDoubleForDecimal = false,useStringForDate = false) spark.sql(s"create database $databaseName") tables.createTemporaryTables(rootDir, format) spark.sql(s"use $databaseName")// TPCDS 24a or 24b val result = spark.sql(""" with ssales as (select c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color, i_current_price, i_manager_id, i_units, i_size, sum(ss_net_paid) netpaid from store_sales, store_returns, store, item, customer, customer_address where ss_ticket_number = sr_ticket_number and ss_item_sk = sr_item_sk and ss_customer_sk = c_customer_sk and ss_item_sk = i_item_sk and ss_store_sk = s_store_sk and c_birth_country = upper(ca_country) and s_zip = ca_zip and s_market_id = 8 group by c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color, i_current_price, i_manager_id, i_units, i_size) select c_last_name, c_first_name, s_store_name, sum(netpaid) paid from ssales where i_color = 'pale' group by c_last_name, c_first_name, s_store_name having sum(netpaid) > (select 0.05*avg(netpaid) from ssales)""").collect() sc.stop() {code} The above test may failed due to `Stage cancelled because SparkContext was shut down` of stage 31 and stage 36 when AQE enabled as follows: !image-2022-08-30-21-09-48-763.png! !image-2022-08-30-21-10-24-862.png! !image-2022-08-30-21-10-57-128.png! The DAG corresponding to sql is as follows: !image-2022-08-30-21-11-50-895.png! The details as follows: {code:java} == Physical Plan == AdaptiveSparkPlan (42) +- == Final Plan == LocalTableScan (1) +- == Initial Plan == Filter (41) +- HashAggregate (40) +- Exchange (39) +- HashAggregate (38) +- HashAggregate (37) +- Exchange (36) +- HashAggregate (35) +- Project (34) +- BroadcastHashJoin Inner BuildRight (33) :- Project (29) : +- BroadcastHashJoin Inner BuildRight (28) : :- Project (24) : : +- BroadcastHashJoin Inner BuildRight (23) : : :- Project (19) : : : +- BroadcastHashJoin Inner BuildRight (18) : : : :- Project (13) : : : : +- SortMergeJoin Inner (12) : : : : :- Sort (6) : : : : : +- Exchange (5) : : : : : +- Project (4) : : : : :+- Filter (3) : : : : : +- Scan parquet (2) : : : : +- Sort (11) : : : :+- Exchange (10) : : : : +- Project (9) : : : : +- Filter (8) : : : : +- Scan parquet (7) : : : +- BroadcastExchange (17) : : :+- Project (16) : : : +- Filter (15) : : : +- Scan parquet (14) : : +- BroadcastExchange (22) : :+- Filter (21) : : +- Scan parquet (20) : +- BroadcastExchange (27) :+- Filter (26) : +- Scan parquet (25) +- BroadcastExchange (32) +- Filter (31) +- Scan parquet (30) (1) LocalTableScan Output [4]: [c_last_name#421, c_first_name#420, s_store_name#669, paid#850] Arguments: , [c_last_name#421, c_first_name#420, s_store_name#669, paid#850] (2) Scan parquet Output [6]: [ss_item_sk#131, ss_customer_sk#132, ss_store_sk#136, ss_ticket_number#138L, ss_net_paid#149, ss_sold_date_sk#152] Batched: true Location: InMemoryFileIndex [afs://tianqi.afs.baidu.com:9902/u
[jira] [Created] (SPARK-40278) Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed
Yang Jie created SPARK-40278: Summary: Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed Key: SPARK-40278 URL: https://issues.apache.org/jira/browse/SPARK-40278 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0, 3.4.0 Reporter: Yang Jie I used databricks spark-sql-pref + Spark 3.3 to run 3TB TPCDS q24a or q24b, the test code as follows: ```scala val rootDir = "hdfs://${clusterName}/tpcds-data/POCGenData3T" val databaseName = "tpcds_database" val scaleFactor = "3072" val format = "parquet" import com.databricks.spark.sql.perf.tpcds.TPCDSTables val tables = new TPCDSTables( spark.sqlContext,dsdgenDir = "./tpcds-kit/tools", scaleFactor = scaleFactor, useDoubleForDecimal = false,useStringForDate = false) spark.sql(s"create database $databaseName") tables.createTemporaryTables(rootDir, format) spark.sql(s"use $databaseName") // TPCDS 24a or 24b val result = spark.sql(""" with ssales as (select c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color, i_current_price, i_manager_id, i_units, i_size, sum(ss_net_paid) netpaid from store_sales, store_returns, store, item, customer, customer_address where ss_ticket_number = sr_ticket_number and ss_item_sk = sr_item_sk and ss_customer_sk = c_customer_sk and ss_item_sk = i_item_sk and ss_store_sk = s_store_sk and c_birth_country = upper(ca_country) and s_zip = ca_zip and s_market_id = 8 group by c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color, i_current_price, i_manager_id, i_units, i_size) select c_last_name, c_first_name, s_store_name, sum(netpaid) paid from ssales where i_color = 'pale' group by c_last_name, c_first_name, s_store_name having sum(netpaid) > (select 0.05*avg(netpaid) from ssales)""").collect() sc.stop() ``` The above test may failed due to `Stage cancelled because SparkContext was shut down` of stage 31 and stage 36 when AQE enabled as follows: !image-2022-08-30-21-09-48-763.png! !image-2022-08-30-21-10-24-862.png! !image-2022-08-30-21-10-57-128.png! The DAG corresponding to sql is as follows: !image-2022-08-30-21-11-50-895.png! The details as follows: {code:java} == Physical Plan == AdaptiveSparkPlan (42) +- == Final Plan == LocalTableScan (1) +- == Initial Plan == Filter (41) +- HashAggregate (40) +- Exchange (39) +- HashAggregate (38) +- HashAggregate (37) +- Exchange (36) +- HashAggregate (35) +- Project (34) +- BroadcastHashJoin Inner BuildRight (33) :- Project (29) : +- BroadcastHashJoin Inner BuildRight (28) : :- Project (24) : : +- BroadcastHashJoin Inner BuildRight (23) : : :- Project (19) : : : +- BroadcastHashJoin Inner BuildRight (18) : : : :- Project (13) : : : : +- SortMergeJoin Inner (12) : : : : :- Sort (6) : : : : : +- Exchange (5) : : : : : +- Project (4) : : : : :+- Filter (3) : : : : : +- Scan parquet (2) : : : : +- Sort (11) : : : :+- Exchange (10) : : : : +- Project (9) : : : : +- Filter (8) : : : : +- Scan parquet (7) : : : +- BroadcastExchange (17) : : :+- Project (16) : : : +- Filter (15) : : : +- Scan parquet (14) : : +- BroadcastExchange (22) : :+- Filter (21) : : +- Scan parquet (20) : +- BroadcastExchange (27) :+- Filter (26) : +- Scan parquet (25) +- BroadcastExchange (32) +- Filter (31) +- Scan parquet (30) (1) LocalTableScan Output [4]: [c_last_name#421, c_first_name#420, s_store_name#669, paid#850] Arguments: , [c_last_name#421, c_first_name#420, s_store_name#669, paid#850] (2) Scan parquet
[jira] [Updated] (SPARK-40278) Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed
[ https://issues.apache.org/jira/browse/SPARK-40278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-40278: - Affects Version/s: (was: 3.4.0) > Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed > -- > > Key: SPARK-40278 > URL: https://issues.apache.org/jira/browse/SPARK-40278 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Major > > I used databricks spark-sql-pref + Spark 3.3 to run 3TB TPCDS q24a or q24b, > the test code as follows: > ```scala > val rootDir = "hdfs://${clusterName}/tpcds-data/POCGenData3T" > val databaseName = "tpcds_database" > val scaleFactor = "3072" > val format = "parquet" > import com.databricks.spark.sql.perf.tpcds.TPCDSTables > val tables = new TPCDSTables( > spark.sqlContext,dsdgenDir = "./tpcds-kit/tools", > scaleFactor = scaleFactor, > useDoubleForDecimal = false,useStringForDate = false) > spark.sql(s"create database $databaseName") > tables.createTemporaryTables(rootDir, format) > spark.sql(s"use $databaseName") > // TPCDS 24a or 24b > val result = spark.sql(""" with ssales as > (select c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color, > i_current_price, i_manager_id, i_units, i_size, sum(ss_net_paid) > netpaid > from store_sales, store_returns, store, item, customer, customer_address > where ss_ticket_number = sr_ticket_number > and ss_item_sk = sr_item_sk > and ss_customer_sk = c_customer_sk > and ss_item_sk = i_item_sk > and ss_store_sk = s_store_sk > and c_birth_country = upper(ca_country) > and s_zip = ca_zip > and s_market_id = 8 > group by c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color, > i_current_price, i_manager_id, i_units, i_size) > select c_last_name, c_first_name, s_store_name, sum(netpaid) paid > from ssales > where i_color = 'pale' > group by c_last_name, c_first_name, s_store_name > having sum(netpaid) > (select 0.05*avg(netpaid) from ssales)""").collect() > sc.stop() > ``` > The above test may failed due to `Stage cancelled because SparkContext was > shut down` of stage 31 and stage 36 when AQE enabled as follows: > > !image-2022-08-30-21-09-48-763.png! > !image-2022-08-30-21-10-24-862.png! > !image-2022-08-30-21-10-57-128.png! > > The DAG corresponding to sql is as follows: > !image-2022-08-30-21-11-50-895.png! > The details as follows: > > > {code:java} > == Physical Plan == > AdaptiveSparkPlan (42) > +- == Final Plan == >LocalTableScan (1) > +- == Initial Plan == >Filter (41) >+- HashAggregate (40) > +- Exchange (39) > +- HashAggregate (38) > +- HashAggregate (37) >+- Exchange (36) > +- HashAggregate (35) > +- Project (34) > +- BroadcastHashJoin Inner BuildRight (33) >:- Project (29) >: +- BroadcastHashJoin Inner BuildRight (28) >: :- Project (24) >: : +- BroadcastHashJoin Inner BuildRight (23) >: : :- Project (19) >: : : +- BroadcastHashJoin Inner > BuildRight (18) >: : : :- Project (13) >: : : : +- SortMergeJoin Inner (12) >: : : : :- Sort (6) >: : : : : +- Exchange (5) >: : : : : +- Project (4) >: : : : :+- Filter (3) >: : : : : +- Scan > parquet (2) >: : : : +- Sort (11) >: : : :+- Exchange (10) >: : : : +- Project (9) >: : : : +- Filter (8) >: : : : +- Scan > parquet (7) >: : : +- BroadcastExchange (17) >: : :+- Project (16) >: : : +- Filter (15) >: : : +- Scan parquet (14) >: : +- BroadcastExchange (22) >: :+- Filter (21) >: : +- Scan parquet (20) >: +- BroadcastExchange (27) >
[jira] [Commented] (SPARK-33861) Simplify conditional in predicate
[ https://issues.apache.org/jira/browse/SPARK-33861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597865#comment-17597865 ] Apache Spark commented on SPARK-33861: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/37729 > Simplify conditional in predicate > - > > Key: SPARK-33861 > URL: https://issues.apache.org/jira/browse/SPARK-33861 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > The use case is: > {noformat} > spark.sql("create table t1 using parquet as select id as a, id as b from > range(10)") > spark.sql("select * from t1 where CASE WHEN a > 2 THEN b + 10 END > > 5").explain() > {noformat} > Before this pr: > {noformat} > == Physical Plan == > *(1) Filter CASE WHEN (a#3L > 2) THEN ((b#4L + 10) > 5) END > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[a#3L,b#4L] Batched: true, DataFilters: > [CASE WHEN (a#3L > 2) THEN ((b#4L + 10) > 5) END], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > {noformat} > After this pr: > {noformat} > == Physical Plan == > *(1) Filter (((isnotnull(a#3L) AND isnotnull(b#4L)) AND (a#3L > 2)) AND > ((b#4L + 10) > 5)) > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[a#3L,b#4L] Batched: true, DataFilters: > [isnotnull(a#3L), isnotnull(b#4L), (a#3L > 2), ((b#4L + 10) > 5)], Format: > Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., > PartitionFilters: [], PushedFilters: [IsNotNull(a), IsNotNull(b), > GreaterThan(a,2)], ReadSchema: struct > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33861) Simplify conditional in predicate
[ https://issues.apache.org/jira/browse/SPARK-33861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597862#comment-17597862 ] Apache Spark commented on SPARK-33861: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/37729 > Simplify conditional in predicate > - > > Key: SPARK-33861 > URL: https://issues.apache.org/jira/browse/SPARK-33861 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > The use case is: > {noformat} > spark.sql("create table t1 using parquet as select id as a, id as b from > range(10)") > spark.sql("select * from t1 where CASE WHEN a > 2 THEN b + 10 END > > 5").explain() > {noformat} > Before this pr: > {noformat} > == Physical Plan == > *(1) Filter CASE WHEN (a#3L > 2) THEN ((b#4L + 10) > 5) END > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[a#3L,b#4L] Batched: true, DataFilters: > [CASE WHEN (a#3L > 2) THEN ((b#4L + 10) > 5) END], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > {noformat} > After this pr: > {noformat} > == Physical Plan == > *(1) Filter (((isnotnull(a#3L) AND isnotnull(b#4L)) AND (a#3L > 2)) AND > ((b#4L + 10) > 5)) > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[a#3L,b#4L] Batched: true, DataFilters: > [isnotnull(a#3L), isnotnull(b#4L), (a#3L > 2), ((b#4L + 10) > 5)], Format: > Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., > PartitionFilters: [], PushedFilters: [IsNotNull(a), IsNotNull(b), > GreaterThan(a,2)], ReadSchema: struct > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40277) Use DataFrame's column for referring to DDL schema for from_csv() and from_json()
Jayant Kumar created SPARK-40277: Summary: Use DataFrame's column for referring to DDL schema for from_csv() and from_json() Key: SPARK-40277 URL: https://issues.apache.org/jira/browse/SPARK-40277 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Jayant Kumar With spark's DataFrame api one has to explicitly pass the StrucType to functions like from_csv and from_json. This works okay in general. In certain circumstances when schema depends on the one of the DataFrame's field, it gets complicated and one has to switch to RDD. This requires additional libraries to be added with additional parsing logic. I am trying to explore a way to enable such use cases with DataFrame api and function itself. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40276) reduce the result size of RDD.takeOrdered
[ https://issues.apache.org/jira/browse/SPARK-40276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597852#comment-17597852 ] Apache Spark commented on SPARK-40276: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37728 > reduce the result size of RDD.takeOrdered > - > > Key: SPARK-40276 > URL: https://issues.apache.org/jira/browse/SPARK-40276 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40276) reduce the result size of RDD.takeOrdered
[ https://issues.apache.org/jira/browse/SPARK-40276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597850#comment-17597850 ] Apache Spark commented on SPARK-40276: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37728 > reduce the result size of RDD.takeOrdered > - > > Key: SPARK-40276 > URL: https://issues.apache.org/jira/browse/SPARK-40276 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40276) reduce the result size of RDD.takeOrdered
[ https://issues.apache.org/jira/browse/SPARK-40276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40276: Assignee: (was: Apache Spark) > reduce the result size of RDD.takeOrdered > - > > Key: SPARK-40276 > URL: https://issues.apache.org/jira/browse/SPARK-40276 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40276) reduce the result size of RDD.takeOrdered
[ https://issues.apache.org/jira/browse/SPARK-40276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40276: Assignee: Apache Spark > reduce the result size of RDD.takeOrdered > - > > Key: SPARK-40276 > URL: https://issues.apache.org/jira/browse/SPARK-40276 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40276) reduce the result size of RDD.takeOrdered
Ruifeng Zheng created SPARK-40276: - Summary: reduce the result size of RDD.takeOrdered Key: SPARK-40276 URL: https://issues.apache.org/jira/browse/SPARK-40276 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
[ https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597778#comment-17597778 ] Apache Spark commented on SPARK-40056: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/37727 > Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9 > - > > Key: SPARK-40056 > URL: https://issues.apache.org/jira/browse/SPARK-40056 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40207) Specify the column name when the data type is not supported by datasource
[ https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-40207: --- Assignee: Yi kaifei (was: Apache Spark) > Specify the column name when the data type is not supported by datasource > - > > Key: SPARK-40207 > URL: https://issues.apache.org/jira/browse/SPARK-40207 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Assignee: Yi kaifei >Priority: Major > Fix For: 3.4.0 > > > Currently, If the data type is not supported by the data source, the > exception message thrown does not contain the column name, which is less > clear for locating the problem, this Jira aims to optimize error message > description -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40207) Specify the column name when the data type is not supported by datasource
[ https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-40207. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37574 [https://github.com/apache/spark/pull/37574] > Specify the column name when the data type is not supported by datasource > - > > Key: SPARK-40207 > URL: https://issues.apache.org/jira/browse/SPARK-40207 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > > Currently, If the data type is not supported by the data source, the > exception message thrown does not contain the column name, which is less > clear for locating the problem, this Jira aims to optimize error message > description -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40275) Support casting decimal128
jiaan.geng created SPARK-40275: -- Summary: Support casting decimal128 Key: SPARK-40275 URL: https://issues.apache.org/jira/browse/SPARK-40275 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: jiaan.geng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
[ https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 张刘强 updated SPARK-40274: Environment: spark 3.1.2 scala 2.12.10 jdk 11 linux (was: spark 3.1.2 scala 2.12.10 jdk 1.8 linux) > ArrayIndexOutOfBoundsException in BytecodeReadingParanamer > -- > > Key: SPARK-40274 > URL: https://issues.apache.org/jira/browse/SPARK-40274 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.1.2 > Environment: spark 3.1.2 scala 2.12.10 jdk 11 linux >Reporter: 张刘强 >Priority: Major > Attachments: code.scala, error.txt, pom.txt > > > spark 3.1.2 scala 2.12.10 jdk 1.8 linux > > when use dataframe.count will throw this exception: > > stacktrace like this: > > java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for > length 206 > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102) > at > com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292) > at > scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:285) > at scala.collection.TraversableLike.map$(TraversableLike.scala:278) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:175) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:174) > at scala.collection.immutable.List.flatMap(List.scala:366) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:174) > at > com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:20) > at > com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:28) > at > com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:80) > at > com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:490) > at > com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:380) > at > com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:308) > at > com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.getJsonValueAccessor(POJOPropertiesCollector.java:196) > at > com.fasterxml.jackson.databind.int
[jira] [Updated] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
[ https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 张刘强 updated SPARK-40274: Description: spark 3.1.2 scala 2.12.10 jdk 1.8 linux when use dataframe.count will throw this exception: stacktrace like this: java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for length 206 at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532) at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315) at com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102) at com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:285) at scala.collection.TraversableLike.map$(TraversableLike.scala:278) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:175) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:174) at scala.collection.immutable.List.flatMap(List.scala:366) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:174) at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:20) at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:28) at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:80) at com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:490) at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:380) at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:308) at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.getJsonValueAccessor(POJOPropertiesCollector.java:196) at com.fasterxml.jackson.databind.introspect.BasicBeanDescription.findJsonValueAccessor(BasicBeanDescription.java:252) at com.fasterxml.jackson.databind.ser.BasicSerializerFactory.findSerializerByAnnotations(BasicSerializerFactory.java:346) at com.fasterxml.jackson.databind.ser.BeanSerializerFactory._createSerializer2(BeanSerializerFactory.java:216) pom like this: com.fasterxml.jackson.core jackson-core 2.10.5 com.fasterxml.jackson.core jackson-databind 2.10.5 com.fasterxml.jackson.core jackson-annotations 2.10.5 com.fasterxml.jackson.module jackson-module-scala_2.12
[jira] [Updated] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
[ https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 张刘强 updated SPARK-40274: Environment: spark 3.1.2 scala 2.12.10 jdk 1.8 linux (was: com.fasterxml.jackson.core jackson-core 2.10.5 com.fasterxml.jackson.core jackson-databind 2.10.5 com.fasterxml.jackson.core jackson-annotations 2.10.5 com.fasterxml.jackson.module jackson-module-scala_2.12 2.10.5 com.thoughtworks.paranamer paranamer 2.8 org.apache.spark spark-core_2.12 3.1.2 com.fasterxml.jackson.core jackson-core com.fasterxml.jackson.core jackson-databind com.fasterxml.jackson.module jackson-module-scala_2.12 org.apache.spark spark-sql_2.12 3.1.2 com.fasterxml.jackson.core jackson-core com.fasterxml.jackson.core jackson-databind com.fasterxml.jackson.module jackson-module-scala_2.12 ) > ArrayIndexOutOfBoundsException in BytecodeReadingParanamer > -- > > Key: SPARK-40274 > URL: https://issues.apache.org/jira/browse/SPARK-40274 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.1.2 > Environment: spark 3.1.2 scala 2.12.10 jdk 1.8 linux >Reporter: 张刘强 >Priority: Major > Attachments: code.scala, error.txt, pom.txt > > > spark 3.1.2 scala 2.12.10 jdk 1.8 linux > > when use dataframe.count will throw this exception: > > stacktrace like this: > > java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for > length 206 > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102) > at > com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292) > at > scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:285) > at scala.collection.TraversableLike.map$(TraversableLike.scala:278) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > com.fasterxml.jackson.module.sca
[jira] [Updated] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
[ https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 张刘强 updated SPARK-40274: Attachment: pom.txt > ArrayIndexOutOfBoundsException in BytecodeReadingParanamer > -- > > Key: SPARK-40274 > URL: https://issues.apache.org/jira/browse/SPARK-40274 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.1.2 > Environment: > com.fasterxml.jackson.core > jackson-core > 2.10.5 > > > com.fasterxml.jackson.core > jackson-databind > 2.10.5 > > > com.fasterxml.jackson.core > jackson-annotations > 2.10.5 > > > com.fasterxml.jackson.module > jackson-module-scala_2.12 > 2.10.5 > > > com.thoughtworks.paranamer > paranamer > 2.8 > > > org.apache.spark > spark-core_2.12 > 3.1.2 > > > com.fasterxml.jackson.core > jackson-core > > > com.fasterxml.jackson.core > jackson-databind > > > com.fasterxml.jackson.module > jackson-module-scala_2.12 > > > > > org.apache.spark > spark-sql_2.12 > 3.1.2 > > > com.fasterxml.jackson.core > jackson-core > > > com.fasterxml.jackson.core > jackson-databind > > > com.fasterxml.jackson.module > jackson-module-scala_2.12 > > > >Reporter: 张刘强 >Priority: Major > Attachments: code.scala, error.txt, pom.txt > > > spark 3.1.2 scala 2.12.10 jdk 1.8 linux > > when use dataframe.count will throw this exception: > > stacktrace like this: > > java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for > length 206 > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102) > at > com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292) > at > scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:285) > at scala.collection.TraversableLike.map$(TraversableLike.scala:278) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) >
[jira] [Updated] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
[ https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 张刘强 updated SPARK-40274: Attachment: code.scala > ArrayIndexOutOfBoundsException in BytecodeReadingParanamer > -- > > Key: SPARK-40274 > URL: https://issues.apache.org/jira/browse/SPARK-40274 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.1.2 > Environment: > com.fasterxml.jackson.core > jackson-core > 2.10.5 > > > com.fasterxml.jackson.core > jackson-databind > 2.10.5 > > > com.fasterxml.jackson.core > jackson-annotations > 2.10.5 > > > com.fasterxml.jackson.module > jackson-module-scala_2.12 > 2.10.5 > > > com.thoughtworks.paranamer > paranamer > 2.8 > > > org.apache.spark > spark-core_2.12 > 3.1.2 > > > com.fasterxml.jackson.core > jackson-core > > > com.fasterxml.jackson.core > jackson-databind > > > com.fasterxml.jackson.module > jackson-module-scala_2.12 > > > > > org.apache.spark > spark-sql_2.12 > 3.1.2 > > > com.fasterxml.jackson.core > jackson-core > > > com.fasterxml.jackson.core > jackson-databind > > > com.fasterxml.jackson.module > jackson-module-scala_2.12 > > > >Reporter: 张刘强 >Priority: Major > Attachments: code.scala, error.txt, pom.txt > > > spark 3.1.2 scala 2.12.10 jdk 1.8 linux > > when use dataframe.count will throw this exception: > > stacktrace like this: > > java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for > length 206 > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102) > at > com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292) > at > scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:285) > at scala.collection.TraversableLike.map$(TraversableLike.scala:278) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) >
[jira] [Updated] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
[ https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 张刘强 updated SPARK-40274: Attachment: error.txt > ArrayIndexOutOfBoundsException in BytecodeReadingParanamer > -- > > Key: SPARK-40274 > URL: https://issues.apache.org/jira/browse/SPARK-40274 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.1.2 > Environment: > com.fasterxml.jackson.core > jackson-core > 2.10.5 > > > com.fasterxml.jackson.core > jackson-databind > 2.10.5 > > > com.fasterxml.jackson.core > jackson-annotations > 2.10.5 > > > com.fasterxml.jackson.module > jackson-module-scala_2.12 > 2.10.5 > > > com.thoughtworks.paranamer > paranamer > 2.8 > > > org.apache.spark > spark-core_2.12 > 3.1.2 > > > com.fasterxml.jackson.core > jackson-core > > > com.fasterxml.jackson.core > jackson-databind > > > com.fasterxml.jackson.module > jackson-module-scala_2.12 > > > > > org.apache.spark > spark-sql_2.12 > 3.1.2 > > > com.fasterxml.jackson.core > jackson-core > > > com.fasterxml.jackson.core > jackson-databind > > > com.fasterxml.jackson.module > jackson-module-scala_2.12 > > > >Reporter: 张刘强 >Priority: Major > Attachments: code.scala, error.txt, pom.txt > > > spark 3.1.2 scala 2.12.10 jdk 1.8 linux > > when use dataframe.count will throw this exception: > > stacktrace like this: > > java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for > length 206 > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102) > at > com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292) > at > scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:285) > at scala.collection.TraversableLike.map$(TraversableLike.scala:278) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) >
[jira] [Updated] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
[ https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 张刘强 updated SPARK-40274: Docs Text: (was: code like this: val dataFrame: DataFrame = sparkSession.read .format(JDBC) .option(SSL, TRUE) .option(SSL_VERIFICATION, NONE) .option(DRIVER, if (DatasourceTaskType.RESOURCE_DIRECTORY.name == inputDataSourceInfo.getSourceType) { COM_TRINO_JDBC_DRIVER } else { JdbcParamsUtil.getDriver(DbType.valueOf(inputDataSourceInfo.getDatabaseType).getCode) }) .option(URL, if (DatasourceTaskType.RESOURCE_DIRECTORY.name == inputDataSourceInfo.getSourceType) { inputDataSourceInfo.getAddress } else { inputDataSourceInfo.getJdbcUrl }) .option(USER, inputDataSourceInfo.getUser) .option(PASSWORD, inputDataSourceInfo.getPassword) .option("keepAlive", TRUE) .option(QUERY, baseSql) .load columns = dataFrame.columns val count: Long = dataFrame.count()) > ArrayIndexOutOfBoundsException in BytecodeReadingParanamer > -- > > Key: SPARK-40274 > URL: https://issues.apache.org/jira/browse/SPARK-40274 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.1.2 > Environment: > com.fasterxml.jackson.core > jackson-core > 2.10.5 > > > com.fasterxml.jackson.core > jackson-databind > 2.10.5 > > > com.fasterxml.jackson.core > jackson-annotations > 2.10.5 > > > com.fasterxml.jackson.module > jackson-module-scala_2.12 > 2.10.5 > > > com.thoughtworks.paranamer > paranamer > 2.8 > > > org.apache.spark > spark-core_2.12 > 3.1.2 > > > com.fasterxml.jackson.core > jackson-core > > > com.fasterxml.jackson.core > jackson-databind > > > com.fasterxml.jackson.module > jackson-module-scala_2.12 > > > > > org.apache.spark > spark-sql_2.12 > 3.1.2 > > > com.fasterxml.jackson.core > jackson-core > > > com.fasterxml.jackson.core > jackson-databind > > > com.fasterxml.jackson.module > jackson-module-scala_2.12 > > > >Reporter: 张刘强 >Priority: Major > > spark 3.1.2 scala 2.12.10 jdk 1.8 linux > > when use dataframe.count will throw this exception: > > stacktrace like this: > > java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for > length 206 > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102) > at > com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292) > at > scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) > at > co
[jira] [Created] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
张刘强 created SPARK-40274: --- Summary: ArrayIndexOutOfBoundsException in BytecodeReadingParanamer Key: SPARK-40274 URL: https://issues.apache.org/jira/browse/SPARK-40274 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 3.1.2 Environment: com.fasterxml.jackson.core jackson-core 2.10.5 com.fasterxml.jackson.core jackson-databind 2.10.5 com.fasterxml.jackson.core jackson-annotations 2.10.5 com.fasterxml.jackson.module jackson-module-scala_2.12 2.10.5 com.thoughtworks.paranamer paranamer 2.8 org.apache.spark spark-core_2.12 3.1.2 com.fasterxml.jackson.core jackson-core com.fasterxml.jackson.core jackson-databind com.fasterxml.jackson.module jackson-module-scala_2.12 org.apache.spark spark-sql_2.12 3.1.2 com.fasterxml.jackson.core jackson-core com.fasterxml.jackson.core jackson-databind com.fasterxml.jackson.module jackson-module-scala_2.12 Reporter: 张刘强 spark 3.1.2 scala 2.12.10 jdk 1.8 linux when use dataframe.count will throw this exception: stacktrace like this: java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for length 206 at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532) at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315) at com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102) at com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:285) at scala.collection.TraversableLike.map$(TraversableLike.scala:278) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:175) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:174) at scala.collection.immutable.List.flatMap(List.scala:366) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:174) at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntro
[jira] [Assigned] (SPARK-40253) Data read exception in orc format
[ https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40253: Assignee: Apache Spark > Data read exception in orc format > -- > > Key: SPARK-40253 > URL: https://issues.apache.org/jira/browse/SPARK-40253 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: os centos7 > spark 2.4.3 > hive 1.2.1 > hadoop 2.7.2 >Reporter: yihangqiao >Assignee: Apache Spark >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > Caused by: java.io.EOFException: Read past end of RLE integer from compressed > stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 > offset: 0 limit: 0 > When running batches using spark-sql and using the create table xxx as select > syntax, the select query part uses a static value as the default value (0.00 > as column_name) and does not specify the data type of the default value. In > this usage scenario, because the data type is not explicitly specified, the > metadata information of the field in the written ORC file is missing (the > writing is successful), but when reading, as long as the query column > contains this field, it will not be able to Parsing the ORC file, the > following error occurs: > > {code:java} > create table testgg as select 0.00 as gg;select * from testgg;Caused by: > java.io.IOException: Error reading file: > viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000 > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown > Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at > org.apache.spark.scheduler.Task.run(Task.scala:121) at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)Caused by: > java.io.EOFException: Read past end of RLE integer from compressed stream > Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 > limit: 0 at > org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398) > at > org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1205) > at > org.apache.orc.impl.TreeReaderFactory$DecimalTreeRead
[jira] [Assigned] (SPARK-40253) Data read exception in orc format
[ https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40253: Assignee: (was: Apache Spark) > Data read exception in orc format > -- > > Key: SPARK-40253 > URL: https://issues.apache.org/jira/browse/SPARK-40253 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: os centos7 > spark 2.4.3 > hive 1.2.1 > hadoop 2.7.2 >Reporter: yihangqiao >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > Caused by: java.io.EOFException: Read past end of RLE integer from compressed > stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 > offset: 0 limit: 0 > When running batches using spark-sql and using the create table xxx as select > syntax, the select query part uses a static value as the default value (0.00 > as column_name) and does not specify the data type of the default value. In > this usage scenario, because the data type is not explicitly specified, the > metadata information of the field in the written ORC file is missing (the > writing is successful), but when reading, as long as the query column > contains this field, it will not be able to Parsing the ORC file, the > following error occurs: > > {code:java} > create table testgg as select 0.00 as gg;select * from testgg;Caused by: > java.io.IOException: Error reading file: > viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000 > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown > Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at > org.apache.spark.scheduler.Task.run(Task.scala:121) at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)Caused by: > java.io.EOFException: Read past end of RLE integer from compressed stream > Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 > limit: 0 at > org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398) > at > org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1205) > at > org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderF
[jira] [Commented] (SPARK-40253) Data read exception in orc format
[ https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597718#comment-17597718 ] Apache Spark commented on SPARK-40253: -- User 'SelfImpr001' has created a pull request for this issue: https://github.com/apache/spark/pull/37726 > Data read exception in orc format > -- > > Key: SPARK-40253 > URL: https://issues.apache.org/jira/browse/SPARK-40253 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: os centos7 > spark 2.4.3 > hive 1.2.1 > hadoop 2.7.2 >Reporter: yihangqiao >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > Caused by: java.io.EOFException: Read past end of RLE integer from compressed > stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 > offset: 0 limit: 0 > When running batches using spark-sql and using the create table xxx as select > syntax, the select query part uses a static value as the default value (0.00 > as column_name) and does not specify the data type of the default value. In > this usage scenario, because the data type is not explicitly specified, the > metadata information of the field in the written ORC file is missing (the > writing is successful), but when reading, as long as the query column > contains this field, it will not be able to Parsing the ORC file, the > following error occurs: > > {code:java} > create table testgg as select 0.00 as gg;select * from testgg;Caused by: > java.io.IOException: Error reading file: > viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000 > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown > Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at > org.apache.spark.scheduler.Task.run(Task.scala:121) at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)Caused by: > java.io.EOFException: Read past end of RLE integer from compressed stream > Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 > limit: 0 at > org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398) > at > org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(Tree
[jira] [Commented] (SPARK-40253) Data read exception in orc format
[ https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597717#comment-17597717 ] Apache Spark commented on SPARK-40253: -- User 'SelfImpr001' has created a pull request for this issue: https://github.com/apache/spark/pull/37726 > Data read exception in orc format > -- > > Key: SPARK-40253 > URL: https://issues.apache.org/jira/browse/SPARK-40253 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: os centos7 > spark 2.4.3 > hive 1.2.1 > hadoop 2.7.2 >Reporter: yihangqiao >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > Caused by: java.io.EOFException: Read past end of RLE integer from compressed > stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 > offset: 0 limit: 0 > When running batches using spark-sql and using the create table xxx as select > syntax, the select query part uses a static value as the default value (0.00 > as column_name) and does not specify the data type of the default value. In > this usage scenario, because the data type is not explicitly specified, the > metadata information of the field in the written ORC file is missing (the > writing is successful), but when reading, as long as the query column > contains this field, it will not be able to Parsing the ORC file, the > following error occurs: > > {code:java} > create table testgg as select 0.00 as gg;select * from testgg;Caused by: > java.io.IOException: Error reading file: > viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000 > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown > Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at > org.apache.spark.scheduler.Task.run(Task.scala:121) at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)Caused by: > java.io.EOFException: Read past end of RLE integer from compressed stream > Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 > limit: 0 at > org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398) > at > org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(Tree
[jira] [Assigned] (SPARK-40273) Fix the documents "Contributing and Maintaining Type Hints".
[ https://issues.apache.org/jira/browse/SPARK-40273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40273: Assignee: Apache Spark > Fix the documents "Contributing and Maintaining Type Hints". > > > Key: SPARK-40273 > URL: https://issues.apache.org/jira/browse/SPARK-40273 > Project: Spark > Issue Type: Test > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > Since we don't use `*.pyi` for type hinting anymore (it's all ported as > inline type hints in the `*.py` files), we also should fix the related > documents accordingly > (https://spark.apache.org/docs/latest/api/python/development/contributing.html#contributing-and-maintaining-type-hints) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40273) Fix the documents "Contributing and Maintaining Type Hints".
[ https://issues.apache.org/jira/browse/SPARK-40273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597690#comment-17597690 ] Apache Spark commented on SPARK-40273: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/37724 > Fix the documents "Contributing and Maintaining Type Hints". > > > Key: SPARK-40273 > URL: https://issues.apache.org/jira/browse/SPARK-40273 > Project: Spark > Issue Type: Test > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Since we don't use `*.pyi` for type hinting anymore (it's all ported as > inline type hints in the `*.py` files), we also should fix the related > documents accordingly > (https://spark.apache.org/docs/latest/api/python/development/contributing.html#contributing-and-maintaining-type-hints) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40273) Fix the documents "Contributing and Maintaining Type Hints".
[ https://issues.apache.org/jira/browse/SPARK-40273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40273: Assignee: (was: Apache Spark) > Fix the documents "Contributing and Maintaining Type Hints". > > > Key: SPARK-40273 > URL: https://issues.apache.org/jira/browse/SPARK-40273 > Project: Spark > Issue Type: Test > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Since we don't use `*.pyi` for type hinting anymore (it's all ported as > inline type hints in the `*.py` files), we also should fix the related > documents accordingly > (https://spark.apache.org/docs/latest/api/python/development/contributing.html#contributing-and-maintaining-type-hints) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39763) Executor memory footprint substantially increases while reading zstd compressed parquet files
[ https://issues.apache.org/jira/browse/SPARK-39763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597685#comment-17597685 ] Fengyu Cao commented on SPARK-39763: had the same problem one of our dataset, 75GB in zstd parquet(134G in snappy) {code:java} # 10 executor # Executor Reqs: memoryOverhead: [amount: 3072] cores: [amount: 4] memory: [amount: 10240] offHeap: [amount: 4096] Task Reqs: cpus: [amount: 1.0] df = spark.read.parquet("dataset_zstd") # with spark.sql.parquet.enableVectorizedReader=false df.write.mode("overwrite").format("noop").save() {code} task failed with OOM, but with dataset in snappy, everything is fine > Executor memory footprint substantially increases while reading zstd > compressed parquet files > - > > Key: SPARK-39763 > URL: https://issues.apache.org/jira/browse/SPARK-39763 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Priority: Minor > > Hi all, > > While transitioning from the default snappy compression to zstd, we noticed a > substantial increase in executor memory whilst *reading* and applying > transformations on *zstd* compressed parquet files. > Memory footprint increased increased 3 fold in some cases, compared to > reading and applying the same transformations on a parquet file compressed > with snappy. > This behaviour only occurs when reading zstd compressed parquet files. > Writing a zstd parquet file does not result in this behaviour. > To reproduce: > # Set "spark.sql.parquet.compression.codec" to zstd > # Write some parquet files, the compression will default to zstd after > setting the option above > # Read the compressed zstd file and run some transformations. Compare the > memory usage of the executor vs running the same transformation on a parquet > file with snappy compression. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40273) Fix the documents "Contributing and Maintaining Type Hints".
[ https://issues.apache.org/jira/browse/SPARK-40273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597646#comment-17597646 ] Haejoon Lee commented on SPARK-40273: - I'm working on this > Fix the documents "Contributing and Maintaining Type Hints". > > > Key: SPARK-40273 > URL: https://issues.apache.org/jira/browse/SPARK-40273 > Project: Spark > Issue Type: Test > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Since we don't use `*.pyi` for type hinting anymore (it's all ported as > inline type hints in the `*.py` files), we also should fix the related > documents accordingly > (https://spark.apache.org/docs/latest/api/python/development/contributing.html#contributing-and-maintaining-type-hints) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40273) Fix the documents "Contributing and Maintaining Type Hints".
Haejoon Lee created SPARK-40273: --- Summary: Fix the documents "Contributing and Maintaining Type Hints". Key: SPARK-40273 URL: https://issues.apache.org/jira/browse/SPARK-40273 Project: Spark Issue Type: Test Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Haejoon Lee Since we don't use `*.pyi` for type hinting anymore (it's all ported as inline type hints in the `*.py` files), we also should fix the related documents accordingly (https://spark.apache.org/docs/latest/api/python/development/contributing.html#contributing-and-maintaining-type-hints) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40271) Support list type for pyspark.sql.functions.lit
[ https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597637#comment-17597637 ] Apache Spark commented on SPARK-40271: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/37722 > Support list type for pyspark.sql.functions.lit > --- > > Key: SPARK-40271 > URL: https://issues.apache.org/jira/browse/SPARK-40271 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Currently, `pyspark.sql.functions.lit` doesn't support for Python list type > as below: > {code:python} > >>> df = spark.range(3).withColumn("c", lit([1,2,3])) > Traceback (most recent call last): > ... > : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] > The feature is not supported: Literal for '[1, 2, 3]' of class > java.util.ArrayList. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302) > at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100) > at org.apache.spark.sql.functions$.lit(functions.scala:125) > at org.apache.spark.sql.functions.lit(functions.scala) > at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:577) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.base/java.lang.Thread.run(Thread.java:833) > {code} > We should make it supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40271) Support list type for pyspark.sql.functions.lit
[ https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597634#comment-17597634 ] Apache Spark commented on SPARK-40271: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/37722 > Support list type for pyspark.sql.functions.lit > --- > > Key: SPARK-40271 > URL: https://issues.apache.org/jira/browse/SPARK-40271 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Currently, `pyspark.sql.functions.lit` doesn't support for Python list type > as below: > {code:python} > >>> df = spark.range(3).withColumn("c", lit([1,2,3])) > Traceback (most recent call last): > ... > : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] > The feature is not supported: Literal for '[1, 2, 3]' of class > java.util.ArrayList. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302) > at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100) > at org.apache.spark.sql.functions$.lit(functions.scala:125) > at org.apache.spark.sql.functions.lit(functions.scala) > at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:577) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.base/java.lang.Thread.run(Thread.java:833) > {code} > We should make it supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org