date:20220830

[jira] [Assigned] (SPARK-40285) Simplify the roundTo[Numeric] for Decimal

2022-08-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40285:


Assignee: Apache Spark

> Simplify the roundTo[Numeric] for Decimal
> -
>
> Key: SPARK-40285
> URL: https://issues.apache.org/jira/browse/SPARK-40285
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Spark Decimal have a lot of methods named roundTo*.
> Except roundToLong, everything else is redundant



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40285) Simplify the roundTo[Numeric] for Decimal

2022-08-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40285:


Assignee: (was: Apache Spark)

> Simplify the roundTo[Numeric] for Decimal
> -
>
> Key: SPARK-40285
> URL: https://issues.apache.org/jira/browse/SPARK-40285
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark Decimal have a lot of methods named roundTo*.
> Except roundToLong, everything else is redundant



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40285) Simplify the roundTo[Numeric] for Decimal

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598204#comment-17598204
 ] 

Apache Spark commented on SPARK-40285:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37736

> Simplify the roundTo[Numeric] for Decimal
> -
>
> Key: SPARK-40285
> URL: https://issues.apache.org/jira/browse/SPARK-40285
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark Decimal have a lot of methods named roundTo*.
> Except roundToLong, everything else is redundant



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40284) spark concurrent overwrite mode writes data to files in HDFS format, all request data write success

2022-08-30 Thread Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu updated SPARK-40284:

Description: 
We use Spark as a service. The same Spark service needs to handle multiple 
requests, but I have a problem with this

When multiple requests are overwritten to a directory at the same time, the 
results of two overwrite requests may be written successfully. I think this 
does not meet the definition of overwrite write

First I ran Write SQL1, then I ran Write SQL2, and I found that both data were 
written in the end, which I thought was unreasonable
{code:java}
sparkSession.udf.register("sleep",  (time: Long) => Thread.sleep(time))



-- write sql1
sparkSession.sql("select 1 as id, sleep(4) as 
time").write.mode(SaveMode.Overwrite).parquet("path")

-- write sql2
 sparkSession.sql("select 2 as id, 1 as 
time").write.mode(SaveMode.Overwrite).parquet("path") {code}
When the spark source, and I saw that all these logic in 
InsertIntoHadoopFsRelationCommand this class.

 

When the target directory already exists, Spark directly deletes the target 
directory and writes to the _temporary directory that it requests. However, 
when multiple requests are written, the data will all append in; For example, 
in Write SQL above, this procedure occurs

1. excute write sql1, spark  create the _temporary directory for SQL1, and 
continue

2. excute write sql2 , spark will  delete the entire target directory and 
create its own 
_temporary
3. sql2 writes  its data

4. sql1 complete the calculation, The corresponding _temporary /0/attemp_id 
directory does not exist and so the request fail. However, the task is retried, 
but the _temporary  directory is not deleted when the task is retried. 
Therefore, the execution result of sql1  result is append to the target 
directory 

 

Based on the above process, the write process, can  spark do a directory check 
before the write task or some other way to avoid this kind of problem?

 

 

 

 

  was:
We use Spark as a service. The same Spark service needs to handle multiple 
requests, but I have a problem with this

When multiple requests are overwritten to a directory at the same time, the 
results of two overwrite requests may be written successfully. I think this 
does not meet the definition of overwrite write

First I ran Write SQL1, then I ran Write SQL2, and I found that both data were 
written in the end, which I thought was unreasonable
{code:java}
sparkSession.udf.register("sleep",  (time: Long) => Thread.sleep(time))



-- write sql1
sparkSession.sql("select 1 as id, sleep(4) as 
time").write.mode(SaveMode.Overwrite).parquet("path")

-- write sql2
 sparkSession.sql("select 2 as id, 1 as 
time").write.mode(SaveMode.Overwrite).parquet("path") {code}
When the spark source, and I saw that all these logic in 
InsertIntoHadoopFsRelationCommand this class.

 

When the target directory already exists, Spark directly deletes the target 
directory and writes to the _temporary directory that it requests. However, 
when multiple requests are written, the data will all append in; For example, 
in Write SQL above, this procedure occurs

1. Run write SQL 1, SQL 1 to create the _TEMPORARY directory, and continue

2. Run write SQL 2 to delete the entire target directory and create its own 
_TEMPORARY

3. Sql2 writes data

4. SQL 1 completion. The corresponding _temporary /0/attemp_id directory does 
not exist and fails. However, the task is retried, but the _temporary  
directory is not deleted when the task is retried. Therefore, the execution 
result of SQL1 is sent to the target directory by append

 

Based on the above process, the write process, can you do a directory check 
before the write task or some other way to avoid this kind of problem

 

 

 

 


> spark  concurrent overwrite mode writes data to files in HDFS format, all 
> request data write success
> 
>
> Key: SPARK-40284
> URL: https://issues.apache.org/jira/browse/SPARK-40284
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Liu
>Priority: Major
>
> We use Spark as a service. The same Spark service needs to handle multiple 
> requests, but I have a problem with this
> When multiple requests are overwritten to a directory at the same time, the 
> results of two overwrite requests may be written successfully. I think this 
> does not meet the definition of overwrite write
> First I ran Write SQL1, then I ran Write SQL2, and I found that both data 
> were written in the end, which I thought was unreasonable
> {code:java}
> sparkSession.udf.register("sleep",  (time: Long) => Thread.sleep(time))
> -- write sql1
> sparkSession.sql("select 1 as id, sleep(4) as 
> time").wr

[jira] [Updated] (SPARK-40286) Load Data from S3 deletes data source file

2022-08-30 Thread Drew (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew updated SPARK-40286:
-
Description: 
Hello, 

I'm using spark to [load 
data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into a 
hive table through Pyspark, and when I load data from a path in Amazon S3, the 
original file is getting wiped from the Directory. The file is found, and is 
populating the table with data. I also tried to add the `Local` clause but that 
throws an error when looking for the file. When looking through the 
documentation it doesn't explicitly state that this is the intended behavior.

Thanks in advance!
{code:java}
spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE 
src"){code}

  was:
Hello, 

I'm using spark to load data into a hive table through Pyspark, and when I load 
data from a path in Amazon S3, the original file is getting wiped from the 
Directory. The file is found, and is populating the table with data. I also 
tried to add the `Local` clause but that throws an error when looking for the 
file. When looking through the documentation it doesn't explicitly state that 
this is the intended behavior.

Thanks in advance!
{code:java}
spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE 
src"){code}


> Load Data from S3 deletes data source file
> --
>
> Key: SPARK-40286
> URL: https://issues.apache.org/jira/browse/SPARK-40286
> Project: Spark
>  Issue Type: Question
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello, 
> I'm using spark to [load 
> data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into 
> a hive table through Pyspark, and when I load data from a path in Amazon S3, 
> the original file is getting wiped from the Directory. The file is found, and 
> is populating the table with data. I also tried to add the `Local` clause but 
> that throws an error when looking for the file. When looking through the 
> documentation it doesn't explicitly state that this is the intended behavior.
> Thanks in advance!
> {code:java}
> spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
> spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE 
> src"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40286) Load Data from S3 deletes data source file

2022-08-30 Thread Drew (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew updated SPARK-40286:
-
Priority: Major  (was: Trivial)

> Load Data from S3 deletes data source file
> --
>
> Key: SPARK-40286
> URL: https://issues.apache.org/jira/browse/SPARK-40286
> Project: Spark
>  Issue Type: Question
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello, 
> I'm using spark to load data into a hive table through Pyspark, and when I 
> load data from a path in Amazon S3, the original file is getting wiped from 
> the Directory. The file is found, and is populating the table with data. I 
> also tried to add the `Local` clause but that throws an error when looking 
> for the file. When looking through the documentation it doesn't explicitly 
> state that this is the intended behavior.
> Thanks in advance!
> {code:java}
> spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
> spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE 
> src"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40286) Load Data from S3 deletes data source file

2022-08-30 Thread Drew (Jira)

Drew created SPARK-40286:


 Summary: Load Data from S3 deletes data source file
 Key: SPARK-40286
 URL: https://issues.apache.org/jira/browse/SPARK-40286
 Project: Spark
  Issue Type: Question
  Components: Documentation
Affects Versions: 3.2.1
Reporter: Drew


Hello, 

I'm using spark to load data into a hive table through Pyspark, and when I load 
data from a path in Amazon S3, the original file is getting wiped from the 
Directory. The file is found, and is populating the table with data. I also 
tried to add the `Local` clause but that throws an error when looking for the 
file. When looking through the documentation it doesn't explicitly state that 
this is the intended behavior.

Thanks in advance!
{code:java}
spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE 
src"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40271) Support list type for pyspark.sql.functions.lit

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598183#comment-17598183
 ] 

Apache Spark commented on SPARK-40271:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/37735

> Support list type for pyspark.sql.functions.lit
> ---
>
> Key: SPARK-40271
> URL: https://issues.apache.org/jira/browse/SPARK-40271
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, `pyspark.sql.functions.lit` doesn't support for Python list type 
> as below:
> {code:python}
> >>> df = spark.range(3).withColumn("c", lit([1,2,3]))
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] 
> The feature is not supported: Literal for '[1, 2, 3]' of class 
> java.util.ArrayList.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302)
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100)
>   at org.apache.spark.sql.functions$.lit(functions.scala:125)
>   at org.apache.spark.sql.functions.lit(functions.scala)
>   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:577)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.base/java.lang.Thread.run(Thread.java:833)
> {code}
> We should make it supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40271) Support list type for pyspark.sql.functions.lit

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598182#comment-17598182
 ] 

Apache Spark commented on SPARK-40271:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/37735

> Support list type for pyspark.sql.functions.lit
> ---
>
> Key: SPARK-40271
> URL: https://issues.apache.org/jira/browse/SPARK-40271
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, `pyspark.sql.functions.lit` doesn't support for Python list type 
> as below:
> {code:python}
> >>> df = spark.range(3).withColumn("c", lit([1,2,3]))
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] 
> The feature is not supported: Literal for '[1, 2, 3]' of class 
> java.util.ArrayList.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302)
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100)
>   at org.apache.spark.sql.functions$.lit(functions.scala:125)
>   at org.apache.spark.sql.functions.lit(functions.scala)
>   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:577)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.base/java.lang.Thread.run(Thread.java:833)
> {code}
> We should make it supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39896) The structural integrity of the plan is broken after UnwrapCastInBinaryComparison

2022-08-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39896:
---

Assignee: Fu Chen

> The structural integrity of the plan is broken after 
> UnwrapCastInBinaryComparison
> -
>
> Key: SPARK-39896
> URL: https://issues.apache.org/jira/browse/SPARK-39896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Fu Chen
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
> {code:scala}
> sql("create table t1(a decimal(3, 0)) using parquet")
> sql("insert into t1 values(100), (10), (1)")
> sql("select * from t1 where a in(10, 10, 0, 1.00)").show
> {code}
> {noformat}
> After applying rule 
> org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
> Operator Optimization before Inferring Filters, the structural integrity of 
> the plan is broken.
> java.lang.RuntimeException: After applying rule 
> org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
> Operator Optimization before Inferring Filters, the structural integrity of 
> the plan is broken.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39896) The structural integrity of the plan is broken after UnwrapCastInBinaryComparison

2022-08-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39896.
-
Fix Version/s: 3.3.1
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37439
[https://github.com/apache/spark/pull/37439]

> The structural integrity of the plan is broken after 
> UnwrapCastInBinaryComparison
> -
>
> Key: SPARK-39896
> URL: https://issues.apache.org/jira/browse/SPARK-39896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
>
> {code:scala}
> sql("create table t1(a decimal(3, 0)) using parquet")
> sql("insert into t1 values(100), (10), (1)")
> sql("select * from t1 where a in(10, 10, 0, 1.00)").show
> {code}
> {noformat}
> After applying rule 
> org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
> Operator Optimization before Inferring Filters, the structural integrity of 
> the plan is broken.
> java.lang.RuntimeException: After applying rule 
> org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch 
> Operator Optimization before Inferring Filters, the structural integrity of 
> the plan is broken.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40285) Simplify the roundTo[Numeric] for Decimal

2022-08-30 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-40285:
--

 Summary: Simplify the roundTo[Numeric] for Decimal
 Key: SPARK-40285
 URL: https://issues.apache.org/jira/browse/SPARK-40285
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng


Spark Decimal have a lot of methods named roundTo*.
Except roundToLong, everything else is redundant



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40284) spark concurrent overwrite mode writes data to files in HDFS format, all request data write success

2022-08-30 Thread Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu updated SPARK-40284:

Description: 
We use Spark as a service. The same Spark service needs to handle multiple 
requests, but I have a problem with this

When multiple requests are overwritten to a directory at the same time, the 
results of two overwrite requests may be written successfully. I think this 
does not meet the definition of overwrite write

First I ran Write SQL1, then I ran Write SQL2, and I found that both data were 
written in the end, which I thought was unreasonable
{code:java}
sparkSession.udf.register("sleep",  (time: Long) => Thread.sleep(time))



-- write sql1
sparkSession.sql("select 1 as id, sleep(4) as 
time").write.mode(SaveMode.Overwrite).parquet("path")

-- write sql2
 sparkSession.sql("select 2 as id, 1 as 
time").write.mode(SaveMode.Overwrite).parquet("path") {code}
When the spark source, and I saw that all these logic in 
InsertIntoHadoopFsRelationCommand this class.

 

When the target directory already exists, Spark directly deletes the target 
directory and writes to the _temporary directory that it requests. However, 
when multiple requests are written, the data will all append in; For example, 
in Write SQL above, this procedure occurs

1. Run write SQL 1, SQL 1 to create the _TEMPORARY directory, and continue

2. Run write SQL 2 to delete the entire target directory and create its own 
_TEMPORARY

3. Sql2 writes data

4. SQL 1 completion. The corresponding _temporary /0/attemp_id directory does 
not exist and fails. However, the task is retried, but the _temporary  
directory is not deleted when the task is retried. Therefore, the execution 
result of SQL1 is sent to the target directory by append

 

Based on the above process, the write process, can you do a directory check 
before the write task or some other way to avoid this kind of problem

 

 

 

 

  was:
We use Spark as a service. The same Spark service needs to handle multiple 
requests, but I have a problem with this

When multiple requests are overwritten to a directory at the same time, the 
results of two overwrite requests may be written successfully. I think this 
does not meet the definition of overwrite write

First I ran Write SQL1, then I ran Write SQL2, and I found that both data were 
written in the end, which I thought was unreasonable
{code:java}
sparkSession.udf.register("sleep",  (time: Long) => Thread.sleep(time))



-- write sql1
sparkSession.sql("select 1 as id, sleep(4) as 
time").write.mode(SaveMode.Overwrite).parquet("path")

-- write sql2
 sparkSession.sql("select 2 as id, 1 as 
time").write.mode(SaveMode.Overwrite).parquet("path") {code}
When the spark source, and I saw that all these logic in 
InsertIntoHadoopFsRelationCommand this class.

 

When the target directory already exists, Spark directly deletes the target 
directory and writes to the _temporary directory that it requests. However, 
when multiple requests are written, the data will all append in; For example, 
in Write SQL above, this procedure occurs

1. Run write SQL 1, SQL 1 to create the _TEMPORARY directory, and continue

2. Run write SQL 2 to delete the entire target directory and create its own 
_TEMPORARY

3. Sql2 writes data

4. SQL 1 completion. The corresponding _Temporary /0/attemp_id directory does 
not exist and fails. However, the task is retried, but the directory is not 
deleted when the task is retried. Therefore, the execution result of SQL1 is 
sent to the target directory by append

 

Based on the above process, the write process, can you do a directory check 
before the write task or some other way to avoid this kind of problem

 

 

 

 


> spark  concurrent overwrite mode writes data to files in HDFS format, all 
> request data write success
> 
>
> Key: SPARK-40284
> URL: https://issues.apache.org/jira/browse/SPARK-40284
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Liu
>Priority: Major
>
> We use Spark as a service. The same Spark service needs to handle multiple 
> requests, but I have a problem with this
> When multiple requests are overwritten to a directory at the same time, the 
> results of two overwrite requests may be written successfully. I think this 
> does not meet the definition of overwrite write
> First I ran Write SQL1, then I ran Write SQL2, and I found that both data 
> were written in the end, which I thought was unreasonable
> {code:java}
> sparkSession.udf.register("sleep",  (time: Long) => Thread.sleep(time))
> -- write sql1
> sparkSession.sql("select 1 as id, sleep(4) as 
> time").write.mode(SaveMode.Overwrite).parquet("path")
> -- write sql2
>  sparkSess

[jira] [Created] (SPARK-40284) spark concurrent overwrite mode writes data to files in HDFS format, all request data write success

2022-08-30 Thread Liu (Jira)

Liu created SPARK-40284:
---

 Summary: spark  concurrent overwrite mode writes data to files in 
HDFS format, all request data write success
 Key: SPARK-40284
 URL: https://issues.apache.org/jira/browse/SPARK-40284
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 3.0.1
Reporter: Liu


We use Spark as a service. The same Spark service needs to handle multiple 
requests, but I have a problem with this

When multiple requests are overwritten to a directory at the same time, the 
results of two overwrite requests may be written successfully. I think this 
does not meet the definition of overwrite write

First I ran Write SQL1, then I ran Write SQL2, and I found that both data were 
written in the end, which I thought was unreasonable
{code:java}
sparkSession.udf.register("sleep",  (time: Long) => Thread.sleep(time))



-- write sql1
sparkSession.sql("select 1 as id, sleep(4) as 
time").write.mode(SaveMode.Overwrite).parquet("path")

-- write sql2
 sparkSession.sql("select 2 as id, 1 as 
time").write.mode(SaveMode.Overwrite).parquet("path") {code}
When the spark source, and I saw that all these logic in 
InsertIntoHadoopFsRelationCommand this class.

 

When the target directory already exists, Spark directly deletes the target 
directory and writes to the _temporary directory that it requests. However, 
when multiple requests are written, the data will all append in; For example, 
in Write SQL above, this procedure occurs

1. Run write SQL 1, SQL 1 to create the _TEMPORARY directory, and continue

2. Run write SQL 2 to delete the entire target directory and create its own 
_TEMPORARY

3. Sql2 writes data

4. SQL 1 completion. The corresponding _Temporary /0/attemp_id directory does 
not exist and fails. However, the task is retried, but the directory is not 
deleted when the task is retried. Therefore, the execution result of SQL1 is 
sent to the target directory by append

 

Based on the above process, the write process, can you do a directory check 
before the write task or some other way to avoid this kind of problem

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33598) Support Java Class with circular references

2022-08-30 Thread Santokh Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598165#comment-17598165
 ] 

Santokh Singh edited comment on SPARK-33598 at 8/31/22 4:38 AM:


*Facing same exception, Spark Version 3.2.2*

*Using avro mvn plugin to generate java class from below avro schema,*

 

{color:#ff}*Exception in thread "main" 
java.lang.UnsupportedOperationException: Cannot have circular references in 
bean class, but got the circular reference of class class 
org.apache.avro.Schema*{color}{*}{{*}}

*AVRO SCHEMA* 

[
{
"type": "record",
"namespace":"kafka.avro.schema.nested",
"name": "Address",
"fields": [

{ "name": "streetaddress", "type": "string"}

,

{"name": "city", "type": "string" }

]
},
{
"type": "record",
"name": "person",
"namespace":"kafka.avro.schema.nested",
"fields": [

{ "name": "firstname", "type": "string"}

,

{ "name": "lastname", "type": "string" }

,

{ "name": "address", "type": ["null","Address"] }

]
}
]

*---CODE *

import kafka.avro.schema.nested.person;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.OutputMode;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.streaming.Trigger;
import za.co.absa.abris.avro.functions.*;
import za.co.absa.abris.config.AbrisConfig;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.TimeoutException;

public class KafkaAvroStreamingAbris {

public static void main(String[] args) throws IOException, 
StreamingQueryException, TimeoutException {

SparkSession spark = SparkSession.builder()
.appName("AvroApp")
.master("local")
.getOrCreate();

Dataset df = spark.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", "person")
.option("startingOffsets", "earliest")
.load();

Dataset df2 = df
.select(za.co.absa.abris.avro.functions.from_avro(
org.apache.spark.sql.functions.col("value"),
za.co.absa.abris.config.AbrisConfig
.fromConfluentAvro().downloadReaderSchemaByLatestVersion()
.andTopicNameStrategy("person",false)
.usingSchemaRegistry("http://localhost:8089";)).as("data"));

Dataset df3 = df2.map((MapFunction) row->

{ String rr = row.toString(); return null; }

, Encoders.bean(person.class));

StreamingQuery streamingQuery = df2
.writeStream()
.queryName("Kafka-Write")
.format("console")
.outputMode(OutputMode.Append())
.trigger(Trigger.ProcessingTime(Long.parseLong("2000")))
.start();

streamingQuery.awaitTermination();

}
}


was (Author: santokhsdg):
*Facing same exception, Spark Version 3.2.2*

*Using avro mvn plugin to generate java class from below avro schema,*

 

{color:#ff}*Exception in thread "main" 
java.lang.UnsupportedOperationException: Cannot have circular references in 
bean class, but got the circular reference of class class 
org.apache.avro.Schema*{color}{*}{{*}}

*AVRO SCHEMA* 

[
{
"type": "record",
"namespace":"kafka.avro.schema.nested",
"name": "Address",
"fields": [

{ "name": "streetaddress", "type": "string"}

,

{"name": "city", "type": "string" }

]
},
{
"type": "record",
"name": "person",
"namespace":"kafka.avro.schema.nested",
"fields": [

{ "name": "firstname", "type": "string"}

,

{ "name": "lastname", "type": "string" }

,

{ "name": "address", "type": ["null","Address"] }

]
}
]

*---CODE *

import kafka.avro.schema.nested.person;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.OutputMode;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.streaming.Trigger;
import za.co.absa.abris.avro.functions.*;
import za.co.absa.abris.config.AbrisConfig;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.TimeoutException;

public class KafkaAvroStreamingAbris {

public static void main(String[] args) throws IOException, 
StreamingQueryException, TimeoutException {

SparkSession spark = SparkSession.builder()
.appName("AvroApp")
.master("local")
.getOrCreate();

Dataset df = spark.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", "person")
.option("startingOffsets", "earliest")
.load();

Dataset df2 = df
.select(za.co.absa.abris.avro.functions.from_avro(
org.apache.spark.sql.functions.col("value"),
za.co.absa.abris.config.AbrisConfig
.fromConfluentAvro().downloadReaderS

[jira] [Commented] (SPARK-33598) Support Java Class with circular references

2022-08-30 Thread Santokh Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598165#comment-17598165
 ] 

Santokh Singh commented on SPARK-33598:
---

*Facing same exception, Spark Version 3.2.2*

*Using avro mvn plugin to generate java class from below avro schema,*

 

{color:#FF}*Exception in thread "main" 
java.lang.UnsupportedOperationException: Cannot have circular references in 
bean class, but got the circular reference of class class 
org.apache.avro.Schema*{color}{*}{*}

*- AVRO SCHEMA -*

[
{
"type": "record",
"namespace":"kafka.avro.schema.nested",
"name": "Address",
"fields": [{
"name": "streetaddress",
"type": "string"},
{"name": "city",
"type": "string"
}]
},
{
"type": "record",
"name": "person",
"namespace":"kafka.avro.schema.nested",
"fields": [{
"name": "firstname",
"type": "string"},
{
"name": "lastname",
"type": "string"
},{
"name": "address",
"type": ["null","Address"] }]
}
]

*---CODE *

import kafka.avro.schema.nested.person;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.OutputMode;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.streaming.Trigger;
import za.co.absa.abris.avro.functions.*;
import za.co.absa.abris.config.AbrisConfig;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.TimeoutException;


public class KafkaAvroStreamingAbris {

public static void main(String[] args) throws IOException, 
StreamingQueryException, TimeoutException {

SparkSession spark = SparkSession.builder()
.appName("AvroApp")
.master("local")
.getOrCreate();

Dataset df = spark.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", "person")
.option("startingOffsets", "earliest")
.load();

Dataset df2 = df
.select(za.co.absa.abris.avro.functions.from_avro(
org.apache.spark.sql.functions.col("value"),
za.co.absa.abris.config.AbrisConfig
.fromConfluentAvro().downloadReaderSchemaByLatestVersion()
.andTopicNameStrategy("person",false)
.usingSchemaRegistry("http://localhost:8089";)).as("data"));

Dataset df3 = df2.map((MapFunction) row->{
String rr = row.toString();
return null;
}, Encoders.bean(PersonBean.class));

StreamingQuery streamingQuery = df2
.writeStream()
.queryName("Kafka-Write")
.format("console")
.outputMode(OutputMode.Append())
.trigger(Trigger.ProcessingTime(Long.parseLong("2000")))
.start();

streamingQuery.awaitTermination();

}
}

> Support Java Class with circular references
> ---
>
> Key: SPARK-33598
> URL: https://issues.apache.org/jira/browse/SPARK-33598
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 3.1.2
>Reporter: jacklzg
>Priority: Minor
>
> If the target Java data class has a circular reference, Spark will fail fast 
> from creating the Dataset or running Encoders.
>  
> For example, with protobuf class, there is a reference with Descriptor, there 
> is no way to build a dataset from the protobuf class.
> From this line
> {color:#7a869a}Encoders.bean(ProtoBuffOuterClass.ProtoBuff.class);{color}
>  
> It will throw out immediately
>  
> {quote}Exception in thread "main" java.lang.UnsupportedOperationException: 
> Cannot have circular references in bean class, but got the circular reference 
> of class class com.google.protobuf.Descriptors$Descriptor
> {quote}
>  
> Can we add  a parameter, for example, 
>  
> {code:java}
> Encoders.bean(Class clas, List fieldsToIgnore);{code}
> 
> or
>  
> {code:java}
> Encoders.bean(Class clas, boolean skipCircularRefField);{code}
>  
>  which subsequently, instead of throwing an exception @ 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L556],
>  it instead skip the field.
>  
> {code:java}
> if (seenTypeSet.contains(t)) {
> if(skipCircularRefField)
>   println("field skipped") //just skip this field
> else throw new UnsupportedOperationException( s"cannot have circular 
> references in class, but got the circular reference of class $t")
> }
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33598) Support Java Class with circular references

2022-08-30 Thread Santokh Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598165#comment-17598165
 ] 

Santokh Singh edited comment on SPARK-33598 at 8/31/22 4:27 AM:


*Facing same exception, Spark Version 3.2.2*

*Using avro mvn plugin to generate java class from below avro schema,*

 

{color:#ff}*Exception in thread "main" 
java.lang.UnsupportedOperationException: Cannot have circular references in 
bean class, but got the circular reference of class class 
org.apache.avro.Schema*{color}{*}{{*}}

*AVRO SCHEMA* 

[
{
"type": "record",
"namespace":"kafka.avro.schema.nested",
"name": "Address",
"fields": [

{ "name": "streetaddress", "type": "string"}

,

{"name": "city", "type": "string" }

]
},
{
"type": "record",
"name": "person",
"namespace":"kafka.avro.schema.nested",
"fields": [

{ "name": "firstname", "type": "string"}

,

{ "name": "lastname", "type": "string" }

,

{ "name": "address", "type": ["null","Address"] }

]
}
]

*---CODE *

import kafka.avro.schema.nested.person;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.OutputMode;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.streaming.Trigger;
import za.co.absa.abris.avro.functions.*;
import za.co.absa.abris.config.AbrisConfig;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.TimeoutException;

public class KafkaAvroStreamingAbris {

public static void main(String[] args) throws IOException, 
StreamingQueryException, TimeoutException {

SparkSession spark = SparkSession.builder()
.appName("AvroApp")
.master("local")
.getOrCreate();

Dataset df = spark.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", "person")
.option("startingOffsets", "earliest")
.load();

Dataset df2 = df
.select(za.co.absa.abris.avro.functions.from_avro(
org.apache.spark.sql.functions.col("value"),
za.co.absa.abris.config.AbrisConfig
.fromConfluentAvro().downloadReaderSchemaByLatestVersion()
.andTopicNameStrategy("person",false)
.usingSchemaRegistry("http://localhost:8089";)).as("data"));

Dataset df3 = df2.map((MapFunction) row->

{ String rr = row.toString(); return null; }

, Encoders.bean(PersonBean.class));

StreamingQuery streamingQuery = df2
.writeStream()
.queryName("Kafka-Write")
.format("console")
.outputMode(OutputMode.Append())
.trigger(Trigger.ProcessingTime(Long.parseLong("2000")))
.start();

streamingQuery.awaitTermination();

}
}


was (Author: santokhsdg):
*Facing same exception, Spark Version 3.2.2*

*Using avro mvn plugin to generate java class from below avro schema,*

 

{color:#FF}*Exception in thread "main" 
java.lang.UnsupportedOperationException: Cannot have circular references in 
bean class, but got the circular reference of class class 
org.apache.avro.Schema*{color}{*}{*}

*- AVRO SCHEMA -*

[
{
"type": "record",
"namespace":"kafka.avro.schema.nested",
"name": "Address",
"fields": [{
"name": "streetaddress",
"type": "string"},
{"name": "city",
"type": "string"
}]
},
{
"type": "record",
"name": "person",
"namespace":"kafka.avro.schema.nested",
"fields": [{
"name": "firstname",
"type": "string"},
{
"name": "lastname",
"type": "string"
},{
"name": "address",
"type": ["null","Address"] }]
}
]

*---CODE *

import kafka.avro.schema.nested.person;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.OutputMode;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.streaming.Trigger;
import za.co.absa.abris.avro.functions.*;
import za.co.absa.abris.config.AbrisConfig;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.TimeoutException;


public class KafkaAvroStreamingAbris {

public static void main(String[] args) throws IOException, 
StreamingQueryException, TimeoutException {

SparkSession spark = SparkSession.builder()
.appName("AvroApp")
.master("local")
.getOrCreate();

Dataset df = spark.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", "person")
.option("startingOffsets", "earliest")
.load();

Dataset df2 = df
.select(za.co.absa.abris.avro.functions.from_avro(
org.apache.spark.sql.functions.col("value"),
za.co.absa.abris.config.AbrisConfig
.fromConfluentAvro().downloadRea

[jira] [Updated] (SPARK-39971) ANALYZE TABLE makes some queries run forever

2022-08-30 Thread Felipe (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe updated SPARK-39971:
---
Attachment: explainMode-cost.zip

> ANALYZE TABLE makes some queries run forever
> 
>
> Key: SPARK-39971
> URL: https://issues.apache.org/jira/browse/SPARK-39971
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.2.2
>Reporter: Felipe
>Priority: Major
> Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, 
> 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-enabled.txt, 
> 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, 
> 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, 
> explainMode-cost.zip
>
>
> I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without 
> the FOR ALL COLUMNS) some queries became really slow. For example query24 - 
> [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes 
> between 10~15min before running the ANALYZE TABLE.
> After running ANALYZE TABLE I waited 24h before cancelling the execution.
> If I disable spark.sql.cbo.joinReorder.enabled or 
> spark.sql.cbo.enabled it becomes fast again.
> It seems something in join reordering is not working well when we have table 
> stats, but not column stats.
> Rows Count:
> store_sales - 2879966589
> store_returns - 288009578
> store - 1002
> item - 30
> customer - 1200
> customer_address - 600



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39971) ANALYZE TABLE makes some queries run forever

2022-08-30 Thread Felipe (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe updated SPARK-39971:
---
Attachment: (was: explainMode-cost.zip)

> ANALYZE TABLE makes some queries run forever
> 
>
> Key: SPARK-39971
> URL: https://issues.apache.org/jira/browse/SPARK-39971
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.2.2
>Reporter: Felipe
>Priority: Major
> Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, 
> 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-enabled.txt, 
> 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, 
> 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, 
> explainMode-cost.zip
>
>
> I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without 
> the FOR ALL COLUMNS) some queries became really slow. For example query24 - 
> [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes 
> between 10~15min before running the ANALYZE TABLE.
> After running ANALYZE TABLE I waited 24h before cancelling the execution.
> If I disable spark.sql.cbo.joinReorder.enabled or 
> spark.sql.cbo.enabled it becomes fast again.
> It seems something in join reordering is not working well when we have table 
> stats, but not column stats.
> Rows Count:
> store_sales - 2879966589
> store_returns - 288009578
> store - 1002
> item - 30
> customer - 1200
> customer_address - 600



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-39971) ANALYZE TABLE makes some queries run forever

2022-08-30 Thread Felipe (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe reopened SPARK-39971:


Sorry. I found my change was causing the issue in some of TPC-DS, but not all. 
For the query 24 specifically the behavior is exact the same without my 
changes. The query plan is also the same

> ANALYZE TABLE makes some queries run forever
> 
>
> Key: SPARK-39971
> URL: https://issues.apache.org/jira/browse/SPARK-39971
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.2.2
>Reporter: Felipe
>Priority: Major
> Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, 
> 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-enabled.txt, 
> 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, 
> 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, 
> explainMode-cost.zip
>
>
> I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without 
> the FOR ALL COLUMNS) some queries became really slow. For example query24 - 
> [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes 
> between 10~15min before running the ANALYZE TABLE.
> After running ANALYZE TABLE I waited 24h before cancelling the execution.
> If I disable spark.sql.cbo.joinReorder.enabled or 
> spark.sql.cbo.enabled it becomes fast again.
> It seems something in join reordering is not working well when we have table 
> stats, but not column stats.
> Rows Count:
> store_sales - 2879966589
> store_returns - 288009578
> store - 1002
> item - 30
> customer - 1200
> customer_address - 600



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40271) Support list type for pyspark.sql.functions.lit

2022-08-30 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-40271.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37722
[https://github.com/apache/spark/pull/37722]

> Support list type for pyspark.sql.functions.lit
> ---
>
> Key: SPARK-40271
> URL: https://issues.apache.org/jira/browse/SPARK-40271
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, `pyspark.sql.functions.lit` doesn't support for Python list type 
> as below:
> {code:python}
> >>> df = spark.range(3).withColumn("c", lit([1,2,3]))
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] 
> The feature is not supported: Literal for '[1, 2, 3]' of class 
> java.util.ArrayList.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302)
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100)
>   at org.apache.spark.sql.functions$.lit(functions.scala:125)
>   at org.apache.spark.sql.functions.lit(functions.scala)
>   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:577)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.base/java.lang.Thread.run(Thread.java:833)
> {code}
> We should make it supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40271) Support list type for pyspark.sql.functions.lit

2022-08-30 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40271:
-

Assignee: Haejoon Lee

> Support list type for pyspark.sql.functions.lit
> ---
>
> Key: SPARK-40271
> URL: https://issues.apache.org/jira/browse/SPARK-40271
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Currently, `pyspark.sql.functions.lit` doesn't support for Python list type 
> as below:
> {code:python}
> >>> df = spark.range(3).withColumn("c", lit([1,2,3]))
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] 
> The feature is not supported: Literal for '[1, 2, 3]' of class 
> java.util.ArrayList.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302)
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100)
>   at org.apache.spark.sql.functions$.lit(functions.scala:125)
>   at org.apache.spark.sql.functions.lit(functions.scala)
>   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:577)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.base/java.lang.Thread.run(Thread.java:833)
> {code}
> We should make it supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer

2022-08-30 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598138#comment-17598138
 ] 

Hyukjin Kwon commented on SPARK-40274:
--

The error is likely from other libraries assuming from the error message:

{code}
t 
com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532)
at 
com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315)
at 
com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102)
at 
com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76)
at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45)
at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59)
at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.s
{code}

> ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
> --
>
> Key: SPARK-40274
> URL: https://issues.apache.org/jira/browse/SPARK-40274
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 scala 2.12.10 jdk 11 linux
>Reporter: 张刘强
>Priority: Major
> Attachments: code.scala, error.txt, pom.txt
>
>
> spark 3.1.2 scala 2.12.10 jdk 1.8 linux
>  
> when use dataframe.count will throw this exception:
>  
> stacktrace like this:
>  
> java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for 
> length 206
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102)
>         at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59)
>         at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292)
>         at scala.collection.Iterator.foreach(Iterator.scala:943)
>         at scala.collection.Iterator.foreach$(Iterator.scala:943)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>         at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>         at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>         at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292)
>         at 
> scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289)
>         at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181)
>         at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285)
>         at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>         at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>         at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>         at scala.collection.TraversableLike.map(TraversableLike.scala:285)
>         at scala.collection.TraversableLike.map$(TraversableLike.scala:278)
>         at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:175)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:174)
>         at scala.collection.immutable.List.flatMap(List.scala:366)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:174)
>         at 
> com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:20)
>         at 
> com.fa

[jira] [Resolved] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer

2022-08-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40274.
--
Resolution: Invalid

> ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
> --
>
> Key: SPARK-40274
> URL: https://issues.apache.org/jira/browse/SPARK-40274
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 scala 2.12.10 jdk 11 linux
>Reporter: 张刘强
>Priority: Major
> Attachments: code.scala, error.txt, pom.txt
>
>
> spark 3.1.2 scala 2.12.10 jdk 1.8 linux
>  
> when use dataframe.count will throw this exception:
>  
> stacktrace like this:
>  
> java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for 
> length 206
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102)
>         at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59)
>         at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292)
>         at scala.collection.Iterator.foreach(Iterator.scala:943)
>         at scala.collection.Iterator.foreach$(Iterator.scala:943)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>         at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>         at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>         at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292)
>         at 
> scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289)
>         at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181)
>         at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285)
>         at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>         at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>         at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>         at scala.collection.TraversableLike.map(TraversableLike.scala:285)
>         at scala.collection.TraversableLike.map$(TraversableLike.scala:278)
>         at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:175)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:174)
>         at scala.collection.immutable.List.flatMap(List.scala:366)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:174)
>         at 
> com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:20)
>         at 
> com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:28)
>         at 
> com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:80)
>         at 
> com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:490)
>         at 
> com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:380)
>         at 
> com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:308)
>         at 
> com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.getJsonValueAccessor(POJOPropertiesCollector.java:196)
>         at 
> com.fasterxml.jackson.databind.introspect.BasicBeanDescription.findJsonValueAccessor(BasicBeanD

[jira] [Commented] (SPARK-40282) DataType argument in StructType.add is incorrectly throwing scala.MatchError

2022-08-30 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598132#comment-17598132
 ] 

Hyukjin Kwon commented on SPARK-40282:
--

We don't have this problem in the languages supported by the official Apache 
Spark . Is it problem from Kotlin?

> DataType argument in StructType.add is incorrectly throwing scala.MatchError
> 
>
> Key: SPARK-40282
> URL: https://issues.apache.org/jira/browse/SPARK-40282
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: M. Manna
>Priority: Major
> Attachments: SparkApplication.kt, retailstore.csv
>
>
> *Problem Description*
> as part of contract mentioned here, Spark should be able to support 
> {{IntegerType}} as an argument in StructType.add method. However, it 
> complaints with {{scala.MatchError}} today.
>  
> If we call the override version which access String value as Type e.g. 
> "Integer" - it works.
> *How to Reproduce*
>  # Create a Kotlin Project - I have used Kotlin but Java will also work 
> (needs minor adjustment)
>  # Place the attached CSV file in {{src/main/resources}}   
>  # Compile the project with Java 11
>  # Run - it will give you error.
> {code:java}
> Exception in thread "main" scala.MatchError: 
> org.apache.spark.sql.types.IntegerType@363fe35a (of class 
> org.apache.spark.sql.types.IntegerType)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:240)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:236)
>     at 
> org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.(objects.scala:1890)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:197)
>     at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>     at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
>     at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
>     at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81)
>     at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92)
>     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
>     at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
>     at 
> org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:444)
>     at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
>     at scala.Option.getOrElse(Option.scala:189)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) 
> {code}
>  # Now change line (commented as HERE) - to have a String value i.e. "Integer"
>  # It works
> *Ask*
>  # Why does it not accept IntegerType, StringType as DataType as part of the 
> parameters supplied through {{add}} function in {{StructType}} ?
>  # If this is a bug, do we know when the fix can come?
>    



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40282) DataType argument in StructType.add is incorrectly throwing scala.MatchError

2022-08-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40282:
-
Component/s: SQL
 (was: Spark Core)

> DataType argument in StructType.add is incorrectly throwing scala.MatchError
> 
>
> Key: SPARK-40282
> URL: https://issues.apache.org/jira/browse/SPARK-40282
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: M. Manna
>Priority: Major
> Attachments: SparkApplication.kt, retailstore.csv
>
>
> *Problem Description*
> as part of contract mentioned here, Spark should be able to support 
> {{IntegerType}} as an argument in StructType.add method. However, it 
> complaints with {{scala.MatchError}} today.
>  
> If we call the override version which access String value as Type e.g. 
> "Integer" - it works.
> *How to Reproduce*
>  # Create a Kotlin Project - I have used Kotlin but Java will also work 
> (needs minor adjustment)
>  # Place the attached CSV file in {{src/main/resources}}   
>  # Compile the project with Java 11
>  # Run - it will give you error.
> {code:java}
> Exception in thread "main" scala.MatchError: 
> org.apache.spark.sql.types.IntegerType@363fe35a (of class 
> org.apache.spark.sql.types.IntegerType)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:240)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:236)
>     at 
> org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.(objects.scala:1890)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:197)
>     at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>     at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
>     at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
>     at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81)
>     at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92)
>     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
>     at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
>     at 
> org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:444)
>     at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
>     at scala.Option.getOrElse(Option.scala:189)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) 
> {code}
>  # Now change line (commented as HERE) - to have a String value i.e. "Integer"
>  # It works
> *Ask*
>  # Why does it not accept IntegerType, StringType as DataType as part of the 
> parameters supplied through {{add}} function in {{StructType}} ?
>  # If this is a bug, do we know when the fix can come?
>    



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40282) DataType argument in StructType.add is incorrectly throwing scala.MatchError

2022-08-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40282:
-
Priority: Major  (was: Blocker)

> DataType argument in StructType.add is incorrectly throwing scala.MatchError
> 
>
> Key: SPARK-40282
> URL: https://issues.apache.org/jira/browse/SPARK-40282
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: M. Manna
>Priority: Major
> Attachments: SparkApplication.kt, retailstore.csv
>
>
> *Problem Description*
> as part of contract mentioned here, Spark should be able to support 
> {{IntegerType}} as an argument in StructType.add method. However, it 
> complaints with {{scala.MatchError}} today.
>  
> If we call the override version which access String value as Type e.g. 
> "Integer" - it works.
> *How to Reproduce*
>  # Create a Kotlin Project - I have used Kotlin but Java will also work 
> (needs minor adjustment)
>  # Place the attached CSV file in {{src/main/resources}}   
>  # Compile the project with Java 11
>  # Run - it will give you error.
> {code:java}
> Exception in thread "main" scala.MatchError: 
> org.apache.spark.sql.types.IntegerType@363fe35a (of class 
> org.apache.spark.sql.types.IntegerType)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:240)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:236)
>     at 
> org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.(objects.scala:1890)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:197)
>     at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>     at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
>     at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
>     at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73)
>     at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81)
>     at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92)
>     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
>     at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
>     at 
> org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:444)
>     at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
>     at scala.Option.getOrElse(Option.scala:189)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) 
> {code}
>  # Now change line (commented as HERE) - to have a String value i.e. "Integer"
>  # It works
> *Ask*
>  # Why does it not accept IntegerType, StringType as DataType as part of the 
> parameters supplied through {{add}} function in {{StructType}} ?
>  # If this is a bug, do we know when the fix can come?
>    



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40283) Update mima's previousSparkVersion to 3.3.0

2022-08-30 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-40283:
---

 Summary: Update mima's previousSparkVersion to 3.3.0
 Key: SPARK-40283
 URL: https://issues.apache.org/jira/browse/SPARK-40283
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39971) ANALYZE TABLE makes some queries run forever

2022-08-30 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-39971.
-
Resolution: Not A Problem

> ANALYZE TABLE makes some queries run forever
> 
>
> Key: SPARK-39971
> URL: https://issues.apache.org/jira/browse/SPARK-39971
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.2.2
>Reporter: Felipe
>Priority: Major
> Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, 
> 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-enabled.txt, 
> 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, 
> 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, 
> explainMode-cost.zip
>
>
> I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without 
> the FOR ALL COLUMNS) some queries became really slow. For example query24 - 
> [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes 
> between 10~15min before running the ANALYZE TABLE.
> After running ANALYZE TABLE I waited 24h before cancelling the execution.
> If I disable spark.sql.cbo.joinReorder.enabled or 
> spark.sql.cbo.enabled it becomes fast again.
> It seems something in join reordering is not working well when we have table 
> stats, but not column stats.
> Rows Count:
> store_sales - 2879966589
> store_returns - 288009578
> store - 1002
> item - 30
> customer - 1200
> customer_address - 600



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39971) ANALYZE TABLE makes some queries run forever

2022-08-30 Thread Felipe (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598121#comment-17598121
 ] 

Felipe commented on SPARK-39971:


I found the issue was caused by a customization in my code. We are setting 
rowCount to distinctCount stats when we have only the rowCount

> ANALYZE TABLE makes some queries run forever
> 
>
> Key: SPARK-39971
> URL: https://issues.apache.org/jira/browse/SPARK-39971
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.2.2
>Reporter: Felipe
>Priority: Major
> Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, 
> 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-enabled.txt, 
> 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, 
> 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, 
> explainMode-cost.zip
>
>
> I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without 
> the FOR ALL COLUMNS) some queries became really slow. For example query24 - 
> [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes 
> between 10~15min before running the ANALYZE TABLE.
> After running ANALYZE TABLE I waited 24h before cancelling the execution.
> If I disable spark.sql.cbo.joinReorder.enabled or 
> spark.sql.cbo.enabled it becomes fast again.
> It seems something in join reordering is not working well when we have table 
> stats, but not column stats.
> Rows Count:
> store_sales - 2879966589
> store_returns - 288009578
> store - 1002
> item - 30
> customer - 1200
> customer_address - 600



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()

2022-08-30 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598115#comment-17598115
 ] 

Nicholas Chammas commented on SPARK-31001:
--

What's {{{}__partition_columns{}}}? Is that something specific to Delta, or are 
you saying it's a hidden feature of Spark?

> Add ability to create a partitioned table via catalog.createTable()
> ---
>
> Key: SPARK-31001
> URL: https://issues.apache.org/jira/browse/SPARK-31001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There doesn't appear to be a way to create a partitioned table using the 
> Catalog interface.
> In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40282) DataType argument in StructType.add is incorrectly throwing scala.MatchError

2022-08-30 Thread M. Manna (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

M. Manna updated SPARK-40282:
-
Summary: DataType argument in StructType.add is incorrectly throwing 
scala.MatchError  (was: IntegerType is missed in "ExternalDataTypeForInput" 
function)

> DataType argument in StructType.add is incorrectly throwing scala.MatchError
> 
>
> Key: SPARK-40282
> URL: https://issues.apache.org/jira/browse/SPARK-40282
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: M. Manna
>Priority: Blocker
> Attachments: SparkApplication.kt, retailstore.csv
>
>
> *Problem Description*
> as part of contract mentioned here, Spark should be able to support 
> {{IntegerType}} as an argument in StructType.add method. However, it 
> complaints with {{scala.MatchError}} today.
>  
> If we call the override version which access String value as Type e.g. 
> "Integer" - it works.
> *How to Reproduce*
>  # Create a Kotlin Project - I have used Kotlin but Java will also work 
> (needs minor adjustment)
>  # Place the attached CSV file in {{src/main/resources}}   
>  # Compile the project with Java 11
>  # Run - it will give you error.
>  # Now change line (commented as HERE) - to have a String value i.e. "Integer"
>  # It works
> *Ask*
>  # Why does it not accept IntegerType, StringType as DataType as part of the 
> parameters supplied through {{add}} function in {{StructType}} ?
>  # If this is a bug, do we know when the fix can come?
>    



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40282) DataType argument in StructType.add is incorrectly throwing scala.MatchError

2022-08-30 Thread M. Manna (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

M. Manna updated SPARK-40282:
-
Description: 
*Problem Description*

as part of contract mentioned here, Spark should be able to support 
{{IntegerType}} as an argument in StructType.add method. However, it complaints 
with {{scala.MatchError}} today.

 

If we call the override version which access String value as Type e.g. 
"Integer" - it works.

*How to Reproduce*
 # Create a Kotlin Project - I have used Kotlin but Java will also work (needs 
minor adjustment)
 # Place the attached CSV file in {{src/main/resources}}   
 # Compile the project with Java 11
 # Run - it will give you error.

{code:java}
Exception in thread "main" scala.MatchError: 
org.apache.spark.sql.types.IntegerType@363fe35a (of class 
org.apache.spark.sql.types.IntegerType)
    at 
org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:240)
    at 
org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:236)
    at 
org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.(objects.scala:1890)
    at 
org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:197)
    at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
    at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
    at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
    at 
org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192)
    at 
org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73)
    at 
org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81)
    at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
    at 
org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:444)
    at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
    at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) 
{code}



 # Now change line (commented as HERE) - to have a String value i.e. "Integer"
 # It works

*Ask*
 # Why does it not accept IntegerType, StringType as DataType as part of the 
parameters supplied through {{add}} function in {{StructType}} ?
 # If this is a bug, do we know when the fix can come?
   

  was:
*Problem Description*

as part of contract mentioned here, Spark should be able to support 
{{IntegerType}} as an argument in StructType.add method. However, it complaints 
with {{scala.MatchError}} today.

 

If we call the override version which access String value as Type e.g. 
"Integer" - it works.


*How to Reproduce*
 # Create a Kotlin Project - I have used Kotlin but Java will also work (needs 
minor adjustment)
 # Place the attached CSV file in {{src/main/resources}}   
 # Compile the project with Java 11
 # Run - it will give you error.
 # Now change line (commented as HERE) - to have a String value i.e. "Integer"
 # It works

*Ask*
 # Why does it not accept IntegerType, StringType as DataType as part of the 
parameters supplied through {{add}} function in {{StructType}} ?
 # If this is a bug, do we know when the fix can come?
   


> DataType argument in StructType.add is incorrectly throwing scala.MatchError
> 
>
> Key: SPARK-40282
> URL: https://issues.apache.org/jira/browse/SPARK-40282
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: M. Manna
>Priority: Blocker
> Attachments: SparkApplication.kt, retailstore.csv
>
>
> *Problem Description*
> as part of contract mentioned here, Spark should be able to support 
> {{IntegerType}} as an argument in StructType.add method. However, it 
> complaints with {{scala.MatchError}} today.
>  
> If we call the override version which access String value as Type e.g. 
> "Integer" - it works.
> *How to Reproduce*
>  # Create a Kotlin Project - I have used Kotlin but Java will also work 
> (needs minor adjustment)
>  # Place the attached CSV file in {{src/main/r

[jira] [Updated] (SPARK-40282) IntegerType is missed in "ExternalDataTypeForInput" function

2022-08-30 Thread M. Manna (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

M. Manna updated SPARK-40282:
-
Attachment: SparkApplication.kt

> IntegerType is missed in "ExternalDataTypeForInput" function
> 
>
> Key: SPARK-40282
> URL: https://issues.apache.org/jira/browse/SPARK-40282
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: M. Manna
>Priority: Blocker
> Attachments: SparkApplication.kt, retailstore.csv
>
>
> *Problem Description*
> as part of contract mentioned here, Spark should be able to support 
> {{IntegerType}} as an argument in StructType.add method. However, it 
> complaints with {{scala.MatchError}} today.
>  
> If we call the override version which access String value as Type e.g. 
> "Integer" - it works.
> *How to Reproduce*
>  # Create a Kotlin Project - I have used Kotlin but Java will also work 
> (needs minor adjustment)
>  # Place the attached CSV file in {{src/main/resources}}   
>  # Compile the project with Java 11
>  # Run - it will give you error.
>  # Now change line (commented as HERE) - to have a String value i.e. "Integer"
>  # It works
> *Ask*
>  # Why does it not accept IntegerType, StringType as DataType as part of the 
> parameters supplied through {{add}} function in {{StructType}} ?
>  # If this is a bug, do we know when the fix can come?
>    



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40282) IntegerType is missed in "ExternalDataTypeForInput" function

2022-08-30 Thread M. Manna (Jira)

M. Manna created SPARK-40282:


 Summary: IntegerType is missed in "ExternalDataTypeForInput" 
function
 Key: SPARK-40282
 URL: https://issues.apache.org/jira/browse/SPARK-40282
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: M. Manna
 Attachments: SparkApplication.kt, retailstore.csv

*Problem Description*

as part of contract mentioned here, Spark should be able to support 
{{IntegerType}} as an argument in StructType.add method. However, it complaints 
with {{scala.MatchError}} today.

 

If we call the override version which access String value as Type e.g. 
"Integer" - it works.


*How to Reproduce*
 # Create a Kotlin Project - I have used Kotlin but Java will also work (needs 
minor adjustment)
 # Place the attached CSV file in {{src/main/resources}}   
 # Compile the project with Java 11
 # Run - it will give you error.
 # Now change line (commented as HERE) - to have a String value i.e. "Integer"
 # It works

*Ask*
 # Why does it not accept IntegerType, StringType as DataType as part of the 
parameters supplied through {{add}} function in {{StructType}} ?
 # If this is a bug, do we know when the fix can come?
   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40282) IntegerType is missed in "ExternalDataTypeForInput" function

2022-08-30 Thread M. Manna (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

M. Manna updated SPARK-40282:
-
Attachment: retailstore.csv

> IntegerType is missed in "ExternalDataTypeForInput" function
> 
>
> Key: SPARK-40282
> URL: https://issues.apache.org/jira/browse/SPARK-40282
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: M. Manna
>Priority: Blocker
> Attachments: SparkApplication.kt, retailstore.csv
>
>
> *Problem Description*
> as part of contract mentioned here, Spark should be able to support 
> {{IntegerType}} as an argument in StructType.add method. However, it 
> complaints with {{scala.MatchError}} today.
>  
> If we call the override version which access String value as Type e.g. 
> "Integer" - it works.
> *How to Reproduce*
>  # Create a Kotlin Project - I have used Kotlin but Java will also work 
> (needs minor adjustment)
>  # Place the attached CSV file in {{src/main/resources}}   
>  # Compile the project with Java 11
>  # Run - it will give you error.
>  # Now change line (commented as HERE) - to have a String value i.e. "Integer"
>  # It works
> *Ask*
>  # Why does it not accept IntegerType, StringType as DataType as part of the 
> parameters supplied through {{add}} function in {{StructType}} ?
>  # If this is a bug, do we know when the fix can come?
>    



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40281) Memory Profiler on Executors

2022-08-30 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reassigned SPARK-40281:


Assignee: (was: Xinrong Meng)

> Memory Profiler on Executors
> 
>
> Key: SPARK-40281
> URL: https://issues.apache.org/jira/browse/SPARK-40281
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Profiling is critical to performance engineering. Memory consumption is a key 
> indicator of how efficient a PySpark program is. There is an existing effort 
> on memory profiling of Python progrms, Memory Profiler 
> ([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/]
> PySpark applications run as independent sets of processes on a cluster, 
> coordinated by the SparkContext object in the driver program. On the driver 
> side, PySpark is a regular Python process, thus, we can profile it as a 
> normal Python program using Memory Profiler.
> However, on the executors side, we are missing such memory profiler. Since 
> executors are distributed on different nodes in the cluster, we need to need 
> to aggregate profiles. Furthermore, Python worker processes are spawned per 
> executor for the Python/Pandas UDF execution, which makes the memory 
> profiling more intricate.
> The umbrella proposes to implement a Memory Profiler on Executors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40281) Memory Profiler on Executors

2022-08-30 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-40281:


 Summary: Memory Profiler on Executors
 Key: SPARK-40281
 URL: https://issues.apache.org/jira/browse/SPARK-40281
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Profiling is critical to performance engineering. Memory consumption is a key 
indicator of how efficient a PySpark program is. There is an existing effort on 
memory profiling of Python progrms, Memory Profiler 
([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/]

PySpark applications run as independent sets of processes on a cluster, 
coordinated by the SparkContext object in the driver program. On the driver 
side, PySpark is a regular Python process, thus, we can profile it as a normal 
Python program using Memory Profiler.

However, on the executors side, we are missing such memory profiler. Since 
executors are distributed on different nodes in the cluster, we need to need to 
aggregate profiles. Furthermore, Python worker processes are spawned per 
executor for the Python/Pandas UDF execution, which makes the memory profiling 
more intricate.

The umbrella proposes to implement a Memory Profiler on Executors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40281) Memory Profiler on Executors

2022-08-30 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reassigned SPARK-40281:


Assignee: Xinrong Meng

> Memory Profiler on Executors
> 
>
> Key: SPARK-40281
> URL: https://issues.apache.org/jira/browse/SPARK-40281
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Profiling is critical to performance engineering. Memory consumption is a key 
> indicator of how efficient a PySpark program is. There is an existing effort 
> on memory profiling of Python progrms, Memory Profiler 
> ([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/]
> PySpark applications run as independent sets of processes on a cluster, 
> coordinated by the SparkContext object in the driver program. On the driver 
> side, PySpark is a regular Python process, thus, we can profile it as a 
> normal Python program using Memory Profiler.
> However, on the executors side, we are missing such memory profiler. Since 
> executors are distributed on different nodes in the cluster, we need to need 
> to aggregate profiles. Furthermore, Python worker processes are spawned per 
> executor for the Python/Pandas UDF execution, which makes the memory 
> profiling more intricate.
> The umbrella proposes to implement a Memory Profiler on Executors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long

2022-08-30 Thread Prashant Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Singh closed SPARK-40266.
--

> Corrected  console output in quick-start -  Datatype Integer instead of Long
> 
>
> Key: SPARK-40266
> URL: https://issues.apache.org/jira/browse/SPARK-40266
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2, 3.3.0
> Environment: spark 3.3.0
> Windows 10 (OS Build 19044.1889)
>Reporter: Prashant Singh
>Assignee: Prashant Singh
>Priority: Minor
> Fix For: 3.4.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> h3. What changes were proposed in this pull request?
> Corrected datatype output of command from Long to Int
> h3. Why are the changes needed?
> It shows incorrect datatype
> h3. Does this PR introduce _any_ user-facing change?
> Yes. It proposes changes in documentation for console output.
> [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png]
> h3. How was this patch tested?
> Manually checked the changes by previewing markdown output. I tested output 
> by installing spark 3.3.0 locally and running commands present in quick start 
> docs
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40256) Switch base image from openjdk to eclipse-temurin

2022-08-30 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-40256.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37705
[https://github.com/apache/spark/pull/37705]

> Switch base image from openjdk to eclipse-temurin
> -
>
> Key: SPARK-40256
> URL: https://issues.apache.org/jira/browse/SPARK-40256
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> According to https://github.com/docker-library/openjdk/issues/505 and 
> https://github.com/docker-library/docs/pull/2162, openjdk:8/11 is EOL and 
> Eclipse Temurin replaces this.
> The `openjdk` image is [not update 
> anymore](https://adoptopenjdk.net/upstream.html) (the last releases were 
> 8u342 and 11.0.16), will `remove the 11 and 8 tags (in October 2022, 
> perhaps)` (we are using it in spark), so we have to switch this before it 
> happens. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40256) Switch base image from openjdk to eclipse-temurin

2022-08-30 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-40256:
--

Assignee: Yikun Jiang

> Switch base image from openjdk to eclipse-temurin
> -
>
> Key: SPARK-40256
> URL: https://issues.apache.org/jira/browse/SPARK-40256
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
> According to https://github.com/docker-library/openjdk/issues/505 and 
> https://github.com/docker-library/docs/pull/2162, openjdk:8/11 is EOL and 
> Eclipse Temurin replaces this.
> The `openjdk` image is [not update 
> anymore](https://adoptopenjdk.net/upstream.html) (the last releases were 
> 8u342 and 11.0.16), will `remove the 11 and 8 tags (in October 2022, 
> perhaps)` (we are using it in spark), so we have to switch this before it 
> happens. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40264) Add helper function for DL model inference in pyspark.ml.functions

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598067#comment-17598067
 ] 

Apache Spark commented on SPARK-40264:
--

User 'leewyang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37734

> Add helper function for DL model inference in pyspark.ml.functions
> --
>
> Key: SPARK-40264
> URL: https://issues.apache.org/jira/browse/SPARK-40264
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.2
>Reporter: Lee Yang
>Priority: Minor
>
> Add a helper function to create a pandas_udf for inference on a given DL 
> model, where the user provides a predict function that is responsible for 
> loading the model and inferring on a batch of numpy inputs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40264) Add helper function for DL model inference in pyspark.ml.functions

2022-08-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40264:


Assignee: Apache Spark

> Add helper function for DL model inference in pyspark.ml.functions
> --
>
> Key: SPARK-40264
> URL: https://issues.apache.org/jira/browse/SPARK-40264
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.2
>Reporter: Lee Yang
>Assignee: Apache Spark
>Priority: Minor
>
> Add a helper function to create a pandas_udf for inference on a given DL 
> model, where the user provides a predict function that is responsible for 
> loading the model and inferring on a batch of numpy inputs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40264) Add helper function for DL model inference in pyspark.ml.functions

2022-08-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40264:


Assignee: (was: Apache Spark)

> Add helper function for DL model inference in pyspark.ml.functions
> --
>
> Key: SPARK-40264
> URL: https://issues.apache.org/jira/browse/SPARK-40264
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.2
>Reporter: Lee Yang
>Priority: Minor
>
> Add a helper function to create a pandas_udf for inference on a given DL 
> model, where the user provides a predict function that is responsible for 
> loading the model and inferring on a batch of numpy inputs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40280) Failure to create parquet predicate push down for ints and longs on some valid files

2022-08-30 Thread Robert Joseph Evans (Jira)

Robert Joseph Evans created SPARK-40280:
---

 Summary: Failure to create parquet predicate push down for ints 
and longs on some valid files
 Key: SPARK-40280
 URL: https://issues.apache.org/jira/browse/SPARK-40280
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0, 3.2.0, 3.1.0, 3.4.0
Reporter: Robert Joseph Evans


The [parquet 
format|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#signed-integers]
 specification states that...

bq. {{{}INT(8, true){}}}, {{{}INT(16, true){}}}, and {{INT(32, true)}} must 
annotate an {{int32}} primitive type and {{INT(64, true)}} must annotate an 
{{int64}} primitive type. {{INT(32, true)}} and {{INT(64, true)}} are implied 
by the {{int32}} and {{int64}} primitive types if no other annotation is 
present and should be considered optional.

But the code inside of 
[ParquetFilters.scala|https://github.com/apache/spark/blob/296fe49ec855ac8c15c080e7bab6d519fe504bd3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L125-L126]
 requires that for {{int32}} and {{int64}} that there be no annotation. If 
there is an annotation for those columns and they are a part of a predicate 
push down, the hard coded types will not match and the corresponding filter 
ends up being {{None}}.

This can be a huge performance penalty for a valid parquet file.

I am happy to provide files that show the issue if needed for testing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40233) Unable to load large pandas dataframe to pyspark

2022-08-30 Thread Niranda Perera (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598021#comment-17598021
 ] 

Niranda Perera commented on SPARK-40233:


I believe the issue is related to executors not being able to load data from 
the python driver (possibly not having enough memory). I believe first one 
executor would have to load the entire data dump from the python driver and 
then repartition it.

 

My suggestion is to add a `num_partitions` option for createDataFrame method so 
that partitioning can be handled at the driver and sent to the executors as a 
list of RDDs. Is this an acceptable way from the POV of spark internals?

> Unable to load large pandas dataframe to pyspark
> 
>
> Key: SPARK-40233
> URL: https://issues.apache.org/jira/browse/SPARK-40233
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Niranda Perera
>Priority: Major
>
> I've been trying to join two large pandas dataframes using pyspark using the 
> following code. I'm trying to vary executor cores allocated for the 
> application and measure scalability of pyspark (strong scaling).
> {code:java}
> r = 10 # 1Bn rows 
> it = 10
> w = 256
> unique = 0.9
> TOTAL_MEM = 240
> TOTAL_NODES = 14
> max_val = r * unique
> rng = default_rng()
> frame_data = rng.integers(0, max_val, size=(r, 2)) 
> frame_data1 = rng.integers(0, max_val, size=(r, 2)) 
> print(f"data generated", flush=True)
> df_l = pd.DataFrame(frame_data).add_prefix("col")
> df_r = pd.DataFrame(frame_data1).add_prefix("col")
> print(f"data loaded", flush=True)
> procs = int(math.ceil(w / TOTAL_NODES))
> mem = int(TOTAL_MEM*0.9)
> print(f"world sz {w} procs per worker {procs} mem {mem} iter {it}", 
> flush=True)
> spark = SparkSession\
> .builder\
> .appName(f'join {r} {w}')\
> .master('spark://node:7077')\
> .config('spark.executor.memory', f'{int(mem*0.6)}g')\
> .config('spark.executor.pyspark.memory', f'{int(mem*0.4)}g')\
> .config('spark.cores.max', w)\
> .config('spark.driver.memory', '100g')\
> .config('sspark.sql.execution.arrow.pyspark.enabled', 'true')\
> .getOrCreate()
> sdf0 = spark.createDataFrame(df_l).repartition(w).cache()
> sdf1 = spark.createDataFrame(df_r).repartition(w).cache()
> print(f"data loaded to spark", flush=True)
> try:   
> for i in range(it):
> t1 = time.time()
> out = sdf0.join(sdf1, on='col0', how='inner')
> count = out.count()
> t2 = time.time()
> print(f"timings {r} 1 {i} {(t2 - t1) * 1000:.0f} ms, {count}", 
> flush=True)
> 
> del out
> del count
> gc.collect()
> finally:
> spark.stop() {code}
> {*}Cluster{*}: I am using standalone spark cluster in a 15 node cluster with 
> 48 cores and 240GB RAM each. I've spawned master and the driver code in 
> node1, while other 14 nodes have spawned workers allocating maximum memory. 
> In the spark context, I am reserving 90% of total memory to executor, 
> splitting 60% to jvm and 40% to pyspark.
> {*}Issue{*}: When I run the above program, I can see that the executors are 
> being assigned to the app. But it doesn't move forward, even after 60 mins. 
> For smaller row count (10M), this was working without a problem. Driver output
> {code:java}
> world sz 256 procs per worker 19 mem 216 iter 8
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 22/08/26 14:52:22 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> /N/u/d/dnperera/.conda/envs/env/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:425:
>  UserWarning: createDataFrame attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Negative initial size: -589934400
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warn(msg) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40267) Add description for ExecutorAllocationManager metrics

2022-08-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40267:


Assignee: Apache Spark

> Add description for ExecutorAllocationManager metrics
> -
>
> Key: SPARK-40267
> URL: https://issues.apache.org/jira/browse/SPARK-40267
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> Some ExecutorAllocationManager metrics are hard to know what stands for just 
> from metric name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40267) Add description for ExecutorAllocationManager metrics

2022-08-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40267:


Assignee: (was: Apache Spark)

> Add description for ExecutorAllocationManager metrics
> -
>
> Key: SPARK-40267
> URL: https://issues.apache.org/jira/browse/SPARK-40267
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Some ExecutorAllocationManager metrics are hard to know what stands for just 
> from metric name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40267) Add description for ExecutorAllocationManager metrics

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597996#comment-17597996
 ] 

Apache Spark commented on SPARK-40267:
--

User 'warrenzhu25' has created a pull request for this issue:
https://github.com/apache/spark/pull/37733

> Add description for ExecutorAllocationManager metrics
> -
>
> Key: SPARK-40267
> URL: https://issues.apache.org/jira/browse/SPARK-40267
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Some ExecutorAllocationManager metrics are hard to know what stands for just 
> from metric name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position

2022-08-30 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40260.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37712
[https://github.com/apache/spark/pull/37712]

> Use error classes in the compilation errors of GROUP BY a position
> --
>
> Key: SPARK-40260
> URL: https://issues.apache.org/jira/browse/SPARK-40260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Migrate the following errors in QueryCompilationErrors:
> * groupByPositionRefersToAggregateFunctionError
> * groupByPositionRangeError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40253) Data read exception in orc format

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597977#comment-17597977
 ] 

Apache Spark commented on SPARK-40253:
--

User 'SelfImpr001' has created a pull request for this issue:
https://github.com/apache/spark/pull/37732

>  Data read exception in orc format
> --
>
> Key: SPARK-40253
> URL: https://issues.apache.org/jira/browse/SPARK-40253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: os centos7
> spark 2.4.3
> hive 1.2.1
> hadoop 2.7.2
>Reporter: yihangqiao
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Caused by: java.io.EOFException: Read past end of RLE integer from compressed 
> stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 
> offset: 0 limit: 0
> When running batches using spark-sql and using the create table xxx as select 
> syntax, the select query part uses a static value as the default value (0.00 
> as column_name) and does not specify the data type of the default value. In 
> this usage scenario, because the data type is not explicitly specified, the 
> metadata information of the field in the written ORC file is missing (the 
> writing is successful), but when reading, as long as the query column 
> contains this field, it will not be able to Parsing the ORC file, the 
> following error occurs：
>  
> {code:java}
> create table testgg as select 0.00 as gg;select * from testgg;Caused by: 
> java.io.IOException: Error reading file: 
> viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000
>        at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291)    
>    at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227)
>        at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109)
>        at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
>  Source)       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>        at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)       at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288)       at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)       at 
> org.apache.spark.scheduler.Task.run(Task.scala:121)       at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)   
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)  
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>        at java.lang.Thread.run(Thread.java:748)Caused by: 
> java.io.EOFException: Read past end of RLE integer from compressed stream 
> Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 
> limit: 0       at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398)
>        at 
> org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(Tree

[jira] [Commented] (SPARK-40253) Data read exception in orc format

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597975#comment-17597975
 ] 

Apache Spark commented on SPARK-40253:
--

User 'SelfImpr001' has created a pull request for this issue:
https://github.com/apache/spark/pull/37732

>  Data read exception in orc format
> --
>
> Key: SPARK-40253
> URL: https://issues.apache.org/jira/browse/SPARK-40253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: os centos7
> spark 2.4.3
> hive 1.2.1
> hadoop 2.7.2
>Reporter: yihangqiao
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Caused by: java.io.EOFException: Read past end of RLE integer from compressed 
> stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 
> offset: 0 limit: 0
> When running batches using spark-sql and using the create table xxx as select 
> syntax, the select query part uses a static value as the default value (0.00 
> as column_name) and does not specify the data type of the default value. In 
> this usage scenario, because the data type is not explicitly specified, the 
> metadata information of the field in the written ORC file is missing (the 
> writing is successful), but when reading, as long as the query column 
> contains this field, it will not be able to Parsing the ORC file, the 
> following error occurs：
>  
> {code:java}
> create table testgg as select 0.00 as gg;select * from testgg;Caused by: 
> java.io.IOException: Error reading file: 
> viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000
>        at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291)    
>    at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227)
>        at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109)
>        at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
>  Source)       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>        at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)       at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288)       at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)       at 
> org.apache.spark.scheduler.Task.run(Task.scala:121)       at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)   
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)  
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>        at java.lang.Thread.run(Thread.java:748)Caused by: 
> java.io.EOFException: Read past end of RLE integer from compressed stream 
> Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 
> limit: 0       at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398)
>        at 
> org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(Tree

[jira] [Resolved] (SPARK-38603) Qualified star selection produces duplicated common columns after join then alias

2022-08-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38603.
-
Resolution: Duplicate

> Qualified star selection produces duplicated common columns after join then 
> alias
> -
>
> Key: SPARK-38603
> URL: https://issues.apache.org/jira/browse/SPARK-38603
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: OS: Ubuntu 18.04.5 LTS
> Scala version: 2.12.15
>Reporter: Yves Li
>Priority: Minor
>
> When joining two DataFrames and then aliasing the result, selecting columns 
> from the resulting Dataset by a qualified star produces duplicates of the 
> joined columns.
> {code:scala}
> scala> val df1 = Seq((1, 10), (2, 20)).toDF("a", "x")
> df1: org.apache.spark.sql.DataFrame = [a: int, x: int]
> scala> val df2 = Seq((2, 200), (3, 300)).toDF("a", "y")
> df2: org.apache.spark.sql.DataFrame = [a: int, y: int]
> scala> val joined = df1.join(df2, "a").alias("joined")
> joined: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, x: 
> int ... 1 more field]
> scala> joined.select("*").show()
> +---+---+---+
> |  a|  x|  y|
> +---+---+---+
> |  2| 20|200|
> +---+---+---+
> scala> joined.select("joined.*").show()
> +---+---+---+---+
> |  a|  a|  x|  y|
> +---+---+---+---+
> |  2|  2| 20|200|
> +---+---+---+---+
> scala> joined.select("*").select("joined.*").show()
> +---+---+---+
> |  a|  x|  y|
> +---+---+---+
> |  2| 20|200|
> +---+---+---+ {code}
> This appears to be introduced by SPARK-34527, leading to some surprising 
> behaviour. Using an earlier version, such as Spark 3.0.2, produces the same 
> output for all three {{{}show(){}}}s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()

2022-08-30 Thread Kevin Appel (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597941#comment-17597941
 ] 

Kevin Appel commented on SPARK-31001:
-

[~nchammas] 

I ran into this recently trying to create the external table that is 
partitioned and I found a bunch of items but nothing was working easily without 
having to manually extract the ddl and remove the partition column.  I found in 
here [https://github.com/delta-io/delta/issues/31] there is vprus there that 
posted an example and they did that option and include the partition columns.

I used this and tried this and combined with the second item this is working 
for me, to create the external table in spark using the existing parquet data.  
From here since this is registered then I am also able to view this in Trino

spark.catalog.createTable("kevin.ktest1", "/user/kevin/ktest1", 
**\{"__partition_columns":"['id']"})
spark.sql("alter table kevin.ktest1 recover partitions")
 
Can you see if this is working for you as well?
 
If this is working for you as well then possibly we can get the Spark 
documentation updated to include this on using this

> Add ability to create a partitioned table via catalog.createTable()
> ---
>
> Key: SPARK-31001
> URL: https://issues.apache.org/jira/browse/SPARK-31001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There doesn't appear to be a way to create a partitioned table using the 
> Catalog interface.
> In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40113) Reactor ParquetScanBuilder DataSourceV2 interface implementation

2022-08-30 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-40113.

Fix Version/s: 3.4.0
 Assignee: miracle
   Resolution: Fixed

> Reactor ParquetScanBuilder DataSourceV2 interface implementation
> 
>
> Key: SPARK-40113
> URL: https://issues.apache.org/jira/browse/SPARK-40113
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.0
>Reporter: Mars
>Assignee: miracle
>Priority: Minor
> Fix For: 3.4.0
>
>
> Now `FileScanBuilder` interface is not fully implemented in 
> `ParquetScanBuilder` like 
> `OrcScanBuilder`,`AvroScanBuilder`,`CSVScanBuilder`
> In order to unify the logic of the code and make it clearer, this part of the 
> implementation is unified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9

2022-08-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40056.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37727
[https://github.com/apache/spark/pull/37727]

> Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
> -
>
> Key: SPARK-40056
> URL: https://issues.apache.org/jira/browse/SPARK-40056
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Trivial
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9

2022-08-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40056:


Assignee: BingKun Pan

> Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
> -
>
> Key: SPARK-40056
> URL: https://issues.apache.org/jira/browse/SPARK-40056
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40253) Data read exception in orc format

2022-08-30 Thread yihangqiao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597905#comment-17597905
 ] 

yihangqiao commented on SPARK-40253:


solution: In Literal, 1 significant digit reserved digit is added by default. 
For example, 0.00 will convert it to decimal(3,2) by default. Compatible with 
38-bit data calculations at the same time

>  Data read exception in orc format
> --
>
> Key: SPARK-40253
> URL: https://issues.apache.org/jira/browse/SPARK-40253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: os centos7
> spark 2.4.3
> hive 1.2.1
> hadoop 2.7.2
>Reporter: yihangqiao
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Caused by: java.io.EOFException: Read past end of RLE integer from compressed 
> stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 
> offset: 0 limit: 0
> When running batches using spark-sql and using the create table xxx as select 
> syntax, the select query part uses a static value as the default value (0.00 
> as column_name) and does not specify the data type of the default value. In 
> this usage scenario, because the data type is not explicitly specified, the 
> metadata information of the field in the written ORC file is missing (the 
> writing is successful), but when reading, as long as the query column 
> contains this field, it will not be able to Parsing the ORC file, the 
> following error occurs：
>  
> {code:java}
> create table testgg as select 0.00 as gg;select * from testgg;Caused by: 
> java.io.IOException: Error reading file: 
> viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000
>        at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291)    
>    at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227)
>        at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109)
>        at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
>  Source)       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>        at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)       at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288)       at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)       at 
> org.apache.spark.scheduler.Task.run(Task.scala:121)       at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)   
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)  
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>        at java.lang.Thread.run(Thread.java:748)Caused by: 
> java.io.EOFException: Read past end of RLE integer from compressed stream 
> Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 
> limit: 0       at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:

[jira] [Assigned] (SPARK-40279) Document spark.yarn.report.interval

2022-08-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40279:


Assignee: Apache Spark

> Document spark.yarn.report.interval
> ---
>
> Key: SPARK-40279
> URL: https://issues.apache.org/jira/browse/SPARK-40279
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Luca Canali
>Assignee: Apache Spark
>Priority: Minor
>
> This proposes to document the configuration paramter 
> spark.yarn.report.interval -> Interval between reports of the current Spark 
> job status in cluster mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40279) Document spark.yarn.report.interval

2022-08-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40279:


Assignee: (was: Apache Spark)

> Document spark.yarn.report.interval
> ---
>
> Key: SPARK-40279
> URL: https://issues.apache.org/jira/browse/SPARK-40279
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to document the configuration paramter 
> spark.yarn.report.interval -> Interval between reports of the current Spark 
> job status in cluster mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40279) Document spark.yarn.report.interval

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597896#comment-17597896
 ] 

Apache Spark commented on SPARK-40279:
--

User 'LucaCanali' has created a pull request for this issue:
https://github.com/apache/spark/pull/37731

> Document spark.yarn.report.interval
> ---
>
> Key: SPARK-40279
> URL: https://issues.apache.org/jira/browse/SPARK-40279
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to document the configuration paramter 
> spark.yarn.report.interval -> Interval between reports of the current Spark 
> job status in cluster mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40279) Document spark.yarn.report.interval

2022-08-30 Thread Luca Canali (Jira)

Luca Canali created SPARK-40279:
---

 Summary: Document spark.yarn.report.interval
 Key: SPARK-40279
 URL: https://issues.apache.org/jira/browse/SPARK-40279
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.3.0
Reporter: Luca Canali


This proposes to document the configuration paramter spark.yarn.report.interval 
-> Interval between reports of the current Spark job status in cluster mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597894#comment-17597894
 ] 

Apache Spark commented on SPARK-39915:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/37730

> Dataset.repartition(N) may not create N partitions
> --
>
> Key: SPARK-39915
> URL: https://issues.apache.org/jira/browse/SPARK-39915
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> Looks like there is a behavior change in Dataset.repartition in 3.3.0. For 
> example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 
> in Spark 3.2.0, but 0 in Spark 3.3.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597893#comment-17597893
 ] 

Apache Spark commented on SPARK-39915:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/37730

> Dataset.repartition(N) may not create N partitions
> --
>
> Key: SPARK-39915
> URL: https://issues.apache.org/jira/browse/SPARK-39915
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> Looks like there is a behavior change in Dataset.repartition in 3.3.0. For 
> example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 
> in Spark 3.2.0, but 0 in Spark 3.3.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40278) Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed

2022-08-30 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-40278:
-
Description: 
I used databricks spark-sql-pref + Spark 3.3 to run 3TB TPCDS q24a or q24b, the 
test code as follows:
{code:java}
val rootDir = "hdfs://${clusterName}/tpcds-data/POCGenData3T"
val databaseName = "tpcds_database"
val scaleFactor = "3072"
val format = "parquet"
import com.databricks.spark.sql.perf.tpcds.TPCDSTables
val tables = new TPCDSTables(
      spark.sqlContext,dsdgenDir = "./tpcds-kit/tools",
      scaleFactor = scaleFactor,
      useDoubleForDecimal = false,useStringForDate = false)
spark.sql(s"create database $databaseName")
tables.createTemporaryTables(rootDir, format)
spark.sql(s"use $databaseName")// TPCDS 24a or 24b
val result = spark.sql(""" with ssales as
 (select c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color,
        i_current_price, i_manager_id, i_units, i_size, sum(ss_net_paid) netpaid
 from store_sales, store_returns, store, item, customer, customer_address
 where ss_ticket_number = sr_ticket_number
   and ss_item_sk = sr_item_sk
   and ss_customer_sk = c_customer_sk
   and ss_item_sk = i_item_sk
   and ss_store_sk = s_store_sk
   and c_birth_country = upper(ca_country)
   and s_zip = ca_zip
 and s_market_id = 8
 group by c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color,
          i_current_price, i_manager_id, i_units, i_size)
 select c_last_name, c_first_name, s_store_name, sum(netpaid) paid
 from ssales
 where i_color = 'pale'
 group by c_last_name, c_first_name, s_store_name
 having sum(netpaid) > (select 0.05*avg(netpaid) from ssales)""").collect()
 sc.stop() {code}
The above test may failed due to `Stage cancelled because SparkContext was shut 
down` of stage 31 and stage 36 when AQE enabled as follows:

 

!image-2022-08-30-21-09-48-763.png!

!image-2022-08-30-21-10-24-862.png!

!image-2022-08-30-21-10-57-128.png!

 

The DAG corresponding to sql is as follows:

!image-2022-08-30-21-11-50-895.png!

The details as follows:

 

 
{code:java}
== Physical Plan ==
AdaptiveSparkPlan (42)
+- == Final Plan ==
   LocalTableScan (1)
+- == Initial Plan ==
   Filter (41)
   +- HashAggregate (40)
  +- Exchange (39)
 +- HashAggregate (38)
+- HashAggregate (37)
   +- Exchange (36)
  +- HashAggregate (35)
 +- Project (34)
+- BroadcastHashJoin Inner BuildRight (33)
   :- Project (29)
   :  +- BroadcastHashJoin Inner BuildRight (28)
   : :- Project (24)
   : :  +- BroadcastHashJoin Inner BuildRight (23)
   : : :- Project (19)
   : : :  +- BroadcastHashJoin Inner BuildRight 
(18)
   : : : :- Project (13)
   : : : :  +- SortMergeJoin Inner (12)
   : : : : :- Sort (6)
   : : : : :  +- Exchange (5)
   : : : : : +- Project (4)
   : : : : :+- Filter (3)
   : : : : :   +- Scan parquet  
(2)
   : : : : +- Sort (11)
   : : : :+- Exchange (10)
   : : : :   +- Project (9)
   : : : :  +- Filter (8)
   : : : : +- Scan parquet  
(7)
   : : : +- BroadcastExchange (17)
   : : :+- Project (16)
   : : :   +- Filter (15)
   : : :  +- Scan parquet  (14)
   : : +- BroadcastExchange (22)
   : :+- Filter (21)
   : :   +- Scan parquet  (20)
   : +- BroadcastExchange (27)
   :+- Filter (26)
   :   +- Scan parquet  (25)
   +- BroadcastExchange (32)
  +- Filter (31)
 +- Scan parquet  (30)


(1) LocalTableScan
Output [4]: [c_last_name#421, c_first_name#420, s_store_name#669, paid#850]
Arguments: , [c_last_name#421, c_first_name#420, s_store_name#669, 
paid#850]

(2) Scan parquet 
Output [6]: [ss_item_sk#131, ss_customer_sk#132, ss_store_sk#136, 
ss_ticket_number#138L, ss_net_paid#149, ss_sold_date_sk#152]
Batched: true
Location: InMemoryFileIndex 
[afs://tianqi.afs.baidu.com:9902/u

[jira] [Created] (SPARK-40278) Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed

2022-08-30 Thread Yang Jie (Jira)

Yang Jie created SPARK-40278:


 Summary: Used databricks spark-sql-pref with Spark 3.3 to run 3TB 
tpcds test failed
 Key: SPARK-40278
 URL: https://issues.apache.org/jira/browse/SPARK-40278
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0, 3.4.0
Reporter: Yang Jie


I used databricks spark-sql-pref + Spark 3.3 to run 3TB TPCDS q24a or q24b, the 
test code as follows:

```scala
val rootDir = "hdfs://${clusterName}/tpcds-data/POCGenData3T"
val databaseName = "tpcds_database"
val scaleFactor = "3072"
val format = "parquet"
import com.databricks.spark.sql.perf.tpcds.TPCDSTables
val tables = new TPCDSTables(
      spark.sqlContext,dsdgenDir = "./tpcds-kit/tools",
      scaleFactor = scaleFactor,
      useDoubleForDecimal = false,useStringForDate = false)
spark.sql(s"create database $databaseName")
tables.createTemporaryTables(rootDir, format)
spark.sql(s"use $databaseName")

// TPCDS 24a or 24b
val result = spark.sql(""" with ssales as
 (select c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color,
        i_current_price, i_manager_id, i_units, i_size, sum(ss_net_paid) netpaid
 from store_sales, store_returns, store, item, customer, customer_address
 where ss_ticket_number = sr_ticket_number
   and ss_item_sk = sr_item_sk
   and ss_customer_sk = c_customer_sk
   and ss_item_sk = i_item_sk
   and ss_store_sk = s_store_sk
   and c_birth_country = upper(ca_country)
   and s_zip = ca_zip
 and s_market_id = 8
 group by c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color,
          i_current_price, i_manager_id, i_units, i_size)
 select c_last_name, c_first_name, s_store_name, sum(netpaid) paid
 from ssales
 where i_color = 'pale'
 group by c_last_name, c_first_name, s_store_name
 having sum(netpaid) > (select 0.05*avg(netpaid) from ssales)""").collect()
 sc.stop()
```

The above test may failed due to `Stage cancelled because SparkContext was shut 
down` of stage 31 and stage 36 when AQE enabled as follows:

 

!image-2022-08-30-21-09-48-763.png!

!image-2022-08-30-21-10-24-862.png!

!image-2022-08-30-21-10-57-128.png!

 

The DAG corresponding to sql is as follows:

!image-2022-08-30-21-11-50-895.png!

The details as follows:

 

 
{code:java}
== Physical Plan ==
AdaptiveSparkPlan (42)
+- == Final Plan ==
   LocalTableScan (1)
+- == Initial Plan ==
   Filter (41)
   +- HashAggregate (40)
  +- Exchange (39)
 +- HashAggregate (38)
+- HashAggregate (37)
   +- Exchange (36)
  +- HashAggregate (35)
 +- Project (34)
+- BroadcastHashJoin Inner BuildRight (33)
   :- Project (29)
   :  +- BroadcastHashJoin Inner BuildRight (28)
   : :- Project (24)
   : :  +- BroadcastHashJoin Inner BuildRight (23)
   : : :- Project (19)
   : : :  +- BroadcastHashJoin Inner BuildRight 
(18)
   : : : :- Project (13)
   : : : :  +- SortMergeJoin Inner (12)
   : : : : :- Sort (6)
   : : : : :  +- Exchange (5)
   : : : : : +- Project (4)
   : : : : :+- Filter (3)
   : : : : :   +- Scan parquet  
(2)
   : : : : +- Sort (11)
   : : : :+- Exchange (10)
   : : : :   +- Project (9)
   : : : :  +- Filter (8)
   : : : : +- Scan parquet  
(7)
   : : : +- BroadcastExchange (17)
   : : :+- Project (16)
   : : :   +- Filter (15)
   : : :  +- Scan parquet  (14)
   : : +- BroadcastExchange (22)
   : :+- Filter (21)
   : :   +- Scan parquet  (20)
   : +- BroadcastExchange (27)
   :+- Filter (26)
   :   +- Scan parquet  (25)
   +- BroadcastExchange (32)
  +- Filter (31)
 +- Scan parquet  (30)


(1) LocalTableScan
Output [4]: [c_last_name#421, c_first_name#420, s_store_name#669, paid#850]
Arguments: , [c_last_name#421, c_first_name#420, s_store_name#669, 
paid#850]

(2) Scan parquet

[jira] [Updated] (SPARK-40278) Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed

2022-08-30 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-40278:
-
Affects Version/s: (was: 3.4.0)

> Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed
> --
>
> Key: SPARK-40278
> URL: https://issues.apache.org/jira/browse/SPARK-40278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Major
>
> I used databricks spark-sql-pref + Spark 3.3 to run 3TB TPCDS q24a or q24b, 
> the test code as follows:
> ```scala
> val rootDir = "hdfs://${clusterName}/tpcds-data/POCGenData3T"
> val databaseName = "tpcds_database"
> val scaleFactor = "3072"
> val format = "parquet"
> import com.databricks.spark.sql.perf.tpcds.TPCDSTables
> val tables = new TPCDSTables(
>       spark.sqlContext,dsdgenDir = "./tpcds-kit/tools",
>       scaleFactor = scaleFactor,
>       useDoubleForDecimal = false,useStringForDate = false)
> spark.sql(s"create database $databaseName")
> tables.createTemporaryTables(rootDir, format)
> spark.sql(s"use $databaseName")
> // TPCDS 24a or 24b
> val result = spark.sql(""" with ssales as
>  (select c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color,
>         i_current_price, i_manager_id, i_units, i_size, sum(ss_net_paid) 
> netpaid
>  from store_sales, store_returns, store, item, customer, customer_address
>  where ss_ticket_number = sr_ticket_number
>    and ss_item_sk = sr_item_sk
>    and ss_customer_sk = c_customer_sk
>    and ss_item_sk = i_item_sk
>    and ss_store_sk = s_store_sk
>    and c_birth_country = upper(ca_country)
>    and s_zip = ca_zip
>  and s_market_id = 8
>  group by c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color,
>           i_current_price, i_manager_id, i_units, i_size)
>  select c_last_name, c_first_name, s_store_name, sum(netpaid) paid
>  from ssales
>  where i_color = 'pale'
>  group by c_last_name, c_first_name, s_store_name
>  having sum(netpaid) > (select 0.05*avg(netpaid) from ssales)""").collect()
>  sc.stop()
> ```
> The above test may failed due to `Stage cancelled because SparkContext was 
> shut down` of stage 31 and stage 36 when AQE enabled as follows:
>  
> !image-2022-08-30-21-09-48-763.png!
> !image-2022-08-30-21-10-24-862.png!
> !image-2022-08-30-21-10-57-128.png!
>  
> The DAG corresponding to sql is as follows:
> !image-2022-08-30-21-11-50-895.png!
> The details as follows:
>  
>  
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (42)
> +- == Final Plan ==
>LocalTableScan (1)
> +- == Initial Plan ==
>Filter (41)
>+- HashAggregate (40)
>   +- Exchange (39)
>  +- HashAggregate (38)
> +- HashAggregate (37)
>+- Exchange (36)
>   +- HashAggregate (35)
>  +- Project (34)
> +- BroadcastHashJoin Inner BuildRight (33)
>:- Project (29)
>:  +- BroadcastHashJoin Inner BuildRight (28)
>: :- Project (24)
>: :  +- BroadcastHashJoin Inner BuildRight (23)
>: : :- Project (19)
>: : :  +- BroadcastHashJoin Inner 
> BuildRight (18)
>: : : :- Project (13)
>: : : :  +- SortMergeJoin Inner (12)
>: : : : :- Sort (6)
>: : : : :  +- Exchange (5)
>: : : : : +- Project (4)
>: : : : :+- Filter (3)
>: : : : :   +- Scan 
> parquet  (2)
>: : : : +- Sort (11)
>: : : :+- Exchange (10)
>: : : :   +- Project (9)
>: : : :  +- Filter (8)
>: : : : +- Scan 
> parquet  (7)
>: : : +- BroadcastExchange (17)
>: : :+- Project (16)
>: : :   +- Filter (15)
>: : :  +- Scan parquet  (14)
>: : +- BroadcastExchange (22)
>: :+- Filter (21)
>: :   +- Scan parquet  (20)
>: +- BroadcastExchange (27)
>

[jira] [Commented] (SPARK-33861) Simplify conditional in predicate

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597865#comment-17597865
 ] 

Apache Spark commented on SPARK-33861:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37729

> Simplify conditional in predicate
> -
>
> Key: SPARK-33861
> URL: https://issues.apache.org/jira/browse/SPARK-33861
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> The use case is:
> {noformat}
> spark.sql("create table t1 using parquet as select id as a, id as b from 
> range(10)")
> spark.sql("select * from t1 where CASE WHEN a > 2 THEN b + 10 END > 
> 5").explain()
> {noformat}
> Before this pr:
> {noformat}
> == Physical Plan ==
> *(1) Filter CASE WHEN (a#3L > 2) THEN ((b#4L + 10) > 5) END
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[a#3L,b#4L] Batched: true, DataFilters: 
> [CASE WHEN (a#3L > 2) THEN ((b#4L + 10) > 5) END], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
> {noformat}
> After this pr:
> {noformat}
> == Physical Plan ==
> *(1) Filter (((isnotnull(a#3L) AND isnotnull(b#4L)) AND (a#3L > 2)) AND 
> ((b#4L + 10) > 5))
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[a#3L,b#4L] Batched: true, DataFilters: 
> [isnotnull(a#3L), isnotnull(b#4L), (a#3L > 2), ((b#4L + 10) > 5)], Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(a), IsNotNull(b), 
> GreaterThan(a,2)], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33861) Simplify conditional in predicate

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597862#comment-17597862
 ] 

Apache Spark commented on SPARK-33861:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37729

> Simplify conditional in predicate
> -
>
> Key: SPARK-33861
> URL: https://issues.apache.org/jira/browse/SPARK-33861
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> The use case is:
> {noformat}
> spark.sql("create table t1 using parquet as select id as a, id as b from 
> range(10)")
> spark.sql("select * from t1 where CASE WHEN a > 2 THEN b + 10 END > 
> 5").explain()
> {noformat}
> Before this pr:
> {noformat}
> == Physical Plan ==
> *(1) Filter CASE WHEN (a#3L > 2) THEN ((b#4L + 10) > 5) END
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[a#3L,b#4L] Batched: true, DataFilters: 
> [CASE WHEN (a#3L > 2) THEN ((b#4L + 10) > 5) END], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
> {noformat}
> After this pr:
> {noformat}
> == Physical Plan ==
> *(1) Filter (((isnotnull(a#3L) AND isnotnull(b#4L)) AND (a#3L > 2)) AND 
> ((b#4L + 10) > 5))
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[a#3L,b#4L] Batched: true, DataFilters: 
> [isnotnull(a#3L), isnotnull(b#4L), (a#3L > 2), ((b#4L + 10) > 5)], Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(a), IsNotNull(b), 
> GreaterThan(a,2)], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40277) Use DataFrame's column for referring to DDL schema for from_csv() and from_json()

2022-08-30 Thread Jayant Kumar (Jira)

Jayant Kumar created SPARK-40277:


 Summary: Use DataFrame's column for referring to DDL schema for 
from_csv() and from_json()
 Key: SPARK-40277
 URL: https://issues.apache.org/jira/browse/SPARK-40277
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Jayant Kumar


With spark's DataFrame api one has to explicitly pass the StrucType to 
functions like from_csv and from_json. This works okay in general.

In certain circumstances when schema depends on the one of the DataFrame's 
field, it gets complicated and one has to switch to RDD. This requires 
additional libraries to be added with additional parsing logic.

I am trying to explore a way to enable such use cases with DataFrame api and 
function itself. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40276) reduce the result size of RDD.takeOrdered

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597852#comment-17597852
 ] 

Apache Spark commented on SPARK-40276:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37728

> reduce the result size of RDD.takeOrdered
> -
>
> Key: SPARK-40276
> URL: https://issues.apache.org/jira/browse/SPARK-40276
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40276) reduce the result size of RDD.takeOrdered

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597850#comment-17597850
 ] 

Apache Spark commented on SPARK-40276:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37728

> reduce the result size of RDD.takeOrdered
> -
>
> Key: SPARK-40276
> URL: https://issues.apache.org/jira/browse/SPARK-40276
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40276) reduce the result size of RDD.takeOrdered

2022-08-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40276:


Assignee: (was: Apache Spark)

> reduce the result size of RDD.takeOrdered
> -
>
> Key: SPARK-40276
> URL: https://issues.apache.org/jira/browse/SPARK-40276
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40276) reduce the result size of RDD.takeOrdered

2022-08-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40276:


Assignee: Apache Spark

> reduce the result size of RDD.takeOrdered
> -
>
> Key: SPARK-40276
> URL: https://issues.apache.org/jira/browse/SPARK-40276
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40276) reduce the result size of RDD.takeOrdered

2022-08-30 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-40276:
-

 Summary: reduce the result size of RDD.takeOrdered
 Key: SPARK-40276
 URL: https://issues.apache.org/jira/browse/SPARK-40276
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597778#comment-17597778
 ] 

Apache Spark commented on SPARK-40056:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37727

> Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
> -
>
> Key: SPARK-40056
> URL: https://issues.apache.org/jira/browse/SPARK-40056
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40207) Specify the column name when the data type is not supported by datasource

2022-08-30 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-40207:
---

Assignee: Yi kaifei  (was: Apache Spark)

> Specify the column name when the data type is not supported by datasource
> -
>
> Key: SPARK-40207
> URL: https://issues.apache.org/jira/browse/SPARK-40207
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Assignee: Yi kaifei
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, If the data type is not supported by the data source, the 
> exception message thrown does not contain the column name, which is less 
> clear for locating the problem, this Jira aims to optimize error message 
> description



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40207) Specify the column name when the data type is not supported by datasource

2022-08-30 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-40207.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37574
[https://github.com/apache/spark/pull/37574]

> Specify the column name when the data type is not supported by datasource
> -
>
> Key: SPARK-40207
> URL: https://issues.apache.org/jira/browse/SPARK-40207
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, If the data type is not supported by the data source, the 
> exception message thrown does not contain the column name, which is less 
> clear for locating the problem, this Jira aims to optimize error message 
> description



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40275) Support casting decimal128

2022-08-30 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-40275:
--

 Summary: Support casting decimal128
 Key: SPARK-40275
 URL: https://issues.apache.org/jira/browse/SPARK-40275
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer

2022-08-30 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

张刘强 updated SPARK-40274:

Environment: spark 3.1.2 scala 2.12.10 jdk 11 linux  (was: spark 3.1.2 
scala 2.12.10 jdk 1.8 linux)

> ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
> --
>
> Key: SPARK-40274
> URL: https://issues.apache.org/jira/browse/SPARK-40274
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 scala 2.12.10 jdk 11 linux
>Reporter: 张刘强
>Priority: Major
> Attachments: code.scala, error.txt, pom.txt
>
>
> spark 3.1.2 scala 2.12.10 jdk 1.8 linux
>  
> when use dataframe.count will throw this exception:
>  
> stacktrace like this:
>  
> java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for 
> length 206
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102)
>         at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59)
>         at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292)
>         at scala.collection.Iterator.foreach(Iterator.scala:943)
>         at scala.collection.Iterator.foreach$(Iterator.scala:943)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>         at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>         at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>         at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292)
>         at 
> scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289)
>         at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181)
>         at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285)
>         at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>         at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>         at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>         at scala.collection.TraversableLike.map(TraversableLike.scala:285)
>         at scala.collection.TraversableLike.map$(TraversableLike.scala:278)
>         at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:175)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:174)
>         at scala.collection.immutable.List.flatMap(List.scala:366)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:174)
>         at 
> com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:20)
>         at 
> com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:28)
>         at 
> com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:80)
>         at 
> com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:490)
>         at 
> com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:380)
>         at 
> com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:308)
>         at 
> com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.getJsonValueAccessor(POJOPropertiesCollector.java:196)
>         at 
> com.fasterxml.jackson.databind.int

[jira] [Updated] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer

2022-08-30 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

张刘强 updated SPARK-40274:

Description: 
spark 3.1.2 scala 2.12.10 jdk 1.8 linux

 

when use dataframe.count will throw this exception:

 

stacktrace like this:

 

java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for length 
206
        at 
com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532)
        at 
com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315)
        at 
com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102)
        at 
com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59)
        at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
        at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292)
        at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289)
        at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181)
        at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285)
        at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
        at scala.collection.TraversableLike.map(TraversableLike.scala:285)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:278)
        at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:175)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:174)
        at scala.collection.immutable.List.flatMap(List.scala:366)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:174)
        at 
com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:20)
        at 
com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:28)
        at 
com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:80)
        at 
com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:490)
        at 
com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:380)
        at 
com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:308)
        at 
com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.getJsonValueAccessor(POJOPropertiesCollector.java:196)
        at 
com.fasterxml.jackson.databind.introspect.BasicBeanDescription.findJsonValueAccessor(BasicBeanDescription.java:252)
        at 
com.fasterxml.jackson.databind.ser.BasicSerializerFactory.findSerializerByAnnotations(BasicSerializerFactory.java:346)
        at 
com.fasterxml.jackson.databind.ser.BeanSerializerFactory._createSerializer2(BeanSerializerFactory.java:216)

 

pom like this:

       
            com.fasterxml.jackson.core
            jackson-core
            2.10.5
        
        
            com.fasterxml.jackson.core
            jackson-databind
            2.10.5
        
        
            com.fasterxml.jackson.core
            jackson-annotations
            2.10.5
        
        
            com.fasterxml.jackson.module
            jackson-module-scala_2.12

[jira] [Updated] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer

2022-08-30 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

张刘强 updated SPARK-40274:

Environment: spark 3.1.2 scala 2.12.10 jdk 1.8 linux  (was: 
            com.fasterxml.jackson.core
            jackson-core
            2.10.5
        
        
            com.fasterxml.jackson.core
            jackson-databind
            2.10.5
        
        
            com.fasterxml.jackson.core
            jackson-annotations
            2.10.5
        
        
            com.fasterxml.jackson.module
            jackson-module-scala_2.12
            2.10.5
        

        
            com.thoughtworks.paranamer
            paranamer
            2.8
        
        
            org.apache.spark
            spark-core_2.12
            3.1.2
            
                
                    com.fasterxml.jackson.core
                    jackson-core
                
                
                    com.fasterxml.jackson.core
                    jackson-databind
                
                
                    com.fasterxml.jackson.module
                    jackson-module-scala_2.12
                
            
        
        
            org.apache.spark
            spark-sql_2.12
            3.1.2
            
                
                    com.fasterxml.jackson.core
                    jackson-core
                
                
                    com.fasterxml.jackson.core
                    jackson-databind
                
                
                    com.fasterxml.jackson.module
                    jackson-module-scala_2.12
                
            
        )

> ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
> --
>
> Key: SPARK-40274
> URL: https://issues.apache.org/jira/browse/SPARK-40274
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 scala 2.12.10 jdk 1.8 linux
>Reporter: 张刘强
>Priority: Major
> Attachments: code.scala, error.txt, pom.txt
>
>
> spark 3.1.2 scala 2.12.10 jdk 1.8 linux
>  
> when use dataframe.count will throw this exception:
>  
> stacktrace like this:
>  
> java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for 
> length 206
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102)
>         at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59)
>         at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292)
>         at scala.collection.Iterator.foreach(Iterator.scala:943)
>         at scala.collection.Iterator.foreach$(Iterator.scala:943)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>         at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>         at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>         at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292)
>         at 
> scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289)
>         at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181)
>         at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285)
>         at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>         at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>         at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>         at scala.collection.TraversableLike.map(TraversableLike.scala:285)
>         at scala.collection.TraversableLike.map$(TraversableLike.scala:278)
>         at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>         at 
> com.fasterxml.jackson.module.sca

[jira] [Updated] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer

2022-08-30 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

张刘强 updated SPARK-40274:

Attachment: pom.txt

> ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
> --
>
> Key: SPARK-40274
> URL: https://issues.apache.org/jira/browse/SPARK-40274
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
> Environment: 
>             com.fasterxml.jackson.core
>             jackson-core
>             2.10.5
>         
>         
>             com.fasterxml.jackson.core
>             jackson-databind
>             2.10.5
>         
>         
>             com.fasterxml.jackson.core
>             jackson-annotations
>             2.10.5
>         
>         
>             com.fasterxml.jackson.module
>             jackson-module-scala_2.12
>             2.10.5
>         
>         
>             com.thoughtworks.paranamer
>             paranamer
>             2.8
>         
>         
>             org.apache.spark
>             spark-core_2.12
>             3.1.2
>             
>                 
>                     com.fasterxml.jackson.core
>                     jackson-core
>                 
>                 
>                     com.fasterxml.jackson.core
>                     jackson-databind
>                 
>                 
>                     com.fasterxml.jackson.module
>                     jackson-module-scala_2.12
>                 
>             
>         
>         
>             org.apache.spark
>             spark-sql_2.12
>             3.1.2
>             
>                 
>                     com.fasterxml.jackson.core
>                     jackson-core
>                 
>                 
>                     com.fasterxml.jackson.core
>                     jackson-databind
>                 
>                 
>                     com.fasterxml.jackson.module
>                     jackson-module-scala_2.12
>                 
>             
>         
>Reporter: 张刘强
>Priority: Major
> Attachments: code.scala, error.txt, pom.txt
>
>
> spark 3.1.2 scala 2.12.10 jdk 1.8 linux
>  
> when use dataframe.count will throw this exception:
>  
> stacktrace like this:
>  
> java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for 
> length 206
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102)
>         at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59)
>         at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292)
>         at scala.collection.Iterator.foreach(Iterator.scala:943)
>         at scala.collection.Iterator.foreach$(Iterator.scala:943)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>         at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>         at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>         at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292)
>         at 
> scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289)
>         at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181)
>         at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285)
>         at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>         at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>         at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>         at scala.collection.TraversableLike.map(TraversableLike.scala:285)
>         at scala.collection.TraversableLike.map$(TraversableLike.scala:278)
>         at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>

[jira] [Updated] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer

2022-08-30 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

张刘强 updated SPARK-40274:

Attachment: code.scala

> ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
> --
>
> Key: SPARK-40274
> URL: https://issues.apache.org/jira/browse/SPARK-40274
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
> Environment: 
>             com.fasterxml.jackson.core
>             jackson-core
>             2.10.5
>         
>         
>             com.fasterxml.jackson.core
>             jackson-databind
>             2.10.5
>         
>         
>             com.fasterxml.jackson.core
>             jackson-annotations
>             2.10.5
>         
>         
>             com.fasterxml.jackson.module
>             jackson-module-scala_2.12
>             2.10.5
>         
>         
>             com.thoughtworks.paranamer
>             paranamer
>             2.8
>         
>         
>             org.apache.spark
>             spark-core_2.12
>             3.1.2
>             
>                 
>                     com.fasterxml.jackson.core
>                     jackson-core
>                 
>                 
>                     com.fasterxml.jackson.core
>                     jackson-databind
>                 
>                 
>                     com.fasterxml.jackson.module
>                     jackson-module-scala_2.12
>                 
>             
>         
>         
>             org.apache.spark
>             spark-sql_2.12
>             3.1.2
>             
>                 
>                     com.fasterxml.jackson.core
>                     jackson-core
>                 
>                 
>                     com.fasterxml.jackson.core
>                     jackson-databind
>                 
>                 
>                     com.fasterxml.jackson.module
>                     jackson-module-scala_2.12
>                 
>             
>         
>Reporter: 张刘强
>Priority: Major
> Attachments: code.scala, error.txt, pom.txt
>
>
> spark 3.1.2 scala 2.12.10 jdk 1.8 linux
>  
> when use dataframe.count will throw this exception:
>  
> stacktrace like this:
>  
> java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for 
> length 206
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102)
>         at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59)
>         at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292)
>         at scala.collection.Iterator.foreach(Iterator.scala:943)
>         at scala.collection.Iterator.foreach$(Iterator.scala:943)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>         at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>         at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>         at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292)
>         at 
> scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289)
>         at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181)
>         at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285)
>         at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>         at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>         at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>         at scala.collection.TraversableLike.map(TraversableLike.scala:285)
>         at scala.collection.TraversableLike.map$(TraversableLike.scala:278)
>         at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>

[jira] [Updated] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer

2022-08-30 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

张刘强 updated SPARK-40274:

Attachment: error.txt

> ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
> --
>
> Key: SPARK-40274
> URL: https://issues.apache.org/jira/browse/SPARK-40274
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
> Environment: 
>             com.fasterxml.jackson.core
>             jackson-core
>             2.10.5
>         
>         
>             com.fasterxml.jackson.core
>             jackson-databind
>             2.10.5
>         
>         
>             com.fasterxml.jackson.core
>             jackson-annotations
>             2.10.5
>         
>         
>             com.fasterxml.jackson.module
>             jackson-module-scala_2.12
>             2.10.5
>         
>         
>             com.thoughtworks.paranamer
>             paranamer
>             2.8
>         
>         
>             org.apache.spark
>             spark-core_2.12
>             3.1.2
>             
>                 
>                     com.fasterxml.jackson.core
>                     jackson-core
>                 
>                 
>                     com.fasterxml.jackson.core
>                     jackson-databind
>                 
>                 
>                     com.fasterxml.jackson.module
>                     jackson-module-scala_2.12
>                 
>             
>         
>         
>             org.apache.spark
>             spark-sql_2.12
>             3.1.2
>             
>                 
>                     com.fasterxml.jackson.core
>                     jackson-core
>                 
>                 
>                     com.fasterxml.jackson.core
>                     jackson-databind
>                 
>                 
>                     com.fasterxml.jackson.module
>                     jackson-module-scala_2.12
>                 
>             
>         
>Reporter: 张刘强
>Priority: Major
> Attachments: code.scala, error.txt, pom.txt
>
>
> spark 3.1.2 scala 2.12.10 jdk 1.8 linux
>  
> when use dataframe.count will throw this exception:
>  
> stacktrace like this:
>  
> java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for 
> length 206
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102)
>         at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59)
>         at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292)
>         at scala.collection.Iterator.foreach(Iterator.scala:943)
>         at scala.collection.Iterator.foreach$(Iterator.scala:943)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>         at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>         at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>         at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292)
>         at 
> scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289)
>         at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181)
>         at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285)
>         at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>         at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>         at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>         at scala.collection.TraversableLike.map(TraversableLike.scala:285)
>         at scala.collection.TraversableLike.map$(TraversableLike.scala:278)
>         at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>

[jira] [Updated] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer

2022-08-30 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-40274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

张刘强 updated SPARK-40274:

Docs Text:   (was: code like this:

val dataFrame: DataFrame = sparkSession.read
  .format(JDBC)
  .option(SSL, TRUE)
  .option(SSL_VERIFICATION, NONE)
  .option(DRIVER,
if (DatasourceTaskType.RESOURCE_DIRECTORY.name == 
inputDataSourceInfo.getSourceType) {
  COM_TRINO_JDBC_DRIVER
} else {
  
JdbcParamsUtil.getDriver(DbType.valueOf(inputDataSourceInfo.getDatabaseType).getCode)
})
  .option(URL, if (DatasourceTaskType.RESOURCE_DIRECTORY.name == 
inputDataSourceInfo.getSourceType) {
inputDataSourceInfo.getAddress
  } else {
inputDataSourceInfo.getJdbcUrl
  })
  .option(USER, inputDataSourceInfo.getUser)
  .option(PASSWORD, inputDataSourceInfo.getPassword)
  .option("keepAlive", TRUE)
  .option(QUERY, baseSql)
  .load

columns = dataFrame.columns
val count: Long = dataFrame.count())

> ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
> --
>
> Key: SPARK-40274
> URL: https://issues.apache.org/jira/browse/SPARK-40274
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
> Environment: 
>             com.fasterxml.jackson.core
>             jackson-core
>             2.10.5
>         
>         
>             com.fasterxml.jackson.core
>             jackson-databind
>             2.10.5
>         
>         
>             com.fasterxml.jackson.core
>             jackson-annotations
>             2.10.5
>         
>         
>             com.fasterxml.jackson.module
>             jackson-module-scala_2.12
>             2.10.5
>         
>         
>             com.thoughtworks.paranamer
>             paranamer
>             2.8
>         
>         
>             org.apache.spark
>             spark-core_2.12
>             3.1.2
>             
>                 
>                     com.fasterxml.jackson.core
>                     jackson-core
>                 
>                 
>                     com.fasterxml.jackson.core
>                     jackson-databind
>                 
>                 
>                     com.fasterxml.jackson.module
>                     jackson-module-scala_2.12
>                 
>             
>         
>         
>             org.apache.spark
>             spark-sql_2.12
>             3.1.2
>             
>                 
>                     com.fasterxml.jackson.core
>                     jackson-core
>                 
>                 
>                     com.fasterxml.jackson.core
>                     jackson-databind
>                 
>                 
>                     com.fasterxml.jackson.module
>                     jackson-module-scala_2.12
>                 
>             
>         
>Reporter: 张刘强
>Priority: Major
>
> spark 3.1.2 scala 2.12.10 jdk 1.8 linux
>  
> when use dataframe.count will throw this exception:
>  
> stacktrace like this:
>  
> java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for 
> length 206
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315)
>         at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102)
>         at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59)
>         at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59)
>         at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292)
>         at scala.collection.Iterator.foreach(Iterator.scala:943)
>         at scala.collection.Iterator.foreach$(Iterator.scala:943)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>         at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>         at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>         at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292)
>         at 
> scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289)
>         at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
>         at 
> co

[jira] [Created] (SPARK-40274) ArrayIndexOutOfBoundsException in BytecodeReadingParanamer

2022-08-30 Thread Jira

张刘强 created SPARK-40274:
---

 Summary: ArrayIndexOutOfBoundsException in BytecodeReadingParanamer
 Key: SPARK-40274
 URL: https://issues.apache.org/jira/browse/SPARK-40274
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 3.1.2
 Environment: 
            com.fasterxml.jackson.core
            jackson-core
            2.10.5
        
        
            com.fasterxml.jackson.core
            jackson-databind
            2.10.5
        
        
            com.fasterxml.jackson.core
            jackson-annotations
            2.10.5
        
        
            com.fasterxml.jackson.module
            jackson-module-scala_2.12
            2.10.5
        

        
            com.thoughtworks.paranamer
            paranamer
            2.8
        
        
            org.apache.spark
            spark-core_2.12
            3.1.2
            
                
                    com.fasterxml.jackson.core
                    jackson-core
                
                
                    com.fasterxml.jackson.core
                    jackson-databind
                
                
                    com.fasterxml.jackson.module
                    jackson-module-scala_2.12
                
            
        
        
            org.apache.spark
            spark-sql_2.12
            3.1.2
            
                
                    com.fasterxml.jackson.core
                    jackson-core
                
                
                    com.fasterxml.jackson.core
                    jackson-databind
                
                
                    com.fasterxml.jackson.module
                    jackson-module-scala_2.12
                
            
        
Reporter: 张刘强


spark 3.1.2 scala 2.12.10 jdk 1.8 linux

 

when use dataframe.count will throw this exception:

 

stacktrace like this:

 

java.lang.ArrayIndexOutOfBoundsException: Index 28499 out of bounds for length 
206
        at 
com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532)
        at 
com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315)
        at 
com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102)
        at 
com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59)
        at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
        at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292)
        at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289)
        at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181)
        at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285)
        at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
        at scala.collection.TraversableLike.map(TraversableLike.scala:285)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:278)
        at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:175)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:174)
        at scala.collection.immutable.List.flatMap(List.scala:366)
        at 
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:174)
        at 
com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntro

[jira] [Assigned] (SPARK-40253) Data read exception in orc format

2022-08-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40253:


Assignee: Apache Spark

>  Data read exception in orc format
> --
>
> Key: SPARK-40253
> URL: https://issues.apache.org/jira/browse/SPARK-40253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: os centos7
> spark 2.4.3
> hive 1.2.1
> hadoop 2.7.2
>Reporter: yihangqiao
>Assignee: Apache Spark
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Caused by: java.io.EOFException: Read past end of RLE integer from compressed 
> stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 
> offset: 0 limit: 0
> When running batches using spark-sql and using the create table xxx as select 
> syntax, the select query part uses a static value as the default value (0.00 
> as column_name) and does not specify the data type of the default value. In 
> this usage scenario, because the data type is not explicitly specified, the 
> metadata information of the field in the written ORC file is missing (the 
> writing is successful), but when reading, as long as the query column 
> contains this field, it will not be able to Parsing the ORC file, the 
> following error occurs：
>  
> {code:java}
> create table testgg as select 0.00 as gg;select * from testgg;Caused by: 
> java.io.IOException: Error reading file: 
> viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000
>        at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291)    
>    at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227)
>        at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109)
>        at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
>  Source)       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>        at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)       at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288)       at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)       at 
> org.apache.spark.scheduler.Task.run(Task.scala:121)       at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)   
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)  
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>        at java.lang.Thread.run(Thread.java:748)Caused by: 
> java.io.EOFException: Read past end of RLE integer from compressed stream 
> Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 
> limit: 0       at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398)
>        at 
> org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1205)
>        at 
> org.apache.orc.impl.TreeReaderFactory$DecimalTreeRead

[jira] [Assigned] (SPARK-40253) Data read exception in orc format

2022-08-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40253:


Assignee: (was: Apache Spark)

>  Data read exception in orc format
> --
>
> Key: SPARK-40253
> URL: https://issues.apache.org/jira/browse/SPARK-40253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: os centos7
> spark 2.4.3
> hive 1.2.1
> hadoop 2.7.2
>Reporter: yihangqiao
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Caused by: java.io.EOFException: Read past end of RLE integer from compressed 
> stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 
> offset: 0 limit: 0
> When running batches using spark-sql and using the create table xxx as select 
> syntax, the select query part uses a static value as the default value (0.00 
> as column_name) and does not specify the data type of the default value. In 
> this usage scenario, because the data type is not explicitly specified, the 
> metadata information of the field in the written ORC file is missing (the 
> writing is successful), but when reading, as long as the query column 
> contains this field, it will not be able to Parsing the ORC file, the 
> following error occurs：
>  
> {code:java}
> create table testgg as select 0.00 as gg;select * from testgg;Caused by: 
> java.io.IOException: Error reading file: 
> viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000
>        at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291)    
>    at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227)
>        at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109)
>        at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
>  Source)       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>        at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)       at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288)       at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)       at 
> org.apache.spark.scheduler.Task.run(Task.scala:121)       at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)   
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)  
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>        at java.lang.Thread.run(Thread.java:748)Caused by: 
> java.io.EOFException: Read past end of RLE integer from compressed stream 
> Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 
> limit: 0       at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398)
>        at 
> org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1205)
>        at 
> org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderF

[jira] [Commented] (SPARK-40253) Data read exception in orc format

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597718#comment-17597718
 ] 

Apache Spark commented on SPARK-40253:
--

User 'SelfImpr001' has created a pull request for this issue:
https://github.com/apache/spark/pull/37726

>  Data read exception in orc format
> --
>
> Key: SPARK-40253
> URL: https://issues.apache.org/jira/browse/SPARK-40253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: os centos7
> spark 2.4.3
> hive 1.2.1
> hadoop 2.7.2
>Reporter: yihangqiao
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Caused by: java.io.EOFException: Read past end of RLE integer from compressed 
> stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 
> offset: 0 limit: 0
> When running batches using spark-sql and using the create table xxx as select 
> syntax, the select query part uses a static value as the default value (0.00 
> as column_name) and does not specify the data type of the default value. In 
> this usage scenario, because the data type is not explicitly specified, the 
> metadata information of the field in the written ORC file is missing (the 
> writing is successful), but when reading, as long as the query column 
> contains this field, it will not be able to Parsing the ORC file, the 
> following error occurs：
>  
> {code:java}
> create table testgg as select 0.00 as gg;select * from testgg;Caused by: 
> java.io.IOException: Error reading file: 
> viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000
>        at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291)    
>    at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227)
>        at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109)
>        at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
>  Source)       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>        at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)       at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288)       at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)       at 
> org.apache.spark.scheduler.Task.run(Task.scala:121)       at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)   
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)  
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>        at java.lang.Thread.run(Thread.java:748)Caused by: 
> java.io.EOFException: Read past end of RLE integer from compressed stream 
> Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 
> limit: 0       at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398)
>        at 
> org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(Tree

[jira] [Commented] (SPARK-40253) Data read exception in orc format

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597717#comment-17597717
 ] 

Apache Spark commented on SPARK-40253:
--

User 'SelfImpr001' has created a pull request for this issue:
https://github.com/apache/spark/pull/37726

>  Data read exception in orc format
> --
>
> Key: SPARK-40253
> URL: https://issues.apache.org/jira/browse/SPARK-40253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: os centos7
> spark 2.4.3
> hive 1.2.1
> hadoop 2.7.2
>Reporter: yihangqiao
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Caused by: java.io.EOFException: Read past end of RLE integer from compressed 
> stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 
> offset: 0 limit: 0
> When running batches using spark-sql and using the create table xxx as select 
> syntax, the select query part uses a static value as the default value (0.00 
> as column_name) and does not specify the data type of the default value. In 
> this usage scenario, because the data type is not explicitly specified, the 
> metadata information of the field in the written ORC file is missing (the 
> writing is successful), but when reading, as long as the query column 
> contains this field, it will not be able to Parsing the ORC file, the 
> following error occurs：
>  
> {code:java}
> create table testgg as select 0.00 as gg;select * from testgg;Caused by: 
> java.io.IOException: Error reading file: 
> viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000
>        at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291)    
>    at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227)
>        at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109)
>        at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
>  Source)       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>        at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)       at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288)       at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)       at 
> org.apache.spark.scheduler.Task.run(Task.scala:121)       at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)   
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)  
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>        at java.lang.Thread.run(Thread.java:748)Caused by: 
> java.io.EOFException: Read past end of RLE integer from compressed stream 
> Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 
> limit: 0       at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398)
>        at 
> org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(Tree

[jira] [Assigned] (SPARK-40273) Fix the documents "Contributing and Maintaining Type Hints".

2022-08-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40273:


Assignee: Apache Spark

> Fix the documents "Contributing and Maintaining Type Hints".
> 
>
> Key: SPARK-40273
> URL: https://issues.apache.org/jira/browse/SPARK-40273
> Project: Spark
>  Issue Type: Test
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Since we don't use `*.pyi` for type hinting anymore (it's all ported as 
> inline type hints in the `*.py` files), we also should fix the related 
> documents accordingly 
> (https://spark.apache.org/docs/latest/api/python/development/contributing.html#contributing-and-maintaining-type-hints)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40273) Fix the documents "Contributing and Maintaining Type Hints".

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597690#comment-17597690
 ] 

Apache Spark commented on SPARK-40273:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/37724

> Fix the documents "Contributing and Maintaining Type Hints".
> 
>
> Key: SPARK-40273
> URL: https://issues.apache.org/jira/browse/SPARK-40273
> Project: Spark
>  Issue Type: Test
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Since we don't use `*.pyi` for type hinting anymore (it's all ported as 
> inline type hints in the `*.py` files), we also should fix the related 
> documents accordingly 
> (https://spark.apache.org/docs/latest/api/python/development/contributing.html#contributing-and-maintaining-type-hints)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40273) Fix the documents "Contributing and Maintaining Type Hints".

2022-08-30 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40273:


Assignee: (was: Apache Spark)

> Fix the documents "Contributing and Maintaining Type Hints".
> 
>
> Key: SPARK-40273
> URL: https://issues.apache.org/jira/browse/SPARK-40273
> Project: Spark
>  Issue Type: Test
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Since we don't use `*.pyi` for type hinting anymore (it's all ported as 
> inline type hints in the `*.py` files), we also should fix the related 
> documents accordingly 
> (https://spark.apache.org/docs/latest/api/python/development/contributing.html#contributing-and-maintaining-type-hints)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39763) Executor memory footprint substantially increases while reading zstd compressed parquet files

2022-08-30 Thread Fengyu Cao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597685#comment-17597685
 ] 

Fengyu Cao commented on SPARK-39763:


had the same problem

 

one of our dataset, 75GB in zstd parquet(134G in snappy)
{code:java}
# 10 executor
# Executor Reqs: memoryOverhead: [amount: 3072] cores: [amount: 4] memory: 
[amount: 10240] offHeap: [amount: 4096] Task Reqs: cpus: [amount: 1.0]
 
df = spark.read.parquet("dataset_zstd")  # with 
spark.sql.parquet.enableVectorizedReader=false
df.write.mode("overwrite").format("noop").save()
{code}
task failed with OOM, but with dataset in snappy, everything is fine

 

> Executor memory footprint substantially increases while reading zstd 
> compressed parquet files
> -
>
> Key: SPARK-39763
> URL: https://issues.apache.org/jira/browse/SPARK-39763
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Priority: Minor
>
> Hi all,
>  
> While transitioning from the default snappy compression to zstd, we noticed a 
> substantial increase in executor memory whilst *reading* and applying 
> transformations on *zstd* compressed parquet files.
> Memory footprint increased increased 3 fold in some cases, compared to 
> reading and applying the same transformations on a parquet file compressed 
> with snappy.
> This behaviour only occurs when reading zstd compressed parquet files. 
> Writing a zstd parquet file does not result in this behaviour.
> To reproduce:
>  # Set "spark.sql.parquet.compression.codec" to zstd
>  # Write some parquet files, the compression will default to zstd after 
> setting the option above
>  # Read the compressed zstd file and run some transformations. Compare the 
> memory usage of the executor vs running the same transformation on a parquet 
> file with snappy compression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40273) Fix the documents "Contributing and Maintaining Type Hints".

2022-08-30 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597646#comment-17597646
 ] 

Haejoon Lee commented on SPARK-40273:
-

I'm working on this

> Fix the documents "Contributing and Maintaining Type Hints".
> 
>
> Key: SPARK-40273
> URL: https://issues.apache.org/jira/browse/SPARK-40273
> Project: Spark
>  Issue Type: Test
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Since we don't use `*.pyi` for type hinting anymore (it's all ported as 
> inline type hints in the `*.py` files), we also should fix the related 
> documents accordingly 
> (https://spark.apache.org/docs/latest/api/python/development/contributing.html#contributing-and-maintaining-type-hints)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40273) Fix the documents "Contributing and Maintaining Type Hints".

2022-08-30 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-40273:
---

 Summary: Fix the documents "Contributing and Maintaining Type 
Hints".
 Key: SPARK-40273
 URL: https://issues.apache.org/jira/browse/SPARK-40273
 Project: Spark
  Issue Type: Test
  Components: Documentation, PySpark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


Since we don't use `*.pyi` for type hinting anymore (it's all ported as inline 
type hints in the `*.py` files), we also should fix the related documents 
accordingly 
(https://spark.apache.org/docs/latest/api/python/development/contributing.html#contributing-and-maintaining-type-hints)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40271) Support list type for pyspark.sql.functions.lit

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597637#comment-17597637
 ] 

Apache Spark commented on SPARK-40271:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/37722

> Support list type for pyspark.sql.functions.lit
> ---
>
> Key: SPARK-40271
> URL: https://issues.apache.org/jira/browse/SPARK-40271
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, `pyspark.sql.functions.lit` doesn't support for Python list type 
> as below:
> {code:python}
> >>> df = spark.range(3).withColumn("c", lit([1,2,3]))
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] 
> The feature is not supported: Literal for '[1, 2, 3]' of class 
> java.util.ArrayList.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302)
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100)
>   at org.apache.spark.sql.functions$.lit(functions.scala:125)
>   at org.apache.spark.sql.functions.lit(functions.scala)
>   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:577)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.base/java.lang.Thread.run(Thread.java:833)
> {code}
> We should make it supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40271) Support list type for pyspark.sql.functions.lit

2022-08-30 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597634#comment-17597634
 ] 

Apache Spark commented on SPARK-40271:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/37722

> Support list type for pyspark.sql.functions.lit
> ---
>
> Key: SPARK-40271
> URL: https://issues.apache.org/jira/browse/SPARK-40271
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, `pyspark.sql.functions.lit` doesn't support for Python list type 
> as below:
> {code:python}
> >>> df = spark.range(3).withColumn("c", lit([1,2,3]))
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] 
> The feature is not supported: Literal for '[1, 2, 3]' of class 
> java.util.ArrayList.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302)
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100)
>   at org.apache.spark.sql.functions$.lit(functions.scala:125)
>   at org.apache.spark.sql.functions.lit(functions.scala)
>   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:577)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.base/java.lang.Thread.run(Thread.java:833)
> {code}
> We should make it supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 114 matches

Mail list logo