[jira] [Comment Edited] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-05-10 Thread Shubham Chaurasia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342311#comment-17342311
 ] 

Shubham Chaurasia edited comment on SPARK-34675 at 5/11/21, 5:53 AM:
-

Thanks for the previous investigations [~maxgekk] [~dongjoon].

I saw something strange on the latest master (till 
2c8ced95905b8a9f7b98c4913a385d25da455e7a)

I was doing the same experiment and found that if we have say timezone T1 and 
we change either shell timezone using export TZ or if we pass using 
{{extraJavaOptions}}, we see different values of timezones.

My system was in UTC. 

1) I get following values 
{code}
user.timezone - America/Los_Angeles
TimeZone.getDefault - America/Los_Angeles
spark.sql.session.timeZone - America/Los_Angeles
++---++ 
|type|timestamp  |millis  |
++---++
|FROM BEELINE-EXT PARQUET|1989-01-04 16:00:00|59996160|
|FROM BEELINE-EXT ORC|1989-01-05 00:00:00|5040|
|FROM BEELINE-EXT AVRO   |1989-01-04 16:00:00|59996160|
|FROM BEELINE-EXT TEXT   |1989-01-05 00:00:00|5040|
++---++
{code}
when I either change the shell timezone like 
{code}
export TZ=America/Los_Angeles
{code}
or if I pass using extraJavaOptions like
{code}
 bin/spark-shell --master local --conf 
spark.driver.extraJavaOptions='-Duser.timezone=America/Los_Angeles' --conf 
spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles'
{code}

Not only with America/Los_Angeles timezone, I tested with Asia/Kolkata as well 
and was seeing the same behavior with above steps.
Result with Asia/Kolkata - 
{code}
user.timezone - Asia/Kolkata
TimeZone.getDefault - Asia/Kolkata
spark.sql.session.timeZone - Asia/Kolkata
++---++ 
|type|timestamp  |millis  |
++---++
|FROM BEELINE-EXT PARQUET|1989-01-05 05:30:00|59996160|
|FROM BEELINE-EXT ORC|1989-01-05 00:00:00|59994180|
|FROM BEELINE-EXT AVRO   |1989-01-05 05:30:00|59996160|
|FROM BEELINE-EXT TEXT   |1989-01-05 00:00:00|59994180|
++---++
{code}

Reopening the jira to discuss the same.


was (Author: shubhamchaurasia):
Thanks for the previous investigations [~maxgekk] [~dongjoon].

I saw something strange on the latest master. 

I was doing the same experiment and found that if we have say timezone T1 and 
we change either shell timezone using export TZ or if we pass using 
{{extraJavaOptions}}, we see different values of timezones.

My system was in UTC. 

1) I get following values 
{code}
user.timezone - America/Los_Angeles
TimeZone.getDefault - America/Los_Angeles
spark.sql.session.timeZone - America/Los_Angeles
++---++ 
|type|timestamp  |millis  |
++---++
|FROM BEELINE-EXT PARQUET|1989-01-04 16:00:00|59996160|
|FROM BEELINE-EXT ORC|1989-01-05 00:00:00|5040|
|FROM BEELINE-EXT AVRO   |1989-01-04 16:00:00|59996160|
|FROM BEELINE-EXT TEXT   |1989-01-05 00:00:00|5040|
++---++
{code}
when I either change the shell timezone like 
{code}
export TZ=America/Los_Angeles
{code}
or if I pass using extraJavaOptions like
{code}
 bin/spark-shell --master local --conf 
spark.driver.extraJavaOptions='-Duser.timezone=America/Los_Angeles' --conf 
spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles'
{code}

Not only with America/Los_Angeles timezone, I tested with Asia/Kolkata as well 
and was seeing the same behavior with above steps.
Result with Asia/Kolkata - 
{code}
user.timezone - Asia/Kolkata
TimeZone.getDefault - Asia/Kolkata
spark.sql.session.timeZone - Asia/Kolkata
++---++ 
|type|timestamp  |millis  |
++---++
|FROM BEELINE-EXT PARQUET|1989-01-05 05:30:00|59996160|
|FROM BEELINE-EXT ORC|1989-01-05 00:00:00|59994180|
|FROM BEELINE-EXT AVRO   |1989-01-05 05:30:00|59996160|
|FROM BEELINE-EXT TEXT   |1989-01-05 00:00:00|59994180|
++---++
{code}

Reopening the jira to discuss the same.

> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark

[jira] [Comment Edited] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-05-10 Thread Shubham Chaurasia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342311#comment-17342311
 ] 

Shubham Chaurasia edited comment on SPARK-34675 at 5/11/21, 5:51 AM:
-

Thanks for the previous investigations [~maxgekk] [~dongjoon].

I saw something strange on the latest master. 

I was doing the same experiment and found that if we have say timezone T1 and 
we change either shell timezone using export TZ or if we pass using 
{{extraJavaOptions}}, we see different values of timezones.

My system was in UTC. 

1) I get following values 
{code}
user.timezone - America/Los_Angeles
TimeZone.getDefault - America/Los_Angeles
spark.sql.session.timeZone - America/Los_Angeles
++---++ 
|type|timestamp  |millis  |
++---++
|FROM BEELINE-EXT PARQUET|1989-01-04 16:00:00|59996160|
|FROM BEELINE-EXT ORC|1989-01-05 00:00:00|5040|
|FROM BEELINE-EXT AVRO   |1989-01-04 16:00:00|59996160|
|FROM BEELINE-EXT TEXT   |1989-01-05 00:00:00|5040|
++---++
{code}
when I either change the shell timezone like 
{code}
export TZ=America/Los_Angeles
{code}
or if I pass using extraJavaOptions like
{code}
 bin/spark-shell --master local --conf 
spark.driver.extraJavaOptions='-Duser.timezone=America/Los_Angeles' --conf 
spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles'
{code}

Not only with America/Los_Angeles timezone, I tested with Asia/Kolkata as well 
and was seeing the same behavior with above steps.
Result with Asia/Kolkata - 
{code}
user.timezone - Asia/Kolkata
TimeZone.getDefault - Asia/Kolkata
spark.sql.session.timeZone - Asia/Kolkata
++---++ 
|type|timestamp  |millis  |
++---++
|FROM BEELINE-EXT PARQUET|1989-01-05 05:30:00|59996160|
|FROM BEELINE-EXT ORC|1989-01-05 00:00:00|59994180|
|FROM BEELINE-EXT AVRO   |1989-01-05 05:30:00|59996160|
|FROM BEELINE-EXT TEXT   |1989-01-05 00:00:00|59994180|
++---++
{code}

Reopening the jira to discuss the same.


was (Author: shubhamchaurasia):
Thanks for the previous investigations [~maxgekk] [~dongjoon].

I saw something strange on the latest master. 

I was doing the same experiment and found that if we have say timezone T1 and 
we change either shell timezone using export TZ or if we pass using 
{{extraJavaOptions}}, we see different values of timezones.

My system was in UTC. 

1) I get following values 
{code}
user.timezone - America/Los_Angeles
TimeZone.getDefault - America/Los_Angeles
spark.sql.session.timeZone - America/Los_Angeles
++---++ 
|type|timestamp  |millis  |
++---++
|FROM BEELINE-EXT PARQUET|1989-01-04 16:00:00|59996160|
|FROM BEELINE-EXT ORC|1989-01-05 00:00:00|5040|
|FROM BEELINE-EXT AVRO   |1989-01-04 16:00:00|59996160|
|FROM BEELINE-EXT TEXT   |1989-01-05 00:00:00|5040|
++---++
{code}
when I either change the shell timezone like 
{code}
export TZ=America/Los_Angeles
{code}
or if I pass using extraJavaOptions like
{code}
 bin/spark-shell --master local --conf 
spark.driver.extraJavaOptions='-Duser.timezone=America/Los_Angeles' --conf 
spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles'
{code}

Not only with America/Los_Angeles timezone, I tested with Asia/Kolkata as well 
and was seeing the same behavior with above steps.
Result with Asia/Kolkata - 
{code}
user.timezone - Asia/Kolkata
TimeZone.getDefault - Asia/Kolkata
spark.sql.session.timeZone - Asia/Kolkata
++---++ 
|type|timestamp  |millis  |
++---++
|FROM BEELINE-EXT PARQUET|1989-01-05 05:30:00|59996160|
|FROM BEELINE-EXT ORC|1989-01-05 00:00:00|59994180|
|FROM BEELINE-EXT AVRO   |1989-01-05 05:30:00|59996160|
|FROM BEELINE-EXT TEXT   |1989-01-05 00:00:00|59994180|
++---++
{code}

> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 

[jira] [Reopened] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-05-10 Thread Shubham Chaurasia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reopened SPARK-34675:
---

> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Shubham Chaurasia
>Priority: Major
> Fix For: 3.2.0
>
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as 
> parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table spark_text(type string, t timestamp) stored as 
> textfile").show
> sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 
> 01:02:03')").show
> {code}
> Used following function to check and verify the returned timestamps
> {code:scala}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> def showTs(
> db: String,
> tables: String*
> ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
>   sql("use " + db).show
>   import scala.collection.mutable.ListBuffer
>   var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
>   for (tbl <- tables) {
> val query = "select * from " + tbl
> println("Executing - " + query);
> results += sql(query)
>   }
>   println("user.timezone - " + System.getProperty("user.timezone"))
>   println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
>   println("spark.sql.session.timeZone - " + 
> spark.conf.get("spark.sql.session.timeZone"))
>   var unionDf = results(0)
>   for (i <- 1 until results.length) {
> unionDf = unionDf.unionAll(results(i))
>   }
>   val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), 
> r.getTimestamp(1).getTime))
>   val renamed = augmented.withColumnRenamed("_1", 
> "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
> renamed.show(false)
>   return renamed
> }
> // Exiting paste mode, now interpreting.
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - UTC
> +--+---++ 
>   
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
> +--+---++
> {code}
> 1. Set session timezone to America/Los_Angeles
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - America/Los_Angeles
> +--+---++
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-04 17:02:03|599965323000|
> 

[jira] [Updated] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-05-10 Thread Shubham Chaurasia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated SPARK-34675:
--
Fix Version/s: (was: 3.2.0)

> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Shubham Chaurasia
>Priority: Major
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as 
> parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table spark_text(type string, t timestamp) stored as 
> textfile").show
> sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 
> 01:02:03')").show
> {code}
> Used following function to check and verify the returned timestamps
> {code:scala}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> def showTs(
> db: String,
> tables: String*
> ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
>   sql("use " + db).show
>   import scala.collection.mutable.ListBuffer
>   var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
>   for (tbl <- tables) {
> val query = "select * from " + tbl
> println("Executing - " + query);
> results += sql(query)
>   }
>   println("user.timezone - " + System.getProperty("user.timezone"))
>   println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
>   println("spark.sql.session.timeZone - " + 
> spark.conf.get("spark.sql.session.timeZone"))
>   var unionDf = results(0)
>   for (i <- 1 until results.length) {
> unionDf = unionDf.unionAll(results(i))
>   }
>   val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), 
> r.getTimestamp(1).getTime))
>   val renamed = augmented.withColumnRenamed("_1", 
> "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
> renamed.show(false)
>   return renamed
> }
> // Exiting paste mode, now interpreting.
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - UTC
> +--+---++ 
>   
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
> +--+---++
> {code}
> 1. Set session timezone to America/Los_Angeles
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - America/Los_Angeles
> +--+---++
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-04 17:02:03|599965323000|
> 

[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-05-10 Thread Shubham Chaurasia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342311#comment-17342311
 ] 

Shubham Chaurasia commented on SPARK-34675:
---

Thanks for the previous investigations [~maxgekk] [~dongjoon].

I saw something strange on the latest master. 

I was doing the same experiment and found that if we have say timezone T1 and 
we change either shell timezone using export TZ or if we pass using 
{{extraJavaOptions}}, we see different values of timezones.

My system was in UTC. 

1) I get following values 
{code}
user.timezone - America/Los_Angeles
TimeZone.getDefault - America/Los_Angeles
spark.sql.session.timeZone - America/Los_Angeles
++---++ 
|type|timestamp  |millis  |
++---++
|FROM BEELINE-EXT PARQUET|1989-01-04 16:00:00|59996160|
|FROM BEELINE-EXT ORC|1989-01-05 00:00:00|5040|
|FROM BEELINE-EXT AVRO   |1989-01-04 16:00:00|59996160|
|FROM BEELINE-EXT TEXT   |1989-01-05 00:00:00|5040|
++---++
{code}
when I either change the shell timezone like 
{code}
export TZ=America/Los_Angeles
{code}
or if I pass using extraJavaOptions like
{code}
 bin/spark-shell --master local --conf 
spark.driver.extraJavaOptions='-Duser.timezone=America/Los_Angeles' --conf 
spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles'
{code}

Not only with America/Los_Angeles timezone, I tested with Asia/Kolkata as well 
and was seeing the same behavior with above steps.
Result with Asia/Kolkata - 
{code}
user.timezone - Asia/Kolkata
TimeZone.getDefault - Asia/Kolkata
spark.sql.session.timeZone - Asia/Kolkata
++---++ 
|type|timestamp  |millis  |
++---++
|FROM BEELINE-EXT PARQUET|1989-01-05 05:30:00|59996160|
|FROM BEELINE-EXT ORC|1989-01-05 00:00:00|59994180|
|FROM BEELINE-EXT AVRO   |1989-01-05 05:30:00|59996160|
|FROM BEELINE-EXT TEXT   |1989-01-05 00:00:00|59994180|
++---++
{code}

> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Shubham Chaurasia
>Priority: Major
> Fix For: 3.2.0
>
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as 
> parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table spark_text(type string, t timestamp) stored as 
> textfile").show
> sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 
> 01:02:03')").show
> {code}
> Used following function to check and verify the returned timestamps
> {code:scala}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> def showTs(
> db: String,
> tables: String*
> ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
>   sql("use " + db).show
>   import scala.collection.mutable.ListBuffer
>   var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
>   for (tbl <- tables) {
> val query = "select * from " + tbl
> println("Executing - " + query);
> results += sql(query)
>   }
>   println("user.timezone - " + System.getProperty("user.timezone"))
>   println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
>   println("spark.sql.session.timeZone - " + 
> spark.conf.get("spark.sql.session.timeZone"))
>   var unionDf = results(0)
>   for (i <- 1 until results.length) {
> unionDf = unionDf.unionAll(results(i))
>   }
>   val augmented = unionDf.map(r => 

[jira] [Commented] (SPARK-35366) Avoid using deprecated `buildForBatch` and `buildForStreaming`

2021-05-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342292#comment-17342292
 ] 

Apache Spark commented on SPARK-35366:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/32497

> Avoid using deprecated `buildForBatch` and `buildForStreaming`
> --
>
> Key: SPARK-35366
> URL: https://issues.apache.org/jira/browse/SPARK-35366
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2
>Reporter: Linhong Liu
>Priority: Major
>
> in DSv2 we are still using the deprecated functions. need to avoid this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35366) Avoid using deprecated `buildForBatch` and `buildForStreaming`

2021-05-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342291#comment-17342291
 ] 

Apache Spark commented on SPARK-35366:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/32497

> Avoid using deprecated `buildForBatch` and `buildForStreaming`
> --
>
> Key: SPARK-35366
> URL: https://issues.apache.org/jira/browse/SPARK-35366
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2
>Reporter: Linhong Liu
>Priority: Major
>
> in DSv2 we are still using the deprecated functions. need to avoid this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35366) Avoid using deprecated `buildForBatch` and `buildForStreaming`

2021-05-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35366:


Assignee: Apache Spark

> Avoid using deprecated `buildForBatch` and `buildForStreaming`
> --
>
> Key: SPARK-35366
> URL: https://issues.apache.org/jira/browse/SPARK-35366
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2
>Reporter: Linhong Liu
>Assignee: Apache Spark
>Priority: Major
>
> in DSv2 we are still using the deprecated functions. need to avoid this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35366) Avoid using deprecated `buildForBatch` and `buildForStreaming`

2021-05-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35366:


Assignee: (was: Apache Spark)

> Avoid using deprecated `buildForBatch` and `buildForStreaming`
> --
>
> Key: SPARK-35366
> URL: https://issues.apache.org/jira/browse/SPARK-35366
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2
>Reporter: Linhong Liu
>Priority: Major
>
> in DSv2 we are still using the deprecated functions. need to avoid this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35366) Avoid using deprecated `buildForBatch` and `buildForStreaming`

2021-05-10 Thread Linhong Liu (Jira)
Linhong Liu created SPARK-35366:
---

 Summary: Avoid using deprecated `buildForBatch` and 
`buildForStreaming`
 Key: SPARK-35366
 URL: https://issues.apache.org/jira/browse/SPARK-35366
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2
Reporter: Linhong Liu


in DSv2 we are still using the deprecated functions. need to avoid this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35365) spark3.1.1 use too long to analyze table fields

2021-05-10 Thread yao (Jira)
yao created SPARK-35365:
---

 Summary: spark3.1.1 use too long to analyze table fields
 Key: SPARK-35365
 URL: https://issues.apache.org/jira/browse/SPARK-35365
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: yao


I have a big sql with a few width tables join and complex logic, when I run 
that in spark 2.4 , it will take 20 minues in analyze phase, when I use spark 
3.1.1, it will use about 40 minutes,

I need set spark.sql.analyzer.maxIterations=1000 in spark3.1.1.

or spark.sql.optimizer.maxIterations=1000 in spark2.4.

no other special setting for this .

I check on the spark ui , I find that there is no job generated, all executor 
have no active tasks, and when I set log level to debug, I find that the job is 
in analyze phase, analyze the fields reference.

this phase use too long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30186) support Dynamic Partition Pruning in Adaptive Execution

2021-05-10 Thread weixiuli (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342266#comment-17342266
 ] 

weixiuli commented on SPARK-30186:
--

https://github.com/apache/spark/pull/31941

> support Dynamic Partition Pruning in Adaptive Execution
> ---
>
> Key: SPARK-30186
> URL: https://issues.apache.org/jira/browse/SPARK-30186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiaoju Wu
>Priority: Major
>
> Currently Adaptive Execution cannot work if Dynamic Partition Pruning is 
> applied.
> private def supportAdaptive(plan: SparkPlan): Boolean = {
>  // TODO migrate dynamic-partition-pruning onto adaptive execution.
>  sanityCheck(plan) &&
>  !plan.logicalLink.exists(_.isStreaming) &&
>  
> *!plan.expressions.exists(_.find(_.isInstanceOf[DynamicPruningSubquery]).isDefined)*
>  &&
>  plan.children.forall(supportAdaptive)
> }
> It means we cannot benefit the performance from both AE and DPP.
> This ticket is target to make DPP + AE works together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35364) Renaming the existing Koalas related codes.

2021-05-10 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342262#comment-17342262
 ] 

Haejoon Lee commented on SPARK-35364:
-

I'm working on this

> Renaming the existing Koalas related codes.
> ---
>
> Key: SPARK-35364
> URL: https://issues.apache.org/jira/browse/SPARK-35364
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should renaming the several Koalas-related codes in the pandas APIs on 
> Spark.
>  * kdf -> psdf
>  * kser -> psser
>  * kidx -> psidx
>  * kmidx -> psmidx
>  * sdf.to_koalas() -> sdf.to_pandas_on_spark()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35364) Renaming the existing Koalas related codes.

2021-05-10 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-35364:
---

 Summary: Renaming the existing Koalas related codes.
 Key: SPARK-35364
 URL: https://issues.apache.org/jira/browse/SPARK-35364
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Haejoon Lee


We should renaming the several Koalas-related codes in the pandas APIs on Spark.
 * kdf -> psdf
 * kser -> psser
 * kidx -> psidx
 * kmidx -> psmidx
 * sdf.to_koalas() -> sdf.to_pandas_on_spark()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35363) Refactor sort merge join code-gen be agnostic to join type

2021-05-10 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-35363.
--
Fix Version/s: 3.2.0
 Assignee: Cheng Su
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/32495

> Refactor sort merge join code-gen be agnostic to join type
> --
>
> Key: SPARK-35363
> URL: https://issues.apache.org/jira/browse/SPARK-35363
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.2.0
>
>
> This is a pre-requisite of [https://github.com/apache/spark/pull/32476,] in 
> discussion of 
> [https://github.com/apache/spark/pull/32476#issuecomment-836469779] . This is 
> to refactor sort merge join code-gen to depend on streamed/buffered 
> terminology, which makes the code-gen agnostic to different join types and 
> can be extended to support other join types than inner join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35359) Insert data with char/varchar datatype will fail when data length exceed length limitation

2021-05-10 Thread YuanGuanhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342233#comment-17342233
 ] 

YuanGuanhu commented on SPARK-35359:


I'd like to work on this.

> Insert data with char/varchar datatype will fail when data length exceed 
> length limitation
> --
>
> Key: SPARK-35359
> URL: https://issues.apache.org/jira/browse/SPARK-35359
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: YuanGuanhu
>Priority: Major
>
> In Spark3.1.1 have support Char/Varchar type, but when insert data with 
> char/varchar datatype will fail when data length exceed length limitation 
> even when spark.sql.legacy.charVarcharAsString is true.
> reproduce:
> create table chartb01(a char(3));
> insert into chartb01 select 'a';



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35079) Transform with udf gives incorrect result

2021-05-10 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-35079.
--
Fix Version/s: 3.2.0
   3.1.2
   Resolution: Fixed

> Transform with udf gives incorrect result
> -
>
> Key: SPARK-35079
> URL: https://issues.apache.org/jira/browse/SPARK-35079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: koert kuipers
>Priority: Minor
> Fix For: 3.1.2, 3.2.0
>
>
> i think this is a correctness bug in spark 3.1.1
> the behavior is correct in spark 3.0.1
> in spark 3.0.1:
> {code:java}
> scala> import spark.implicits._
> scala> import org.apache.spark.sql.functions._
> scala> val x = Seq(Seq("aa", "bb", "cc")).toDF
> x: org.apache.spark.sql.DataFrame = [value: array]
> scala> x.select(transform(col("value"), col => udf((_: 
> String).drop(1)).apply(col))).show
> +---+
> |transform(value, lambdafunction(UDF(lambda 'x), x))|
> +---+
> |  [a, b, c]|
> +---+
> {code}
> in spark 3.1.1:
> {code:java}
> scala> import spark.implicits._
> scala> import org.apache.spark.sql.functions._
> scala> val x = Seq(Seq("aa", "bb", "cc")).toDF
> x: org.apache.spark.sql.DataFrame = [value: array]
> scala> x.select(transform(col("value"), col => udf((_: 
> String).drop(1)).apply(col))).show
> +---+
> |transform(value, lambdafunction(UDF(lambda 'x), x))|
> +---+
> |  [c, c, c]|
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35079) Transform with udf gives incorrect result

2021-05-10 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342230#comment-17342230
 ] 

Takeshi Yamamuro commented on SPARK-35079:
--

I've checked it and the issue has already resolved in latest branch-3.1;
{code:java}
Welcome to
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2-SNAPSHOT
      /_/
         
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import spark.implicits._

scala> import org.apache.spark.sql.functions._

scala> val x = Seq(Seq("aa", "bb", "cc")).toDF
x: org.apache.spark.sql.DataFrame = [value: array]


scala> x.select(transform(col("value"), col => udf((_: 
String).drop(1)).apply(col))).show
+---+
|transform(value, lambdafunction(UDF(lambda 'x_0), x_0))|
+---+
|                                              [a, b, c]|
+---+
 {code}
So, I will close this. Anyway, thank you for the report.

> Transform with udf gives incorrect result
> -
>
> Key: SPARK-35079
> URL: https://issues.apache.org/jira/browse/SPARK-35079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: koert kuipers
>Priority: Minor
>
> i think this is a correctness bug in spark 3.1.1
> the behavior is correct in spark 3.0.1
> in spark 3.0.1:
> {code:java}
> scala> import spark.implicits._
> scala> import org.apache.spark.sql.functions._
> scala> val x = Seq(Seq("aa", "bb", "cc")).toDF
> x: org.apache.spark.sql.DataFrame = [value: array]
> scala> x.select(transform(col("value"), col => udf((_: 
> String).drop(1)).apply(col))).show
> +---+
> |transform(value, lambdafunction(UDF(lambda 'x), x))|
> +---+
> |  [a, b, c]|
> +---+
> {code}
> in spark 3.1.1:
> {code:java}
> scala> import spark.implicits._
> scala> import org.apache.spark.sql.functions._
> scala> val x = Seq(Seq("aa", "bb", "cc")).toDF
> x: org.apache.spark.sql.DataFrame = [value: array]
> scala> x.select(transform(col("value"), col => udf((_: 
> String).drop(1)).apply(col))).show
> +---+
> |transform(value, lambdafunction(UDF(lambda 'x), x))|
> +---+
> |  [c, c, c]|
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35301) Document migration guide from Koalas to pandas APIs on Spark

2021-05-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-35301:
-
Summary: Document migration guide from Koalas to pandas APIs on Spark  
(was: Document migration from Koalas to pandas APIs on Spark)

> Document migration guide from Koalas to pandas APIs on Spark
> 
>
> Key: SPARK-35301
> URL: https://issues.apache.org/jira/browse/SPARK-35301
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA aims to document the migration from the thirdparty Koalas to Apache 
> Spark pandas APIs on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35207) hash() and other hash builtins do not normalize negative zero

2021-05-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35207:


Assignee: (was: Apache Spark)

> hash() and other hash builtins do not normalize negative zero
> -
>
> Key: SPARK-35207
> URL: https://issues.apache.org/jira/browse/SPARK-35207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Tim Armstrong
>Priority: Major
>  Labels: correctness
>
> I would generally expect that {{x = y => hash( x ) = hash( y )}}. However +-0 
> hash to different values for floating point types. 
> {noformat}
> scala> spark.sql("select hash(cast('0.0' as double)), hash(cast('-0.0' as 
> double))").show
> +-+--+
> |hash(CAST(0.0 AS DOUBLE))|hash(CAST(-0.0 AS DOUBLE))|
> +-+--+
> |  -1670924195|-853646085|
> +-+--+
> scala> spark.sql("select cast('0.0' as double) == cast('-0.0' as 
> double)").show
> ++
> |(CAST(0.0 AS DOUBLE) = CAST(-0.0 AS DOUBLE))|
> ++
> |true|
> ++
> {noformat}
> I'm not sure how likely this is to cause issues in practice, since only a 
> limited number of calculations can produce -0 and joining or aggregating with 
> floating point keys is a bad practice as a general rule, but I think it would 
> be safer if we normalised -0.0 to +0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35207) hash() and other hash builtins do not normalize negative zero

2021-05-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35207:


Assignee: Apache Spark

> hash() and other hash builtins do not normalize negative zero
> -
>
> Key: SPARK-35207
> URL: https://issues.apache.org/jira/browse/SPARK-35207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Tim Armstrong
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>
> I would generally expect that {{x = y => hash( x ) = hash( y )}}. However +-0 
> hash to different values for floating point types. 
> {noformat}
> scala> spark.sql("select hash(cast('0.0' as double)), hash(cast('-0.0' as 
> double))").show
> +-+--+
> |hash(CAST(0.0 AS DOUBLE))|hash(CAST(-0.0 AS DOUBLE))|
> +-+--+
> |  -1670924195|-853646085|
> +-+--+
> scala> spark.sql("select cast('0.0' as double) == cast('-0.0' as 
> double)").show
> ++
> |(CAST(0.0 AS DOUBLE) = CAST(-0.0 AS DOUBLE))|
> ++
> |true|
> ++
> {noformat}
> I'm not sure how likely this is to cause issues in practice, since only a 
> limited number of calculations can produce -0 and joining or aggregating with 
> floating point keys is a bad practice as a general rule, but I think it would 
> be safer if we normalised -0.0 to +0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35207) hash() and other hash builtins do not normalize negative zero

2021-05-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342207#comment-17342207
 ] 

Apache Spark commented on SPARK-35207:
--

User 'planga82' has created a pull request for this issue:
https://github.com/apache/spark/pull/32496

> hash() and other hash builtins do not normalize negative zero
> -
>
> Key: SPARK-35207
> URL: https://issues.apache.org/jira/browse/SPARK-35207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Tim Armstrong
>Priority: Major
>  Labels: correctness
>
> I would generally expect that {{x = y => hash( x ) = hash( y )}}. However +-0 
> hash to different values for floating point types. 
> {noformat}
> scala> spark.sql("select hash(cast('0.0' as double)), hash(cast('-0.0' as 
> double))").show
> +-+--+
> |hash(CAST(0.0 AS DOUBLE))|hash(CAST(-0.0 AS DOUBLE))|
> +-+--+
> |  -1670924195|-853646085|
> +-+--+
> scala> spark.sql("select cast('0.0' as double) == cast('-0.0' as 
> double)").show
> ++
> |(CAST(0.0 AS DOUBLE) = CAST(-0.0 AS DOUBLE))|
> ++
> |true|
> ++
> {noformat}
> I'm not sure how likely this is to cause issues in practice, since only a 
> limited number of calculations can produce -0 and joining or aggregating with 
> floating point keys is a bad practice as a general rule, but I think it would 
> be safer if we normalised -0.0 to +0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35079) Transform with udf gives incorrect result

2021-05-10 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342182#comment-17342182
 ] 

Takeshi Yamamuro commented on SPARK-35079:
--

Could you check if branch-3.1 has the issue?

> Transform with udf gives incorrect result
> -
>
> Key: SPARK-35079
> URL: https://issues.apache.org/jira/browse/SPARK-35079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: koert kuipers
>Priority: Minor
>
> i think this is a correctness bug in spark 3.1.1
> the behavior is correct in spark 3.0.1
> in spark 3.0.1:
> {code:java}
> scala> import spark.implicits._
> scala> import org.apache.spark.sql.functions._
> scala> val x = Seq(Seq("aa", "bb", "cc")).toDF
> x: org.apache.spark.sql.DataFrame = [value: array]
> scala> x.select(transform(col("value"), col => udf((_: 
> String).drop(1)).apply(col))).show
> +---+
> |transform(value, lambdafunction(UDF(lambda 'x), x))|
> +---+
> |  [a, b, c]|
> +---+
> {code}
> in spark 3.1.1:
> {code:java}
> scala> import spark.implicits._
> scala> import org.apache.spark.sql.functions._
> scala> val x = Seq(Seq("aa", "bb", "cc")).toDF
> x: org.apache.spark.sql.DataFrame = [value: array]
> scala> x.select(transform(col("value"), col => udf((_: 
> String).drop(1)).apply(col))).show
> +---+
> |transform(value, lambdafunction(UDF(lambda 'x), x))|
> +---+
> |  [c, c, c]|
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35363) Refactor sort merge join code-gen be agnostic to join type

2021-05-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342177#comment-17342177
 ] 

Apache Spark commented on SPARK-35363:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/32495

> Refactor sort merge join code-gen be agnostic to join type
> --
>
> Key: SPARK-35363
> URL: https://issues.apache.org/jira/browse/SPARK-35363
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> This is a pre-requisite of [https://github.com/apache/spark/pull/32476,] in 
> discussion of 
> [https://github.com/apache/spark/pull/32476#issuecomment-836469779] . This is 
> to refactor sort merge join code-gen to depend on streamed/buffered 
> terminology, which makes the code-gen agnostic to different join types and 
> can be extended to support other join types than inner join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35363) Refactor sort merge join code-gen be agnostic to join type

2021-05-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35363:


Assignee: (was: Apache Spark)

> Refactor sort merge join code-gen be agnostic to join type
> --
>
> Key: SPARK-35363
> URL: https://issues.apache.org/jira/browse/SPARK-35363
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> This is a pre-requisite of [https://github.com/apache/spark/pull/32476,] in 
> discussion of 
> [https://github.com/apache/spark/pull/32476#issuecomment-836469779] . This is 
> to refactor sort merge join code-gen to depend on streamed/buffered 
> terminology, which makes the code-gen agnostic to different join types and 
> can be extended to support other join types than inner join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35363) Refactor sort merge join code-gen be agnostic to join type

2021-05-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35363:


Assignee: Apache Spark

> Refactor sort merge join code-gen be agnostic to join type
> --
>
> Key: SPARK-35363
> URL: https://issues.apache.org/jira/browse/SPARK-35363
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Minor
>
> This is a pre-requisite of [https://github.com/apache/spark/pull/32476,] in 
> discussion of 
> [https://github.com/apache/spark/pull/32476#issuecomment-836469779] . This is 
> to refactor sort merge join code-gen to depend on streamed/buffered 
> terminology, which makes the code-gen agnostic to different join types and 
> can be extended to support other join types than inner join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35363) Refactor sort merge join code-gen be agnostic to join type

2021-05-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342175#comment-17342175
 ] 

Apache Spark commented on SPARK-35363:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/32495

> Refactor sort merge join code-gen be agnostic to join type
> --
>
> Key: SPARK-35363
> URL: https://issues.apache.org/jira/browse/SPARK-35363
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> This is a pre-requisite of [https://github.com/apache/spark/pull/32476,] in 
> discussion of 
> [https://github.com/apache/spark/pull/32476#issuecomment-836469779] . This is 
> to refactor sort merge join code-gen to depend on streamed/buffered 
> terminology, which makes the code-gen agnostic to different join types and 
> can be extended to support other join types than inner join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35363) Refactor sort merge join code-gen be agnostic to join type

2021-05-10 Thread Cheng Su (Jira)
Cheng Su created SPARK-35363:


 Summary: Refactor sort merge join code-gen be agnostic to join type
 Key: SPARK-35363
 URL: https://issues.apache.org/jira/browse/SPARK-35363
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Cheng Su


This is a pre-requisite of [https://github.com/apache/spark/pull/32476,] in 
discussion of 
[https://github.com/apache/spark/pull/32476#issuecomment-836469779] . This is 
to refactor sort merge join code-gen to depend on streamed/buffered 
terminology, which makes the code-gen agnostic to different join types and can 
be extended to support other join types than inner join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-35079) Transform with udf gives incorrect result

2021-05-10 Thread shahid (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342171#comment-17342171
 ] 

shahid edited comment on SPARK-35079 at 5/10/21, 10:29 PM:
---

Seems It is not reproducible with master branch?
`

{code:java}
+-+
|transform(value, lambdafunction(UDF(lambda x_0#3993), namedlambdavariable()))|
+-+
|[a, b, c]|
+-+


{code}

`




was (Author: shahid):
Seems It is not reproducible with master branch?
+-+
|transform(value, lambdafunction(UDF(lambda x_0#3993), namedlambdavariable()))|
+-+
|[a, b, c]|
+-+



> Transform with udf gives incorrect result
> -
>
> Key: SPARK-35079
> URL: https://issues.apache.org/jira/browse/SPARK-35079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: koert kuipers
>Priority: Minor
>
> i think this is a correctness bug in spark 3.1.1
> the behavior is correct in spark 3.0.1
> in spark 3.0.1:
> {code:java}
> scala> import spark.implicits._
> scala> import org.apache.spark.sql.functions._
> scala> val x = Seq(Seq("aa", "bb", "cc")).toDF
> x: org.apache.spark.sql.DataFrame = [value: array]
> scala> x.select(transform(col("value"), col => udf((_: 
> String).drop(1)).apply(col))).show
> +---+
> |transform(value, lambdafunction(UDF(lambda 'x), x))|
> +---+
> |  [a, b, c]|
> +---+
> {code}
> in spark 3.1.1:
> {code:java}
> scala> import spark.implicits._
> scala> import org.apache.spark.sql.functions._
> scala> val x = Seq(Seq("aa", "bb", "cc")).toDF
> x: org.apache.spark.sql.DataFrame = [value: array]
> scala> x.select(transform(col("value"), col => udf((_: 
> String).drop(1)).apply(col))).show
> +---+
> |transform(value, lambdafunction(UDF(lambda 'x), x))|
> +---+
> |  [c, c, c]|
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35079) Transform with udf gives incorrect result

2021-05-10 Thread shahid (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342171#comment-17342171
 ] 

shahid commented on SPARK-35079:


Seems It is not reproducible with master branch?
+-+
|transform(value, lambdafunction(UDF(lambda x_0#3993), namedlambdavariable()))|
+-+
|[a, b, c]|
+-+



> Transform with udf gives incorrect result
> -
>
> Key: SPARK-35079
> URL: https://issues.apache.org/jira/browse/SPARK-35079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: koert kuipers
>Priority: Minor
>
> i think this is a correctness bug in spark 3.1.1
> the behavior is correct in spark 3.0.1
> in spark 3.0.1:
> {code:java}
> scala> import spark.implicits._
> scala> import org.apache.spark.sql.functions._
> scala> val x = Seq(Seq("aa", "bb", "cc")).toDF
> x: org.apache.spark.sql.DataFrame = [value: array]
> scala> x.select(transform(col("value"), col => udf((_: 
> String).drop(1)).apply(col))).show
> +---+
> |transform(value, lambdafunction(UDF(lambda 'x), x))|
> +---+
> |  [a, b, c]|
> +---+
> {code}
> in spark 3.1.1:
> {code:java}
> scala> import spark.implicits._
> scala> import org.apache.spark.sql.functions._
> scala> val x = Seq(Seq("aa", "bb", "cc")).toDF
> x: org.apache.spark.sql.DataFrame = [value: array]
> scala> x.select(transform(col("value"), col => udf((_: 
> String).drop(1)).apply(col))).show
> +---+
> |transform(value, lambdafunction(UDF(lambda 'x), x))|
> +---+
> |  [c, c, c]|
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35362) Update null count in the column stats for UNION stats estimation

2021-05-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35362:


Assignee: (was: Apache Spark)

> Update null count in the column stats for UNION stats estimation
> 
>
> Key: SPARK-35362
> URL: https://issues.apache.org/jira/browse/SPARK-35362
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.0.2
>Reporter: shahid
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35362) Update null count in the column stats for UNION stats estimation

2021-05-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342165#comment-17342165
 ] 

Apache Spark commented on SPARK-35362:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/32494

> Update null count in the column stats for UNION stats estimation
> 
>
> Key: SPARK-35362
> URL: https://issues.apache.org/jira/browse/SPARK-35362
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.0.2
>Reporter: shahid
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35362) Update null count in the column stats for UNION stats estimation

2021-05-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35362:


Assignee: Apache Spark

> Update null count in the column stats for UNION stats estimation
> 
>
> Key: SPARK-35362
> URL: https://issues.apache.org/jira/browse/SPARK-35362
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.0.2
>Reporter: shahid
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35362) Update null count in the column stats for UNION stats estimation

2021-05-10 Thread shahid (Jira)
shahid created SPARK-35362:
--

 Summary: Update null count in the column stats for UNION stats 
estimation
 Key: SPARK-35362
 URL: https://issues.apache.org/jira/browse/SPARK-35362
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 3.0.2
Reporter: shahid






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35353) Cross-building docker images to ARM64 is failing (with Ubuntu host)

2021-05-10 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated SPARK-35353:
---
Summary: Cross-building docker images to ARM64 is failing (with Ubuntu 
host)  (was: Cross-building docker images to ARM64 is failing)

> Cross-building docker images to ARM64 is failing (with Ubuntu host)
> ---
>
> Key: SPARK-35353
> URL: https://issues.apache.org/jira/browse/SPARK-35353
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Andy Grove
>Priority: Minor
>
> I was trying to cross-build Spark 3.1.1 for ARM64 so that I could deploy to a 
> Raspberry Pi Kubernetes cluster this weekend and the Docker build fails.
> Here are the commands I used:
> {code:java}
> docker buildx create --use
> ./bin/docker-image-tool.sh -n -r andygrove -t 3.1.1 -X build {code}
> The Docker build for ARM64 fails on the following command:
> {code:java}
>  apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps{code}
> The install fails with "Error while loading /usr/sbin/dpkg-split: No such 
> file or directory".
> Here is a fragment of the output showing the relevant error message.
> {code:java}
> #6 6.034 Get:35 https://deb.debian.org/debian buster/main arm64 libnss3 arm64 
> 2:3.42.1-1+deb10u3 [1082 kB]
> #6 6.102 Get:36 https://deb.debian.org/debian buster/main arm64 psmisc arm64 
> 23.2-1 [122 kB]
> #6 6.109 Get:37 https://deb.debian.org/debian buster/main arm64 tini arm64 
> 0.18.0-1 [194 kB]
> #6 6.767 debconf: delaying package configuration, since apt-utils is not 
> installed
> #6 6.883 Fetched 18.1 MB in 1s (13.4 MB/s)
> #6 6.956 Error while loading /usr/sbin/dpkg-split: No such file or directory
> #6 6.959 Error while loading /usr/sbin/dpkg-deb: No such file or directory
> #6 6.961 dpkg: error processing archive 
> /tmp/apt-dpkg-install-NdOR40/00-libncurses6_6.1+20181013-2+deb10u2_arm64.deb 
> (--unpack):
>  {code}
> My host environment details:
>  * Ubuntu 18.04.5 LTS
>  * Docker version 20.10.6, build 370c289
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35353) Cross-building docker images to ARM64 is failing

2021-05-10 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342066#comment-17342066
 ] 

Andy Grove commented on SPARK-35353:


The issue seems specific to running on an Ubuntu host. I reproduced the issue 
on two computers. It works fine on my Macbook Pro though.

> Cross-building docker images to ARM64 is failing
> 
>
> Key: SPARK-35353
> URL: https://issues.apache.org/jira/browse/SPARK-35353
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Andy Grove
>Priority: Minor
>
> I was trying to cross-build Spark 3.1.1 for ARM64 so that I could deploy to a 
> Raspberry Pi Kubernetes cluster this weekend and the Docker build fails.
> Here are the commands I used:
> {code:java}
> docker buildx create --use
> ./bin/docker-image-tool.sh -n -r andygrove -t 3.1.1 -X build {code}
> The Docker build for ARM64 fails on the following command:
> {code:java}
>  apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps{code}
> The install fails with "Error while loading /usr/sbin/dpkg-split: No such 
> file or directory".
> Here is a fragment of the output showing the relevant error message.
> {code:java}
> #6 6.034 Get:35 https://deb.debian.org/debian buster/main arm64 libnss3 arm64 
> 2:3.42.1-1+deb10u3 [1082 kB]
> #6 6.102 Get:36 https://deb.debian.org/debian buster/main arm64 psmisc arm64 
> 23.2-1 [122 kB]
> #6 6.109 Get:37 https://deb.debian.org/debian buster/main arm64 tini arm64 
> 0.18.0-1 [194 kB]
> #6 6.767 debconf: delaying package configuration, since apt-utils is not 
> installed
> #6 6.883 Fetched 18.1 MB in 1s (13.4 MB/s)
> #6 6.956 Error while loading /usr/sbin/dpkg-split: No such file or directory
> #6 6.959 Error while loading /usr/sbin/dpkg-deb: No such file or directory
> #6 6.961 dpkg: error processing archive 
> /tmp/apt-dpkg-install-NdOR40/00-libncurses6_6.1+20181013-2+deb10u2_arm64.deb 
> (--unpack):
>  {code}
> My host environment details:
>  * Ubuntu 18.04.5 LTS
>  * Docker version 20.10.6, build 370c289
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35361) Improve performance for ApplyFunctionExpression

2021-05-10 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35361:
-
Priority: Minor  (was: Major)

> Improve performance for ApplyFunctionExpression
> ---
>
> Key: SPARK-35361
> URL: https://issues.apache.org/jira/browse/SPARK-35361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could 
> incur significant runtime cost with `zipWithIndex` call. This proposes to 
> move the call outside the loop over each input row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35361) Improve performance for ApplyFunctionExpression

2021-05-10 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35361:
-
Priority: Major  (was: Minor)

> Improve performance for ApplyFunctionExpression
> ---
>
> Key: SPARK-35361
> URL: https://issues.apache.org/jira/browse/SPARK-35361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could 
> incur significant runtime cost with `zipWithIndex` call. This proposes to 
> move the call outside the loop over each input row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35361) Improve performance for ApplyFunctionExpression

2021-05-10 Thread Chao Sun (Jira)
Chao Sun created SPARK-35361:


 Summary: Improve performance for ApplyFunctionExpression
 Key: SPARK-35361
 URL: https://issues.apache.org/jira/browse/SPARK-35361
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could 
incur significant runtime cost with `zipWithIndex` call. This proposes to move 
the call outside the loop over each input row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes and move to docker 'virtualization' layer

2021-05-10 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342040#comment-17342040
 ] 

Attila Zsolt Piros commented on SPARK-34738:


[~shaneknapp] I am sorry to hear about the sickness part but I am glad you are 
now better. Take your time and don't worry about this! 
Stay safe!

> Upgrade Minikube and kubernetes and move to docker 'virtualization' layer
> -
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
> Attachments: integration-tests.log
>
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.
>  
> Added by Shane:
> we also need to move from the kvm2 virtualization layer to docker.  docker is 
> a recommended driver w/the latest versions of minikube, and this will allow 
> devs to more easily recreate the minikube/k8s env on their local workstations 
> and run the integration tests in an identical environment as jenkins.
> the TL;DR is that upgrading to docker works, except that the PV integration 
> tests are failing due to a couple of possible reasons:
> 1) the 'spark-kubernetes-driver' isn't properly being loaded 
> (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312517=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312517)
> 2) during the PV test run, the error message 'Given path 
> (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist' shows up in 
> the logs.  however, the mk cluster *does* mount successfully to the local 
> bare-metal filesystem *and* if i 'minikube ssh' in to it, i can see the mount 
> and read/write successfully to it 
> (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312548=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312548)
> i could really use some help, and if it's useful, i can create some local 
> accounts manually and allow ssh access for a couple of people to assist me.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34883) Setting CSV reader option "multiLine" to "true" causes URISyntaxException when colon is in file path

2021-05-10 Thread Brady Tello (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342031#comment-17342031
 ] 

Brady Tello commented on SPARK-34883:
-

Update to this thread:

I found that if I provide a schema to the csv reader, it works fine even if I 
use the multiLine option.  I've only verified this on my laptop so far but I 
plan to see if it works on something like Databricks soon.  I suspect that the 
bug is probably isolated to `inferSchema`:
{code:java}
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:62)
{code}

> Setting CSV reader option "multiLine" to "true" causes URISyntaxException 
> when colon is in file path
> 
>
> Key: SPARK-34883
> URL: https://issues.apache.org/jira/browse/SPARK-34883
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.1
>Reporter: Brady Tello
>Priority: Major
>
> Setting the CSV reader's "multiLine" option to "True" throws the following 
> exception when a ':' character is in the file path.
>  
> {code:java}
> java.net.URISyntaxException: Relative path in absolute URI: test:dir
> {code}
> I've tested this in both Spark 3.0.0 and Spark 3.1.1 and I get the same error 
> whether I use Scala, Python, or SQL.
> The following code works fine:
>  
> {code:java}
> csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" 
> tempDF = (spark.read.option("sep", "\t").csv(csvFile)
> {code}
> While the following code fails:
>  
> {code:java}
> csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv"
> tempDF = (spark.read.option("sep", "\t").option("multiLine", 
> "True").csv(csvFile)
> {code}
> Full Stack Trace from Python:
>  
> {code:java}
> --- 
> IllegalArgumentException Traceback (most recent call last)  
> in  
> 3 csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" 
> 4 
> > 5  tempDF = (spark.read.option("sep", "\t").option("multiLine", "True") 
> /databricks/spark/python/pyspark/sql/readwriter.py in csv(self, path, schema, 
> sep, encoding, quote, escape, comment, header, inferSchema, 
> ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, 
> positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, 
> maxCharsPerColumn, maxMalformedLogPerPartition, mode, 
> columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, 
> samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, 
> recursiveFileLookup, modifiedBefore, modifiedAfter, unescapedQuoteHandling) 
> 735 path = [path] 
> 736 if type(path) == list: 
> --> 737 return 
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) 
> 738 elif isinstance(path, RDD): 
> 739 def func(iterator): 
> /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in 
> __call__(self, *args) 
> 1302 
> 1303 answer = self.gateway_client.send_command(command) 
> -> 1304 return_value = get_return_value( 
> 1305 answer, self.gateway_client, self.target_id, self.name) 
> 1306 
> /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 
> 114 # Hide where the exception came from that shows a non-Pythonic 
> 115 # JVM exception message. 
> --> 116 raise converted from None 
> 117 else: 
> 118 raise IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:dir
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes and move to docker 'virtualization' layer

2021-05-10 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342024#comment-17342024
 ] 

Shane Knapp commented on SPARK-34738:
-

sorry i dropped off the radar...  i've been dealing w/a serious health issue 
these past few weeks (which is sorted), and i will update the remaining workers 
this week.

> Upgrade Minikube and kubernetes and move to docker 'virtualization' layer
> -
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
> Attachments: integration-tests.log
>
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.
>  
> Added by Shane:
> we also need to move from the kvm2 virtualization layer to docker.  docker is 
> a recommended driver w/the latest versions of minikube, and this will allow 
> devs to more easily recreate the minikube/k8s env on their local workstations 
> and run the integration tests in an identical environment as jenkins.
> the TL;DR is that upgrading to docker works, except that the PV integration 
> tests are failing due to a couple of possible reasons:
> 1) the 'spark-kubernetes-driver' isn't properly being loaded 
> (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312517=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312517)
> 2) during the PV test run, the error message 'Given path 
> (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist' shows up in 
> the logs.  however, the mk cluster *does* mount successfully to the local 
> bare-metal filesystem *and* if i 'minikube ssh' in to it, i can see the mount 
> and read/write successfully to it 
> (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312548=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312548)
> i could really use some help, and if it's useful, i can create some local 
> accounts manually and allow ssh access for a couple of people to assist me.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes and move to docker 'virtualization' layer

2021-05-10 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342018#comment-17342018
 ] 

Attila Zsolt Piros commented on SPARK-34738:


[~shaneknapp] I have merged my commit this means "PVs with local storage" test 
is skipped. 
Could you please check whether the tests on Minikube with docker driver are 
passing now successfully on jenkins? 

> Upgrade Minikube and kubernetes and move to docker 'virtualization' layer
> -
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
> Attachments: integration-tests.log
>
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.
>  
> Added by Shane:
> we also need to move from the kvm2 virtualization layer to docker.  docker is 
> a recommended driver w/the latest versions of minikube, and this will allow 
> devs to more easily recreate the minikube/k8s env on their local workstations 
> and run the integration tests in an identical environment as jenkins.
> the TL;DR is that upgrading to docker works, except that the PV integration 
> tests are failing due to a couple of possible reasons:
> 1) the 'spark-kubernetes-driver' isn't properly being loaded 
> (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312517=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312517)
> 2) during the PV test run, the error message 'Given path 
> (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist' shows up in 
> the logs.  however, the mk cluster *does* mount successfully to the local 
> bare-metal filesystem *and* if i 'minikube ssh' in to it, i can see the mount 
> and read/write successfully to it 
> (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312548=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312548)
> i could really use some help, and if it's useful, i can create some local 
> accounts manually and allow ssh access for a couple of people to assist me.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34736) Kubernetes and Minikube version upgrade for integration tests

2021-05-10 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros resolved SPARK-34736.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31829
[https://github.com/apache/spark/pull/31829]

> Kubernetes and Minikube version upgrade for integration tests
> -
>
> Key: SPARK-34736
> URL: https://issues.apache.org/jira/browse/SPARK-34736
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.2.0
>
>
> As [discussed in the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]:
>  upgrading Minikube version from v0.34.1 to v1.7.3 and kubernetes version 
> from v1.15.12 to v1.17.3.
> Moreover Minikube version will be checked.
> By making this upgrade we can simplify how the kubernetes client is 
> configured for Minikube.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34736) Kubernetes and Minikube version upgrade for integration tests

2021-05-10 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros reassigned SPARK-34736:
--

Assignee: Attila Zsolt Piros

> Kubernetes and Minikube version upgrade for integration tests
> -
>
> Key: SPARK-34736
> URL: https://issues.apache.org/jira/browse/SPARK-34736
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> As [discussed in the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]:
>  upgrading Minikube version from v0.34.1 to v1.7.3 and kubernetes version 
> from v1.15.12 to v1.17.3.
> Moreover Minikube version will be checked.
> By making this upgrade we can simplify how the kubernetes client is 
> configured for Minikube.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34246) New type coercion syntax rules in ANSI mode

2021-05-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341909#comment-17341909
 ] 

Apache Spark commented on SPARK-34246:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32493

> New type coercion syntax rules in ANSI mode
> ---
>
> Key: SPARK-34246
> URL: https://issues.apache.org/jira/browse/SPARK-34246
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Add new implicit cast syntax rules in ANSI mode.
> In Spark ANSI mode, the type coercion rules are based on the type precedence 
> lists of the input data types. 
> As per the section "Type precedence list determination" of "ISO/IEC 
> 9075-2:2011
> Information technology — Database languages - SQL — Part 2: Foundation 
> (SQL/Foundation)", the type precedence lists of primitive
>  data types are as following:
> * Byte: Byte, Short, Int, Long, Decimal, Float, Double
> * Short: Short, Int, Long, Decimal, Float, Double
> * Int: Int, Long, Decimal, Float, Double
> * Long: Long, Decimal, Float, Double
> * Decimal: Any wider Numeric type
> * Float: Float, Double
> * Double: Double
> * String: String
> * Date: Date, Timestamp
> * Timestamp: Timestamp
> * Binary: Binary
> * Boolean: Boolean
> * Interval: Interval
> As for complex data types, Spark will determine the precedent list 
> recursively based on their sub-types.
> With the definition of type precedent list, the general type coercion rules 
> are as following:
> * Data type S is allowed to be implicitly cast as type T iff T is in the 
> precedence list of S
> * Comparison is allowed iff the data type precedence list of both sides has 
> at least one common element. When evaluating the comparison, Spark casts both 
> sides as the tightest common data type of their precedent lists.
> * There should be at least one common data type among all the children's 
> precedence lists for the following operators. The data type of the operator 
> is the tightest common precedent data type.
> {code:java}
> In
> Except(odd)
> Intersect
> Greatest
> Least
> Union
> If
> CaseWhen
> CreateArray
> Array Concat
> Sequence
> MapConcat
> CreateMap
> {code}
> * For complex types (struct, array, map), Spark recursively looks into the 
> element type and applies the rules above. If the element nullability is 
> converted from true to false, add runtime null check to the elements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34246) New type coercion syntax rules in ANSI mode

2021-05-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341908#comment-17341908
 ] 

Apache Spark commented on SPARK-34246:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32493

> New type coercion syntax rules in ANSI mode
> ---
>
> Key: SPARK-34246
> URL: https://issues.apache.org/jira/browse/SPARK-34246
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Add new implicit cast syntax rules in ANSI mode.
> In Spark ANSI mode, the type coercion rules are based on the type precedence 
> lists of the input data types. 
> As per the section "Type precedence list determination" of "ISO/IEC 
> 9075-2:2011
> Information technology — Database languages - SQL — Part 2: Foundation 
> (SQL/Foundation)", the type precedence lists of primitive
>  data types are as following:
> * Byte: Byte, Short, Int, Long, Decimal, Float, Double
> * Short: Short, Int, Long, Decimal, Float, Double
> * Int: Int, Long, Decimal, Float, Double
> * Long: Long, Decimal, Float, Double
> * Decimal: Any wider Numeric type
> * Float: Float, Double
> * Double: Double
> * String: String
> * Date: Date, Timestamp
> * Timestamp: Timestamp
> * Binary: Binary
> * Boolean: Boolean
> * Interval: Interval
> As for complex data types, Spark will determine the precedent list 
> recursively based on their sub-types.
> With the definition of type precedent list, the general type coercion rules 
> are as following:
> * Data type S is allowed to be implicitly cast as type T iff T is in the 
> precedence list of S
> * Comparison is allowed iff the data type precedence list of both sides has 
> at least one common element. When evaluating the comparison, Spark casts both 
> sides as the tightest common data type of their precedent lists.
> * There should be at least one common data type among all the children's 
> precedence lists for the following operators. The data type of the operator 
> is the tightest common precedent data type.
> {code:java}
> In
> Except(odd)
> Intersect
> Greatest
> Least
> Union
> If
> CaseWhen
> CreateArray
> Array Concat
> Sequence
> MapConcat
> CreateMap
> {code}
> * For complex types (struct, array, map), Spark recursively looks into the 
> element type and applies the rules above. If the element nullability is 
> converted from true to false, add runtime null check to the elements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35360) Spark make add partition batch size configurable when call RepairTableCommand

2021-05-10 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-35360:


Assignee: angerszhu

> Spark make add partition batch size configurable when call RepairTableCommand
> -
>
> Key: SPARK-35360
> URL: https://issues.apache.org/jira/browse/SPARK-35360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Currently when we msck repaire table, we use batch of 100, this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35360) Spark make add partition batch size configurable when call RepairTableCommand

2021-05-10 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35360.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32489
[https://github.com/apache/spark/pull/32489]

> Spark make add partition batch size configurable when call RepairTableCommand
> -
>
> Key: SPARK-35360
> URL: https://issues.apache.org/jira/browse/SPARK-35360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently when we msck repaire table, we use batch of 100, this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35088) Accept ANSI intervals by the Sequence expression

2021-05-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341794#comment-17341794
 ] 

Apache Spark commented on SPARK-35088:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/32492

> Accept ANSI intervals by the Sequence expression
> 
>
> Key: SPARK-35088
> URL: https://issues.apache.org/jira/browse/SPARK-35088
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, the expression accepts only CalendarIntervalType as the step 
> expression. It should support ANSI intervals as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35358) Set maximum Java heap used for release build

2021-05-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35358.
---
Fix Version/s: 3.2.0
   3.1.2
   3.0.3
   Resolution: Fixed

> Set maximum Java heap used for release build
> 
>
> Key: SPARK-35358
> URL: https://issues.apache.org/jira/browse/SPARK-35358
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> When I was cutting RCs for 2.4.8, I frequently encountered OOM during 
> building using mvn. It happens many times until I increased the heap memory 
> setting.
> I am not sure if other release managers encounter the same issue. I will try 
> to increase the heap memory setting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35360) Spark make add partition batch size configurable when call RepairTableCommand

2021-05-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341704#comment-17341704
 ] 

Apache Spark commented on SPARK-35360:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32489

> Spark make add partition batch size configurable when call RepairTableCommand
> -
>
> Key: SPARK-35360
> URL: https://issues.apache.org/jira/browse/SPARK-35360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: angerszhu
>Priority: Major
>
> Currently when we msck repaire table, we use batch of 100, this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35360) Spark make add partition batch size configurable when call RepairTableCommand

2021-05-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341703#comment-17341703
 ] 

Apache Spark commented on SPARK-35360:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32489

> Spark make add partition batch size configurable when call RepairTableCommand
> -
>
> Key: SPARK-35360
> URL: https://issues.apache.org/jira/browse/SPARK-35360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: angerszhu
>Priority: Major
>
> Currently when we msck repaire table, we use batch of 100, this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35360) Spark make add partition batch size configurable when call RepairTableCommand

2021-05-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35360:


Assignee: Apache Spark

> Spark make add partition batch size configurable when call RepairTableCommand
> -
>
> Key: SPARK-35360
> URL: https://issues.apache.org/jira/browse/SPARK-35360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Currently when we msck repaire table, we use batch of 100, this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35360) Spark make add partition batch size configurable when call RepairTableCommand

2021-05-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35360:


Assignee: (was: Apache Spark)

> Spark make add partition batch size configurable when call RepairTableCommand
> -
>
> Key: SPARK-35360
> URL: https://issues.apache.org/jira/browse/SPARK-35360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: angerszhu
>Priority: Major
>
> Currently when we msck repaire table, we use batch of 100, this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35360) Spark make add partition batch size configurable when call RepairTableCommand

2021-05-10 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-35360:
--
Summary: Spark make add partition batch size configurable when call 
RepairTableCommand  (was: Spark make add partition batch size configurable)

> Spark make add partition batch size configurable when call RepairTableCommand
> -
>
> Key: SPARK-35360
> URL: https://issues.apache.org/jira/browse/SPARK-35360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: angerszhu
>Priority: Major
>
> Currently when we msck repaire table, we use batch of 100, this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35360) Spark make add partition batch size configurable

2021-05-10 Thread angerszhu (Jira)
angerszhu created SPARK-35360:
-

 Summary: Spark make add partition batch size configurable
 Key: SPARK-35360
 URL: https://issues.apache.org/jira/browse/SPARK-35360
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: angerszhu


Currently when we msck repaire table, we use batch of 100, this should be 
configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org