[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-05-10 Thread Shubham Chaurasia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342311#comment-17342311
 ] 

Shubham Chaurasia commented on SPARK-34675:
---

Thanks for the previous investigations [~maxgekk] [~dongjoon].

I saw something strange on the latest master. 

I was doing the same experiment and found that if we have say timezone T1 and 
we change either shell timezone using export TZ or if we pass using 
{{extraJavaOptions}}, we see different values of timezones.

My system was in UTC. 

1) I get following values 
{code}
user.timezone - America/Los_Angeles
TimeZone.getDefault - America/Los_Angeles
spark.sql.session.timeZone - America/Los_Angeles
++---++ 
|type|timestamp  |millis  |
++---++
|FROM BEELINE-EXT PARQUET|1989-01-04 16:00:00|59996160|
|FROM BEELINE-EXT ORC|1989-01-05 00:00:00|5040|
|FROM BEELINE-EXT AVRO   |1989-01-04 16:00:00|59996160|
|FROM BEELINE-EXT TEXT   |1989-01-05 00:00:00|5040|
++---++
{code}
when I either change the shell timezone like 
{code}
export TZ=America/Los_Angeles
{code}
or if I pass using extraJavaOptions like
{code}
 bin/spark-shell --master local --conf 
spark.driver.extraJavaOptions='-Duser.timezone=America/Los_Angeles' --conf 
spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles'
{code}

Not only with America/Los_Angeles timezone, I tested with Asia/Kolkata as well 
and was seeing the same behavior with above steps.
Result with Asia/Kolkata - 
{code}
user.timezone - Asia/Kolkata
TimeZone.getDefault - Asia/Kolkata
spark.sql.session.timeZone - Asia/Kolkata
++---++ 
|type|timestamp  |millis  |
++---++
|FROM BEELINE-EXT PARQUET|1989-01-05 05:30:00|59996160|
|FROM BEELINE-EXT ORC|1989-01-05 00:00:00|59994180|
|FROM BEELINE-EXT AVRO   |1989-01-05 05:30:00|59996160|
|FROM BEELINE-EXT TEXT   |1989-01-05 00:00:00|59994180|
++---++
{code}

> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Shubham Chaurasia
>Priority: Major
> Fix For: 3.2.0
>
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as 
> parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table spark_text(type string, t timestamp) stored as 
> textfile").show
> sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 
> 01:02:03')").show
> {code}
> Used following function to check and verify the returned timestamps
> {code:scala}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> def showTs(
> db: String,
> tables: String*
> ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
>   sql("use " + db).show
>   import scala.collection.mutable.ListBuffer
>   var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
>   for (tbl <- tables) {
> val query = "select * from " + tbl
> println("Executing - " + query);
> results += sql(query)
>   }
>   println("user.timezone - " + System.getProperty("user.timezone"))
>   println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
>   println("spark.sql.session.timeZone - " + 
> spark.conf.get("spark.sql.session.timeZone"))
>   var unionDf = results(0)
>   for (i <- 1 until results.length) {
> unionDf = unionDf.unionAll(results(i))
>   }
>   val augmented = unionD

[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299373#comment-17299373
 ] 

Dongjoon Hyun commented on SPARK-34675:
---

Thank you for checking those, [~maxgekk] . In that case, it looks sufficient to 
resolve this PR with the above your analysis (the result on master and your 
comment).

> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Shubham Chaurasia
>Priority: Major
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as 
> parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table spark_text(type string, t timestamp) stored as 
> textfile").show
> sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 
> 01:02:03')").show
> {code}
> Used following function to check and verify the returned timestamps
> {code:scala}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> def showTs(
> db: String,
> tables: String*
> ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
>   sql("use " + db).show
>   import scala.collection.mutable.ListBuffer
>   var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
>   for (tbl <- tables) {
> val query = "select * from " + tbl
> println("Executing - " + query);
> results += sql(query)
>   }
>   println("user.timezone - " + System.getProperty("user.timezone"))
>   println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
>   println("spark.sql.session.timeZone - " + 
> spark.conf.get("spark.sql.session.timeZone"))
>   var unionDf = results(0)
>   for (i <- 1 until results.length) {
> unionDf = unionDf.unionAll(results(i))
>   }
>   val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), 
> r.getTimestamp(1).getTime))
>   val renamed = augmented.withColumnRenamed("_1", 
> "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
> renamed.show(false)
>   return renamed
> }
> // Exiting paste mode, now interpreting.
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - UTC
> +--+---++ 
>   
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
> +--+---++
> {code}
> 1. Set session timezone to America/Los_Angeles
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - America/Los_Angeles
> +--+---++
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-04 17:02:03|599

[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299376#comment-17299376
 ] 

Dongjoon Hyun commented on SPARK-34675:
---

I linked those issue as `Superceded by` links.

> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Shubham Chaurasia
>Priority: Major
> Fix For: 3.2.0
>
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as 
> parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table spark_text(type string, t timestamp) stored as 
> textfile").show
> sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 
> 01:02:03')").show
> {code}
> Used following function to check and verify the returned timestamps
> {code:scala}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> def showTs(
> db: String,
> tables: String*
> ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
>   sql("use " + db).show
>   import scala.collection.mutable.ListBuffer
>   var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
>   for (tbl <- tables) {
> val query = "select * from " + tbl
> println("Executing - " + query);
> results += sql(query)
>   }
>   println("user.timezone - " + System.getProperty("user.timezone"))
>   println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
>   println("spark.sql.session.timeZone - " + 
> spark.conf.get("spark.sql.session.timeZone"))
>   var unionDf = results(0)
>   for (i <- 1 until results.length) {
> unionDf = unionDf.unionAll(results(i))
>   }
>   val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), 
> r.getTimestamp(1).getTime))
>   val renamed = augmented.withColumnRenamed("_1", 
> "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
> renamed.show(false)
>   return renamed
> }
> // Exiting paste mode, now interpreting.
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - UTC
> +--+---++ 
>   
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
> +--+---++
> {code}
> 1. Set session timezone to America/Los_Angeles
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - America/Los_Angeles
> +--+---++
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT T

[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-03-10 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299345#comment-17299345
 ] 

Maxim Gekk commented on SPARK-34675:


> Could you link the original related patch and close this issue, [~maxgekk]?
 
I think the issue has been fixed by multiple commits for sub-tasks of 
https://issues.apache.org/jira/browse/SPARK-26651, 
https://issues.apache.org/jira/browse/SPARK-31404 & 
https://issues.apache.org/jira/browse/SPARK-30951 . It is hard to identify 
particular patches that fix the issue.

> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Shubham Chaurasia
>Priority: Major
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as 
> parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table spark_text(type string, t timestamp) stored as 
> textfile").show
> sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 
> 01:02:03')").show
> {code}
> Used following function to check and verify the returned timestamps
> {code:scala}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> def showTs(
> db: String,
> tables: String*
> ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
>   sql("use " + db).show
>   import scala.collection.mutable.ListBuffer
>   var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
>   for (tbl <- tables) {
> val query = "select * from " + tbl
> println("Executing - " + query);
> results += sql(query)
>   }
>   println("user.timezone - " + System.getProperty("user.timezone"))
>   println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
>   println("spark.sql.session.timeZone - " + 
> spark.conf.get("spark.sql.session.timeZone"))
>   var unionDf = results(0)
>   for (i <- 1 until results.length) {
> unionDf = unionDf.unionAll(results(i))
>   }
>   val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), 
> r.getTimestamp(1).getTime))
>   val renamed = augmented.withColumnRenamed("_1", 
> "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
> renamed.show(false)
>   return renamed
> }
> // Exiting paste mode, now interpreting.
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - UTC
> +--+---++ 
>   
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
> +--+---++
> {code}
> 1. Set session timezone to America/Los_Angeles
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - America/Los_Angeles
> +--+---++
> |type  |ts

[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299115#comment-17299115
 ] 

Dongjoon Hyun commented on SPARK-34675:
---

According to [~maxgekk]'s result, it seems that we close this issue for 3.2.0 
as a `Duplicate` issue.

Could you link the original related patch and close this issue, [~maxgekk]?

> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Shubham Chaurasia
>Priority: Major
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as 
> parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table spark_text(type string, t timestamp) stored as 
> textfile").show
> sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 
> 01:02:03')").show
> {code}
> Used following function to check and verify the returned timestamps
> {code:scala}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> def showTs(
> db: String,
> tables: String*
> ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
>   sql("use " + db).show
>   import scala.collection.mutable.ListBuffer
>   var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
>   for (tbl <- tables) {
> val query = "select * from " + tbl
> println("Executing - " + query);
> results += sql(query)
>   }
>   println("user.timezone - " + System.getProperty("user.timezone"))
>   println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
>   println("spark.sql.session.timeZone - " + 
> spark.conf.get("spark.sql.session.timeZone"))
>   var unionDf = results(0)
>   for (i <- 1 until results.length) {
> unionDf = unionDf.unionAll(results(i))
>   }
>   val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), 
> r.getTimestamp(1).getTime))
>   val renamed = augmented.withColumnRenamed("_1", 
> "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
> renamed.show(false)
>   return renamed
> }
> // Exiting paste mode, now interpreting.
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - UTC
> +--+---++ 
>   
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
> +--+---++
> {code}
> 1. Set session timezone to America/Los_Angeles
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - America/Los_Angeles
> +--+---++
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-04 

[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-03-10 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299086#comment-17299086
 ] 

Maxim Gekk commented on SPARK-34675:


Here is the output on the current master (the same result for all datasources):

{code}
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

scala> showTs("default", "spark_parquet", "spark_orc", "spark_avro", 
"spark_text")
++
||
++
++

Executing - select * from spark_parquet
Executing - select * from spark_orc
Executing - select * from spark_avro
Executing - select * from spark_text
user.timezone - America/Los_Angeles
TimeZone.getDefault - America/Los_Angeles
spark.sql.session.timeZone - America/Los_Angeles
+--+---++
|type  |ts |millis  |
+--+---++
|FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|54123000|
|FROM SPARK-EXT ORC|1989-01-05 01:02:03|54123000|
|FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|54123000|
|FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|54123000|
+--+---++

res18: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [type: string, 
ts: timestamp ... 1 more field]
{code}

 

> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Shubham Chaurasia
>Priority: Major
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as 
> parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table spark_text(type string, t timestamp) stored as 
> textfile").show
> sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 
> 01:02:03')").show
> {code}
> Used following function to check and verify the returned timestamps
> {code:scala}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> def showTs(
> db: String,
> tables: String*
> ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
>   sql("use " + db).show
>   import scala.collection.mutable.ListBuffer
>   var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
>   for (tbl <- tables) {
> val query = "select * from " + tbl
> println("Executing - " + query);
> results += sql(query)
>   }
>   println("user.timezone - " + System.getProperty("user.timezone"))
>   println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
>   println("spark.sql.session.timeZone - " + 
> spark.conf.get("spark.sql.session.timeZone"))
>   var unionDf = results(0)
>   for (i <- 1 until results.length) {
> unionDf = unionDf.unionAll(results(i))
>   }
>   val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), 
> r.getTimestamp(1).getTime))
>   val renamed = augmented.withColumnRenamed("_1", 
> "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
> renamed.show(false)
>   return renamed
> }
> // Exiting paste mode, now interpreting.
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - UTC
> +--+---++ 
>   
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1

[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-03-10 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299059#comment-17299059
 ] 

L. C. Hsieh commented on SPARK-34675:
-

Thanks for ping me [~dongjoon].

So looks like even in Spark 3 (not sure which version you use, is it current 
master?), there are still some result difference between data sources?

And I agree with [~dongjoon] that we cannot change Spark 2.4 for that.

> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Shubham Chaurasia
>Priority: Major
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as 
> parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table spark_text(type string, t timestamp) stored as 
> textfile").show
> sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 
> 01:02:03')").show
> {code}
> Used following function to check and verify the returned timestamps
> {code:scala}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> def showTs(
> db: String,
> tables: String*
> ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
>   sql("use " + db).show
>   import scala.collection.mutable.ListBuffer
>   var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
>   for (tbl <- tables) {
> val query = "select * from " + tbl
> println("Executing - " + query);
> results += sql(query)
>   }
>   println("user.timezone - " + System.getProperty("user.timezone"))
>   println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
>   println("spark.sql.session.timeZone - " + 
> spark.conf.get("spark.sql.session.timeZone"))
>   var unionDf = results(0)
>   for (i <- 1 until results.length) {
> unionDf = unionDf.unionAll(results(i))
>   }
>   val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), 
> r.getTimestamp(1).getTime))
>   val renamed = augmented.withColumnRenamed("_1", 
> "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
> renamed.show(false)
>   return renamed
> }
> // Exiting paste mode, now interpreting.
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - UTC
> +--+---++ 
>   
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
> +--+---++
> {code}
> 1. Set session timezone to America/Los_Angeles
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - America/Los_Angeles
> +--+---++
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1

[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298962#comment-17298962
 ] 

Dongjoon Hyun commented on SPARK-34675:
---

Thank you for verification, [~ShubhamChaurasia] . The result difference between 
data sources seems to be the real problem here because we cannot change Spark 
2.4. BTW, Apache Spark community is preparing Apache Spark 2.4.8 as an official 
EOF release in this month (Release Manager: [~viirya] )

> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Shubham Chaurasia
>Priority: Major
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as 
> parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table spark_text(type string, t timestamp) stored as 
> textfile").show
> sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 
> 01:02:03')").show
> {code}
> Used following function to check and verify the returned timestamps
> {code:scala}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> def showTs(
> db: String,
> tables: String*
> ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
>   sql("use " + db).show
>   import scala.collection.mutable.ListBuffer
>   var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
>   for (tbl <- tables) {
> val query = "select * from " + tbl
> println("Executing - " + query);
> results += sql(query)
>   }
>   println("user.timezone - " + System.getProperty("user.timezone"))
>   println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
>   println("spark.sql.session.timeZone - " + 
> spark.conf.get("spark.sql.session.timeZone"))
>   var unionDf = results(0)
>   for (i <- 1 until results.length) {
> unionDf = unionDf.unionAll(results(i))
>   }
>   val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), 
> r.getTimestamp(1).getTime))
>   val renamed = augmented.withColumnRenamed("_1", 
> "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
> renamed.show(false)
>   return renamed
> }
> // Exiting paste mode, now interpreting.
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - UTC
> +--+---++ 
>   
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
> +--+---++
> {code}
> 1. Set session timezone to America/Los_Angeles
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - America/Los_Angeles
> +--+---++
> |type  |ts |millis  |
> +--+--

[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-03-09 Thread Shubham Chaurasia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298524#comment-17298524
 ] 

Shubham Chaurasia commented on SPARK-34675:
---

Thanks [~dongjoon] [~maxgekk]

Experiments with Spark 3 
{code:scala}
bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
spark.sql.catalogImplementation=hive --conf 
spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
spark.executor.extraJavaOptions='-Duser.timezone=UTC'


scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
++
||
++
++

Executing - select * from spark_parquet
Executing - select * from spark_orc
Executing - select * from spark_avro
Executing - select * from spark_text
user.timezone - UTC
TimeZone.getDefault - UTC
spark.sql.session.timeZone - UTC
+--+---++   
|type  |ts |millis  |
+--+---++
|FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
|FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000|
|FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
|FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
+--+---++

res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [type: string, 
ts: timestamp ... 1 more field]

scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
++
||
++
++

Executing - select * from spark_parquet
Executing - select * from spark_orc
Executing - select * from spark_avro
Executing - select * from spark_text
user.timezone - UTC
TimeZone.getDefault - UTC
spark.sql.session.timeZone - America/Los_Angeles
+--+---++
|type  |ts |millis  |
+--+---++
|FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT ORC|1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT AVRO   |1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT TEXT   |1989-01-04 17:02:03|599965323000|
+--+---++
{code}

When JVM (and hence session) timezone = {{America/Los_Angeles}} (avro is 
aligned to parquet here as opposed to 2.4.x above, orc and text remain like old)
{code:scala}
bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
spark.sql.catalogImplementation=hive --conf 
spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
spark.driver.extraJavaOptions=' -Duser.timezone=America/Los_Angeles' --conf 
spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles'

scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
++
||
++
++

Executing - select * from spark_parquet
Executing - select * from spark_orc
Executing - select * from spark_avro
Executing - select * from spark_text
user.timezone - America/Los_Angeles
TimeZone.getDefault - America/Los_Angeles
spark.sql.session.timeZone - America/Los_Angeles
+--+---++   
|type  |ts |millis  |
+--+---++
|FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT ORC|1989-01-05 01:02:03|54123000|
|FROM SPARK-EXT AVRO   |1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|54123000|
+--+---++
{code}


> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Shubham Chaurasia
>Priority: Major
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as 
> parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table sp

[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-03-09 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298233#comment-17298233
 ] 

Maxim Gekk commented on SPARK-34675:


> Set session timezone to America/Los_Angeles
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

Processing of dates/timestamps in Spark 2.4.x is based on Java 7 time APIs 
where JVM time zone is "hard coded" in the classes 
java.sql.Date/java.sql.Timestamp. So, Spark 2.4.x cannot apply the session time 
zone in some cases. In Spark 3.x, most of the problems were solved. I would 
recommend to try the same on it.
 

> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Shubham Chaurasia
>Priority: Major
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as 
> parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table spark_text(type string, t timestamp) stored as 
> textfile").show
> sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 
> 01:02:03')").show
> {code}
> Used following function to check and verify the returned timestamps
> {code:scala}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> def showTs(
> db: String,
> tables: String*
> ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
>   sql("use " + db).show
>   import scala.collection.mutable.ListBuffer
>   var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
>   for (tbl <- tables) {
> val query = "select * from " + tbl
> println("Executing - " + query);
> results += sql(query)
>   }
>   println("user.timezone - " + System.getProperty("user.timezone"))
>   println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
>   println("spark.sql.session.timeZone - " + 
> spark.conf.get("spark.sql.session.timeZone"))
>   var unionDf = results(0)
>   for (i <- 1 until results.length) {
> unionDf = unionDf.unionAll(results(i))
>   }
>   val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), 
> r.getTimestamp(1).getTime))
>   val renamed = augmented.withColumnRenamed("_1", 
> "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
> renamed.show(false)
>   return renamed
> }
> // Exiting paste mode, now interpreting.
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - UTC
> +--+---++ 
>   
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
> +--+---++
> {code}
> 1. Set session timezone to America/Los_Angeles
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - America/Los_Angeles
> +---

[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different

2021-03-09 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298177#comment-17298177
 ] 

Dongjoon Hyun commented on SPARK-34675:
---

cc [~maxgekk]

> TimeZone inconsistencies when JVM and session timezones are different
> -
>
> Key: SPARK-34675
> URL: https://issues.apache.org/jira/browse/SPARK-34675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Shubham Chaurasia
>Priority: Major
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as 
> parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table spark_text(type string, t timestamp) stored as 
> textfile").show
> sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 
> 01:02:03')").show
> {code}
> Used following function to check and verify the returned timestamps
> {code:scala}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> def showTs(
> db: String,
> tables: String*
> ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
>   sql("use " + db).show
>   import scala.collection.mutable.ListBuffer
>   var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
>   for (tbl <- tables) {
> val query = "select * from " + tbl
> println("Executing - " + query);
> results += sql(query)
>   }
>   println("user.timezone - " + System.getProperty("user.timezone"))
>   println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
>   println("spark.sql.session.timeZone - " + 
> spark.conf.get("spark.sql.session.timeZone"))
>   var unionDf = results(0)
>   for (i <- 1 until results.length) {
> unionDf = unionDf.unionAll(results(i))
>   }
>   val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), 
> r.getTimestamp(1).getTime))
>   val renamed = augmented.withColumnRenamed("_1", 
> "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
> renamed.show(false)
>   return renamed
> }
> // Exiting paste mode, now interpreting.
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - UTC
> +--+---++ 
>   
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
> +--+---++
> {code}
> 1. Set session timezone to America/Los_Angeles
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - America/Los_Angeles
> +--+---++
> |type  |ts |millis  |
> +--+---++
> |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT ORC|1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-04 17:02:03|599965323000|
> +