[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different
[ https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342311#comment-17342311 ] Shubham Chaurasia commented on SPARK-34675: --- Thanks for the previous investigations [~maxgekk] [~dongjoon]. I saw something strange on the latest master. I was doing the same experiment and found that if we have say timezone T1 and we change either shell timezone using export TZ or if we pass using {{extraJavaOptions}}, we see different values of timezones. My system was in UTC. 1) I get following values {code} user.timezone - America/Los_Angeles TimeZone.getDefault - America/Los_Angeles spark.sql.session.timeZone - America/Los_Angeles ++---++ |type|timestamp |millis | ++---++ |FROM BEELINE-EXT PARQUET|1989-01-04 16:00:00|59996160| |FROM BEELINE-EXT ORC|1989-01-05 00:00:00|5040| |FROM BEELINE-EXT AVRO |1989-01-04 16:00:00|59996160| |FROM BEELINE-EXT TEXT |1989-01-05 00:00:00|5040| ++---++ {code} when I either change the shell timezone like {code} export TZ=America/Los_Angeles {code} or if I pass using extraJavaOptions like {code} bin/spark-shell --master local --conf spark.driver.extraJavaOptions='-Duser.timezone=America/Los_Angeles' --conf spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles' {code} Not only with America/Los_Angeles timezone, I tested with Asia/Kolkata as well and was seeing the same behavior with above steps. Result with Asia/Kolkata - {code} user.timezone - Asia/Kolkata TimeZone.getDefault - Asia/Kolkata spark.sql.session.timeZone - Asia/Kolkata ++---++ |type|timestamp |millis | ++---++ |FROM BEELINE-EXT PARQUET|1989-01-05 05:30:00|59996160| |FROM BEELINE-EXT ORC|1989-01-05 00:00:00|59994180| |FROM BEELINE-EXT AVRO |1989-01-05 05:30:00|59996160| |FROM BEELINE-EXT TEXT |1989-01-05 00:00:00|59994180| ++---++ {code} > TimeZone inconsistencies when JVM and session timezones are different > - > > Key: SPARK-34675 > URL: https://issues.apache.org/jira/browse/SPARK-34675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Shubham Chaurasia >Priority: Major > Fix For: 3.2.0 > > > Inserted following data with UTC as both JVM and session timezone. > Spark-shell launch command > {code} > bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf > spark.sql.catalogImplementation=hive --conf > spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf > spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf > spark.executor.extraJavaOptions='-Duser.timezone=UTC' > {code} > Table creation > {code:scala} > sql("use ts").show > sql("create table spark_parquet(type string, t timestamp) stored as > parquet").show > sql("create table spark_orc(type string, t timestamp) stored as orc").show > sql("create table spark_avro(type string, t timestamp) stored as avro").show > sql("create table spark_text(type string, t timestamp) stored as > textfile").show > sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 > 01:02:03')").show > sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 > 01:02:03')").show > sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 > 01:02:03')").show > sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 > 01:02:03')").show > {code} > Used following function to check and verify the returned timestamps > {code:scala} > scala> :paste > // Entering paste mode (ctrl-D to finish) > def showTs( > db: String, > tables: String* > ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = { > sql("use " + db).show > import scala.collection.mutable.ListBuffer > var results = new ListBuffer[org.apache.spark.sql.DataFrame]() > for (tbl <- tables) { > val query = "select * from " + tbl > println("Executing - " + query); > results += sql(query) > } > println("user.timezone - " + System.getProperty("user.timezone")) > println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID) > println("spark.sql.session.timeZone - " + > spark.conf.get("spark.sql.session.timeZone")) > var unionDf = results(0) > for (i <- 1 until results.length) { > unionDf = unionDf.unionAll(results(i)) > } > val augmented = unionD
[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different
[ https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299373#comment-17299373 ] Dongjoon Hyun commented on SPARK-34675: --- Thank you for checking those, [~maxgekk] . In that case, it looks sufficient to resolve this PR with the above your analysis (the result on master and your comment). > TimeZone inconsistencies when JVM and session timezones are different > - > > Key: SPARK-34675 > URL: https://issues.apache.org/jira/browse/SPARK-34675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Shubham Chaurasia >Priority: Major > > Inserted following data with UTC as both JVM and session timezone. > Spark-shell launch command > {code} > bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf > spark.sql.catalogImplementation=hive --conf > spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf > spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf > spark.executor.extraJavaOptions='-Duser.timezone=UTC' > {code} > Table creation > {code:scala} > sql("use ts").show > sql("create table spark_parquet(type string, t timestamp) stored as > parquet").show > sql("create table spark_orc(type string, t timestamp) stored as orc").show > sql("create table spark_avro(type string, t timestamp) stored as avro").show > sql("create table spark_text(type string, t timestamp) stored as > textfile").show > sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 > 01:02:03')").show > sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 > 01:02:03')").show > sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 > 01:02:03')").show > sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 > 01:02:03')").show > {code} > Used following function to check and verify the returned timestamps > {code:scala} > scala> :paste > // Entering paste mode (ctrl-D to finish) > def showTs( > db: String, > tables: String* > ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = { > sql("use " + db).show > import scala.collection.mutable.ListBuffer > var results = new ListBuffer[org.apache.spark.sql.DataFrame]() > for (tbl <- tables) { > val query = "select * from " + tbl > println("Executing - " + query); > results += sql(query) > } > println("user.timezone - " + System.getProperty("user.timezone")) > println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID) > println("spark.sql.session.timeZone - " + > spark.conf.get("spark.sql.session.timeZone")) > var unionDf = results(0) > for (i <- 1 until results.length) { > unionDf = unionDf.unionAll(results(i)) > } > val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), > r.getTimestamp(1).getTime)) > val renamed = augmented.withColumnRenamed("_1", > "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis") > renamed.show(false) > return renamed > } > // Exiting paste mode, now interpreting. > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368 > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - UTC > +--+---++ > > |type |ts |millis | > +--+---++ > |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|599965323000| > +--+---++ > {code} > 1. Set session timezone to America/Los_Angeles > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - America/Los_Angeles > +--+---++ > |type |ts |millis | > +--+---++ > |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000| > |FROM SPARK-EXT ORC|1989-01-04 17:02:03|599
[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different
[ https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299376#comment-17299376 ] Dongjoon Hyun commented on SPARK-34675: --- I linked those issue as `Superceded by` links. > TimeZone inconsistencies when JVM and session timezones are different > - > > Key: SPARK-34675 > URL: https://issues.apache.org/jira/browse/SPARK-34675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Shubham Chaurasia >Priority: Major > Fix For: 3.2.0 > > > Inserted following data with UTC as both JVM and session timezone. > Spark-shell launch command > {code} > bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf > spark.sql.catalogImplementation=hive --conf > spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf > spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf > spark.executor.extraJavaOptions='-Duser.timezone=UTC' > {code} > Table creation > {code:scala} > sql("use ts").show > sql("create table spark_parquet(type string, t timestamp) stored as > parquet").show > sql("create table spark_orc(type string, t timestamp) stored as orc").show > sql("create table spark_avro(type string, t timestamp) stored as avro").show > sql("create table spark_text(type string, t timestamp) stored as > textfile").show > sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 > 01:02:03')").show > sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 > 01:02:03')").show > sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 > 01:02:03')").show > sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 > 01:02:03')").show > {code} > Used following function to check and verify the returned timestamps > {code:scala} > scala> :paste > // Entering paste mode (ctrl-D to finish) > def showTs( > db: String, > tables: String* > ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = { > sql("use " + db).show > import scala.collection.mutable.ListBuffer > var results = new ListBuffer[org.apache.spark.sql.DataFrame]() > for (tbl <- tables) { > val query = "select * from " + tbl > println("Executing - " + query); > results += sql(query) > } > println("user.timezone - " + System.getProperty("user.timezone")) > println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID) > println("spark.sql.session.timeZone - " + > spark.conf.get("spark.sql.session.timeZone")) > var unionDf = results(0) > for (i <- 1 until results.length) { > unionDf = unionDf.unionAll(results(i)) > } > val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), > r.getTimestamp(1).getTime)) > val renamed = augmented.withColumnRenamed("_1", > "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis") > renamed.show(false) > return renamed > } > // Exiting paste mode, now interpreting. > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368 > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - UTC > +--+---++ > > |type |ts |millis | > +--+---++ > |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|599965323000| > +--+---++ > {code} > 1. Set session timezone to America/Los_Angeles > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - America/Los_Angeles > +--+---++ > |type |ts |millis | > +--+---++ > |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000| > |FROM SPARK-EXT ORC|1989-01-04 17:02:03|599965323000| > |FROM SPARK-EXT AVRO |1989-01-04 17:02:03|599965323000| > |FROM SPARK-EXT T
[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different
[ https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299345#comment-17299345 ] Maxim Gekk commented on SPARK-34675: > Could you link the original related patch and close this issue, [~maxgekk]? I think the issue has been fixed by multiple commits for sub-tasks of https://issues.apache.org/jira/browse/SPARK-26651, https://issues.apache.org/jira/browse/SPARK-31404 & https://issues.apache.org/jira/browse/SPARK-30951 . It is hard to identify particular patches that fix the issue. > TimeZone inconsistencies when JVM and session timezones are different > - > > Key: SPARK-34675 > URL: https://issues.apache.org/jira/browse/SPARK-34675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Shubham Chaurasia >Priority: Major > > Inserted following data with UTC as both JVM and session timezone. > Spark-shell launch command > {code} > bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf > spark.sql.catalogImplementation=hive --conf > spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf > spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf > spark.executor.extraJavaOptions='-Duser.timezone=UTC' > {code} > Table creation > {code:scala} > sql("use ts").show > sql("create table spark_parquet(type string, t timestamp) stored as > parquet").show > sql("create table spark_orc(type string, t timestamp) stored as orc").show > sql("create table spark_avro(type string, t timestamp) stored as avro").show > sql("create table spark_text(type string, t timestamp) stored as > textfile").show > sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 > 01:02:03')").show > sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 > 01:02:03')").show > sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 > 01:02:03')").show > sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 > 01:02:03')").show > {code} > Used following function to check and verify the returned timestamps > {code:scala} > scala> :paste > // Entering paste mode (ctrl-D to finish) > def showTs( > db: String, > tables: String* > ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = { > sql("use " + db).show > import scala.collection.mutable.ListBuffer > var results = new ListBuffer[org.apache.spark.sql.DataFrame]() > for (tbl <- tables) { > val query = "select * from " + tbl > println("Executing - " + query); > results += sql(query) > } > println("user.timezone - " + System.getProperty("user.timezone")) > println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID) > println("spark.sql.session.timeZone - " + > spark.conf.get("spark.sql.session.timeZone")) > var unionDf = results(0) > for (i <- 1 until results.length) { > unionDf = unionDf.unionAll(results(i)) > } > val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), > r.getTimestamp(1).getTime)) > val renamed = augmented.withColumnRenamed("_1", > "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis") > renamed.show(false) > return renamed > } > // Exiting paste mode, now interpreting. > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368 > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - UTC > +--+---++ > > |type |ts |millis | > +--+---++ > |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|599965323000| > +--+---++ > {code} > 1. Set session timezone to America/Los_Angeles > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - America/Los_Angeles > +--+---++ > |type |ts
[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different
[ https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299115#comment-17299115 ] Dongjoon Hyun commented on SPARK-34675: --- According to [~maxgekk]'s result, it seems that we close this issue for 3.2.0 as a `Duplicate` issue. Could you link the original related patch and close this issue, [~maxgekk]? > TimeZone inconsistencies when JVM and session timezones are different > - > > Key: SPARK-34675 > URL: https://issues.apache.org/jira/browse/SPARK-34675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Shubham Chaurasia >Priority: Major > > Inserted following data with UTC as both JVM and session timezone. > Spark-shell launch command > {code} > bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf > spark.sql.catalogImplementation=hive --conf > spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf > spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf > spark.executor.extraJavaOptions='-Duser.timezone=UTC' > {code} > Table creation > {code:scala} > sql("use ts").show > sql("create table spark_parquet(type string, t timestamp) stored as > parquet").show > sql("create table spark_orc(type string, t timestamp) stored as orc").show > sql("create table spark_avro(type string, t timestamp) stored as avro").show > sql("create table spark_text(type string, t timestamp) stored as > textfile").show > sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 > 01:02:03')").show > sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 > 01:02:03')").show > sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 > 01:02:03')").show > sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 > 01:02:03')").show > {code} > Used following function to check and verify the returned timestamps > {code:scala} > scala> :paste > // Entering paste mode (ctrl-D to finish) > def showTs( > db: String, > tables: String* > ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = { > sql("use " + db).show > import scala.collection.mutable.ListBuffer > var results = new ListBuffer[org.apache.spark.sql.DataFrame]() > for (tbl <- tables) { > val query = "select * from " + tbl > println("Executing - " + query); > results += sql(query) > } > println("user.timezone - " + System.getProperty("user.timezone")) > println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID) > println("spark.sql.session.timeZone - " + > spark.conf.get("spark.sql.session.timeZone")) > var unionDf = results(0) > for (i <- 1 until results.length) { > unionDf = unionDf.unionAll(results(i)) > } > val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), > r.getTimestamp(1).getTime)) > val renamed = augmented.withColumnRenamed("_1", > "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis") > renamed.show(false) > return renamed > } > // Exiting paste mode, now interpreting. > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368 > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - UTC > +--+---++ > > |type |ts |millis | > +--+---++ > |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|599965323000| > +--+---++ > {code} > 1. Set session timezone to America/Los_Angeles > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - America/Los_Angeles > +--+---++ > |type |ts |millis | > +--+---++ > |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000| > |FROM SPARK-EXT ORC|1989-01-04
[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different
[ https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299086#comment-17299086 ] Maxim Gekk commented on SPARK-34675: Here is the output on the current master (the same result for all datasources): {code} scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> showTs("default", "spark_parquet", "spark_orc", "spark_avro", "spark_text") ++ || ++ ++ Executing - select * from spark_parquet Executing - select * from spark_orc Executing - select * from spark_avro Executing - select * from spark_text user.timezone - America/Los_Angeles TimeZone.getDefault - America/Los_Angeles spark.sql.session.timeZone - America/Los_Angeles +--+---++ |type |ts |millis | +--+---++ |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|54123000| |FROM SPARK-EXT ORC|1989-01-05 01:02:03|54123000| |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|54123000| |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|54123000| +--+---++ res18: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [type: string, ts: timestamp ... 1 more field] {code} > TimeZone inconsistencies when JVM and session timezones are different > - > > Key: SPARK-34675 > URL: https://issues.apache.org/jira/browse/SPARK-34675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Shubham Chaurasia >Priority: Major > > Inserted following data with UTC as both JVM and session timezone. > Spark-shell launch command > {code} > bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf > spark.sql.catalogImplementation=hive --conf > spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf > spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf > spark.executor.extraJavaOptions='-Duser.timezone=UTC' > {code} > Table creation > {code:scala} > sql("use ts").show > sql("create table spark_parquet(type string, t timestamp) stored as > parquet").show > sql("create table spark_orc(type string, t timestamp) stored as orc").show > sql("create table spark_avro(type string, t timestamp) stored as avro").show > sql("create table spark_text(type string, t timestamp) stored as > textfile").show > sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 > 01:02:03')").show > sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 > 01:02:03')").show > sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 > 01:02:03')").show > sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 > 01:02:03')").show > {code} > Used following function to check and verify the returned timestamps > {code:scala} > scala> :paste > // Entering paste mode (ctrl-D to finish) > def showTs( > db: String, > tables: String* > ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = { > sql("use " + db).show > import scala.collection.mutable.ListBuffer > var results = new ListBuffer[org.apache.spark.sql.DataFrame]() > for (tbl <- tables) { > val query = "select * from " + tbl > println("Executing - " + query); > results += sql(query) > } > println("user.timezone - " + System.getProperty("user.timezone")) > println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID) > println("spark.sql.session.timeZone - " + > spark.conf.get("spark.sql.session.timeZone")) > var unionDf = results(0) > for (i <- 1 until results.length) { > unionDf = unionDf.unionAll(results(i)) > } > val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), > r.getTimestamp(1).getTime)) > val renamed = augmented.withColumnRenamed("_1", > "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis") > renamed.show(false) > return renamed > } > // Exiting paste mode, now interpreting. > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368 > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - UTC > +--+---++ > > |type |ts |millis | > +--+---++ > |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT AVRO |1
[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different
[ https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299059#comment-17299059 ] L. C. Hsieh commented on SPARK-34675: - Thanks for ping me [~dongjoon]. So looks like even in Spark 3 (not sure which version you use, is it current master?), there are still some result difference between data sources? And I agree with [~dongjoon] that we cannot change Spark 2.4 for that. > TimeZone inconsistencies when JVM and session timezones are different > - > > Key: SPARK-34675 > URL: https://issues.apache.org/jira/browse/SPARK-34675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Shubham Chaurasia >Priority: Major > > Inserted following data with UTC as both JVM and session timezone. > Spark-shell launch command > {code} > bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf > spark.sql.catalogImplementation=hive --conf > spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf > spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf > spark.executor.extraJavaOptions='-Duser.timezone=UTC' > {code} > Table creation > {code:scala} > sql("use ts").show > sql("create table spark_parquet(type string, t timestamp) stored as > parquet").show > sql("create table spark_orc(type string, t timestamp) stored as orc").show > sql("create table spark_avro(type string, t timestamp) stored as avro").show > sql("create table spark_text(type string, t timestamp) stored as > textfile").show > sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 > 01:02:03')").show > sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 > 01:02:03')").show > sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 > 01:02:03')").show > sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 > 01:02:03')").show > {code} > Used following function to check and verify the returned timestamps > {code:scala} > scala> :paste > // Entering paste mode (ctrl-D to finish) > def showTs( > db: String, > tables: String* > ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = { > sql("use " + db).show > import scala.collection.mutable.ListBuffer > var results = new ListBuffer[org.apache.spark.sql.DataFrame]() > for (tbl <- tables) { > val query = "select * from " + tbl > println("Executing - " + query); > results += sql(query) > } > println("user.timezone - " + System.getProperty("user.timezone")) > println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID) > println("spark.sql.session.timeZone - " + > spark.conf.get("spark.sql.session.timeZone")) > var unionDf = results(0) > for (i <- 1 until results.length) { > unionDf = unionDf.unionAll(results(i)) > } > val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), > r.getTimestamp(1).getTime)) > val renamed = augmented.withColumnRenamed("_1", > "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis") > renamed.show(false) > return renamed > } > // Exiting paste mode, now interpreting. > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368 > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - UTC > +--+---++ > > |type |ts |millis | > +--+---++ > |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|599965323000| > +--+---++ > {code} > 1. Set session timezone to America/Los_Angeles > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - America/Los_Angeles > +--+---++ > |type |ts |millis | > +--+---++ > |FROM SPARK-EXT PARQUET|1
[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different
[ https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298962#comment-17298962 ] Dongjoon Hyun commented on SPARK-34675: --- Thank you for verification, [~ShubhamChaurasia] . The result difference between data sources seems to be the real problem here because we cannot change Spark 2.4. BTW, Apache Spark community is preparing Apache Spark 2.4.8 as an official EOF release in this month (Release Manager: [~viirya] ) > TimeZone inconsistencies when JVM and session timezones are different > - > > Key: SPARK-34675 > URL: https://issues.apache.org/jira/browse/SPARK-34675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Shubham Chaurasia >Priority: Major > > Inserted following data with UTC as both JVM and session timezone. > Spark-shell launch command > {code} > bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf > spark.sql.catalogImplementation=hive --conf > spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf > spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf > spark.executor.extraJavaOptions='-Duser.timezone=UTC' > {code} > Table creation > {code:scala} > sql("use ts").show > sql("create table spark_parquet(type string, t timestamp) stored as > parquet").show > sql("create table spark_orc(type string, t timestamp) stored as orc").show > sql("create table spark_avro(type string, t timestamp) stored as avro").show > sql("create table spark_text(type string, t timestamp) stored as > textfile").show > sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 > 01:02:03')").show > sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 > 01:02:03')").show > sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 > 01:02:03')").show > sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 > 01:02:03')").show > {code} > Used following function to check and verify the returned timestamps > {code:scala} > scala> :paste > // Entering paste mode (ctrl-D to finish) > def showTs( > db: String, > tables: String* > ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = { > sql("use " + db).show > import scala.collection.mutable.ListBuffer > var results = new ListBuffer[org.apache.spark.sql.DataFrame]() > for (tbl <- tables) { > val query = "select * from " + tbl > println("Executing - " + query); > results += sql(query) > } > println("user.timezone - " + System.getProperty("user.timezone")) > println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID) > println("spark.sql.session.timeZone - " + > spark.conf.get("spark.sql.session.timeZone")) > var unionDf = results(0) > for (i <- 1 until results.length) { > unionDf = unionDf.unionAll(results(i)) > } > val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), > r.getTimestamp(1).getTime)) > val renamed = augmented.withColumnRenamed("_1", > "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis") > renamed.show(false) > return renamed > } > // Exiting paste mode, now interpreting. > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368 > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - UTC > +--+---++ > > |type |ts |millis | > +--+---++ > |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|599965323000| > +--+---++ > {code} > 1. Set session timezone to America/Los_Angeles > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - America/Los_Angeles > +--+---++ > |type |ts |millis | > +--+--
[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different
[ https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298524#comment-17298524 ] Shubham Chaurasia commented on SPARK-34675: --- Thanks [~dongjoon] [~maxgekk] Experiments with Spark 3 {code:scala} bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf spark.executor.extraJavaOptions='-Duser.timezone=UTC' scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") ++ || ++ ++ Executing - select * from spark_parquet Executing - select * from spark_orc Executing - select * from spark_avro Executing - select * from spark_text user.timezone - UTC TimeZone.getDefault - UTC spark.sql.session.timeZone - UTC +--+---++ |type |ts |millis | +--+---++ |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000| |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000| |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|599965323000| |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|599965323000| +--+---++ res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [type: string, ts: timestamp ... 1 more field] scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") ++ || ++ ++ Executing - select * from spark_parquet Executing - select * from spark_orc Executing - select * from spark_avro Executing - select * from spark_text user.timezone - UTC TimeZone.getDefault - UTC spark.sql.session.timeZone - America/Los_Angeles +--+---++ |type |ts |millis | +--+---++ |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000| |FROM SPARK-EXT ORC|1989-01-04 17:02:03|599965323000| |FROM SPARK-EXT AVRO |1989-01-04 17:02:03|599965323000| |FROM SPARK-EXT TEXT |1989-01-04 17:02:03|599965323000| +--+---++ {code} When JVM (and hence session) timezone = {{America/Los_Angeles}} (avro is aligned to parquet here as opposed to 2.4.x above, orc and text remain like old) {code:scala} bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=America/Los_Angeles' --conf spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles' scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") ++ || ++ ++ Executing - select * from spark_parquet Executing - select * from spark_orc Executing - select * from spark_avro Executing - select * from spark_text user.timezone - America/Los_Angeles TimeZone.getDefault - America/Los_Angeles spark.sql.session.timeZone - America/Los_Angeles +--+---++ |type |ts |millis | +--+---++ |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000| |FROM SPARK-EXT ORC|1989-01-05 01:02:03|54123000| |FROM SPARK-EXT AVRO |1989-01-04 17:02:03|599965323000| |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|54123000| +--+---++ {code} > TimeZone inconsistencies when JVM and session timezones are different > - > > Key: SPARK-34675 > URL: https://issues.apache.org/jira/browse/SPARK-34675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Shubham Chaurasia >Priority: Major > > Inserted following data with UTC as both JVM and session timezone. > Spark-shell launch command > {code} > bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf > spark.sql.catalogImplementation=hive --conf > spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf > spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf > spark.executor.extraJavaOptions='-Duser.timezone=UTC' > {code} > Table creation > {code:scala} > sql("use ts").show > sql("create table spark_parquet(type string, t timestamp) stored as > parquet").show > sql("create table spark_orc(type string, t timestamp) stored as orc").show > sql("create table spark_avro(type string, t timestamp) stored as avro").show > sql("create table sp
[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different
[ https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298233#comment-17298233 ] Maxim Gekk commented on SPARK-34675: > Set session timezone to America/Los_Angeles > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") Processing of dates/timestamps in Spark 2.4.x is based on Java 7 time APIs where JVM time zone is "hard coded" in the classes java.sql.Date/java.sql.Timestamp. So, Spark 2.4.x cannot apply the session time zone in some cases. In Spark 3.x, most of the problems were solved. I would recommend to try the same on it. > TimeZone inconsistencies when JVM and session timezones are different > - > > Key: SPARK-34675 > URL: https://issues.apache.org/jira/browse/SPARK-34675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Shubham Chaurasia >Priority: Major > > Inserted following data with UTC as both JVM and session timezone. > Spark-shell launch command > {code} > bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf > spark.sql.catalogImplementation=hive --conf > spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf > spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf > spark.executor.extraJavaOptions='-Duser.timezone=UTC' > {code} > Table creation > {code:scala} > sql("use ts").show > sql("create table spark_parquet(type string, t timestamp) stored as > parquet").show > sql("create table spark_orc(type string, t timestamp) stored as orc").show > sql("create table spark_avro(type string, t timestamp) stored as avro").show > sql("create table spark_text(type string, t timestamp) stored as > textfile").show > sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 > 01:02:03')").show > sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 > 01:02:03')").show > sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 > 01:02:03')").show > sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 > 01:02:03')").show > {code} > Used following function to check and verify the returned timestamps > {code:scala} > scala> :paste > // Entering paste mode (ctrl-D to finish) > def showTs( > db: String, > tables: String* > ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = { > sql("use " + db).show > import scala.collection.mutable.ListBuffer > var results = new ListBuffer[org.apache.spark.sql.DataFrame]() > for (tbl <- tables) { > val query = "select * from " + tbl > println("Executing - " + query); > results += sql(query) > } > println("user.timezone - " + System.getProperty("user.timezone")) > println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID) > println("spark.sql.session.timeZone - " + > spark.conf.get("spark.sql.session.timeZone")) > var unionDf = results(0) > for (i <- 1 until results.length) { > unionDf = unionDf.unionAll(results(i)) > } > val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), > r.getTimestamp(1).getTime)) > val renamed = augmented.withColumnRenamed("_1", > "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis") > renamed.show(false) > return renamed > } > // Exiting paste mode, now interpreting. > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368 > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - UTC > +--+---++ > > |type |ts |millis | > +--+---++ > |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|599965323000| > +--+---++ > {code} > 1. Set session timezone to America/Los_Angeles > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - America/Los_Angeles > +---
[jira] [Commented] (SPARK-34675) TimeZone inconsistencies when JVM and session timezones are different
[ https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298177#comment-17298177 ] Dongjoon Hyun commented on SPARK-34675: --- cc [~maxgekk] > TimeZone inconsistencies when JVM and session timezones are different > - > > Key: SPARK-34675 > URL: https://issues.apache.org/jira/browse/SPARK-34675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Shubham Chaurasia >Priority: Major > > Inserted following data with UTC as both JVM and session timezone. > Spark-shell launch command > {code} > bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf > spark.sql.catalogImplementation=hive --conf > spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf > spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf > spark.executor.extraJavaOptions='-Duser.timezone=UTC' > {code} > Table creation > {code:scala} > sql("use ts").show > sql("create table spark_parquet(type string, t timestamp) stored as > parquet").show > sql("create table spark_orc(type string, t timestamp) stored as orc").show > sql("create table spark_avro(type string, t timestamp) stored as avro").show > sql("create table spark_text(type string, t timestamp) stored as > textfile").show > sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 > 01:02:03')").show > sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 > 01:02:03')").show > sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 > 01:02:03')").show > sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 > 01:02:03')").show > {code} > Used following function to check and verify the returned timestamps > {code:scala} > scala> :paste > // Entering paste mode (ctrl-D to finish) > def showTs( > db: String, > tables: String* > ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = { > sql("use " + db).show > import scala.collection.mutable.ListBuffer > var results = new ListBuffer[org.apache.spark.sql.DataFrame]() > for (tbl <- tables) { > val query = "select * from " + tbl > println("Executing - " + query); > results += sql(query) > } > println("user.timezone - " + System.getProperty("user.timezone")) > println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID) > println("spark.sql.session.timeZone - " + > spark.conf.get("spark.sql.session.timeZone")) > var unionDf = results(0) > for (i <- 1 until results.length) { > unionDf = unionDf.unionAll(results(i)) > } > val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), > r.getTimestamp(1).getTime)) > val renamed = augmented.withColumnRenamed("_1", > "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis") > renamed.show(false) > return renamed > } > // Exiting paste mode, now interpreting. > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368 > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - UTC > +--+---++ > > |type |ts |millis | > +--+---++ > |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT ORC|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|599965323000| > +--+---++ > {code} > 1. Set session timezone to America/Los_Angeles > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - America/Los_Angeles > +--+---++ > |type |ts |millis | > +--+---++ > |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000| > |FROM SPARK-EXT ORC|1989-01-04 17:02:03|599965323000| > |FROM SPARK-EXT AVRO |1989-01-04 17:02:03|599965323000| > |FROM SPARK-EXT TEXT |1989-01-04 17:02:03|599965323000| > +