[GitHub] [spark] juliuszsompolski opened a new pull request #28671: [SPARK-31859][SPARK-31861][SPARK-31863] Fix Thriftserver session timezone issues

GitBox Thu, 28 May 2020 15:54:14 -0700


juliuszsompolski opened a new pull request #28671:
URL: https://github.com/apache/spark/pull/28671



   ### What changes were proposed in this pull request?
   
   Timestamp literals in Spark are interpreted as timestamps in local timezone 
spark.sql.session.timeZone.
   
   If JDBC client is e.g. in TimeZone UTC-7, and sets 
spark.sql.session.timeZone to PST, and sends a query "SELECT timestamp 
'2020-05-20 12:00:00'", and the JVM timezone of the Spark cluster is e.g. 
UTC+2, then what currently happens is:
   * The timestamp literal in the query is interpreted as 12:00:00 UTC-7, i.e. 
19:00:00 UTC.
   * When it's returned from the query, it is collected as a java.sql.Timestamp 
object with Dataset.collect(), and put into a Thriftserver RowSet.
   * Before sending it over the wire, the Timestamp is converted to String. 
This happens in explicitly in ColumnValue for RowBasedSet, and implicitly in 
ColumnBuffer for ColumnBasedSet (all non-primitive types are converted 
toString() there). The conversion toString uses JVM timezone, which results in 
a "21:00:00" (UTC+2) string representation.
   * The client JDBC application parses gets a "21:00:00" Timestamp back (in 
it's JVM timezone; if the JDBC application cares about the correct UTC internal 
value, it should set spark.sql.session.timeZone to be consistent with its JVM 
timezone)
   
   The problem is caused by the conversion happening in Thriftserver RowSet 
with the generic toString() function, instead of using 
HiveResults.toHiveString() that takes care of correct, timezone respecting 
conversions. This PR fixes it by converting the Timestamp values to String 
earlier, in SparkExecuteStatementOperation, using that function. This fixes 
SPARK-31861.
   
   Thriftserver also did not work spark.sql.datetime.java8API.enabled, because 
the conversions in RowSet expected an Timestamp object instead of Instant 
object. Using HiveResults.toHiveString() also fixes that. For this reason, we 
also convert Date values in SparkExecuteStatementOperation as well - so that 
HiveResults.toHiveString() handles LocalDate as well. This fixes SPARK-31859.
   
   Thriftserver also did not correctly set the active SparkSession. Because of 
that, configuration obtained using SQLConf.get was not the correct session 
configuration. This affected getting the correct spark.sql.session.timeZone. It 
is fixed by extending the use of 
SparkExecuteStatementOperation.withSchedulerPool to also set the correct active 
SparkSession. When the correct session is set, we also no longer need to 
maintain the pool mapping in a sessionToActivePool map. The scheduler pool can 
be just correctly retrieved from the session config. "withSchedulerPool" is 
renamed to "withLocalProperties" and moved into a mixin helper trait, because 
it should be applied with every operation. This fixes SPARK-31863.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] juliuszsompolski opened a new pull request #28671: [SPARK-31859][SPARK-31861][SPARK-31863] Fix Thriftserver session timezone issues

Reply via email to