Toby Harradine created SPARK-32123: -------------------------------------- Summary: [Python] Setting `spark.sql.session.timeZone` only partially respected Key: SPARK-32123 URL: https://issues.apache.org/jira/browse/SPARK-32123 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.1 Reporter: Toby Harradine
The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. If the maintainers agree that this should be fixed, I would try to come up with a patch. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org