[jira] [Created] (SPARK-30632) to_timestamp() doesn't work with certain timezones
Anton Daitche created SPARK-30632: - Summary: to_timestamp() doesn't work with certain timezones Key: SPARK-30632 URL: https://issues.apache.org/jira/browse/SPARK-30632 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4, 2.3.0 Reporter: Anton Daitche It seams that to_timestamp() doesn't work with timezones of the type /, e.g. America/Los_Angeles. The code {code:scala} val df = Seq( ("2019-01-24 11:30:00.123", "America/Los_Angeles"), ("2020-01-01 01:30:00.123", "PST") ).toDF("ts_str", "tz_name") val ts_parsed = to_timestamp( concat_ws(" ", $"ts_str", $"tz_name"), "-MM-dd HH:mm:ss.SSS z" ).as("timestamp") df.select(ts_parsed).show(false) {code} prints {code} +---+ |timestamp | +---+ |null | |2020-01-01 10:30:00| +---+ {code} So, the datetime string with timezone PST is properly parsed, whereas the one with America/Los_Angeles is converted to null. According to [this|https://github.com/apache/spark/pull/24195#issuecomment-578055146] response on GitHub, this code works when run on the recent master version. See also the discussion in [this|https://github.com/apache/spark/pull/24195#issue] issue for more context. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected
[ https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Daitche updated SPARK-25244: -- Description: The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. If the maintainers agree that this should be fixed, I would try to come up with a patch. was: The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. If the maintainers agree that this should be fixed, I would be happy to contribute a patch. > [Python] Setting `spark.sql.session.timeZone` only partially respected > -- > > Key: SPARK-25244 > URL: https://issues.apache.org/jira/browse/SPARK-25244 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Anton Daitche >Priority: Major > > The setting `spark.sql.session.timeZone` is respected by PySpark when > converting from and to Pandas, as described > [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. > However, when timestamps are converted directly to Pythons `datetime` > objects, its ignored and the systems timezone is used. > This can be checked by the following code snippet > {code:java} > import pyspark.sql > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .getOrCreate() > ) > df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) > df = df.withColumn("ts", df["ts"].astype("timestamp")) > print(df.toPandas().iloc[0,0]) > print(df.collect()[0][0]) > {code} > Which for me prints (the exact result depends on the timezone of your system, > mine is Europe/Berlin) > {code:java} > 2018-06-01 01:00:00 > 2018-06-01 03:00:00 > {code} > Hence, the method `toPandas` respected the timezone setting (UTC), but the > method `collect` ignored it and converted the timestamp to my systems > timezone. > The cause for this behaviour is that the methods `toInternal` and > `fromInternal` of PySparks `TimestampType` class don't take into account the > setting `spark.sql.session.timeZone` and use the system timezone. > If the maintainers agree that this should be fixed, I would try to
[jira] [Created] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected
Anton Daitche created SPARK-25244: - Summary: [Python] Setting `spark.sql.session.timeZone` only partially respected Key: SPARK-25244 URL: https://issues.apache.org/jira/browse/SPARK-25244 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.1 Reporter: Anton Daitche The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|[http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].] However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. If the maintainers agree that this should be fixed, I would be happy to contribute a patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected
[ https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Daitche updated SPARK-25244: -- Description: The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. If the maintainers agree that this should be fixed, I would be happy to contribute a patch. was: The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|[http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].] However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. If the maintainers agree that this should be fixed, I would be happy to contribute a patch. > [Python] Setting `spark.sql.session.timeZone` only partially respected > -- > > Key: SPARK-25244 > URL: https://issues.apache.org/jira/browse/SPARK-25244 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Anton Daitche >Priority: Major > > The setting `spark.sql.session.timeZone` is respected by PySpark when > converting from and to Pandas, as described > [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. > However, when timestamps are converted directly to Pythons `datetime` > objects, its ignored and the systems timezone is used. > This can be checked by the following code snippet > {code:java} > import pyspark.sql > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .getOrCreate() > ) > df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) > df = df.withColumn("ts", df["ts"].astype("timestamp")) > print(df.toPandas().iloc[0,0]) > print(df.collect()[0][0]) > {code} > Which for me prints (the exact result depends on the timezone of your system, > mine is Europe/Berlin) > {code:java} > 2018-06-01 01:00:00 > 2018-06-01 03:00:00 > {code} > Hence, the method `toPandas` respected the timezone setting (UTC), but the > method `collect` ignored it and converted the timestamp to my systems > timezone. > The cause for this behaviour is that the methods `toInternal` and > `fromInternal` of PySparks `TimestampType` class don't take into account the > setting `spark.sql.session.timeZone` and use the system timezone. > If the maintainers agree that this should be fixed, I would b
[jira] [Updated] (SPARK-25130) Wrong timestamp returned by toPandas
[ https://issues.apache.org/jira/browse/SPARK-25130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Daitche updated SPARK-25130: -- Description: The code snippet {code:java} import datetime df = spark.createDataFrame([(datetime.datetime(1901, 1, 1, 0, 0, 0),)], ["ts"]) print("collect:", df.collect()[0][0]) print("toPandas:", df.toPandas().iloc[0, 0]) {code} prints {code:java} collect: 1901-01-01 00:00:00 toPandas: 1900-12-31 23:53:00{code} Hence the toPandas methods seems to convert the timestamp wrongly. The problem disappears for "1902-01-01 00:00:00" and later dates (I didn't do an exhaustive test though). was: The code snippet {code:java} import datetime df = spark.createDataFrame([(datetime.datetime(1901, 1, 1, 0, 0, 0),)], ["ts"]) print(df.collect()) print(df.toPandas()) {code} prints {code:java} collect: 1901-01-01 00:00:00 toPandas: 1900-12-31 23:53:00{code} Hence the toPandas methods seems to convert the timestamp wrongly. The problem disappears for "1902-01-01 00:00:00" and later dates (I didn't do an exhaustive test though). > Wrong timestamp returned by toPandas > > > Key: SPARK-25130 > URL: https://issues.apache.org/jira/browse/SPARK-25130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0, 2.3.1 > Environment: Tested with version 2.3.1 on OSX and 2.3.0 on Linux. >Reporter: Anton Daitche >Priority: Major > > The code snippet > {code:java} > import datetime > df = spark.createDataFrame([(datetime.datetime(1901, 1, 1, 0, 0, 0),)], > ["ts"]) > print("collect:", df.collect()[0][0]) > print("toPandas:", df.toPandas().iloc[0, 0]) > {code} > prints > {code:java} > collect: 1901-01-01 00:00:00 > toPandas: 1900-12-31 23:53:00{code} > Hence the toPandas methods seems to convert the timestamp wrongly. > The problem disappears for "1902-01-01 00:00:00" and later dates (I didn't do > an exhaustive test though). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25130) [Python] Wrong timestamp returned by toPandas
[ https://issues.apache.org/jira/browse/SPARK-25130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Daitche updated SPARK-25130: -- Summary: [Python] Wrong timestamp returned by toPandas (was: Wrong timestamp returned by toPandas) > [Python] Wrong timestamp returned by toPandas > - > > Key: SPARK-25130 > URL: https://issues.apache.org/jira/browse/SPARK-25130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0, 2.3.1 > Environment: Tested with version 2.3.1 on OSX and 2.3.0 on Linux. >Reporter: Anton Daitche >Priority: Major > > The code snippet > {code:java} > import datetime > df = spark.createDataFrame([(datetime.datetime(1901, 1, 1, 0, 0, 0),)], > ["ts"]) > print("collect:", df.collect()[0][0]) > print("toPandas:", df.toPandas().iloc[0, 0]) > {code} > prints > {code:java} > collect: 1901-01-01 00:00:00 > toPandas: 1900-12-31 23:53:00{code} > Hence the toPandas methods seems to convert the timestamp wrongly. > The problem disappears for "1902-01-01 00:00:00" and later dates (I didn't do > an exhaustive test though). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25130) Wrong timestamp returned by toPandas
Anton Daitche created SPARK-25130: - Summary: Wrong timestamp returned by toPandas Key: SPARK-25130 URL: https://issues.apache.org/jira/browse/SPARK-25130 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.3.1, 2.3.0 Environment: Tested with version 2.3.1 on OSX and 2.3.0 on Linux. Reporter: Anton Daitche The code snippet {code:java} import datetime df = spark.createDataFrame([(datetime.datetime(1901, 1, 1, 0, 0, 0),)], ["ts"]) print(df.collect()) print(df.toPandas()) {code} prints {code:java} collect: 1901-01-01 00:00:00 toPandas: 1900-12-31 23:53:00{code} Hence the toPandas methods seems to convert the timestamp wrongly. The problem disappears for "1902-01-01 00:00:00" and later dates (I didn't do an exhaustive test though). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org