[jira] [Created] (SPARK-30632) to_timestamp() doesn't work with certain timezones

2020-01-24 Thread Anton Daitche (Jira)
Anton Daitche created SPARK-30632:
-

 Summary: to_timestamp() doesn't work with certain timezones
 Key: SPARK-30632
 URL: https://issues.apache.org/jira/browse/SPARK-30632
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4, 2.3.0
Reporter: Anton Daitche


It seams that to_timestamp() doesn't work with timezones of the type 
/, e.g. America/Los_Angeles.

The code

{code:scala}
val df = Seq(
("2019-01-24 11:30:00.123", "America/Los_Angeles"), 
("2020-01-01 01:30:00.123", "PST")
).toDF("ts_str", "tz_name")

val ts_parsed = to_timestamp(
concat_ws(" ", $"ts_str", $"tz_name"), "-MM-dd HH:mm:ss.SSS z"
).as("timestamp")

df.select(ts_parsed).show(false)
{code}

prints


{code}
+---+
|timestamp  |
+---+
|null   |
|2020-01-01 10:30:00|
+---+
{code}

So, the datetime string with timezone PST is properly parsed, whereas the one 
with America/Los_Angeles is converted to null. According to 
[this|https://github.com/apache/spark/pull/24195#issuecomment-578055146] 
response on GitHub, this code works when run on the recent master version. 

See also the discussion in 
[this|https://github.com/apache/spark/pull/24195#issue] issue for more context.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected

2018-08-26 Thread Anton Daitche (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Daitche updated SPARK-25244:
--
Description: 
The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would try to come up with 
a patch. 

 

 

  was:
The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would be happy to 
contribute a patch. 

 

 


> [Python] Setting `spark.sql.session.timeZone` only partially respected
> --
>
> Key: SPARK-25244
> URL: https://issues.apache.org/jira/browse/SPARK-25244
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Anton Daitche
>Priority: Major
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons `datetime` 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the 
> method `collect` ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods `toInternal` and 
> `fromInternal` of PySparks `TimestampType` class don't take into account the 
> setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would try to

[jira] [Created] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected

2018-08-26 Thread Anton Daitche (JIRA)
Anton Daitche created SPARK-25244:
-

 Summary: [Python] Setting `spark.sql.session.timeZone` only 
partially respected
 Key: SPARK-25244
 URL: https://issues.apache.org/jira/browse/SPARK-25244
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.1
Reporter: Anton Daitche


The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|[http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].]
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would be happy to 
contribute a patch. 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected

2018-08-26 Thread Anton Daitche (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Daitche updated SPARK-25244:
--
Description: 
The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would be happy to 
contribute a patch. 

 

 

  was:
The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|[http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].]
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would be happy to 
contribute a patch. 

 

 


> [Python] Setting `spark.sql.session.timeZone` only partially respected
> --
>
> Key: SPARK-25244
> URL: https://issues.apache.org/jira/browse/SPARK-25244
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Anton Daitche
>Priority: Major
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons `datetime` 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the 
> method `collect` ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods `toInternal` and 
> `fromInternal` of PySparks `TimestampType` class don't take into account the 
> setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would b

[jira] [Updated] (SPARK-25130) Wrong timestamp returned by toPandas

2018-08-16 Thread Anton Daitche (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Daitche updated SPARK-25130:
--
Description: 
The code snippet
{code:java}
import datetime

df = spark.createDataFrame([(datetime.datetime(1901, 1, 1, 0, 0, 0),)], ["ts"])
print("collect:", df.collect()[0][0])
print("toPandas:", df.toPandas().iloc[0, 0])
 {code}
prints
{code:java}
collect: 1901-01-01 00:00:00
toPandas: 1900-12-31 23:53:00{code}
Hence the toPandas methods seems to convert the timestamp wrongly.

The problem disappears for "1902-01-01 00:00:00" and later dates (I didn't do 
an exhaustive test though).

  was:
The code snippet
{code:java}
import datetime

df = spark.createDataFrame([(datetime.datetime(1901, 1, 1, 0, 0, 0),)], ["ts"])
print(df.collect())
print(df.toPandas())
 {code}
prints
{code:java}
collect: 1901-01-01 00:00:00
toPandas: 1900-12-31 23:53:00{code}
Hence the toPandas methods seems to convert the timestamp wrongly.

The problem disappears for "1902-01-01 00:00:00" and later dates (I didn't do 
an exhaustive test though).


> Wrong timestamp returned by toPandas
> 
>
> Key: SPARK-25130
> URL: https://issues.apache.org/jira/browse/SPARK-25130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0, 2.3.1
> Environment: Tested with version 2.3.1 on OSX and 2.3.0 on Linux.
>Reporter: Anton Daitche
>Priority: Major
>
> The code snippet
> {code:java}
> import datetime
> df = spark.createDataFrame([(datetime.datetime(1901, 1, 1, 0, 0, 0),)], 
> ["ts"])
> print("collect:", df.collect()[0][0])
> print("toPandas:", df.toPandas().iloc[0, 0])
>  {code}
> prints
> {code:java}
> collect: 1901-01-01 00:00:00
> toPandas: 1900-12-31 23:53:00{code}
> Hence the toPandas methods seems to convert the timestamp wrongly.
> The problem disappears for "1902-01-01 00:00:00" and later dates (I didn't do 
> an exhaustive test though).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25130) [Python] Wrong timestamp returned by toPandas

2018-08-16 Thread Anton Daitche (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Daitche updated SPARK-25130:
--
Summary: [Python] Wrong timestamp returned by toPandas  (was: Wrong 
timestamp returned by toPandas)

> [Python] Wrong timestamp returned by toPandas
> -
>
> Key: SPARK-25130
> URL: https://issues.apache.org/jira/browse/SPARK-25130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0, 2.3.1
> Environment: Tested with version 2.3.1 on OSX and 2.3.0 on Linux.
>Reporter: Anton Daitche
>Priority: Major
>
> The code snippet
> {code:java}
> import datetime
> df = spark.createDataFrame([(datetime.datetime(1901, 1, 1, 0, 0, 0),)], 
> ["ts"])
> print("collect:", df.collect()[0][0])
> print("toPandas:", df.toPandas().iloc[0, 0])
>  {code}
> prints
> {code:java}
> collect: 1901-01-01 00:00:00
> toPandas: 1900-12-31 23:53:00{code}
> Hence the toPandas methods seems to convert the timestamp wrongly.
> The problem disappears for "1902-01-01 00:00:00" and later dates (I didn't do 
> an exhaustive test though).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25130) Wrong timestamp returned by toPandas

2018-08-16 Thread Anton Daitche (JIRA)
Anton Daitche created SPARK-25130:
-

 Summary: Wrong timestamp returned by toPandas
 Key: SPARK-25130
 URL: https://issues.apache.org/jira/browse/SPARK-25130
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.3.1, 2.3.0
 Environment: Tested with version 2.3.1 on OSX and 2.3.0 on Linux.
Reporter: Anton Daitche


The code snippet
{code:java}
import datetime

df = spark.createDataFrame([(datetime.datetime(1901, 1, 1, 0, 0, 0),)], ["ts"])
print(df.collect())
print(df.toPandas())
 {code}
prints
{code:java}
collect: 1901-01-01 00:00:00
toPandas: 1900-12-31 23:53:00{code}
Hence the toPandas methods seems to convert the timestamp wrongly.

The problem disappears for "1902-01-01 00:00:00" and later dates (I didn't do 
an exhaustive test though).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org