[ https://issues.apache.org/jira/browse/SPARK-33863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nasir Ali updated SPARK-33863: ------------------------------ Description: *Problem*: If I create a new column using udf, pyspark udf changes timestamps into UTC time. I have used following configs to let spark know the timestamps are in UTC: {code:java} --conf spark.driver.extraJavaOptions=-Duser.timezone=UTC --conf spark.executor.extraJavaOptions=-Duser.timezone=UTC --conf spark.sql.session.timeZone=UTC {code} Below is a code snippet to reproduce the error: {code:java} from pyspark.sql import SparkSession from pyspark.sql import functions as F from pyspark.sql.types import StringType import datetime spark = SparkSession.builder.config("spark.sql.session.timeZone", "UTC").getOrCreate()df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"), ("usr1",13.00, "2018-03-11T12:27:18+00:00"), ("usr1",25.00, "2018-03-12T11:27:18+00:00"), ("usr1",20.00, "2018-03-13T15:27:18+00:00"), ("usr1",17.00, "2018-03-14T12:27:18+00:00"), ("usr2",99.00, "2018-03-15T11:27:18+00:00"), ("usr2",156.00, "2018-03-22T11:27:18+00:00"), ("usr2",17.00, "2018-03-31T11:27:18+00:00"), ("usr2",25.00, "2018-03-15T11:27:18+00:00"), ("usr2",25.00, "2018-03-16T11:27:18+00:00") ], ["user","id", "ts"]) df = df.withColumn('ts', df.ts.cast('timestamp')) df.show(truncate=False)def some_time_udf(i): tmp="" if datetime.time(5, 0)<=i.time() < datetime.time(12, 0): tmp="Morning -> "+str(i) return tmpudf = F.udf(some_time_udf,StringType()) df.withColumn("day_part", udf(df.ts)).show(truncate=False) {code} I have concatenated timestamps with the string to show that pyspark pass timestamps as UTC. was: *Problem*: If I create a new column using udf, pyspark udf changes timestamps into UTC time. I have used following configs to let spark know the timestamps are in UTC: {code:java} --conf spark.driver.extraJavaOptions=-Duser.timezone=UTC --conf spark.executor.extraJavaOptions=-Duser.timezone=UTC --conf spark.sql.session.timeZone=UTC {code} Below is a code snippet to reproduce the error: {code:java} from pyspark.sql import SparkSession from pyspark.sql import functions as F from pyspark.sql.types import StringType import datetime spark = SparkSession.builder.config("spark.sql.session.timeZone", "UTC").getOrCreate() df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"), ("usr1",13.00, "2018-03-11T12:27:18+00:00"), ("usr1",25.00, "2018-03-12T11:27:18+00:00"), ("usr1",20.00, "2018-03-13T15:27:18+00:00"), ("usr1",17.00, "2018-03-14T12:27:18+00:00"), ("usr2",99.00, "2018-03-15T11:27:18+00:00"), ("usr2",156.00, "2018-03-22T11:27:18+00:00"), ("usr2",17.00, "2018-03-31T11:27:18+00:00"), ("usr2",25.00, "2018-03-15T11:27:18+00:00"), ("usr2",25.00, "2018-03-16T11:27:18+00:00") ], ["user","id", "ts"]) df = df.withColumn('ts', df.ts.cast('timestamp')) df.show(truncate=False) def some_time_udf(i): tmp="" if datetime.time(5, 0)<=i.time() < datetime.time(12, 0): tmp="Morning -> "+str(i) elif datetime.time(12, 0)<=i.time() < datetime.time(17, 0): tmp= "Afternoon -> "+str(i) elif datetime.time(17, 0)<=i.time() < datetime.time(21, 0): tmp= "Evening -> "+str(i) elif datetime.time(21, 0)<=i.time() < datetime.time(0, 0): tmp= "Night -> "+str(i) elif datetime.time(0, 0)<=i.time() < datetime.time(5, 0): tmp= "Night -> "+str(i) return tmpsometimeudf = F.udf(some_time_udf,StringType())df.withColumn("day_part", sometimeudf("ts")).show(truncate=False) {code} I have concatenated timestamps with the string to show that pyspark pass timestamps as UTC. > Pyspark UDF changes timestamps to UTC > ------------------------------------- > > Key: SPARK-33863 > URL: https://issues.apache.org/jira/browse/SPARK-33863 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.0.1 > Environment: MAC/Linux > Standalone cluster / local machine > Reporter: Nasir Ali > Priority: Major > > *Problem*: > If I create a new column using udf, pyspark udf changes timestamps into UTC > time. I have used following configs to let spark know the timestamps are in > UTC: > > {code:java} > --conf spark.driver.extraJavaOptions=-Duser.timezone=UTC > --conf spark.executor.extraJavaOptions=-Duser.timezone=UTC > --conf spark.sql.session.timeZone=UTC > {code} > Below is a code snippet to reproduce the error: > > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import functions as F > from pyspark.sql.types import StringType > import datetime > spark = SparkSession.builder.config("spark.sql.session.timeZone", > "UTC").getOrCreate()df = spark.createDataFrame([("usr1",17.00, > "2018-03-10T15:27:18+00:00"), > ("usr1",13.00, "2018-03-11T12:27:18+00:00"), > ("usr1",25.00, "2018-03-12T11:27:18+00:00"), > ("usr1",20.00, "2018-03-13T15:27:18+00:00"), > ("usr1",17.00, "2018-03-14T12:27:18+00:00"), > ("usr2",99.00, "2018-03-15T11:27:18+00:00"), > ("usr2",156.00, "2018-03-22T11:27:18+00:00"), > ("usr2",17.00, "2018-03-31T11:27:18+00:00"), > ("usr2",25.00, "2018-03-15T11:27:18+00:00"), > ("usr2",25.00, "2018-03-16T11:27:18+00:00") > ], > ["user","id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > df.show(truncate=False)def some_time_udf(i): > tmp="" > if datetime.time(5, 0)<=i.time() < datetime.time(12, 0): > tmp="Morning -> "+str(i) > return tmpudf = F.udf(some_time_udf,StringType()) > df.withColumn("day_part", udf(df.ts)).show(truncate=False) > {code} > I have concatenated timestamps with the string to show that pyspark pass > timestamps as UTC. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org