Hello everyone,

I'm having a hard time with time zones.

I have a Long representing a timestamp: 1496361600000, I want the output to be 
2017-06-02 00:00:00


Based on 
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html

The only function that helps formatting a timestamp is from_unixtime, but it 
bases the output timezone as the system timezone... but my timezone and the 
server timezone I'm working on are not in UTC.

I couldn't find any help on Google among the dozens of JIRA, stackoverflow and 
blog articles I found.


In the end I decided to write a udf:

def formatMs(ms: Long): java.lang.String = {
    val formatter = new java.text.SimpleDateFormat("yyyyMMdd HH:mm:ss")
    formatter.setTimeZone(java.util.TimeZone.getTimeZone("GMT"));
    formatter.format(new java.util.Date(ms))
}

spark.udf.register("formatMs", formatMs)
spark.sql("SELECT formatMs(1496361600000)").show


But if I really go for that, it will decrease the performance of my 
application, right?

Like I need to aggregate some data based on such column. As my function is a 
black box, spark will use it first and aggregate on the String output of the 
udf, though it would lead to the same result by aggregate on the initial Long 
value and then using the udf.


I know I can also forget my udf in the sql query, and apply, on the created 
dataframe, the functions withColumn() and withColumnRenamed(), but that is 
something to bench.


Did I miss any possibility to do that within a SparkQL query using standard 
functions or something much more performant than what I can think of?


Regards,

Yohann

Reply via email to