Re: Change default timestamp offset on data load

Jack Goodson Thu, 07 Sep 2023 15:20:59 -0700

Thanks Mich figured that might be the case, regardless, appreciate the help
:)


On Thu, Sep 7, 2023 at 8:36 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> As far as I am aware there is no Spark or JVM setting that can make Spark
> assume a different timezone during the initial load from Parquet as Parquet
> files store timestamps in UTC. The timezone conversion can be done (as I
> described before) after the load.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 7 Sept 2023 at 01:42, Jack Goodson <jackagood...@gmail.com> wrote:
>
>> Thanks Mich, sorry, I might have been a bit unclear in my original email.
>> The timestamps are getting loaded as 2003-11-24T09:02:32+0000 for
>> example but I want it loaded as 2003-11-24T09:02:32+1300 I know how to
>> do this with various transformations however I'm wondering if there's any
>> spark or jvm settings that I can change so it assumes +1300 (as the time in
>> the column is relative to NZ local time not UTC) on load instead of +0000.
>> I inspected the parquet column with my created date with pyarrow with the
>> below results.
>>
>> I had a look in here
>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md and
>> it looks like I need isAdjustedUTC=false (maybe?) but am at a loss on how
>> to set it
>>
>> <pyarrow._parquet.ColumnChunkMetaData object at 0xdeadbeef>
>>
>>   file_offset: 6019
>>
>>   file_path:
>>
>>   physical_type: INT96
>>
>>   num_values: 4
>>
>>   path_in_schema: created
>>
>>   is_stats_set: False
>>
>>   statistics:
>>
>>     None
>>
>>   compression: SNAPPY
>>
>>   encodings: ('BIT_PACKED', 'PLAIN', 'RLE')
>>
>>   has_dictionary_page: False
>>
>>   dictionary_page_offset: None
>>
>>   data_page_offset: 6019
>>
>>   total_compressed_size: 90
>>
>>   total_uncompressed_size: 103
>>
>> On Wed, Sep 6, 2023 at 8:14 PM Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Hi Jack,
>>>
>>> You may use from_utc_timestamp and to_utc_timestamp to see if they help.
>>>
>>> from pyspark.sql.functions import from_utc_timestamp
>>>
>>> You can read your Parquet file into DF
>>>
>>> df = spark.read.parquet('parquet_file_path')
>>>
>>> # Convert timestamps (assuming your column name) from UTC to
>>> Pacific/Auckland timezone
>>>
>>> df_with_local_timezone = df.withColumn( 'timestamp',
>>> from_utc_timestamp(df['timestamp'], 'Pacific/Auckland') )
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's technical content is explicitly disclaimed.
>>> The author will in no case be liable for any monetary damages arising from
>>> such loss, damage or destruction.
>>>
>>>
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 6 Sept 2023 at 04:19, Jack Goodson <jackagood...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've got a number of tables that I'm loading in from a SQL server. The
>>>> timestamp in SQL server is stored like 2003-11-24T09:02:32 I get these
>>>> as parquet files in our raw storage location and pick them up in
>>>> Databricks. When I load the data in databricks, the dataframe/spark assumes
>>>> UTC or +0000 on the timestamp like 2003-11-24T09:02:32+0000 the time
>>>> and date is the same as what's in SQL server however the offset is
>>>> incorrect
>>>>
>>>> I've tried various methods like the below code to set the JVM timezone
>>>> to my local timezone but when viewing the data it seems to just subtract
>>>> the offset from the timestamp and add it to the offset part like 
>>>> 2003-11-24T09:02:32+0000
>>>> -> 2003-11-23T20:02:32+1300 (NZ has a +13 offset in winter)
>>>>
>>>> spark = pyspark.sql.SparkSession \
>>>> .Builder()\
>>>> .appName('test') \
>>>> .master('local') \
>>>> .config('spark.driver.extraJavaOptions',
>>>> '-Duser.timezone=Pacific/Auckland') \
>>>> .config('spark.executor.extraJavaOptions',
>>>> '-Duser.timezone=Pacific/Auckland') \
>>>> .config('spark.sql.session.timeZone', 'Pacific/Auckland') \
>>>> .getOrCreate()
>>>>
>>>>
>>>>
>>>> I understand that in Parquet these are stored as UNIX time and aren't
>>>> timezone aware, however are there any settings that I can set in spark that
>>>> would implicitly convert/assume the timestamp from 2003-11-24T09:02:32+0000
>>>> to 2003-11-24T09:02:32+1300 I know this can be done with
>>>> transformations however I'm trying to avoid doing transformations for every
>>>> timestamp on 100's tables
>>>>
>>>> Any help much appreciated, thanks,
>>>>
>>>> Jack
>>>>
>>>>
>>>>
>>>>

Re: Change default timestamp offset on data load

Reply via email to