Thanks Mich figured that might be the case, regardless, appreciate the help :)
On Thu, Sep 7, 2023 at 8:36 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > As far as I am aware there is no Spark or JVM setting that can make Spark > assume a different timezone during the initial load from Parquet as Parquet > files store timestamps in UTC. The timezone conversion can be done (as I > described before) after the load. > > HTH > > Mich Talebzadeh, > Distinguished Technologist, Solutions Architect & Engineer > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Thu, 7 Sept 2023 at 01:42, Jack Goodson <jackagood...@gmail.com> wrote: > >> Thanks Mich, sorry, I might have been a bit unclear in my original email. >> The timestamps are getting loaded as 2003-11-24T09:02:32+0000 for >> example but I want it loaded as 2003-11-24T09:02:32+1300 I know how to >> do this with various transformations however I'm wondering if there's any >> spark or jvm settings that I can change so it assumes +1300 (as the time in >> the column is relative to NZ local time not UTC) on load instead of +0000. >> I inspected the parquet column with my created date with pyarrow with the >> below results. >> >> I had a look in here >> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md and >> it looks like I need isAdjustedUTC=false (maybe?) but am at a loss on how >> to set it >> >> <pyarrow._parquet.ColumnChunkMetaData object at 0xdeadbeef> >> >> file_offset: 6019 >> >> file_path: >> >> physical_type: INT96 >> >> num_values: 4 >> >> path_in_schema: created >> >> is_stats_set: False >> >> statistics: >> >> None >> >> compression: SNAPPY >> >> encodings: ('BIT_PACKED', 'PLAIN', 'RLE') >> >> has_dictionary_page: False >> >> dictionary_page_offset: None >> >> data_page_offset: 6019 >> >> total_compressed_size: 90 >> >> total_uncompressed_size: 103 >> >> On Wed, Sep 6, 2023 at 8:14 PM Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> Hi Jack, >>> >>> You may use from_utc_timestamp and to_utc_timestamp to see if they help. >>> >>> from pyspark.sql.functions import from_utc_timestamp >>> >>> You can read your Parquet file into DF >>> >>> df = spark.read.parquet('parquet_file_path') >>> >>> # Convert timestamps (assuming your column name) from UTC to >>> Pacific/Auckland timezone >>> >>> df_with_local_timezone = df.withColumn( 'timestamp', >>> from_utc_timestamp(df['timestamp'], 'Pacific/Auckland') ) >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Solutions Architect & Engineer >>> London >>> United Kingdom >>> >>> >>> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>> loss, damage or destruction of data or any other property which may arise >>> from relying on this email's technical content is explicitly disclaimed. >>> The author will in no case be liable for any monetary damages arising from >>> such loss, damage or destruction. >>> >>> >>> >>> Mich Talebzadeh, >>> Distinguished Technologist, Solutions Architect & Engineer >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Wed, 6 Sept 2023 at 04:19, Jack Goodson <jackagood...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I've got a number of tables that I'm loading in from a SQL server. The >>>> timestamp in SQL server is stored like 2003-11-24T09:02:32 I get these >>>> as parquet files in our raw storage location and pick them up in >>>> Databricks. When I load the data in databricks, the dataframe/spark assumes >>>> UTC or +0000 on the timestamp like 2003-11-24T09:02:32+0000 the time >>>> and date is the same as what's in SQL server however the offset is >>>> incorrect >>>> >>>> I've tried various methods like the below code to set the JVM timezone >>>> to my local timezone but when viewing the data it seems to just subtract >>>> the offset from the timestamp and add it to the offset part like >>>> 2003-11-24T09:02:32+0000 >>>> -> 2003-11-23T20:02:32+1300 (NZ has a +13 offset in winter) >>>> >>>> spark = pyspark.sql.SparkSession \ >>>> .Builder()\ >>>> .appName('test') \ >>>> .master('local') \ >>>> .config('spark.driver.extraJavaOptions', >>>> '-Duser.timezone=Pacific/Auckland') \ >>>> .config('spark.executor.extraJavaOptions', >>>> '-Duser.timezone=Pacific/Auckland') \ >>>> .config('spark.sql.session.timeZone', 'Pacific/Auckland') \ >>>> .getOrCreate() >>>> >>>> >>>> >>>> I understand that in Parquet these are stored as UNIX time and aren't >>>> timezone aware, however are there any settings that I can set in spark that >>>> would implicitly convert/assume the timestamp from 2003-11-24T09:02:32+0000 >>>> to 2003-11-24T09:02:32+1300 I know this can be done with >>>> transformations however I'm trying to avoid doing transformations for every >>>> timestamp on 100's tables >>>> >>>> Any help much appreciated, thanks, >>>> >>>> Jack >>>> >>>> >>>> >>>>