I think the Parquet layer should probably restore a non-UTC timezone.
We store enough metadata that this should be possible:
In [20]: df = pd.DataFrame({'a': pd.Series(np.arange(0, 10000,
1000)).astype(pd.DatetimeTZDtype('ns', 'America/Los_Angeles'
...: ))})
In [21]: t = pa.table(df)
In [22]: t
Out[22]:
pyarrow.Table
a: timestamp[ns, tz=America/Los_Angeles]
In [23]: pq.write_table(t, 'test.parquet')
In [24]: pq.read_table('test.parquet')
Out[24]:
pyarrow.Table
a: timestamp[us, tz=UTC]
In [25]: pq.read_table('test.parquet')[0]
Out[25]:
<pyarrow.lib.ChunkedArray object at 0x7f72eb4b68f0>
[
[
1970-01-01 00:00:00.000000,
1970-01-01 00:00:00.000001,
1970-01-01 00:00:00.000002,
1970-01-01 00:00:00.000003,
1970-01-01 00:00:00.000004,
1970-01-01 00:00:00.000005,
1970-01-01 00:00:00.000006,
1970-01-01 00:00:00.000007,
1970-01-01 00:00:00.000008,
1970-01-01 00:00:00.000009
]
]
I opened https://issues.apache.org/jira/browse/ARROW-9634 so someone
can look into it
On Mon, Aug 3, 2020 at 10:10 AM David Gallagher
<[email protected]> wrote:
>
> Hi – I have a pandas dataframe that I want to output to parquet. The
> dataframe has a timestamp field with timezone information. I need control
> over the schema at output, so I am using ParquetWriter and a schema with the
> timestamp column defined as:
>
>
>
> ('timestamp', pa.timestamp('s', tz=self._timezone)),
>
>
>
> Where timezone is a string, e.g. ‘America/Los_Angeles’. I’m then writing out
> the file using this code:
>
>
>
> schema = pa.schema(fields)
> table = pa.Table.from_pandas(self._df, schema,
> preserve_index=False).replace_schema_metadata()
> writer = pq.ParquetWriter(os.path.join(file_path,
> '{}.parquet'.format(self._file_name)), schema=schema)
> writer.write_table(table)
> writer.close()
>
>
>
> However, upon reading the resulting file, the timestamp is in UTC:
>
>
>
> timestamp datetime64[ns, UTC]
>
>
>
> But, if I output the same pandas dataframe to parquet directly, the timestamp
> is localized. Is this expected behavior? I’m using pyarrow 1.0.0. I tried
> playing with the ‘flavor’ argument of ParquetWriter, but this just seemed to
> generate naïve UTC timestamps.
>
>
>
> Thanks,
>
>
>
> Dave
>
>