guys,

This is my suggestion. Use Spark SQL instead of Impala from Hive tables to
get correct timestamp values all the time. The situation is explained below:


I have come across a situation where a multi-tenant cluster is being used
to read and write to Parquet file.

This causes some issues as I understand when Hive stores a timestamp into
Parquet format, it converts local time into UTC time, and when it reads
data out, it converts back to local time.

Impala, however, on the other hand does not do any conversion when it reads
the timestamp column from Parquet file so the UTC time is returned instead
of local time.

so there are multiple issues:

Data read by impala is not converted from UTC to local time
A flag can be set to make impala convert at the cluster level only
a group is saying they don't want to o the conversion at the application
level

So it will cure certain problems but make other tenants less happy with the
conversion.

now my understanding is that this issue comes about because impala bypasses
hive metadata and goes directly to Parquet files.

there is an impact to business.

my suggestion is that if they want performant reads they should use Spark
SQL on Hive. it will always get the same values as stored by Hive


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Reply via email to