Parquet

Chris Mathews Thu, 15 Oct 2015 07:22:26 -0700

Hi Stefan

I am not sure I fully understand your question 'why you don't seem to be 
storing your dates in Parquet Date files.'


As far as I am aware all date types in Parquet (ie: DATE, TIME_MILLIS, 
TIMESTAMP_MILLIS) are all stored as either in int32 or int64 annotated types. 
The only other option is to store them as strings (or VARCHAR) and interpret 
them as required when selecting from the Parquet files.

Please let me know if I have understood your question correctly or not.

What I have not acheived yet is to use Avro schema definitions (via JSON) to 
define a TIMESTAMP type, which is why we have gone the route of defining a VIEW 
for each Parquet file. By doing this we reduce the amount of casting we have to 
do when building the query since the VIEW effectively does all the casting for 
us behind the scenes.

We are currently looking at possibility of defining Parquet schemas directly 
(using java) without going the Avro route; in other words produce a parser from 
JSON to Parquet, similar to the Avro parser but supporting some other logical 
types.

Some background to our drill trials:

We are generating billions of columns from machine generated data every day. 
There a quite a number of different types of machine generating this data and 
the format of the data varies between machines.  Some produce string format 
dates/timestamps and others numeric (unix epoch style), and we have to 
normalise this data to a common format; for dates this format is the Parquet 
TIMESTAMP_MILLIS type because we need to use millisecond granularity when 
available.

quote from Parquet docs:

"TIMESTAMP_MILLIS Logical date and time. Annotates an int64 that stores the 
number of milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, 
UTC."

This type corresponds nicely to the SQL type TIMESTAMP (which is why we cast).

Again, hope this helps.

Cheers -- Chris

> On 15 Oct 2015, at 14:46, Stefán Baxter <[email protected]> wrote:
> 
> Thank you Chris, this clarifies a whole lot :).
> 
> I wanted to try to avoid the cast in the CTAS on the way from Avro to
> Parquet (not possible) and then avoid casting as much as possible when
> selecting from the Parquet files.
> 
> What is still unclear to me is why you don't seem to be storing your dates
> in Parquet Date files.
> 
> Can you please elaborate a bit on the pros/cons?
> 
> Regards,
> -Stefan
> 
> On Thu, Oct 15, 2015 at 10:59 AM, Chris Mathews <[email protected]> wrote:
> 
>> Hello Stefan
>> 
>> We use Avro to define our schemas for Parquet files, and we find that
>> using long for dates and converting the dates to long using milliseconds
>> works.  We then CAST the long to a TIMESTAMP on the way out during the
>> SELECT statement (or by using a VIEW).
>> 
>> example java snippet:
>> 
>> //  various date and time formats
>> public static final String FORMAT_Z_TIMESTAMP = "yyyy-MM-dd HH:mm:ss.SSS";
>> public static final String FORMAT_DATETIME    = "yyyy-MM-dd HH:mm:ss";
>> public static final String FORMAT_DATE        = "yyyy-MM-dd";
>> 
>> …
>> 
>> // parser for each format
>> public final SimpleDateFormat  sdf_z_timestamp = new SimpleDateFormat(
>> FORMAT_Z_TIMESTAMP );
>> public final SimpleDateFormat  sdf_datetime    = new SimpleDateFormat(
>> FORMAT_DATETIME );
>> public final SimpleDateFormat  sdf_date        = new SimpleDateFormat(
>> FORMAT_DATE );
>> 
>> …
>> 
>> //  choose parser based on column name / string format
>> public SimpleDateFormat  sdf = (NAME_Z_TIMESTAMP.equals(name())) ?
>> sdf_z_timestamp
>>                            : (NAME_DATETIME.equals(name()))    ?
>> sdf_datetime
>>                            : (NAME_DATE.equals(name()))        ? sdf_date
>>                            : null;
>> …
>> 
>> Date date = sdf.parse(str);
>> long millis = date.getTime();
>> Object value = new java.lang.Long(millis);
>> 
>> We then use something like
>> 
>> AvroParquetWriter<GenericRecord> writer = new
>> AvroParquetWriter<>(hdfs_path, schema);
>> GenericRecord data = new GenericData.Record(schema);
>> data.put( name(), value);
>> writer.write(data);
>> 
>> to write the records out directly to a Parquet file.
>> 
>> example schema:
>> 
>> {
>>  "type": "record",
>>  "name": "timestamp_test",
>>   "doc": "Avro -> Parquet long to timestamp test",
>>  "fields":
>>  [
>>    { "name": "z_timestamp",  "type": "long" }
>>   ,{ "name": "datetime",     "type": "long" }
>>   ,{ "name": "date",         "type": "long" }
>>   ,{ "name": "granularity",  "type": "long" }
>>  ]
>> }
>> 
>> Then to get the data back we either define a VIEW, or cast during the
>> SELECT statement.
>> 
>> example view:
>> 
>> use hdfs.cjm;
>> 
>> create or replace view TIMESTAMP_TEST_VIEW as
>> SELECT
>>  CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp`
>> ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
>> ,CAST(`date` AS DATE) AS `date`
>> ,CAST(`granularity` AS BIGINT) AS `granularity`
>> 
>> FROM hdfs.cjm.TIMESTAMP_TEST;
>> 
>> Then execute the following to get results:
>> 
>> 
>> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity from
>> TIMESTAMP_TEST limit 1;
>> +----------------+----------------+----------------+--------------+
>> |  z_timestamp   |    datetime    |      date      | granularity  |
>> +----------------+----------------+----------------+--------------+
>> | 1429592511991  | 1429520400000  | 1421625600000  | 3600         |
>> +----------------+----------------+----------------+--------------+
>> 1 row selected (2.593 seconds)
>> 
>> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity from
>> TIMESTAMP_TEST_VIEW limit 1;
>> 
>> +--------------------------+------------------------+-------------+--------------+
>> |       z_timestamp        |        datetime        |    date     |
>> granularity  |
>> 
>> +--------------------------+------------------------+-------------+--------------+
>> | 2015-04-22 05:16:22.173  | 2015-04-21 12:00:00.0  | 2015-01-20  | 3600
>>       |
>> 
>> +--------------------------+------------------------+-------------+--------------+
>> 1 row selected (3.282 seconds)
>> 
>> 0: jdbc:drill:> SELECT
>> . . . . . . . >   CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp`
>> . . . . . . . >  ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
>> . . . . . . . >  ,CAST(`date` AS DATE) AS `date`
>> . . . . . . . >  ,CAST(`granularity` AS BIGINT) AS `granularity`
>> . . . . . . . >  from TIMESTAMP_TEST limit 1;
>> 
>> +--------------------------+------------------------+-------------+--------------+
>> |       z_timestamp        |        datetime        |    date     |
>> granularity  |
>> 
>> +--------------------------+------------------------+-------------+--------------+
>> | 2015-04-22 05:16:22.173  | 2015-04-21 09:00:00.0  | 2015-01-20  | 3600
>>       |
>> 
>> +--------------------------+------------------------+-------------+--------------+
>> 1 row selected (3.071 seconds)
>> 
>> 
>> Hope this helps.
>> 
>> Cheers — Chris
>> 
>>> On 14 Oct 2015, at 16:07, Stefán Baxter <[email protected]>
>> wrote:
>>> 
>>> Hi,
>>> 
>>> What is the best practice when working with dates in a Avro/Parquet
>>> scenario?
>>> 
>>> Avro does not support dates directly (only longs) and I'm wondering how
>> the
>>> get persisted in Parquet.
>>> 
>>> Perhaps Parquet does not distinguish between long and date-long in any
>>> significant way.
>>> 
>>> Regards,
>>> -Stefan
>> 
>>

Re: Dates -> Avro -> Parquet

Reply via email to