Parquet

Chris Mathews Thu, 15 Oct 2015 08:46:53 -0700

Thank you Jacques - yes this is exactly the issue I am having.

We are currently using Avro to define schemas for our Parquet files, and as you 
correctly point out there is no way of defining date types in Avro.  Due to the 
volumes of data we are dealing with, using CTAS is not an option for us as we 
need to create the Parquet files on the fly rather than create intermediate 
files for use as input to the CTAS.


Since Avro is the restriction here we need to come up with some way of defining 
our Parquet schemas so the date types can be defined natively; and as you say 
there will be no need for any casting again.

What we really need  is some way of creating a schema for use in a Java app, 
similar to the way Avro works using JSON, for ease of implementation but 
without the restrictions.

We are still investigating this approach, but for now we are casting !

Cheers — Chris


> On 15 Oct 2015, at 16:19, Jacques Nadeau <[email protected]> wrote:
> 
> A little clarification here:
> 
> Parquet has native support for date types. Drill does too. However, since
> Avro does not, there is no way that I know of to write a Parquet file via
> the Avro adapter that will not require a cast. If you did a CTAS in Drill
> and cast the data types correctly in the CTAS, then Drill will output
> Parquet files that never need casting again.
> 
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
> 
> On Thu, Oct 15, 2015 at 7:31 AM, Stefán Baxter <[email protected]>
> wrote:
> 
>> Hi Chris,
>> 
>> I understand now, thank you.
>> 
>> What threw me off was that, in our standard use-case, we are not using cast
>> for our TIMESTAMP_MILLIS fields and I thought we were getting them directly
>> formatted from Parquet but then I overlooked our UDF that is handling the
>> casting... sorry :).
>> 
>> Thank you for taking the time!
>> 
>> Regards,
>>  -Stefan
>> 
>> 
>> On Thu, Oct 15, 2015 at 2:21 PM, Chris Mathews <[email protected]> wrote:
>> 
>>> Hi Stefan
>>> 
>>> I am not sure I fully understand your question 'why you don't seem to be
>>> storing your dates in Parquet Date files.'
>>> 
>>> As far as I am aware all date types in Parquet (ie: DATE, TIME_MILLIS,
>>> TIMESTAMP_MILLIS) are all stored as either in int32 or int64 annotated
>>> types. The only other option is to store them as strings (or VARCHAR) and
>>> interpret them as required when selecting from the Parquet files.
>>> 
>>> Please let me know if I have understood your question correctly or not.
>>> 
>>> What I have not acheived yet is to use Avro schema definitions (via JSON)
>>> to define a TIMESTAMP type, which is why we have gone the route of
>> defining
>>> a VIEW for each Parquet file. By doing this we reduce the amount of
>> casting
>>> we have to do when building the query since the VIEW effectively does all
>>> the casting for us behind the scenes.
>>> 
>>> We are currently looking at possibility of defining Parquet schemas
>>> directly (using java) without going the Avro route; in other words
>> produce
>>> a parser from JSON to Parquet, similar to the Avro parser but supporting
>>> some other logical types.
>>> 
>>> Some background to our drill trials:
>>> 
>>> We are generating billions of columns from machine generated data every
>>> day. There a quite a number of different types of machine generating this
>>> data and the format of the data varies between machines.  Some produce
>>> string format dates/timestamps and others numeric (unix epoch style), and
>>> we have to normalise this data to a common format; for dates this format
>> is
>>> the Parquet TIMESTAMP_MILLIS type because we need to use millisecond
>>> granularity when available.
>>> 
>>> quote from Parquet docs:
>>> 
>>> "TIMESTAMP_MILLIS Logical date and time. Annotates an int64 that stores
>>> the number of milliseconds from the Unix epoch, 00:00:00.000 on 1 January
>>> 1970, UTC."
>>> 
>>> This type corresponds nicely to the SQL type TIMESTAMP (which is why we
>>> cast).
>>> 
>>> Again, hope this helps.
>>> 
>>> Cheers -- Chris
>>> 
>>>> On 15 Oct 2015, at 14:46, Stefán Baxter <[email protected]>
>>> wrote:
>>>> 
>>>> Thank you Chris, this clarifies a whole lot :).
>>>> 
>>>> I wanted to try to avoid the cast in the CTAS on the way from Avro to
>>>> Parquet (not possible) and then avoid casting as much as possible when
>>>> selecting from the Parquet files.
>>>> 
>>>> What is still unclear to me is why you don't seem to be storing your
>>> dates
>>>> in Parquet Date files.
>>>> 
>>>> Can you please elaborate a bit on the pros/cons?
>>>> 
>>>> Regards,
>>>> -Stefan
>>>> 
>>>> On Thu, Oct 15, 2015 at 10:59 AM, Chris Mathews <[email protected]>
>> wrote:
>>>> 
>>>>> Hello Stefan
>>>>> 
>>>>> We use Avro to define our schemas for Parquet files, and we find that
>>>>> using long for dates and converting the dates to long using
>> milliseconds
>>>>> works.  We then CAST the long to a TIMESTAMP on the way out during the
>>>>> SELECT statement (or by using a VIEW).
>>>>> 
>>>>> example java snippet:
>>>>> 
>>>>> //  various date and time formats
>>>>> public static final String FORMAT_Z_TIMESTAMP = "yyyy-MM-dd
>>> HH:mm:ss.SSS";
>>>>> public static final String FORMAT_DATETIME    = "yyyy-MM-dd HH:mm:ss";
>>>>> public static final String FORMAT_DATE        = "yyyy-MM-dd";
>>>>> 
>>>>> …
>>>>> 
>>>>> // parser for each format
>>>>> public final SimpleDateFormat  sdf_z_timestamp = new SimpleDateFormat(
>>>>> FORMAT_Z_TIMESTAMP );
>>>>> public final SimpleDateFormat  sdf_datetime    = new SimpleDateFormat(
>>>>> FORMAT_DATETIME );
>>>>> public final SimpleDateFormat  sdf_date        = new SimpleDateFormat(
>>>>> FORMAT_DATE );
>>>>> 
>>>>> …
>>>>> 
>>>>> //  choose parser based on column name / string format
>>>>> public SimpleDateFormat  sdf = (NAME_Z_TIMESTAMP.equals(name())) ?
>>>>> sdf_z_timestamp
>>>>>                           : (NAME_DATETIME.equals(name()))    ?
>>>>> sdf_datetime
>>>>>                           : (NAME_DATE.equals(name()))        ?
>>> sdf_date
>>>>>                           : null;
>>>>> …
>>>>> 
>>>>> Date date = sdf.parse(str);
>>>>> long millis = date.getTime();
>>>>> Object value = new java.lang.Long(millis);
>>>>> 
>>>>> We then use something like
>>>>> 
>>>>> AvroParquetWriter<GenericRecord> writer = new
>>>>> AvroParquetWriter<>(hdfs_path, schema);
>>>>> GenericRecord data = new GenericData.Record(schema);
>>>>> data.put( name(), value);
>>>>> writer.write(data);
>>>>> 
>>>>> to write the records out directly to a Parquet file.
>>>>> 
>>>>> example schema:
>>>>> 
>>>>> {
>>>>> "type": "record",
>>>>> "name": "timestamp_test",
>>>>>  "doc": "Avro -> Parquet long to timestamp test",
>>>>> "fields":
>>>>> [
>>>>>   { "name": "z_timestamp",  "type": "long" }
>>>>>  ,{ "name": "datetime",     "type": "long" }
>>>>>  ,{ "name": "date",         "type": "long" }
>>>>>  ,{ "name": "granularity",  "type": "long" }
>>>>> ]
>>>>> }
>>>>> 
>>>>> Then to get the data back we either define a VIEW, or cast during the
>>>>> SELECT statement.
>>>>> 
>>>>> example view:
>>>>> 
>>>>> use hdfs.cjm;
>>>>> 
>>>>> create or replace view TIMESTAMP_TEST_VIEW as
>>>>> SELECT
>>>>> CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp`
>>>>> ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
>>>>> ,CAST(`date` AS DATE) AS `date`
>>>>> ,CAST(`granularity` AS BIGINT) AS `granularity`
>>>>> 
>>>>> FROM hdfs.cjm.TIMESTAMP_TEST;
>>>>> 
>>>>> Then execute the following to get results:
>>>>> 
>>>>> 
>>>>> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity
>> from
>>>>> TIMESTAMP_TEST limit 1;
>>>>> +----------------+----------------+----------------+--------------+
>>>>> |  z_timestamp   |    datetime    |      date      | granularity  |
>>>>> +----------------+----------------+----------------+--------------+
>>>>> | 1429592511991  | 1429520400000  | 1421625600000  | 3600         |
>>>>> +----------------+----------------+----------------+--------------+
>>>>> 1 row selected (2.593 seconds)
>>>>> 
>>>>> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity
>> from
>>>>> TIMESTAMP_TEST_VIEW limit 1;
>>>>> 
>>>>> 
>>> 
>> +--------------------------+------------------------+-------------+--------------+
>>>>> |       z_timestamp        |        datetime        |    date     |
>>>>> granularity  |
>>>>> 
>>>>> 
>>> 
>> +--------------------------+------------------------+-------------+--------------+
>>>>> | 2015-04-22 05:16:22.173  | 2015-04-21 12:00:00.0  | 2015-01-20  |
>> 3600
>>>>>      |
>>>>> 
>>>>> 
>>> 
>> +--------------------------+------------------------+-------------+--------------+
>>>>> 1 row selected (3.282 seconds)
>>>>> 
>>>>> 0: jdbc:drill:> SELECT
>>>>> . . . . . . . >   CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp`
>>>>> . . . . . . . >  ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
>>>>> . . . . . . . >  ,CAST(`date` AS DATE) AS `date`
>>>>> . . . . . . . >  ,CAST(`granularity` AS BIGINT) AS `granularity`
>>>>> . . . . . . . >  from TIMESTAMP_TEST limit 1;
>>>>> 
>>>>> 
>>> 
>> +--------------------------+------------------------+-------------+--------------+
>>>>> |       z_timestamp        |        datetime        |    date     |
>>>>> granularity  |
>>>>> 
>>>>> 
>>> 
>> +--------------------------+------------------------+-------------+--------------+
>>>>> | 2015-04-22 05:16:22.173  | 2015-04-21 09:00:00.0  | 2015-01-20  |
>> 3600
>>>>>      |
>>>>> 
>>>>> 
>>> 
>> +--------------------------+------------------------+-------------+--------------+
>>>>> 1 row selected (3.071 seconds)
>>>>> 
>>>>> 
>>>>> Hope this helps.
>>>>> 
>>>>> Cheers — Chris
>>>>> 
>>>>>> On 14 Oct 2015, at 16:07, Stefán Baxter <[email protected]>
>>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> What is the best practice when working with dates in a Avro/Parquet
>>>>>> scenario?
>>>>>> 
>>>>>> Avro does not support dates directly (only longs) and I'm wondering
>> how
>>>>> the
>>>>>> get persisted in Parquet.
>>>>>> 
>>>>>> Perhaps Parquet does not distinguish between long and date-long in
>> any
>>>>>> significant way.
>>>>>> 
>>>>>> Regards,
>>>>>> -Stefan
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: Dates -> Avro -> Parquet

Reply via email to