Parquet

Chris Mathews Thu, 15 Oct 2015 10:27:08 -0700

Hi Ryan

Thanks for this - it sounds just what we need.


How do we go about doing a trial of the local copies with our code ?
It would be good to check this all out now if 1.8.0 is delayed for a while ?

contact me by https://drillers.slack.com/messages/dev/team/cmathews/ to discuss.

Cheers — Chris

> On 15 Oct 2015, at 17:59, Julien Le Dem <[email protected]> wrote:
> 
> thanks Ryan!
> (cc parquet dev list as well)
> 
> On Thu, Oct 15, 2015 at 9:46 AM, Ryan Blue <[email protected]> wrote:
> 
>> Hi Chris,
>> 
>> Avro does have support for dates, but it hasn't been released yet because
>> 1.8.0 was blocked on license issues (AVRO-1722). I have a branch with
>> preliminary parquet-avro support for Decimal (which uses the same Avro
>> construct) if you would like to take a look at it.
>> 
>> What we can do in the mean time, before Avro's logical type support is
>> released, is to copy the logical types and conversions into parquet-avro
>> and use local copies until we can use the correct upstream ones. If you're
>> interested, ping me and we can get it working.
>> 
>> rb
>> 
>> On 10/15/2015 09:24 AM, Julien Le Dem wrote:
>> 
>>> Hi Chris,
>>> You could probably contribute some sort of type annotation to
>>> parquet-avro so that it produces the data type in the Parquet schema.
>>> This class generates a Parquet schema from the Avro schema:
>>> 
>>> https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java
>>> It looks like Avro has some annotation support but I have never used it.
>>> http://avro.apache.org/docs/1.4.1/idl.html#minutiae_annotations
>>> (CC'ing Ryan, who knows the avro integration much better)
>>> 
>>> On Thu, Oct 15, 2015 at 8:45 AM, Chris Mathews <[email protected]
>>> <mailto:[email protected]>> wrote:
>>> 
>>>    Thank you Jacques - yes this is exactly the issue I am having.
>>> 
>>>    We are currently using Avro to define schemas for our Parquet files,
>>>    and as you correctly point out there is no way of defining date
>>>    types in Avro.  Due to the volumes of data we are dealing with,
>>>    using CTAS is not an option for us as we need to create the Parquet
>>>    files on the fly rather than create intermediate files for use as
>>>    input to the CTAS.
>>> 
>>>    Since Avro is the restriction here we need to come up with some way
>>>    of defining our Parquet schemas so the date types can be defined
>>>    natively; and as you say there will be no need for any casting again.
>>> 
>>>    What we really need  is some way of creating a schema for use in a
>>>    Java app, similar to the way Avro works using JSON, for ease of
>>>    implementation but without the restrictions.
>>> 
>>>    We are still investigating this approach, but for now we are casting !
>>> 
>>>    Cheers — Chris
>>> 
>>> 
>>>> On 15 Oct 2015, at 16:19, Jacques Nadeau <[email protected]
>>>    <mailto:[email protected]>> wrote:
>>>> 
>>>> A little clarification here:
>>>> 
>>>> Parquet has native support for date types. Drill does too.
>>>    However, since
>>>> Avro does not, there is no way that I know of to write a Parquet
>>>    file via
>>>> the Avro adapter that will not require a cast. If you did a CTAS
>>>    in Drill
>>>> and cast the data types correctly in the CTAS, then Drill will
>>> output
>>>> Parquet files that never need casting again.
>>>> 
>>>> --
>>>> Jacques Nadeau
>>>> CTO and Co-Founder, Dremio
>>>> 
>>>> On Thu, Oct 15, 2015 at 7:31 AM, Stefán Baxter
>>>    <[email protected] <mailto:[email protected]>>
>>>> wrote:
>>>> 
>>>>> Hi Chris,
>>>>> 
>>>>> I understand now, thank you.
>>>>> 
>>>>> What threw me off was that, in our standard use-case, we are not
>>>    using cast
>>>>> for our TIMESTAMP_MILLIS fields and I thought we were getting
>>>    them directly
>>>>> formatted from Parquet but then I overlooked our UDF that is
>>>    handling the
>>>>> casting... sorry :).
>>>>> 
>>>>> Thank you for taking the time!
>>>>> 
>>>>> Regards,
>>>>> -Stefan
>>>>> 
>>>>> 
>>>>> On Thu, Oct 15, 2015 at 2:21 PM, Chris Mathews <[email protected]
>>>    <mailto:[email protected]>> wrote:
>>>>> 
>>>>>> Hi Stefan
>>>>>> 
>>>>>> I am not sure I fully understand your question 'why you don't
>>>    seem to be
>>>>>> storing your dates in Parquet Date files.'
>>>>>> 
>>>>>> As far as I am aware all date types in Parquet (ie: DATE,
>>>    TIME_MILLIS,
>>>>>> TIMESTAMP_MILLIS) are all stored as either in int32 or int64
>>>    annotated
>>>>>> types. The only other option is to store them as strings (or
>>>    VARCHAR) and
>>>>>> interpret them as required when selecting from the Parquet files.
>>>>>> 
>>>>>> Please let me know if I have understood your question correctly
>>>    or not.
>>>>>> 
>>>>>> What I have not acheived yet is to use Avro schema definitions
>>>    (via JSON)
>>>>>> to define a TIMESTAMP type, which is why we have gone the route
>>> of
>>>>> defining
>>>>>> a VIEW for each Parquet file. By doing this we reduce the amount
>>> of
>>>>> casting
>>>>>> we have to do when building the query since the VIEW
>>>    effectively does all
>>>>>> the casting for us behind the scenes.
>>>>>> 
>>>>>> We are currently looking at possibility of defining Parquet
>>> schemas
>>>>>> directly (using java) without going the Avro route; in other
>>> words
>>>>> produce
>>>>>> a parser from JSON to Parquet, similar to the Avro parser but
>>>    supporting
>>>>>> some other logical types.
>>>>>> 
>>>>>> Some background to our drill trials:
>>>>>> 
>>>>>> We are generating billions of columns from machine generated
>>>    data every
>>>>>> day. There a quite a number of different types of machine
>>>    generating this
>>>>>> data and the format of the data varies between machines.  Some
>>>    produce
>>>>>> string format dates/timestamps and others numeric (unix epoch
>>>    style), and
>>>>>> we have to normalise this data to a common format; for dates
>>>    this format
>>>>> is
>>>>>> the Parquet TIMESTAMP_MILLIS type because we need to use
>>>    millisecond
>>>>>> granularity when available.
>>>>>> 
>>>>>> quote from Parquet docs:
>>>>>> 
>>>>>> "TIMESTAMP_MILLIS Logical date and time. Annotates an int64
>>>    that stores
>>>>>> the number of milliseconds from the Unix epoch, 00:00:00.000 on
>>>    1 January
>>>>>> 1970, UTC."
>>>>>> 
>>>>>> This type corresponds nicely to the SQL type TIMESTAMP (which
>>>    is why we
>>>>>> cast).
>>>>>> 
>>>>>> Again, hope this helps.
>>>>>> 
>>>>>> Cheers -- Chris
>>>>>> 
>>>>>>> On 15 Oct 2015, at 14:46, Stefán Baxter
>>>    <[email protected] <mailto:[email protected]>>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Thank you Chris, this clarifies a whole lot :).
>>>>>>> 
>>>>>>> I wanted to try to avoid the cast in the CTAS on the way from
>>>    Avro to
>>>>>>> Parquet (not possible) and then avoid casting as much as
>>>    possible when
>>>>>>> selecting from the Parquet files.
>>>>>>> 
>>>>>>> What is still unclear to me is why you don't seem to be
>>>    storing your
>>>>>> dates
>>>>>>> in Parquet Date files.
>>>>>>> 
>>>>>>> Can you please elaborate a bit on the pros/cons?
>>>>>>> 
>>>>>>> Regards,
>>>>>>> -Stefan
>>>>>>> 
>>>>>>> On Thu, Oct 15, 2015 at 10:59 AM, Chris Mathews
>>>    <[email protected] <mailto:[email protected]>>
>>> 
>>>>> wrote:
>>>>>>> 
>>>>>>>> Hello Stefan
>>>>>>>> 
>>>>>>>> We use Avro to define our schemas for Parquet files, and we
>>>    find that
>>>>>>>> using long for dates and converting the dates to long using
>>>>> milliseconds
>>>>>>>> works.  We then CAST the long to a TIMESTAMP on the way out
>>>    during the
>>>>>>>> SELECT statement (or by using a VIEW).
>>>>>>>> 
>>>>>>>> example java snippet:
>>>>>>>> 
>>>>>>>> //  various date and time formats
>>>>>>>> public static final String FORMAT_Z_TIMESTAMP = "yyyy-MM-dd
>>>>>> HH:mm:ss.SSS";
>>>>>>>> public static final String FORMAT_DATETIME    = "yyyy-MM-dd
>>>    HH:mm:ss";
>>>>>>>> public static final String FORMAT_DATE        = "yyyy-MM-dd";
>>>>>>>> 
>>>>>>>> …
>>>>>>>> 
>>>>>>>> // parser for each format
>>>>>>>> public final SimpleDateFormat  sdf_z_timestamp = new
>>>    SimpleDateFormat(
>>>>>>>> FORMAT_Z_TIMESTAMP );
>>>>>>>> public final SimpleDateFormat  sdf_datetime    = new
>>>    SimpleDateFormat(
>>>>>>>> FORMAT_DATETIME );
>>>>>>>> public final SimpleDateFormat  sdf_date        = new
>>>    SimpleDateFormat(
>>>>>>>> FORMAT_DATE );
>>>>>>>> 
>>>>>>>> …
>>>>>>>> 
>>>>>>>> //  choose parser based on column name / string format
>>>>>>>> public SimpleDateFormat  sdf =
>>>    (NAME_Z_TIMESTAMP.equals(name())) ?
>>>>>>>> sdf_z_timestamp
>>>>>>>>                          : (NAME_DATETIME.equals(name()))    ?
>>>>>>>> sdf_datetime
>>>>>>>>                          : (NAME_DATE.equals(name()))        ?
>>>>>> sdf_date
>>>>>>>>                          : null;
>>>>>>>> …
>>>>>>>> 
>>>>>>>> Date date = sdf.parse(str);
>>>>>>>> long millis = date.getTime();
>>>>>>>> Object value = new java.lang.Long(millis);
>>>>>>>> 
>>>>>>>> We then use something like
>>>>>>>> 
>>>>>>>> AvroParquetWriter<GenericRecord> writer = new
>>>>>>>> AvroParquetWriter<>(hdfs_path, schema);
>>>>>>>> GenericRecord data = new GenericData.Record(schema);
>>>>>>>> data.put( name(), value);
>>>>>>>> writer.write(data);
>>>>>>>> 
>>>>>>>> to write the records out directly to a Parquet file.
>>>>>>>> 
>>>>>>>> example schema:
>>>>>>>> 
>>>>>>>> {
>>>>>>>> "type": "record",
>>>>>>>> "name": "timestamp_test",
>>>>>>>> "doc": "Avro -> Parquet long to timestamp test",
>>>>>>>> "fields":
>>>>>>>> [
>>>>>>>>  { "name": "z_timestamp",  "type": "long" }
>>>>>>>> ,{ "name": "datetime",     "type": "long" }
>>>>>>>> ,{ "name": "date",         "type": "long" }
>>>>>>>> ,{ "name": "granularity",  "type": "long" }
>>>>>>>> ]
>>>>>>>> }
>>>>>>>> 
>>>>>>>> Then to get the data back we either define a VIEW, or cast
>>>    during the
>>>>>>>> SELECT statement.
>>>>>>>> 
>>>>>>>> example view:
>>>>>>>> 
>>>>>>>> use hdfs.cjm;
>>>>>>>> 
>>>>>>>> create or replace view TIMESTAMP_TEST_VIEW as
>>>>>>>> SELECT
>>>>>>>> CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp`
>>>>>>>> ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
>>>>>>>> ,CAST(`date` AS DATE) AS `date`
>>>>>>>> ,CAST(`granularity` AS BIGINT) AS `granularity`
>>>>>>>> 
>>>>>>>> FROM hdfs.cjm.TIMESTAMP_TEST;
>>>>>>>> 
>>>>>>>> Then execute the following to get results:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`,
>>>    granularity
>>>>> from
>>>>>>>> TIMESTAMP_TEST limit 1;
>>>>>>>> 
>>>    +----------------+----------------+----------------+--------------+
>>>>>>>> |  z_timestamp   |    datetime    |      date      |
>>>    granularity  |
>>>>>>>> 
>>>    +----------------+----------------+----------------+--------------+
>>>>>>>> | 1429592511991  | 1429520400000  | 1421625600000  | 3600
>>>          |
>>>>>>>> 
>>>    +----------------+----------------+----------------+--------------+
>>>>>>>> 1 row selected (2.593 seconds)
>>>>>>>> 
>>>>>>>> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`,
>>>    granularity
>>>>> from
>>>>>>>> TIMESTAMP_TEST_VIEW limit 1;
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> +--------------------------+------------------------+-------------+--------------+
>>>>>>>> |       z_timestamp        |        datetime        |
>>>    date     |
>>>>>>>> granularity  |
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> +--------------------------+------------------------+-------------+--------------+
>>>>>>>> | 2015-04-22 05:16:22.173  | 2015-04-21 12:00:00.0  |
>>>    2015-01-20  |
>>>>> 3600
>>>>>>>>     |
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> +--------------------------+------------------------+-------------+--------------+
>>>>>>>> 1 row selected (3.282 seconds)
>>>>>>>> 
>>>>>>>> 0: jdbc:drill:> SELECT
>>>>>>>> . . . . . . . >   CAST(`z_timestamp` AS TIMESTAMP) AS
>>>    `z_timestamp`
>>>>>>>> . . . . . . . >  ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
>>>>>>>> . . . . . . . >  ,CAST(`date` AS DATE) AS `date`
>>>>>>>> . . . . . . . >  ,CAST(`granularity` AS BIGINT) AS
>>> `granularity`
>>>>>>>> . . . . . . . >  from TIMESTAMP_TEST limit 1;
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> +--------------------------+------------------------+-------------+--------------+
>>>>>>>> |       z_timestamp        |        datetime        |
>>>    date     |
>>>>>>>> granularity  |
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> +--------------------------+------------------------+-------------+--------------+
>>>>>>>> | 2015-04-22 05:16:22.173  | 2015-04-21 09:00:00.0  |
>>>    2015-01-20  |
>>>>> 3600
>>>>>>>>     |
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> +--------------------------+------------------------+-------------+--------------+
>>>>>>>> 1 row selected (3.071 seconds)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hope this helps.
>>>>>>>> 
>>>>>>>> Cheers — Chris
>>>>>>>> 
>>>>>>>>> On 14 Oct 2015, at 16:07, Stefán Baxter
>>>    <[email protected] <mailto:[email protected]>>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> What is the best practice when working with dates in a
>>>    Avro/Parquet
>>>>>>>>> scenario?
>>>>>>>>> 
>>>>>>>>> Avro does not support dates directly (only longs) and I'm
>>>    wondering
>>>>> how
>>>>>>>> the
>>>>>>>>> get persisted in Parquet.
>>>>>>>>> 
>>>>>>>>> Perhaps Parquet does not distinguish between long and
>>>    date-long in
>>>>> any
>>>>>>>>> significant way.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> -Stefan
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Julien
>>> 
>> 
>> 
>> --
>> Ryan Blue
>> Software Engineer
>> Cloudera, Inc.
>> 
> 
> 
> 
> -- 
> Julien

Re: Dates -> Avro -> Parquet

Reply via email to