Parquet

Julien Le Dem Thu, 15 Oct 2015 09:59:46 -0700

thanks Ryan!
(cc parquet dev list as well)

On Thu, Oct 15, 2015 at 9:46 AM, Ryan Blue <[email protected]> wrote:


> Hi Chris,
>
> Avro does have support for dates, but it hasn't been released yet because
> 1.8.0 was blocked on license issues (AVRO-1722). I have a branch with
> preliminary parquet-avro support for Decimal (which uses the same Avro
> construct) if you would like to take a look at it.
>
> What we can do in the mean time, before Avro's logical type support is
> released, is to copy the logical types and conversions into parquet-avro
> and use local copies until we can use the correct upstream ones. If you're
> interested, ping me and we can get it working.
>
> rb
>
> On 10/15/2015 09:24 AM, Julien Le Dem wrote:
>
>> Hi Chris,
>> You could probably contribute some sort of type annotation to
>> parquet-avro so that it produces the data type in the Parquet schema.
>> This class generates a Parquet schema from the Avro schema:
>>
>> https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java
>> It looks like Avro has some annotation support but I have never used it.
>> http://avro.apache.org/docs/1.4.1/idl.html#minutiae_annotations
>> (CC'ing Ryan, who knows the avro integration much better)
>>
>> On Thu, Oct 15, 2015 at 8:45 AM, Chris Mathews <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     Thank you Jacques - yes this is exactly the issue I am having.
>>
>>     We are currently using Avro to define schemas for our Parquet files,
>>     and as you correctly point out there is no way of defining date
>>     types in Avro.  Due to the volumes of data we are dealing with,
>>     using CTAS is not an option for us as we need to create the Parquet
>>     files on the fly rather than create intermediate files for use as
>>     input to the CTAS.
>>
>>     Since Avro is the restriction here we need to come up with some way
>>     of defining our Parquet schemas so the date types can be defined
>>     natively; and as you say there will be no need for any casting again.
>>
>>     What we really need  is some way of creating a schema for use in a
>>     Java app, similar to the way Avro works using JSON, for ease of
>>     implementation but without the restrictions.
>>
>>     We are still investigating this approach, but for now we are casting !
>>
>>     Cheers — Chris
>>
>>
>>      > On 15 Oct 2015, at 16:19, Jacques Nadeau <[email protected]
>>     <mailto:[email protected]>> wrote:
>>      >
>>      > A little clarification here:
>>      >
>>      > Parquet has native support for date types. Drill does too.
>>     However, since
>>      > Avro does not, there is no way that I know of to write a Parquet
>>     file via
>>      > the Avro adapter that will not require a cast. If you did a CTAS
>>     in Drill
>>      > and cast the data types correctly in the CTAS, then Drill will
>> output
>>      > Parquet files that never need casting again.
>>      >
>>      > --
>>      > Jacques Nadeau
>>      > CTO and Co-Founder, Dremio
>>      >
>>      > On Thu, Oct 15, 2015 at 7:31 AM, Stefán Baxter
>>     <[email protected] <mailto:[email protected]>>
>>      > wrote:
>>      >
>>      >> Hi Chris,
>>      >>
>>      >> I understand now, thank you.
>>      >>
>>      >> What threw me off was that, in our standard use-case, we are not
>>     using cast
>>      >> for our TIMESTAMP_MILLIS fields and I thought we were getting
>>     them directly
>>      >> formatted from Parquet but then I overlooked our UDF that is
>>     handling the
>>      >> casting... sorry :).
>>      >>
>>      >> Thank you for taking the time!
>>      >>
>>      >> Regards,
>>      >>  -Stefan
>>      >>
>>      >>
>>      >> On Thu, Oct 15, 2015 at 2:21 PM, Chris Mathews <[email protected]
>>     <mailto:[email protected]>> wrote:
>>      >>
>>      >>> Hi Stefan
>>      >>>
>>      >>> I am not sure I fully understand your question 'why you don't
>>     seem to be
>>      >>> storing your dates in Parquet Date files.'
>>      >>>
>>      >>> As far as I am aware all date types in Parquet (ie: DATE,
>>     TIME_MILLIS,
>>      >>> TIMESTAMP_MILLIS) are all stored as either in int32 or int64
>>     annotated
>>      >>> types. The only other option is to store them as strings (or
>>     VARCHAR) and
>>      >>> interpret them as required when selecting from the Parquet files.
>>      >>>
>>      >>> Please let me know if I have understood your question correctly
>>     or not.
>>      >>>
>>      >>> What I have not acheived yet is to use Avro schema definitions
>>     (via JSON)
>>      >>> to define a TIMESTAMP type, which is why we have gone the route
>> of
>>      >> defining
>>      >>> a VIEW for each Parquet file. By doing this we reduce the amount
>> of
>>      >> casting
>>      >>> we have to do when building the query since the VIEW
>>     effectively does all
>>      >>> the casting for us behind the scenes.
>>      >>>
>>      >>> We are currently looking at possibility of defining Parquet
>> schemas
>>      >>> directly (using java) without going the Avro route; in other
>> words
>>      >> produce
>>      >>> a parser from JSON to Parquet, similar to the Avro parser but
>>     supporting
>>      >>> some other logical types.
>>      >>>
>>      >>> Some background to our drill trials:
>>      >>>
>>      >>> We are generating billions of columns from machine generated
>>     data every
>>      >>> day. There a quite a number of different types of machine
>>     generating this
>>      >>> data and the format of the data varies between machines.  Some
>>     produce
>>      >>> string format dates/timestamps and others numeric (unix epoch
>>     style), and
>>      >>> we have to normalise this data to a common format; for dates
>>     this format
>>      >> is
>>      >>> the Parquet TIMESTAMP_MILLIS type because we need to use
>>     millisecond
>>      >>> granularity when available.
>>      >>>
>>      >>> quote from Parquet docs:
>>      >>>
>>      >>> "TIMESTAMP_MILLIS Logical date and time. Annotates an int64
>>     that stores
>>      >>> the number of milliseconds from the Unix epoch, 00:00:00.000 on
>>     1 January
>>      >>> 1970, UTC."
>>      >>>
>>      >>> This type corresponds nicely to the SQL type TIMESTAMP (which
>>     is why we
>>      >>> cast).
>>      >>>
>>      >>> Again, hope this helps.
>>      >>>
>>      >>> Cheers -- Chris
>>      >>>
>>      >>>> On 15 Oct 2015, at 14:46, Stefán Baxter
>>     <[email protected] <mailto:[email protected]>>
>>      >>> wrote:
>>      >>>>
>>      >>>> Thank you Chris, this clarifies a whole lot :).
>>      >>>>
>>      >>>> I wanted to try to avoid the cast in the CTAS on the way from
>>     Avro to
>>      >>>> Parquet (not possible) and then avoid casting as much as
>>     possible when
>>      >>>> selecting from the Parquet files.
>>      >>>>
>>      >>>> What is still unclear to me is why you don't seem to be
>>     storing your
>>      >>> dates
>>      >>>> in Parquet Date files.
>>      >>>>
>>      >>>> Can you please elaborate a bit on the pros/cons?
>>      >>>>
>>      >>>> Regards,
>>      >>>> -Stefan
>>      >>>>
>>      >>>> On Thu, Oct 15, 2015 at 10:59 AM, Chris Mathews
>>     <[email protected] <mailto:[email protected]>>
>>
>>      >> wrote:
>>      >>>>
>>      >>>>> Hello Stefan
>>      >>>>>
>>      >>>>> We use Avro to define our schemas for Parquet files, and we
>>     find that
>>      >>>>> using long for dates and converting the dates to long using
>>      >> milliseconds
>>      >>>>> works.  We then CAST the long to a TIMESTAMP on the way out
>>     during the
>>      >>>>> SELECT statement (or by using a VIEW).
>>      >>>>>
>>      >>>>> example java snippet:
>>      >>>>>
>>      >>>>> //  various date and time formats
>>      >>>>> public static final String FORMAT_Z_TIMESTAMP = "yyyy-MM-dd
>>      >>> HH:mm:ss.SSS";
>>      >>>>> public static final String FORMAT_DATETIME    = "yyyy-MM-dd
>>     HH:mm:ss";
>>      >>>>> public static final String FORMAT_DATE        = "yyyy-MM-dd";
>>      >>>>>
>>      >>>>> …
>>      >>>>>
>>      >>>>> // parser for each format
>>      >>>>> public final SimpleDateFormat  sdf_z_timestamp = new
>>     SimpleDateFormat(
>>      >>>>> FORMAT_Z_TIMESTAMP );
>>      >>>>> public final SimpleDateFormat  sdf_datetime    = new
>>     SimpleDateFormat(
>>      >>>>> FORMAT_DATETIME );
>>      >>>>> public final SimpleDateFormat  sdf_date        = new
>>     SimpleDateFormat(
>>      >>>>> FORMAT_DATE );
>>      >>>>>
>>      >>>>> …
>>      >>>>>
>>      >>>>> //  choose parser based on column name / string format
>>      >>>>> public SimpleDateFormat  sdf =
>>     (NAME_Z_TIMESTAMP.equals(name())) ?
>>      >>>>> sdf_z_timestamp
>>      >>>>>                           : (NAME_DATETIME.equals(name()))    ?
>>      >>>>> sdf_datetime
>>      >>>>>                           : (NAME_DATE.equals(name()))        ?
>>      >>> sdf_date
>>      >>>>>                           : null;
>>      >>>>> …
>>      >>>>>
>>      >>>>> Date date = sdf.parse(str);
>>      >>>>> long millis = date.getTime();
>>      >>>>> Object value = new java.lang.Long(millis);
>>      >>>>>
>>      >>>>> We then use something like
>>      >>>>>
>>      >>>>> AvroParquetWriter<GenericRecord> writer = new
>>      >>>>> AvroParquetWriter<>(hdfs_path, schema);
>>      >>>>> GenericRecord data = new GenericData.Record(schema);
>>      >>>>> data.put( name(), value);
>>      >>>>> writer.write(data);
>>      >>>>>
>>      >>>>> to write the records out directly to a Parquet file.
>>      >>>>>
>>      >>>>> example schema:
>>      >>>>>
>>      >>>>> {
>>      >>>>> "type": "record",
>>      >>>>> "name": "timestamp_test",
>>      >>>>>  "doc": "Avro -> Parquet long to timestamp test",
>>      >>>>> "fields":
>>      >>>>> [
>>      >>>>>   { "name": "z_timestamp",  "type": "long" }
>>      >>>>>  ,{ "name": "datetime",     "type": "long" }
>>      >>>>>  ,{ "name": "date",         "type": "long" }
>>      >>>>>  ,{ "name": "granularity",  "type": "long" }
>>      >>>>> ]
>>      >>>>> }
>>      >>>>>
>>      >>>>> Then to get the data back we either define a VIEW, or cast
>>     during the
>>      >>>>> SELECT statement.
>>      >>>>>
>>      >>>>> example view:
>>      >>>>>
>>      >>>>> use hdfs.cjm;
>>      >>>>>
>>      >>>>> create or replace view TIMESTAMP_TEST_VIEW as
>>      >>>>> SELECT
>>      >>>>> CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp`
>>      >>>>> ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
>>      >>>>> ,CAST(`date` AS DATE) AS `date`
>>      >>>>> ,CAST(`granularity` AS BIGINT) AS `granularity`
>>      >>>>>
>>      >>>>> FROM hdfs.cjm.TIMESTAMP_TEST;
>>      >>>>>
>>      >>>>> Then execute the following to get results:
>>      >>>>>
>>      >>>>>
>>      >>>>> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`,
>>     granularity
>>      >> from
>>      >>>>> TIMESTAMP_TEST limit 1;
>>      >>>>>
>>     +----------------+----------------+----------------+--------------+
>>      >>>>> |  z_timestamp   |    datetime    |      date      |
>>     granularity  |
>>      >>>>>
>>     +----------------+----------------+----------------+--------------+
>>      >>>>> | 1429592511991  | 1429520400000  | 1421625600000  | 3600
>>           |
>>      >>>>>
>>     +----------------+----------------+----------------+--------------+
>>      >>>>> 1 row selected (2.593 seconds)
>>      >>>>>
>>      >>>>> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`,
>>     granularity
>>      >> from
>>      >>>>> TIMESTAMP_TEST_VIEW limit 1;
>>      >>>>>
>>      >>>>>
>>      >>>
>>      >>
>>
>> +--------------------------+------------------------+-------------+--------------+
>>      >>>>> |       z_timestamp        |        datetime        |
>>     date     |
>>      >>>>> granularity  |
>>      >>>>>
>>      >>>>>
>>      >>>
>>      >>
>>
>> +--------------------------+------------------------+-------------+--------------+
>>      >>>>> | 2015-04-22 05:16:22.173  | 2015-04-21 12:00:00.0  |
>>     2015-01-20  |
>>      >> 3600
>>      >>>>>      |
>>      >>>>>
>>      >>>>>
>>      >>>
>>      >>
>>
>> +--------------------------+------------------------+-------------+--------------+
>>      >>>>> 1 row selected (3.282 seconds)
>>      >>>>>
>>      >>>>> 0: jdbc:drill:> SELECT
>>      >>>>> . . . . . . . >   CAST(`z_timestamp` AS TIMESTAMP) AS
>>     `z_timestamp`
>>      >>>>> . . . . . . . >  ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
>>      >>>>> . . . . . . . >  ,CAST(`date` AS DATE) AS `date`
>>      >>>>> . . . . . . . >  ,CAST(`granularity` AS BIGINT) AS
>> `granularity`
>>      >>>>> . . . . . . . >  from TIMESTAMP_TEST limit 1;
>>      >>>>>
>>      >>>>>
>>      >>>
>>      >>
>>
>> +--------------------------+------------------------+-------------+--------------+
>>      >>>>> |       z_timestamp        |        datetime        |
>>     date     |
>>      >>>>> granularity  |
>>      >>>>>
>>      >>>>>
>>      >>>
>>      >>
>>
>> +--------------------------+------------------------+-------------+--------------+
>>      >>>>> | 2015-04-22 05:16:22.173  | 2015-04-21 09:00:00.0  |
>>     2015-01-20  |
>>      >> 3600
>>      >>>>>      |
>>      >>>>>
>>      >>>>>
>>      >>>
>>      >>
>>
>> +--------------------------+------------------------+-------------+--------------+
>>      >>>>> 1 row selected (3.071 seconds)
>>      >>>>>
>>      >>>>>
>>      >>>>> Hope this helps.
>>      >>>>>
>>      >>>>> Cheers — Chris
>>      >>>>>
>>      >>>>>> On 14 Oct 2015, at 16:07, Stefán Baxter
>>     <[email protected] <mailto:[email protected]>>
>>      >>>>> wrote:
>>      >>>>>>
>>      >>>>>> Hi,
>>      >>>>>>
>>      >>>>>> What is the best practice when working with dates in a
>>     Avro/Parquet
>>      >>>>>> scenario?
>>      >>>>>>
>>      >>>>>> Avro does not support dates directly (only longs) and I'm
>>     wondering
>>      >> how
>>      >>>>> the
>>      >>>>>> get persisted in Parquet.
>>      >>>>>>
>>      >>>>>> Perhaps Parquet does not distinguish between long and
>>     date-long in
>>      >> any
>>      >>>>>> significant way.
>>      >>>>>>
>>      >>>>>> Regards,
>>      >>>>>> -Stefan
>>      >>>>>
>>      >>>>>
>>      >>>
>>      >>>
>>      >>
>>
>>
>>
>>
>> --
>> Julien
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>



-- 
Julien

Re: Dates -> Avro -> Parquet

Reply via email to