Parquet

Julien Le Dem Thu, 15 Oct 2015 09:26:20 -0700

Hi Chris,
You could probably contribute some sort of type annotation to parquet-avro
so that it produces the data type in the Parquet schema.
This class generates a Parquet schema from the Avro schema:
https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java
It looks like Avro has some annotation support but I have never used it.
http://avro.apache.org/docs/1.4.1/idl.html#minutiae_annotations
(CC'ing Ryan, who knows the avro integration much better)


On Thu, Oct 15, 2015 at 8:45 AM, Chris Mathews <[email protected]> wrote:

> Thank you Jacques - yes this is exactly the issue I am having.
>
> We are currently using Avro to define schemas for our Parquet files, and
> as you correctly point out there is no way of defining date types in Avro.
> Due to the volumes of data we are dealing with, using CTAS is not an option
> for us as we need to create the Parquet files on the fly rather than create
> intermediate files for use as input to the CTAS.
>
> Since Avro is the restriction here we need to come up with some way of
> defining our Parquet schemas so the date types can be defined natively; and
> as you say there will be no need for any casting again.
>
> What we really need  is some way of creating a schema for use in a Java
> app, similar to the way Avro works using JSON, for ease of implementation
> but without the restrictions.
>
> We are still investigating this approach, but for now we are casting !
>
> Cheers — Chris
>
>
> > On 15 Oct 2015, at 16:19, Jacques Nadeau <[email protected]> wrote:
> >
> > A little clarification here:
> >
> > Parquet has native support for date types. Drill does too. However, since
> > Avro does not, there is no way that I know of to write a Parquet file via
> > the Avro adapter that will not require a cast. If you did a CTAS in Drill
> > and cast the data types correctly in the CTAS, then Drill will output
> > Parquet files that never need casting again.
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Thu, Oct 15, 2015 at 7:31 AM, Stefán Baxter <
> [email protected]>
> > wrote:
> >
> >> Hi Chris,
> >>
> >> I understand now, thank you.
> >>
> >> What threw me off was that, in our standard use-case, we are not using
> cast
> >> for our TIMESTAMP_MILLIS fields and I thought we were getting them
> directly
> >> formatted from Parquet but then I overlooked our UDF that is handling
> the
> >> casting... sorry :).
> >>
> >> Thank you for taking the time!
> >>
> >> Regards,
> >>  -Stefan
> >>
> >>
> >> On Thu, Oct 15, 2015 at 2:21 PM, Chris Mathews <[email protected]> wrote:
> >>
> >>> Hi Stefan
> >>>
> >>> I am not sure I fully understand your question 'why you don't seem to
> be
> >>> storing your dates in Parquet Date files.'
> >>>
> >>> As far as I am aware all date types in Parquet (ie: DATE, TIME_MILLIS,
> >>> TIMESTAMP_MILLIS) are all stored as either in int32 or int64 annotated
> >>> types. The only other option is to store them as strings (or VARCHAR)
> and
> >>> interpret them as required when selecting from the Parquet files.
> >>>
> >>> Please let me know if I have understood your question correctly or not.
> >>>
> >>> What I have not acheived yet is to use Avro schema definitions (via
> JSON)
> >>> to define a TIMESTAMP type, which is why we have gone the route of
> >> defining
> >>> a VIEW for each Parquet file. By doing this we reduce the amount of
> >> casting
> >>> we have to do when building the query since the VIEW effectively does
> all
> >>> the casting for us behind the scenes.
> >>>
> >>> We are currently looking at possibility of defining Parquet schemas
> >>> directly (using java) without going the Avro route; in other words
> >> produce
> >>> a parser from JSON to Parquet, similar to the Avro parser but
> supporting
> >>> some other logical types.
> >>>
> >>> Some background to our drill trials:
> >>>
> >>> We are generating billions of columns from machine generated data every
> >>> day. There a quite a number of different types of machine generating
> this
> >>> data and the format of the data varies between machines.  Some produce
> >>> string format dates/timestamps and others numeric (unix epoch style),
> and
> >>> we have to normalise this data to a common format; for dates this
> format
> >> is
> >>> the Parquet TIMESTAMP_MILLIS type because we need to use millisecond
> >>> granularity when available.
> >>>
> >>> quote from Parquet docs:
> >>>
> >>> "TIMESTAMP_MILLIS Logical date and time. Annotates an int64 that stores
> >>> the number of milliseconds from the Unix epoch, 00:00:00.000 on 1
> January
> >>> 1970, UTC."
> >>>
> >>> This type corresponds nicely to the SQL type TIMESTAMP (which is why we
> >>> cast).
> >>>
> >>> Again, hope this helps.
> >>>
> >>> Cheers -- Chris
> >>>
> >>>> On 15 Oct 2015, at 14:46, Stefán Baxter <[email protected]>
> >>> wrote:
> >>>>
> >>>> Thank you Chris, this clarifies a whole lot :).
> >>>>
> >>>> I wanted to try to avoid the cast in the CTAS on the way from Avro to
> >>>> Parquet (not possible) and then avoid casting as much as possible when
> >>>> selecting from the Parquet files.
> >>>>
> >>>> What is still unclear to me is why you don't seem to be storing your
> >>> dates
> >>>> in Parquet Date files.
> >>>>
> >>>> Can you please elaborate a bit on the pros/cons?
> >>>>
> >>>> Regards,
> >>>> -Stefan
> >>>>
> >>>> On Thu, Oct 15, 2015 at 10:59 AM, Chris Mathews <[email protected]>
> >> wrote:
> >>>>
> >>>>> Hello Stefan
> >>>>>
> >>>>> We use Avro to define our schemas for Parquet files, and we find that
> >>>>> using long for dates and converting the dates to long using
> >> milliseconds
> >>>>> works.  We then CAST the long to a TIMESTAMP on the way out during
> the
> >>>>> SELECT statement (or by using a VIEW).
> >>>>>
> >>>>> example java snippet:
> >>>>>
> >>>>> //  various date and time formats
> >>>>> public static final String FORMAT_Z_TIMESTAMP = "yyyy-MM-dd
> >>> HH:mm:ss.SSS";
> >>>>> public static final String FORMAT_DATETIME    = "yyyy-MM-dd
> HH:mm:ss";
> >>>>> public static final String FORMAT_DATE        = "yyyy-MM-dd";
> >>>>>
> >>>>> …
> >>>>>
> >>>>> // parser for each format
> >>>>> public final SimpleDateFormat  sdf_z_timestamp = new
> SimpleDateFormat(
> >>>>> FORMAT_Z_TIMESTAMP );
> >>>>> public final SimpleDateFormat  sdf_datetime    = new
> SimpleDateFormat(
> >>>>> FORMAT_DATETIME );
> >>>>> public final SimpleDateFormat  sdf_date        = new
> SimpleDateFormat(
> >>>>> FORMAT_DATE );
> >>>>>
> >>>>> …
> >>>>>
> >>>>> //  choose parser based on column name / string format
> >>>>> public SimpleDateFormat  sdf = (NAME_Z_TIMESTAMP.equals(name())) ?
> >>>>> sdf_z_timestamp
> >>>>>                           : (NAME_DATETIME.equals(name()))    ?
> >>>>> sdf_datetime
> >>>>>                           : (NAME_DATE.equals(name()))        ?
> >>> sdf_date
> >>>>>                           : null;
> >>>>> …
> >>>>>
> >>>>> Date date = sdf.parse(str);
> >>>>> long millis = date.getTime();
> >>>>> Object value = new java.lang.Long(millis);
> >>>>>
> >>>>> We then use something like
> >>>>>
> >>>>> AvroParquetWriter<GenericRecord> writer = new
> >>>>> AvroParquetWriter<>(hdfs_path, schema);
> >>>>> GenericRecord data = new GenericData.Record(schema);
> >>>>> data.put( name(), value);
> >>>>> writer.write(data);
> >>>>>
> >>>>> to write the records out directly to a Parquet file.
> >>>>>
> >>>>> example schema:
> >>>>>
> >>>>> {
> >>>>> "type": "record",
> >>>>> "name": "timestamp_test",
> >>>>>  "doc": "Avro -> Parquet long to timestamp test",
> >>>>> "fields":
> >>>>> [
> >>>>>   { "name": "z_timestamp",  "type": "long" }
> >>>>>  ,{ "name": "datetime",     "type": "long" }
> >>>>>  ,{ "name": "date",         "type": "long" }
> >>>>>  ,{ "name": "granularity",  "type": "long" }
> >>>>> ]
> >>>>> }
> >>>>>
> >>>>> Then to get the data back we either define a VIEW, or cast during the
> >>>>> SELECT statement.
> >>>>>
> >>>>> example view:
> >>>>>
> >>>>> use hdfs.cjm;
> >>>>>
> >>>>> create or replace view TIMESTAMP_TEST_VIEW as
> >>>>> SELECT
> >>>>> CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp`
> >>>>> ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
> >>>>> ,CAST(`date` AS DATE) AS `date`
> >>>>> ,CAST(`granularity` AS BIGINT) AS `granularity`
> >>>>>
> >>>>> FROM hdfs.cjm.TIMESTAMP_TEST;
> >>>>>
> >>>>> Then execute the following to get results:
> >>>>>
> >>>>>
> >>>>> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity
> >> from
> >>>>> TIMESTAMP_TEST limit 1;
> >>>>> +----------------+----------------+----------------+--------------+
> >>>>> |  z_timestamp   |    datetime    |      date      | granularity  |
> >>>>> +----------------+----------------+----------------+--------------+
> >>>>> | 1429592511991  | 1429520400000  | 1421625600000  | 3600         |
> >>>>> +----------------+----------------+----------------+--------------+
> >>>>> 1 row selected (2.593 seconds)
> >>>>>
> >>>>> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity
> >> from
> >>>>> TIMESTAMP_TEST_VIEW limit 1;
> >>>>>
> >>>>>
> >>>
> >>
> +--------------------------+------------------------+-------------+--------------+
> >>>>> |       z_timestamp        |        datetime        |    date     |
> >>>>> granularity  |
> >>>>>
> >>>>>
> >>>
> >>
> +--------------------------+------------------------+-------------+--------------+
> >>>>> | 2015-04-22 05:16:22.173  | 2015-04-21 12:00:00.0  | 2015-01-20  |
> >> 3600
> >>>>>      |
> >>>>>
> >>>>>
> >>>
> >>
> +--------------------------+------------------------+-------------+--------------+
> >>>>> 1 row selected (3.282 seconds)
> >>>>>
> >>>>> 0: jdbc:drill:> SELECT
> >>>>> . . . . . . . >   CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp`
> >>>>> . . . . . . . >  ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
> >>>>> . . . . . . . >  ,CAST(`date` AS DATE) AS `date`
> >>>>> . . . . . . . >  ,CAST(`granularity` AS BIGINT) AS `granularity`
> >>>>> . . . . . . . >  from TIMESTAMP_TEST limit 1;
> >>>>>
> >>>>>
> >>>
> >>
> +--------------------------+------------------------+-------------+--------------+
> >>>>> |       z_timestamp        |        datetime        |    date     |
> >>>>> granularity  |
> >>>>>
> >>>>>
> >>>
> >>
> +--------------------------+------------------------+-------------+--------------+
> >>>>> | 2015-04-22 05:16:22.173  | 2015-04-21 09:00:00.0  | 2015-01-20  |
> >> 3600
> >>>>>      |
> >>>>>
> >>>>>
> >>>
> >>
> +--------------------------+------------------------+-------------+--------------+
> >>>>> 1 row selected (3.071 seconds)
> >>>>>
> >>>>>
> >>>>> Hope this helps.
> >>>>>
> >>>>> Cheers — Chris
> >>>>>
> >>>>>> On 14 Oct 2015, at 16:07, Stefán Baxter <[email protected]>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> What is the best practice when working with dates in a Avro/Parquet
> >>>>>> scenario?
> >>>>>>
> >>>>>> Avro does not support dates directly (only longs) and I'm wondering
> >> how
> >>>>> the
> >>>>>> get persisted in Parquet.
> >>>>>>
> >>>>>> Perhaps Parquet does not distinguish between long and date-long in
> >> any
> >>>>>> significant way.
> >>>>>>
> >>>>>> Regards,
> >>>>>> -Stefan
> >>>>>
> >>>>>
> >>>
> >>>
> >>
>
>


-- 
Julien

Re: Dates -> Avro -> Parquet

Reply via email to