Parquet

Stefán Baxter Thu, 15 Oct 2015 08:31:11 -0700

Hi Jacques,

You describe my small dilemma much better than me, thank you.


Regards,
 -Stefan

On Thu, Oct 15, 2015 at 3:19 PM, Jacques Nadeau <[email protected]> wrote:

> A little clarification here:
>
> Parquet has native support for date types. Drill does too. However, since
> Avro does not, there is no way that I know of to write a Parquet file via
> the Avro adapter that will not require a cast. If you did a CTAS in Drill
> and cast the data types correctly in the CTAS, then Drill will output
> Parquet files that never need casting again.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Thu, Oct 15, 2015 at 7:31 AM, Stefán Baxter <[email protected]>
> wrote:
>
> > Hi Chris,
> >
> > I understand now, thank you.
> >
> > What threw me off was that, in our standard use-case, we are not using
> cast
> > for our TIMESTAMP_MILLIS fields and I thought we were getting them
> directly
> > formatted from Parquet but then I overlooked our UDF that is handling the
> > casting... sorry :).
> >
> > Thank you for taking the time!
> >
> > Regards,
> >   -Stefan
> >
> >
> > On Thu, Oct 15, 2015 at 2:21 PM, Chris Mathews <[email protected]> wrote:
> >
> > > Hi Stefan
> > >
> > > I am not sure I fully understand your question 'why you don't seem to
> be
> > > storing your dates in Parquet Date files.'
> > >
> > > As far as I am aware all date types in Parquet (ie: DATE, TIME_MILLIS,
> > > TIMESTAMP_MILLIS) are all stored as either in int32 or int64 annotated
> > > types. The only other option is to store them as strings (or VARCHAR)
> and
> > > interpret them as required when selecting from the Parquet files.
> > >
> > > Please let me know if I have understood your question correctly or not.
> > >
> > > What I have not acheived yet is to use Avro schema definitions (via
> JSON)
> > > to define a TIMESTAMP type, which is why we have gone the route of
> > defining
> > > a VIEW for each Parquet file. By doing this we reduce the amount of
> > casting
> > > we have to do when building the query since the VIEW effectively does
> all
> > > the casting for us behind the scenes.
> > >
> > > We are currently looking at possibility of defining Parquet schemas
> > > directly (using java) without going the Avro route; in other words
> > produce
> > > a parser from JSON to Parquet, similar to the Avro parser but
> supporting
> > > some other logical types.
> > >
> > > Some background to our drill trials:
> > >
> > > We are generating billions of columns from machine generated data every
> > > day. There a quite a number of different types of machine generating
> this
> > > data and the format of the data varies between machines.  Some produce
> > > string format dates/timestamps and others numeric (unix epoch style),
> and
> > > we have to normalise this data to a common format; for dates this
> format
> > is
> > > the Parquet TIMESTAMP_MILLIS type because we need to use millisecond
> > > granularity when available.
> > >
> > > quote from Parquet docs:
> > >
> > > "TIMESTAMP_MILLIS Logical date and time. Annotates an int64 that stores
> > > the number of milliseconds from the Unix epoch, 00:00:00.000 on 1
> January
> > > 1970, UTC."
> > >
> > > This type corresponds nicely to the SQL type TIMESTAMP (which is why we
> > > cast).
> > >
> > > Again, hope this helps.
> > >
> > > Cheers -- Chris
> > >
> > > > On 15 Oct 2015, at 14:46, Stefán Baxter <[email protected]>
> > > wrote:
> > > >
> > > > Thank you Chris, this clarifies a whole lot :).
> > > >
> > > > I wanted to try to avoid the cast in the CTAS on the way from Avro to
> > > > Parquet (not possible) and then avoid casting as much as possible
> when
> > > > selecting from the Parquet files.
> > > >
> > > > What is still unclear to me is why you don't seem to be storing your
> > > dates
> > > > in Parquet Date files.
> > > >
> > > > Can you please elaborate a bit on the pros/cons?
> > > >
> > > > Regards,
> > > > -Stefan
> > > >
> > > > On Thu, Oct 15, 2015 at 10:59 AM, Chris Mathews <[email protected]>
> > wrote:
> > > >
> > > >> Hello Stefan
> > > >>
> > > >> We use Avro to define our schemas for Parquet files, and we find
> that
> > > >> using long for dates and converting the dates to long using
> > milliseconds
> > > >> works.  We then CAST the long to a TIMESTAMP on the way out during
> the
> > > >> SELECT statement (or by using a VIEW).
> > > >>
> > > >> example java snippet:
> > > >>
> > > >> //  various date and time formats
> > > >> public static final String FORMAT_Z_TIMESTAMP = "yyyy-MM-dd
> > > HH:mm:ss.SSS";
> > > >> public static final String FORMAT_DATETIME    = "yyyy-MM-dd
> HH:mm:ss";
> > > >> public static final String FORMAT_DATE        = "yyyy-MM-dd";
> > > >>
> > > >> …
> > > >>
> > > >> // parser for each format
> > > >> public final SimpleDateFormat  sdf_z_timestamp = new
> SimpleDateFormat(
> > > >> FORMAT_Z_TIMESTAMP );
> > > >> public final SimpleDateFormat  sdf_datetime    = new
> SimpleDateFormat(
> > > >> FORMAT_DATETIME );
> > > >> public final SimpleDateFormat  sdf_date        = new
> SimpleDateFormat(
> > > >> FORMAT_DATE );
> > > >>
> > > >> …
> > > >>
> > > >> //  choose parser based on column name / string format
> > > >> public SimpleDateFormat  sdf = (NAME_Z_TIMESTAMP.equals(name())) ?
> > > >> sdf_z_timestamp
> > > >>                            : (NAME_DATETIME.equals(name()))    ?
> > > >> sdf_datetime
> > > >>                            : (NAME_DATE.equals(name()))        ?
> > > sdf_date
> > > >>                            : null;
> > > >> …
> > > >>
> > > >> Date date = sdf.parse(str);
> > > >> long millis = date.getTime();
> > > >> Object value = new java.lang.Long(millis);
> > > >>
> > > >> We then use something like
> > > >>
> > > >> AvroParquetWriter<GenericRecord> writer = new
> > > >> AvroParquetWriter<>(hdfs_path, schema);
> > > >> GenericRecord data = new GenericData.Record(schema);
> > > >> data.put( name(), value);
> > > >> writer.write(data);
> > > >>
> > > >> to write the records out directly to a Parquet file.
> > > >>
> > > >> example schema:
> > > >>
> > > >> {
> > > >>  "type": "record",
> > > >>  "name": "timestamp_test",
> > > >>   "doc": "Avro -> Parquet long to timestamp test",
> > > >>  "fields":
> > > >>  [
> > > >>    { "name": "z_timestamp",  "type": "long" }
> > > >>   ,{ "name": "datetime",     "type": "long" }
> > > >>   ,{ "name": "date",         "type": "long" }
> > > >>   ,{ "name": "granularity",  "type": "long" }
> > > >>  ]
> > > >> }
> > > >>
> > > >> Then to get the data back we either define a VIEW, or cast during
> the
> > > >> SELECT statement.
> > > >>
> > > >> example view:
> > > >>
> > > >> use hdfs.cjm;
> > > >>
> > > >> create or replace view TIMESTAMP_TEST_VIEW as
> > > >> SELECT
> > > >>  CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp`
> > > >> ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
> > > >> ,CAST(`date` AS DATE) AS `date`
> > > >> ,CAST(`granularity` AS BIGINT) AS `granularity`
> > > >>
> > > >> FROM hdfs.cjm.TIMESTAMP_TEST;
> > > >>
> > > >> Then execute the following to get results:
> > > >>
> > > >>
> > > >> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity
> > from
> > > >> TIMESTAMP_TEST limit 1;
> > > >> +----------------+----------------+----------------+--------------+
> > > >> |  z_timestamp   |    datetime    |      date      | granularity  |
> > > >> +----------------+----------------+----------------+--------------+
> > > >> | 1429592511991  | 1429520400000  | 1421625600000  | 3600         |
> > > >> +----------------+----------------+----------------+--------------+
> > > >> 1 row selected (2.593 seconds)
> > > >>
> > > >> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity
> > from
> > > >> TIMESTAMP_TEST_VIEW limit 1;
> > > >>
> > > >>
> > >
> >
> +--------------------------+------------------------+-------------+--------------+
> > > >> |       z_timestamp        |        datetime        |    date     |
> > > >> granularity  |
> > > >>
> > > >>
> > >
> >
> +--------------------------+------------------------+-------------+--------------+
> > > >> | 2015-04-22 05:16:22.173  | 2015-04-21 12:00:00.0  | 2015-01-20  |
> > 3600
> > > >>       |
> > > >>
> > > >>
> > >
> >
> +--------------------------+------------------------+-------------+--------------+
> > > >> 1 row selected (3.282 seconds)
> > > >>
> > > >> 0: jdbc:drill:> SELECT
> > > >> . . . . . . . >   CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp`
> > > >> . . . . . . . >  ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
> > > >> . . . . . . . >  ,CAST(`date` AS DATE) AS `date`
> > > >> . . . . . . . >  ,CAST(`granularity` AS BIGINT) AS `granularity`
> > > >> . . . . . . . >  from TIMESTAMP_TEST limit 1;
> > > >>
> > > >>
> > >
> >
> +--------------------------+------------------------+-------------+--------------+
> > > >> |       z_timestamp        |        datetime        |    date     |
> > > >> granularity  |
> > > >>
> > > >>
> > >
> >
> +--------------------------+------------------------+-------------+--------------+
> > > >> | 2015-04-22 05:16:22.173  | 2015-04-21 09:00:00.0  | 2015-01-20  |
> > 3600
> > > >>       |
> > > >>
> > > >>
> > >
> >
> +--------------------------+------------------------+-------------+--------------+
> > > >> 1 row selected (3.071 seconds)
> > > >>
> > > >>
> > > >> Hope this helps.
> > > >>
> > > >> Cheers — Chris
> > > >>
> > > >>> On 14 Oct 2015, at 16:07, Stefán Baxter <[email protected]
> >
> > > >> wrote:
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> What is the best practice when working with dates in a Avro/Parquet
> > > >>> scenario?
> > > >>>
> > > >>> Avro does not support dates directly (only longs) and I'm wondering
> > how
> > > >> the
> > > >>> get persisted in Parquet.
> > > >>>
> > > >>> Perhaps Parquet does not distinguish between long and date-long in
> > any
> > > >>> significant way.
> > > >>>
> > > >>> Regards,
> > > >>> -Stefan
> > > >>
> > > >>
> > >
> > >
> >
>

Re: Dates -> Avro -> Parquet

Reply via email to