Parquet

Stefán Baxter Thu, 15 Oct 2015 06:47:33 -0700

Thank you Chris, this clarifies a whole lot :).

I wanted to try to avoid the cast in the CTAS on the way from Avro to
Parquet (not possible) and then avoid casting as much as possible when
selecting from the Parquet files.


What is still unclear to me is why you don't seem to be storing your dates
in Parquet Date files.

Can you please elaborate a bit on the pros/cons?

Regards,
 -Stefan

On Thu, Oct 15, 2015 at 10:59 AM, Chris Mathews <[email protected]> wrote:

> Hello Stefan
>
> We use Avro to define our schemas for Parquet files, and we find that
> using long for dates and converting the dates to long using milliseconds
> works.  We then CAST the long to a TIMESTAMP on the way out during the
> SELECT statement (or by using a VIEW).
>
> example java snippet:
>
> //  various date and time formats
> public static final String FORMAT_Z_TIMESTAMP = "yyyy-MM-dd HH:mm:ss.SSS";
> public static final String FORMAT_DATETIME    = "yyyy-MM-dd HH:mm:ss";
> public static final String FORMAT_DATE        = "yyyy-MM-dd";
>
> …
>
> // parser for each format
> public final SimpleDateFormat  sdf_z_timestamp = new SimpleDateFormat(
> FORMAT_Z_TIMESTAMP );
> public final SimpleDateFormat  sdf_datetime    = new SimpleDateFormat(
> FORMAT_DATETIME );
> public final SimpleDateFormat  sdf_date        = new SimpleDateFormat(
> FORMAT_DATE );
>
> …
>
> //  choose parser based on column name / string format
> public SimpleDateFormat  sdf = (NAME_Z_TIMESTAMP.equals(name())) ?
> sdf_z_timestamp
>                             : (NAME_DATETIME.equals(name()))    ?
> sdf_datetime
>                             : (NAME_DATE.equals(name()))        ? sdf_date
>                             : null;
> …
>
> Date date = sdf.parse(str);
> long millis = date.getTime();
> Object value = new java.lang.Long(millis);
>
> We then use something like
>
> AvroParquetWriter<GenericRecord> writer = new
> AvroParquetWriter<>(hdfs_path, schema);
> GenericRecord data = new GenericData.Record(schema);
> data.put( name(), value);
> writer.write(data);
>
> to write the records out directly to a Parquet file.
>
> example schema:
>
> {
>   "type": "record",
>   "name": "timestamp_test",
>    "doc": "Avro -> Parquet long to timestamp test",
>   "fields":
>   [
>     { "name": "z_timestamp",  "type": "long" }
>    ,{ "name": "datetime",     "type": "long" }
>    ,{ "name": "date",         "type": "long" }
>    ,{ "name": "granularity",  "type": "long" }
>   ]
> }
>
> Then to get the data back we either define a VIEW, or cast during the
> SELECT statement.
>
> example view:
>
> use hdfs.cjm;
>
> create or replace view TIMESTAMP_TEST_VIEW as
> SELECT
>   CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp`
>  ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
>  ,CAST(`date` AS DATE) AS `date`
>  ,CAST(`granularity` AS BIGINT) AS `granularity`
>
> FROM hdfs.cjm.TIMESTAMP_TEST;
>
> Then execute the following to get results:
>
>
> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity from
> TIMESTAMP_TEST limit 1;
> +----------------+----------------+----------------+--------------+
> |  z_timestamp   |    datetime    |      date      | granularity  |
> +----------------+----------------+----------------+--------------+
> | 1429592511991  | 1429520400000  | 1421625600000  | 3600         |
> +----------------+----------------+----------------+--------------+
> 1 row selected (2.593 seconds)
>
> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity from
> TIMESTAMP_TEST_VIEW limit 1;
>
> +--------------------------+------------------------+-------------+--------------+
> |       z_timestamp        |        datetime        |    date     |
> granularity  |
>
> +--------------------------+------------------------+-------------+--------------+
> | 2015-04-22 05:16:22.173  | 2015-04-21 12:00:00.0  | 2015-01-20  | 3600
>        |
>
> +--------------------------+------------------------+-------------+--------------+
> 1 row selected (3.282 seconds)
>
> 0: jdbc:drill:> SELECT
> . . . . . . . >   CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp`
> . . . . . . . >  ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
> . . . . . . . >  ,CAST(`date` AS DATE) AS `date`
> . . . . . . . >  ,CAST(`granularity` AS BIGINT) AS `granularity`
> . . . . . . . >  from TIMESTAMP_TEST limit 1;
>
> +--------------------------+------------------------+-------------+--------------+
> |       z_timestamp        |        datetime        |    date     |
> granularity  |
>
> +--------------------------+------------------------+-------------+--------------+
> | 2015-04-22 05:16:22.173  | 2015-04-21 09:00:00.0  | 2015-01-20  | 3600
>        |
>
> +--------------------------+------------------------+-------------+--------------+
> 1 row selected (3.071 seconds)
>
>
> Hope this helps.
>
> Cheers — Chris
>
> > On 14 Oct 2015, at 16:07, Stefán Baxter <[email protected]>
> wrote:
> >
> > Hi,
> >
> > What is the best practice when working with dates in a Avro/Parquet
> > scenario?
> >
> > Avro does not support dates directly (only longs) and I'm wondering how
> the
> > get persisted in Parquet.
> >
> > Perhaps Parquet does not distinguish between long and date-long in any
> > significant way.
> >
> > Regards,
> > -Stefan
>
>

Re: Dates -> Avro -> Parquet

Reply via email to