Parquet

Chris Mathews Thu, 15 Oct 2015 04:01:11 -0700

Hello Stefan

We use Avro to define our schemas for Parquet files, and we find that using 
long for dates and converting the dates to long using milliseconds works.  We 
then CAST the long to a TIMESTAMP on the way out during the SELECT statement 
(or by using a VIEW).


example java snippet:

//  various date and time formats
public static final String FORMAT_Z_TIMESTAMP = "yyyy-MM-dd HH:mm:ss.SSS";
public static final String FORMAT_DATETIME    = "yyyy-MM-dd HH:mm:ss";
public static final String FORMAT_DATE        = "yyyy-MM-dd";

…

// parser for each format
public final SimpleDateFormat  sdf_z_timestamp = new SimpleDateFormat( 
FORMAT_Z_TIMESTAMP );
public final SimpleDateFormat  sdf_datetime    = new SimpleDateFormat( 
FORMAT_DATETIME );
public final SimpleDateFormat  sdf_date        = new SimpleDateFormat( 
FORMAT_DATE );

…

//  choose parser based on column name / string format
public SimpleDateFormat  sdf = (NAME_Z_TIMESTAMP.equals(name())) ? 
sdf_z_timestamp
                            : (NAME_DATETIME.equals(name()))    ? sdf_datetime
                            : (NAME_DATE.equals(name()))        ? sdf_date
                            : null;
…

Date date = sdf.parse(str);
long millis = date.getTime();
Object value = new java.lang.Long(millis);

We then use something like

AvroParquetWriter<GenericRecord> writer = new AvroParquetWriter<>(hdfs_path, 
schema);
GenericRecord data = new GenericData.Record(schema);
data.put( name(), value);
writer.write(data);

to write the records out directly to a Parquet file.

example schema:

{
  "type": "record",
  "name": "timestamp_test",
   "doc": "Avro -> Parquet long to timestamp test",
  "fields":
  [
    { "name": "z_timestamp",  "type": "long" }
   ,{ "name": "datetime",     "type": "long" }
   ,{ "name": "date",         "type": "long" }
   ,{ "name": "granularity",  "type": "long" }
  ]
}

Then to get the data back we either define a VIEW, or cast during the SELECT 
statement.

example view:

use hdfs.cjm;

create or replace view TIMESTAMP_TEST_VIEW as
SELECT
  CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp`
 ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
 ,CAST(`date` AS DATE) AS `date`
 ,CAST(`granularity` AS BIGINT) AS `granularity`

FROM hdfs.cjm.TIMESTAMP_TEST;

Then execute the following to get results:


0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity from 
TIMESTAMP_TEST limit 1;
+----------------+----------------+----------------+--------------+
|  z_timestamp   |    datetime    |      date      | granularity  |
+----------------+----------------+----------------+--------------+
| 1429592511991  | 1429520400000  | 1421625600000  | 3600         |
+----------------+----------------+----------------+--------------+
1 row selected (2.593 seconds)

0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity from 
TIMESTAMP_TEST_VIEW limit 1;
+--------------------------+------------------------+-------------+--------------+
|       z_timestamp        |        datetime        |    date     | granularity 
 |
+--------------------------+------------------------+-------------+--------------+
| 2015-04-22 05:16:22.173  | 2015-04-21 12:00:00.0  | 2015-01-20  | 3600        
 |
+--------------------------+------------------------+-------------+--------------+
1 row selected (3.282 seconds)

0: jdbc:drill:> SELECT
. . . . . . . >   CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp`
. . . . . . . >  ,CAST(`datetime` AS TIMESTAMP) AS `datetime`
. . . . . . . >  ,CAST(`date` AS DATE) AS `date`
. . . . . . . >  ,CAST(`granularity` AS BIGINT) AS `granularity`
. . . . . . . >  from TIMESTAMP_TEST limit 1;
+--------------------------+------------------------+-------------+--------------+
|       z_timestamp        |        datetime        |    date     | granularity 
 |
+--------------------------+------------------------+-------------+--------------+
| 2015-04-22 05:16:22.173  | 2015-04-21 09:00:00.0  | 2015-01-20  | 3600        
 |
+--------------------------+------------------------+-------------+--------------+
1 row selected (3.071 seconds)


Hope this helps.

Cheers — Chris

> On 14 Oct 2015, at 16:07, Stefán Baxter <[email protected]> wrote:
> 
> Hi,
> 
> What is the best practice when working with dates in a Avro/Parquet
> scenario?
> 
> Avro does not support dates directly (only longs) and I'm wondering how the
> get persisted in Parquet.
> 
> Perhaps Parquet does not distinguish between long and date-long in any
> significant way.
> 
> Regards,
> -Stefan

Re: Dates -> Avro -> Parquet

Reply via email to