Hi Jacques, You describe my small dilemma much better than me, thank you.
Regards, -Stefan On Thu, Oct 15, 2015 at 3:19 PM, Jacques Nadeau <[email protected]> wrote: > A little clarification here: > > Parquet has native support for date types. Drill does too. However, since > Avro does not, there is no way that I know of to write a Parquet file via > the Avro adapter that will not require a cast. If you did a CTAS in Drill > and cast the data types correctly in the CTAS, then Drill will output > Parquet files that never need casting again. > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > On Thu, Oct 15, 2015 at 7:31 AM, Stefán Baxter <[email protected]> > wrote: > > > Hi Chris, > > > > I understand now, thank you. > > > > What threw me off was that, in our standard use-case, we are not using > cast > > for our TIMESTAMP_MILLIS fields and I thought we were getting them > directly > > formatted from Parquet but then I overlooked our UDF that is handling the > > casting... sorry :). > > > > Thank you for taking the time! > > > > Regards, > > -Stefan > > > > > > On Thu, Oct 15, 2015 at 2:21 PM, Chris Mathews <[email protected]> wrote: > > > > > Hi Stefan > > > > > > I am not sure I fully understand your question 'why you don't seem to > be > > > storing your dates in Parquet Date files.' > > > > > > As far as I am aware all date types in Parquet (ie: DATE, TIME_MILLIS, > > > TIMESTAMP_MILLIS) are all stored as either in int32 or int64 annotated > > > types. The only other option is to store them as strings (or VARCHAR) > and > > > interpret them as required when selecting from the Parquet files. > > > > > > Please let me know if I have understood your question correctly or not. > > > > > > What I have not acheived yet is to use Avro schema definitions (via > JSON) > > > to define a TIMESTAMP type, which is why we have gone the route of > > defining > > > a VIEW for each Parquet file. By doing this we reduce the amount of > > casting > > > we have to do when building the query since the VIEW effectively does > all > > > the casting for us behind the scenes. > > > > > > We are currently looking at possibility of defining Parquet schemas > > > directly (using java) without going the Avro route; in other words > > produce > > > a parser from JSON to Parquet, similar to the Avro parser but > supporting > > > some other logical types. > > > > > > Some background to our drill trials: > > > > > > We are generating billions of columns from machine generated data every > > > day. There a quite a number of different types of machine generating > this > > > data and the format of the data varies between machines. Some produce > > > string format dates/timestamps and others numeric (unix epoch style), > and > > > we have to normalise this data to a common format; for dates this > format > > is > > > the Parquet TIMESTAMP_MILLIS type because we need to use millisecond > > > granularity when available. > > > > > > quote from Parquet docs: > > > > > > "TIMESTAMP_MILLIS Logical date and time. Annotates an int64 that stores > > > the number of milliseconds from the Unix epoch, 00:00:00.000 on 1 > January > > > 1970, UTC." > > > > > > This type corresponds nicely to the SQL type TIMESTAMP (which is why we > > > cast). > > > > > > Again, hope this helps. > > > > > > Cheers -- Chris > > > > > > > On 15 Oct 2015, at 14:46, Stefán Baxter <[email protected]> > > > wrote: > > > > > > > > Thank you Chris, this clarifies a whole lot :). > > > > > > > > I wanted to try to avoid the cast in the CTAS on the way from Avro to > > > > Parquet (not possible) and then avoid casting as much as possible > when > > > > selecting from the Parquet files. > > > > > > > > What is still unclear to me is why you don't seem to be storing your > > > dates > > > > in Parquet Date files. > > > > > > > > Can you please elaborate a bit on the pros/cons? > > > > > > > > Regards, > > > > -Stefan > > > > > > > > On Thu, Oct 15, 2015 at 10:59 AM, Chris Mathews <[email protected]> > > wrote: > > > > > > > >> Hello Stefan > > > >> > > > >> We use Avro to define our schemas for Parquet files, and we find > that > > > >> using long for dates and converting the dates to long using > > milliseconds > > > >> works. We then CAST the long to a TIMESTAMP on the way out during > the > > > >> SELECT statement (or by using a VIEW). > > > >> > > > >> example java snippet: > > > >> > > > >> // various date and time formats > > > >> public static final String FORMAT_Z_TIMESTAMP = "yyyy-MM-dd > > > HH:mm:ss.SSS"; > > > >> public static final String FORMAT_DATETIME = "yyyy-MM-dd > HH:mm:ss"; > > > >> public static final String FORMAT_DATE = "yyyy-MM-dd"; > > > >> > > > >> … > > > >> > > > >> // parser for each format > > > >> public final SimpleDateFormat sdf_z_timestamp = new > SimpleDateFormat( > > > >> FORMAT_Z_TIMESTAMP ); > > > >> public final SimpleDateFormat sdf_datetime = new > SimpleDateFormat( > > > >> FORMAT_DATETIME ); > > > >> public final SimpleDateFormat sdf_date = new > SimpleDateFormat( > > > >> FORMAT_DATE ); > > > >> > > > >> … > > > >> > > > >> // choose parser based on column name / string format > > > >> public SimpleDateFormat sdf = (NAME_Z_TIMESTAMP.equals(name())) ? > > > >> sdf_z_timestamp > > > >> : (NAME_DATETIME.equals(name())) ? > > > >> sdf_datetime > > > >> : (NAME_DATE.equals(name())) ? > > > sdf_date > > > >> : null; > > > >> … > > > >> > > > >> Date date = sdf.parse(str); > > > >> long millis = date.getTime(); > > > >> Object value = new java.lang.Long(millis); > > > >> > > > >> We then use something like > > > >> > > > >> AvroParquetWriter<GenericRecord> writer = new > > > >> AvroParquetWriter<>(hdfs_path, schema); > > > >> GenericRecord data = new GenericData.Record(schema); > > > >> data.put( name(), value); > > > >> writer.write(data); > > > >> > > > >> to write the records out directly to a Parquet file. > > > >> > > > >> example schema: > > > >> > > > >> { > > > >> "type": "record", > > > >> "name": "timestamp_test", > > > >> "doc": "Avro -> Parquet long to timestamp test", > > > >> "fields": > > > >> [ > > > >> { "name": "z_timestamp", "type": "long" } > > > >> ,{ "name": "datetime", "type": "long" } > > > >> ,{ "name": "date", "type": "long" } > > > >> ,{ "name": "granularity", "type": "long" } > > > >> ] > > > >> } > > > >> > > > >> Then to get the data back we either define a VIEW, or cast during > the > > > >> SELECT statement. > > > >> > > > >> example view: > > > >> > > > >> use hdfs.cjm; > > > >> > > > >> create or replace view TIMESTAMP_TEST_VIEW as > > > >> SELECT > > > >> CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp` > > > >> ,CAST(`datetime` AS TIMESTAMP) AS `datetime` > > > >> ,CAST(`date` AS DATE) AS `date` > > > >> ,CAST(`granularity` AS BIGINT) AS `granularity` > > > >> > > > >> FROM hdfs.cjm.TIMESTAMP_TEST; > > > >> > > > >> Then execute the following to get results: > > > >> > > > >> > > > >> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity > > from > > > >> TIMESTAMP_TEST limit 1; > > > >> +----------------+----------------+----------------+--------------+ > > > >> | z_timestamp | datetime | date | granularity | > > > >> +----------------+----------------+----------------+--------------+ > > > >> | 1429592511991 | 1429520400000 | 1421625600000 | 3600 | > > > >> +----------------+----------------+----------------+--------------+ > > > >> 1 row selected (2.593 seconds) > > > >> > > > >> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity > > from > > > >> TIMESTAMP_TEST_VIEW limit 1; > > > >> > > > >> > > > > > > +--------------------------+------------------------+-------------+--------------+ > > > >> | z_timestamp | datetime | date | > > > >> granularity | > > > >> > > > >> > > > > > > +--------------------------+------------------------+-------------+--------------+ > > > >> | 2015-04-22 05:16:22.173 | 2015-04-21 12:00:00.0 | 2015-01-20 | > > 3600 > > > >> | > > > >> > > > >> > > > > > > +--------------------------+------------------------+-------------+--------------+ > > > >> 1 row selected (3.282 seconds) > > > >> > > > >> 0: jdbc:drill:> SELECT > > > >> . . . . . . . > CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp` > > > >> . . . . . . . > ,CAST(`datetime` AS TIMESTAMP) AS `datetime` > > > >> . . . . . . . > ,CAST(`date` AS DATE) AS `date` > > > >> . . . . . . . > ,CAST(`granularity` AS BIGINT) AS `granularity` > > > >> . . . . . . . > from TIMESTAMP_TEST limit 1; > > > >> > > > >> > > > > > > +--------------------------+------------------------+-------------+--------------+ > > > >> | z_timestamp | datetime | date | > > > >> granularity | > > > >> > > > >> > > > > > > +--------------------------+------------------------+-------------+--------------+ > > > >> | 2015-04-22 05:16:22.173 | 2015-04-21 09:00:00.0 | 2015-01-20 | > > 3600 > > > >> | > > > >> > > > >> > > > > > > +--------------------------+------------------------+-------------+--------------+ > > > >> 1 row selected (3.071 seconds) > > > >> > > > >> > > > >> Hope this helps. > > > >> > > > >> Cheers — Chris > > > >> > > > >>> On 14 Oct 2015, at 16:07, Stefán Baxter <[email protected] > > > > > >> wrote: > > > >>> > > > >>> Hi, > > > >>> > > > >>> What is the best practice when working with dates in a Avro/Parquet > > > >>> scenario? > > > >>> > > > >>> Avro does not support dates directly (only longs) and I'm wondering > > how > > > >> the > > > >>> get persisted in Parquet. > > > >>> > > > >>> Perhaps Parquet does not distinguish between long and date-long in > > any > > > >>> significant way. > > > >>> > > > >>> Regards, > > > >>> -Stefan > > > >> > > > >> > > > > > > > > >
