Hi Chris, You could probably contribute some sort of type annotation to parquet-avro so that it produces the data type in the Parquet schema. This class generates a Parquet schema from the Avro schema: https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java It looks like Avro has some annotation support but I have never used it. http://avro.apache.org/docs/1.4.1/idl.html#minutiae_annotations (CC'ing Ryan, who knows the avro integration much better)
On Thu, Oct 15, 2015 at 8:45 AM, Chris Mathews <[email protected]> wrote: > Thank you Jacques - yes this is exactly the issue I am having. > > We are currently using Avro to define schemas for our Parquet files, and > as you correctly point out there is no way of defining date types in Avro. > Due to the volumes of data we are dealing with, using CTAS is not an option > for us as we need to create the Parquet files on the fly rather than create > intermediate files for use as input to the CTAS. > > Since Avro is the restriction here we need to come up with some way of > defining our Parquet schemas so the date types can be defined natively; and > as you say there will be no need for any casting again. > > What we really need is some way of creating a schema for use in a Java > app, similar to the way Avro works using JSON, for ease of implementation > but without the restrictions. > > We are still investigating this approach, but for now we are casting ! > > Cheers — Chris > > > > On 15 Oct 2015, at 16:19, Jacques Nadeau <[email protected]> wrote: > > > > A little clarification here: > > > > Parquet has native support for date types. Drill does too. However, since > > Avro does not, there is no way that I know of to write a Parquet file via > > the Avro adapter that will not require a cast. If you did a CTAS in Drill > > and cast the data types correctly in the CTAS, then Drill will output > > Parquet files that never need casting again. > > > > -- > > Jacques Nadeau > > CTO and Co-Founder, Dremio > > > > On Thu, Oct 15, 2015 at 7:31 AM, Stefán Baxter < > [email protected]> > > wrote: > > > >> Hi Chris, > >> > >> I understand now, thank you. > >> > >> What threw me off was that, in our standard use-case, we are not using > cast > >> for our TIMESTAMP_MILLIS fields and I thought we were getting them > directly > >> formatted from Parquet but then I overlooked our UDF that is handling > the > >> casting... sorry :). > >> > >> Thank you for taking the time! > >> > >> Regards, > >> -Stefan > >> > >> > >> On Thu, Oct 15, 2015 at 2:21 PM, Chris Mathews <[email protected]> wrote: > >> > >>> Hi Stefan > >>> > >>> I am not sure I fully understand your question 'why you don't seem to > be > >>> storing your dates in Parquet Date files.' > >>> > >>> As far as I am aware all date types in Parquet (ie: DATE, TIME_MILLIS, > >>> TIMESTAMP_MILLIS) are all stored as either in int32 or int64 annotated > >>> types. The only other option is to store them as strings (or VARCHAR) > and > >>> interpret them as required when selecting from the Parquet files. > >>> > >>> Please let me know if I have understood your question correctly or not. > >>> > >>> What I have not acheived yet is to use Avro schema definitions (via > JSON) > >>> to define a TIMESTAMP type, which is why we have gone the route of > >> defining > >>> a VIEW for each Parquet file. By doing this we reduce the amount of > >> casting > >>> we have to do when building the query since the VIEW effectively does > all > >>> the casting for us behind the scenes. > >>> > >>> We are currently looking at possibility of defining Parquet schemas > >>> directly (using java) without going the Avro route; in other words > >> produce > >>> a parser from JSON to Parquet, similar to the Avro parser but > supporting > >>> some other logical types. > >>> > >>> Some background to our drill trials: > >>> > >>> We are generating billions of columns from machine generated data every > >>> day. There a quite a number of different types of machine generating > this > >>> data and the format of the data varies between machines. Some produce > >>> string format dates/timestamps and others numeric (unix epoch style), > and > >>> we have to normalise this data to a common format; for dates this > format > >> is > >>> the Parquet TIMESTAMP_MILLIS type because we need to use millisecond > >>> granularity when available. > >>> > >>> quote from Parquet docs: > >>> > >>> "TIMESTAMP_MILLIS Logical date and time. Annotates an int64 that stores > >>> the number of milliseconds from the Unix epoch, 00:00:00.000 on 1 > January > >>> 1970, UTC." > >>> > >>> This type corresponds nicely to the SQL type TIMESTAMP (which is why we > >>> cast). > >>> > >>> Again, hope this helps. > >>> > >>> Cheers -- Chris > >>> > >>>> On 15 Oct 2015, at 14:46, Stefán Baxter <[email protected]> > >>> wrote: > >>>> > >>>> Thank you Chris, this clarifies a whole lot :). > >>>> > >>>> I wanted to try to avoid the cast in the CTAS on the way from Avro to > >>>> Parquet (not possible) and then avoid casting as much as possible when > >>>> selecting from the Parquet files. > >>>> > >>>> What is still unclear to me is why you don't seem to be storing your > >>> dates > >>>> in Parquet Date files. > >>>> > >>>> Can you please elaborate a bit on the pros/cons? > >>>> > >>>> Regards, > >>>> -Stefan > >>>> > >>>> On Thu, Oct 15, 2015 at 10:59 AM, Chris Mathews <[email protected]> > >> wrote: > >>>> > >>>>> Hello Stefan > >>>>> > >>>>> We use Avro to define our schemas for Parquet files, and we find that > >>>>> using long for dates and converting the dates to long using > >> milliseconds > >>>>> works. We then CAST the long to a TIMESTAMP on the way out during > the > >>>>> SELECT statement (or by using a VIEW). > >>>>> > >>>>> example java snippet: > >>>>> > >>>>> // various date and time formats > >>>>> public static final String FORMAT_Z_TIMESTAMP = "yyyy-MM-dd > >>> HH:mm:ss.SSS"; > >>>>> public static final String FORMAT_DATETIME = "yyyy-MM-dd > HH:mm:ss"; > >>>>> public static final String FORMAT_DATE = "yyyy-MM-dd"; > >>>>> > >>>>> … > >>>>> > >>>>> // parser for each format > >>>>> public final SimpleDateFormat sdf_z_timestamp = new > SimpleDateFormat( > >>>>> FORMAT_Z_TIMESTAMP ); > >>>>> public final SimpleDateFormat sdf_datetime = new > SimpleDateFormat( > >>>>> FORMAT_DATETIME ); > >>>>> public final SimpleDateFormat sdf_date = new > SimpleDateFormat( > >>>>> FORMAT_DATE ); > >>>>> > >>>>> … > >>>>> > >>>>> // choose parser based on column name / string format > >>>>> public SimpleDateFormat sdf = (NAME_Z_TIMESTAMP.equals(name())) ? > >>>>> sdf_z_timestamp > >>>>> : (NAME_DATETIME.equals(name())) ? > >>>>> sdf_datetime > >>>>> : (NAME_DATE.equals(name())) ? > >>> sdf_date > >>>>> : null; > >>>>> … > >>>>> > >>>>> Date date = sdf.parse(str); > >>>>> long millis = date.getTime(); > >>>>> Object value = new java.lang.Long(millis); > >>>>> > >>>>> We then use something like > >>>>> > >>>>> AvroParquetWriter<GenericRecord> writer = new > >>>>> AvroParquetWriter<>(hdfs_path, schema); > >>>>> GenericRecord data = new GenericData.Record(schema); > >>>>> data.put( name(), value); > >>>>> writer.write(data); > >>>>> > >>>>> to write the records out directly to a Parquet file. > >>>>> > >>>>> example schema: > >>>>> > >>>>> { > >>>>> "type": "record", > >>>>> "name": "timestamp_test", > >>>>> "doc": "Avro -> Parquet long to timestamp test", > >>>>> "fields": > >>>>> [ > >>>>> { "name": "z_timestamp", "type": "long" } > >>>>> ,{ "name": "datetime", "type": "long" } > >>>>> ,{ "name": "date", "type": "long" } > >>>>> ,{ "name": "granularity", "type": "long" } > >>>>> ] > >>>>> } > >>>>> > >>>>> Then to get the data back we either define a VIEW, or cast during the > >>>>> SELECT statement. > >>>>> > >>>>> example view: > >>>>> > >>>>> use hdfs.cjm; > >>>>> > >>>>> create or replace view TIMESTAMP_TEST_VIEW as > >>>>> SELECT > >>>>> CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp` > >>>>> ,CAST(`datetime` AS TIMESTAMP) AS `datetime` > >>>>> ,CAST(`date` AS DATE) AS `date` > >>>>> ,CAST(`granularity` AS BIGINT) AS `granularity` > >>>>> > >>>>> FROM hdfs.cjm.TIMESTAMP_TEST; > >>>>> > >>>>> Then execute the following to get results: > >>>>> > >>>>> > >>>>> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity > >> from > >>>>> TIMESTAMP_TEST limit 1; > >>>>> +----------------+----------------+----------------+--------------+ > >>>>> | z_timestamp | datetime | date | granularity | > >>>>> +----------------+----------------+----------------+--------------+ > >>>>> | 1429592511991 | 1429520400000 | 1421625600000 | 3600 | > >>>>> +----------------+----------------+----------------+--------------+ > >>>>> 1 row selected (2.593 seconds) > >>>>> > >>>>> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, granularity > >> from > >>>>> TIMESTAMP_TEST_VIEW limit 1; > >>>>> > >>>>> > >>> > >> > +--------------------------+------------------------+-------------+--------------+ > >>>>> | z_timestamp | datetime | date | > >>>>> granularity | > >>>>> > >>>>> > >>> > >> > +--------------------------+------------------------+-------------+--------------+ > >>>>> | 2015-04-22 05:16:22.173 | 2015-04-21 12:00:00.0 | 2015-01-20 | > >> 3600 > >>>>> | > >>>>> > >>>>> > >>> > >> > +--------------------------+------------------------+-------------+--------------+ > >>>>> 1 row selected (3.282 seconds) > >>>>> > >>>>> 0: jdbc:drill:> SELECT > >>>>> . . . . . . . > CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp` > >>>>> . . . . . . . > ,CAST(`datetime` AS TIMESTAMP) AS `datetime` > >>>>> . . . . . . . > ,CAST(`date` AS DATE) AS `date` > >>>>> . . . . . . . > ,CAST(`granularity` AS BIGINT) AS `granularity` > >>>>> . . . . . . . > from TIMESTAMP_TEST limit 1; > >>>>> > >>>>> > >>> > >> > +--------------------------+------------------------+-------------+--------------+ > >>>>> | z_timestamp | datetime | date | > >>>>> granularity | > >>>>> > >>>>> > >>> > >> > +--------------------------+------------------------+-------------+--------------+ > >>>>> | 2015-04-22 05:16:22.173 | 2015-04-21 09:00:00.0 | 2015-01-20 | > >> 3600 > >>>>> | > >>>>> > >>>>> > >>> > >> > +--------------------------+------------------------+-------------+--------------+ > >>>>> 1 row selected (3.071 seconds) > >>>>> > >>>>> > >>>>> Hope this helps. > >>>>> > >>>>> Cheers — Chris > >>>>> > >>>>>> On 14 Oct 2015, at 16:07, Stefán Baxter <[email protected]> > >>>>> wrote: > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> What is the best practice when working with dates in a Avro/Parquet > >>>>>> scenario? > >>>>>> > >>>>>> Avro does not support dates directly (only longs) and I'm wondering > >> how > >>>>> the > >>>>>> get persisted in Parquet. > >>>>>> > >>>>>> Perhaps Parquet does not distinguish between long and date-long in > >> any > >>>>>> significant way. > >>>>>> > >>>>>> Regards, > >>>>>> -Stefan > >>>>> > >>>>> > >>> > >>> > >> > > -- Julien
