Hi Ryan Thanks for this - it sounds just what we need.
How do we go about doing a trial of the local copies with our code ? It would be good to check this all out now if 1.8.0 is delayed for a while ? contact me by https://drillers.slack.com/messages/dev/team/cmathews/ to discuss. Cheers — Chris > On 15 Oct 2015, at 17:59, Julien Le Dem <[email protected]> wrote: > > thanks Ryan! > (cc parquet dev list as well) > > On Thu, Oct 15, 2015 at 9:46 AM, Ryan Blue <[email protected]> wrote: > >> Hi Chris, >> >> Avro does have support for dates, but it hasn't been released yet because >> 1.8.0 was blocked on license issues (AVRO-1722). I have a branch with >> preliminary parquet-avro support for Decimal (which uses the same Avro >> construct) if you would like to take a look at it. >> >> What we can do in the mean time, before Avro's logical type support is >> released, is to copy the logical types and conversions into parquet-avro >> and use local copies until we can use the correct upstream ones. If you're >> interested, ping me and we can get it working. >> >> rb >> >> On 10/15/2015 09:24 AM, Julien Le Dem wrote: >> >>> Hi Chris, >>> You could probably contribute some sort of type annotation to >>> parquet-avro so that it produces the data type in the Parquet schema. >>> This class generates a Parquet schema from the Avro schema: >>> >>> https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java >>> It looks like Avro has some annotation support but I have never used it. >>> http://avro.apache.org/docs/1.4.1/idl.html#minutiae_annotations >>> (CC'ing Ryan, who knows the avro integration much better) >>> >>> On Thu, Oct 15, 2015 at 8:45 AM, Chris Mathews <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Thank you Jacques - yes this is exactly the issue I am having. >>> >>> We are currently using Avro to define schemas for our Parquet files, >>> and as you correctly point out there is no way of defining date >>> types in Avro. Due to the volumes of data we are dealing with, >>> using CTAS is not an option for us as we need to create the Parquet >>> files on the fly rather than create intermediate files for use as >>> input to the CTAS. >>> >>> Since Avro is the restriction here we need to come up with some way >>> of defining our Parquet schemas so the date types can be defined >>> natively; and as you say there will be no need for any casting again. >>> >>> What we really need is some way of creating a schema for use in a >>> Java app, similar to the way Avro works using JSON, for ease of >>> implementation but without the restrictions. >>> >>> We are still investigating this approach, but for now we are casting ! >>> >>> Cheers — Chris >>> >>> >>>> On 15 Oct 2015, at 16:19, Jacques Nadeau <[email protected] >>> <mailto:[email protected]>> wrote: >>>> >>>> A little clarification here: >>>> >>>> Parquet has native support for date types. Drill does too. >>> However, since >>>> Avro does not, there is no way that I know of to write a Parquet >>> file via >>>> the Avro adapter that will not require a cast. If you did a CTAS >>> in Drill >>>> and cast the data types correctly in the CTAS, then Drill will >>> output >>>> Parquet files that never need casting again. >>>> >>>> -- >>>> Jacques Nadeau >>>> CTO and Co-Founder, Dremio >>>> >>>> On Thu, Oct 15, 2015 at 7:31 AM, Stefán Baxter >>> <[email protected] <mailto:[email protected]>> >>>> wrote: >>>> >>>>> Hi Chris, >>>>> >>>>> I understand now, thank you. >>>>> >>>>> What threw me off was that, in our standard use-case, we are not >>> using cast >>>>> for our TIMESTAMP_MILLIS fields and I thought we were getting >>> them directly >>>>> formatted from Parquet but then I overlooked our UDF that is >>> handling the >>>>> casting... sorry :). >>>>> >>>>> Thank you for taking the time! >>>>> >>>>> Regards, >>>>> -Stefan >>>>> >>>>> >>>>> On Thu, Oct 15, 2015 at 2:21 PM, Chris Mathews <[email protected] >>> <mailto:[email protected]>> wrote: >>>>> >>>>>> Hi Stefan >>>>>> >>>>>> I am not sure I fully understand your question 'why you don't >>> seem to be >>>>>> storing your dates in Parquet Date files.' >>>>>> >>>>>> As far as I am aware all date types in Parquet (ie: DATE, >>> TIME_MILLIS, >>>>>> TIMESTAMP_MILLIS) are all stored as either in int32 or int64 >>> annotated >>>>>> types. The only other option is to store them as strings (or >>> VARCHAR) and >>>>>> interpret them as required when selecting from the Parquet files. >>>>>> >>>>>> Please let me know if I have understood your question correctly >>> or not. >>>>>> >>>>>> What I have not acheived yet is to use Avro schema definitions >>> (via JSON) >>>>>> to define a TIMESTAMP type, which is why we have gone the route >>> of >>>>> defining >>>>>> a VIEW for each Parquet file. By doing this we reduce the amount >>> of >>>>> casting >>>>>> we have to do when building the query since the VIEW >>> effectively does all >>>>>> the casting for us behind the scenes. >>>>>> >>>>>> We are currently looking at possibility of defining Parquet >>> schemas >>>>>> directly (using java) without going the Avro route; in other >>> words >>>>> produce >>>>>> a parser from JSON to Parquet, similar to the Avro parser but >>> supporting >>>>>> some other logical types. >>>>>> >>>>>> Some background to our drill trials: >>>>>> >>>>>> We are generating billions of columns from machine generated >>> data every >>>>>> day. There a quite a number of different types of machine >>> generating this >>>>>> data and the format of the data varies between machines. Some >>> produce >>>>>> string format dates/timestamps and others numeric (unix epoch >>> style), and >>>>>> we have to normalise this data to a common format; for dates >>> this format >>>>> is >>>>>> the Parquet TIMESTAMP_MILLIS type because we need to use >>> millisecond >>>>>> granularity when available. >>>>>> >>>>>> quote from Parquet docs: >>>>>> >>>>>> "TIMESTAMP_MILLIS Logical date and time. Annotates an int64 >>> that stores >>>>>> the number of milliseconds from the Unix epoch, 00:00:00.000 on >>> 1 January >>>>>> 1970, UTC." >>>>>> >>>>>> This type corresponds nicely to the SQL type TIMESTAMP (which >>> is why we >>>>>> cast). >>>>>> >>>>>> Again, hope this helps. >>>>>> >>>>>> Cheers -- Chris >>>>>> >>>>>>> On 15 Oct 2015, at 14:46, Stefán Baxter >>> <[email protected] <mailto:[email protected]>> >>>>>> wrote: >>>>>>> >>>>>>> Thank you Chris, this clarifies a whole lot :). >>>>>>> >>>>>>> I wanted to try to avoid the cast in the CTAS on the way from >>> Avro to >>>>>>> Parquet (not possible) and then avoid casting as much as >>> possible when >>>>>>> selecting from the Parquet files. >>>>>>> >>>>>>> What is still unclear to me is why you don't seem to be >>> storing your >>>>>> dates >>>>>>> in Parquet Date files. >>>>>>> >>>>>>> Can you please elaborate a bit on the pros/cons? >>>>>>> >>>>>>> Regards, >>>>>>> -Stefan >>>>>>> >>>>>>> On Thu, Oct 15, 2015 at 10:59 AM, Chris Mathews >>> <[email protected] <mailto:[email protected]>> >>> >>>>> wrote: >>>>>>> >>>>>>>> Hello Stefan >>>>>>>> >>>>>>>> We use Avro to define our schemas for Parquet files, and we >>> find that >>>>>>>> using long for dates and converting the dates to long using >>>>> milliseconds >>>>>>>> works. We then CAST the long to a TIMESTAMP on the way out >>> during the >>>>>>>> SELECT statement (or by using a VIEW). >>>>>>>> >>>>>>>> example java snippet: >>>>>>>> >>>>>>>> // various date and time formats >>>>>>>> public static final String FORMAT_Z_TIMESTAMP = "yyyy-MM-dd >>>>>> HH:mm:ss.SSS"; >>>>>>>> public static final String FORMAT_DATETIME = "yyyy-MM-dd >>> HH:mm:ss"; >>>>>>>> public static final String FORMAT_DATE = "yyyy-MM-dd"; >>>>>>>> >>>>>>>> … >>>>>>>> >>>>>>>> // parser for each format >>>>>>>> public final SimpleDateFormat sdf_z_timestamp = new >>> SimpleDateFormat( >>>>>>>> FORMAT_Z_TIMESTAMP ); >>>>>>>> public final SimpleDateFormat sdf_datetime = new >>> SimpleDateFormat( >>>>>>>> FORMAT_DATETIME ); >>>>>>>> public final SimpleDateFormat sdf_date = new >>> SimpleDateFormat( >>>>>>>> FORMAT_DATE ); >>>>>>>> >>>>>>>> … >>>>>>>> >>>>>>>> // choose parser based on column name / string format >>>>>>>> public SimpleDateFormat sdf = >>> (NAME_Z_TIMESTAMP.equals(name())) ? >>>>>>>> sdf_z_timestamp >>>>>>>> : (NAME_DATETIME.equals(name())) ? >>>>>>>> sdf_datetime >>>>>>>> : (NAME_DATE.equals(name())) ? >>>>>> sdf_date >>>>>>>> : null; >>>>>>>> … >>>>>>>> >>>>>>>> Date date = sdf.parse(str); >>>>>>>> long millis = date.getTime(); >>>>>>>> Object value = new java.lang.Long(millis); >>>>>>>> >>>>>>>> We then use something like >>>>>>>> >>>>>>>> AvroParquetWriter<GenericRecord> writer = new >>>>>>>> AvroParquetWriter<>(hdfs_path, schema); >>>>>>>> GenericRecord data = new GenericData.Record(schema); >>>>>>>> data.put( name(), value); >>>>>>>> writer.write(data); >>>>>>>> >>>>>>>> to write the records out directly to a Parquet file. >>>>>>>> >>>>>>>> example schema: >>>>>>>> >>>>>>>> { >>>>>>>> "type": "record", >>>>>>>> "name": "timestamp_test", >>>>>>>> "doc": "Avro -> Parquet long to timestamp test", >>>>>>>> "fields": >>>>>>>> [ >>>>>>>> { "name": "z_timestamp", "type": "long" } >>>>>>>> ,{ "name": "datetime", "type": "long" } >>>>>>>> ,{ "name": "date", "type": "long" } >>>>>>>> ,{ "name": "granularity", "type": "long" } >>>>>>>> ] >>>>>>>> } >>>>>>>> >>>>>>>> Then to get the data back we either define a VIEW, or cast >>> during the >>>>>>>> SELECT statement. >>>>>>>> >>>>>>>> example view: >>>>>>>> >>>>>>>> use hdfs.cjm; >>>>>>>> >>>>>>>> create or replace view TIMESTAMP_TEST_VIEW as >>>>>>>> SELECT >>>>>>>> CAST(`z_timestamp` AS TIMESTAMP) AS `z_timestamp` >>>>>>>> ,CAST(`datetime` AS TIMESTAMP) AS `datetime` >>>>>>>> ,CAST(`date` AS DATE) AS `date` >>>>>>>> ,CAST(`granularity` AS BIGINT) AS `granularity` >>>>>>>> >>>>>>>> FROM hdfs.cjm.TIMESTAMP_TEST; >>>>>>>> >>>>>>>> Then execute the following to get results: >>>>>>>> >>>>>>>> >>>>>>>> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, >>> granularity >>>>> from >>>>>>>> TIMESTAMP_TEST limit 1; >>>>>>>> >>> +----------------+----------------+----------------+--------------+ >>>>>>>> | z_timestamp | datetime | date | >>> granularity | >>>>>>>> >>> +----------------+----------------+----------------+--------------+ >>>>>>>> | 1429592511991 | 1429520400000 | 1421625600000 | 3600 >>> | >>>>>>>> >>> +----------------+----------------+----------------+--------------+ >>>>>>>> 1 row selected (2.593 seconds) >>>>>>>> >>>>>>>> 0: jdbc:drill:> select z_timestamp, `datetime`, `date`, >>> granularity >>>>> from >>>>>>>> TIMESTAMP_TEST_VIEW limit 1; >>>>>>>> >>>>>>>> >>>>>> >>>>> >>> >>> +--------------------------+------------------------+-------------+--------------+ >>>>>>>> | z_timestamp | datetime | >>> date | >>>>>>>> granularity | >>>>>>>> >>>>>>>> >>>>>> >>>>> >>> >>> +--------------------------+------------------------+-------------+--------------+ >>>>>>>> | 2015-04-22 05:16:22.173 | 2015-04-21 12:00:00.0 | >>> 2015-01-20 | >>>>> 3600 >>>>>>>> | >>>>>>>> >>>>>>>> >>>>>> >>>>> >>> >>> +--------------------------+------------------------+-------------+--------------+ >>>>>>>> 1 row selected (3.282 seconds) >>>>>>>> >>>>>>>> 0: jdbc:drill:> SELECT >>>>>>>> . . . . . . . > CAST(`z_timestamp` AS TIMESTAMP) AS >>> `z_timestamp` >>>>>>>> . . . . . . . > ,CAST(`datetime` AS TIMESTAMP) AS `datetime` >>>>>>>> . . . . . . . > ,CAST(`date` AS DATE) AS `date` >>>>>>>> . . . . . . . > ,CAST(`granularity` AS BIGINT) AS >>> `granularity` >>>>>>>> . . . . . . . > from TIMESTAMP_TEST limit 1; >>>>>>>> >>>>>>>> >>>>>> >>>>> >>> >>> +--------------------------+------------------------+-------------+--------------+ >>>>>>>> | z_timestamp | datetime | >>> date | >>>>>>>> granularity | >>>>>>>> >>>>>>>> >>>>>> >>>>> >>> >>> +--------------------------+------------------------+-------------+--------------+ >>>>>>>> | 2015-04-22 05:16:22.173 | 2015-04-21 09:00:00.0 | >>> 2015-01-20 | >>>>> 3600 >>>>>>>> | >>>>>>>> >>>>>>>> >>>>>> >>>>> >>> >>> +--------------------------+------------------------+-------------+--------------+ >>>>>>>> 1 row selected (3.071 seconds) >>>>>>>> >>>>>>>> >>>>>>>> Hope this helps. >>>>>>>> >>>>>>>> Cheers — Chris >>>>>>>> >>>>>>>>> On 14 Oct 2015, at 16:07, Stefán Baxter >>> <[email protected] <mailto:[email protected]>> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> What is the best practice when working with dates in a >>> Avro/Parquet >>>>>>>>> scenario? >>>>>>>>> >>>>>>>>> Avro does not support dates directly (only longs) and I'm >>> wondering >>>>> how >>>>>>>> the >>>>>>>>> get persisted in Parquet. >>>>>>>>> >>>>>>>>> Perhaps Parquet does not distinguish between long and >>> date-long in >>>>> any >>>>>>>>> significant way. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> -Stefan >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>> >>> >>> >>> >>> >>> -- >>> Julien >>> >> >> >> -- >> Ryan Blue >> Software Engineer >> Cloudera, Inc. >> > > > > -- > Julien
