We haven't turned on the 2.0 encodings in Drill's Parquet writer, so they have not been thoroughly tested. That being said we do use the standard parquet-mr interfaces for reading parquet files in our complex parquet reader. We are currently depending on 1.8.1 in Drill, so it should be compatible.
I think it would be safest to run with `store.parquet.use_new_reader` set to true if you were going to working with parquet 2.0 files right now. - Jason On Thu, Feb 4, 2016 at 3:40 PM, Stefán Baxter <[email protected]> wrote: > OK, the automatic handling and encoding options improve a lot in Parquet > 2.0. (Manual override is not an option) > > I'm using parquet-mr/parquet-avro to create parquet 2 files > (ParquetProperties.WriterVersion.PARQUET_2_0). > > Drill seems to read them just fine but I wonder if there are any gotchas > > Regards, > -Stefán > > > On Thu, Feb 4, 2016 at 4:51 PM, Stefán Baxter <[email protected]> > wrote: > > > Hi again, > > > > I did a little test and ~5 million fairly wide records take 791 MB in > > parquet without dictionary encoding and 550MB with dictionary encoding > > enabled (The non-dictionary encoded file is a whooping 45% bigger). > > The plain, non-dictionary-encoding, file returns results for identical > > queries in ~20% less time than the one that uses dictionary encoding. > > > > Regards, > > -Stefán > > > > > > > > On Thu, Feb 4, 2016 at 3:48 PM, Stefán Baxter <[email protected] > > > > wrote: > > > >> Hi Jason, > >> > >> Thank you for the explanation. > >> > >> I have several *low* cardinality fields that contain semi-long values > and > >> they are, I think, a perfect candidate for dictionary encoding. > >> > >> I assumed that the choose to use dictionary encoding was a bit smarter > >> than this and would rely on Strings type column where x% repeated values > >> were a clear signal for it's use. > >> > >> If you can outline what needs to be done and where then I will gladly > >> take a stab at it :). > >> > >> Several questions along those lines: > >> > >> - Does the Parquet library that Drill uses allow for programmatic > >> section? > >> - What metadata, regarding the column content, is available when the > >> choice is made? > >> - Where in the Parquet part of Drill is this logic? > >> - Is there no ongoing effort in parquet-mr to make the automatic > >> handling smarter? > >> - Are all Parquet encoding options being used by drill? > >> - Like the encoding of longs where delta between semi-subsequent > >> numbers is stored. (As I understand it) > >> > >> thanks again. > >> > >> Regards, > >> -Stefan > >> > >> > >> > >> > >> On Thu, Feb 4, 2016 at 3:36 PM, Jason Altekruse < > [email protected] > >> > wrote: > >> > >>> Hi Stefan, > >>> > >>> There is a reason that dictionary is disabled by default. The > parquet-mr > >>> library we leverage for writing parquet files currently has the > behavior > >>> to > >>> write nearly all columns as dictionary encoded for all types when > >>> dictionary encoding is enabled. This includes columns with integers, > >>> doubles, dates and timestamps. > >>> > >>> Do you have some data that you believe is well suited for dictionary > >>> encoding in the dataset? I think there are good uses for it, such as > data > >>> coming from systems that support enumerations, that might be > represented > >>> as > >>> strings when exported from a database for use with Big Data tools like > >>> Drill. Unfortunately we do not currently provide a mechanism for > >>> requesting > >>> dictionary encoding on only some columns, and we don't do anything like > >>> buffer values to determine if a given column is well-suited for > >>> dictionary > >>> encoding before starting to write them. > >>> > >>> In many cases it obviously is not a good choice, and so we actually > take > >>> a > >>> performance hit re-materializing the data out of the dictionary upon > >>> read. > >>> > >>> If you would be interested in trying to contribute such an enhancement > I > >>> would be willing to help you get started with it. > >>> > >>> - Jason > >>> > >>> On Wed, Feb 3, 2016 at 5:15 AM, Stefán Baxter < > [email protected] > >>> > > >>> wrote: > >>> > >>> > Hi, > >>> > > >>> > I'm converting Avro to parquest and I'm getting this log entry back > >>> for a > >>> > timestamp field: > >>> > > >>> > Written 1,008,842B for [occurred_at] INT64: 591,435 values, > 2,169,557B > >>> raw, > >>> > 1,008,606B comp, 5 pages, encodings: [BIT_PACKED, PLAIN, > >>> PLAIN_DICTIONARY, > >>> > RLE], dic { 123,832 entries, 990,656B raw, 123,832B comp} > >>> > > >>> > Can someone please tell me if this is the expected encoding for a > >>> timestamp > >>> > field. > >>> > > >>> > I'm a bit surprised that it seems to be dictionary based. (Yes, I > have > >>> > enabled dictionary encoding for Parquet files). > >>> > > >>> > Regards, > >>> > -Stefán > >>> > > >>> > >> > >> > > >
