Hi Jason, Thank you for the explanation.
I have several *low* cardinality fields that contain semi-long values and they are, I think, a perfect candidate for dictionary encoding. I assumed that the choose to use dictionary encoding was a bit smarter than this and would rely on Strings type column where x% repeated values were a clear signal for it's use. If you can outline what needs to be done and where then I will gladly take a stab at it :). Several questions along those lines: - Does the Parquet library that Drill uses allow for programmatic section? - What metadata, regarding the column content, is available when the choice is made? - Where in the Parquet part of Drill is this logic? - Is there no ongoing effort in parquet-mr to make the automatic handling smarter? - Are all Parquet encoding options being used by drill? - Like the encoding of longs where delta between semi-subsequent numbers is stored. (As I understand it) thanks again. Regards, -Stefan On Thu, Feb 4, 2016 at 3:36 PM, Jason Altekruse <[email protected]> wrote: > Hi Stefan, > > There is a reason that dictionary is disabled by default. The parquet-mr > library we leverage for writing parquet files currently has the behavior to > write nearly all columns as dictionary encoded for all types when > dictionary encoding is enabled. This includes columns with integers, > doubles, dates and timestamps. > > Do you have some data that you believe is well suited for dictionary > encoding in the dataset? I think there are good uses for it, such as data > coming from systems that support enumerations, that might be represented as > strings when exported from a database for use with Big Data tools like > Drill. Unfortunately we do not currently provide a mechanism for requesting > dictionary encoding on only some columns, and we don't do anything like > buffer values to determine if a given column is well-suited for dictionary > encoding before starting to write them. > > In many cases it obviously is not a good choice, and so we actually take a > performance hit re-materializing the data out of the dictionary upon read. > > If you would be interested in trying to contribute such an enhancement I > would be willing to help you get started with it. > > - Jason > > On Wed, Feb 3, 2016 at 5:15 AM, Stefán Baxter <[email protected]> > wrote: > > > Hi, > > > > I'm converting Avro to parquest and I'm getting this log entry back for a > > timestamp field: > > > > Written 1,008,842B for [occurred_at] INT64: 591,435 values, 2,169,557B > raw, > > 1,008,606B comp, 5 pages, encodings: [BIT_PACKED, PLAIN, > PLAIN_DICTIONARY, > > RLE], dic { 123,832 entries, 990,656B raw, 123,832B comp} > > > > Can someone please tell me if this is the expected encoding for a > timestamp > > field. > > > > I'm a bit surprised that it seems to be dictionary based. (Yes, I have > > enabled dictionary encoding for Parquet files). > > > > Regards, > > -Stefán > > >
