Hi Stefan, There is a reason that dictionary is disabled by default. The parquet-mr library we leverage for writing parquet files currently has the behavior to write nearly all columns as dictionary encoded for all types when dictionary encoding is enabled. This includes columns with integers, doubles, dates and timestamps.
Do you have some data that you believe is well suited for dictionary encoding in the dataset? I think there are good uses for it, such as data coming from systems that support enumerations, that might be represented as strings when exported from a database for use with Big Data tools like Drill. Unfortunately we do not currently provide a mechanism for requesting dictionary encoding on only some columns, and we don't do anything like buffer values to determine if a given column is well-suited for dictionary encoding before starting to write them. In many cases it obviously is not a good choice, and so we actually take a performance hit re-materializing the data out of the dictionary upon read. If you would be interested in trying to contribute such an enhancement I would be willing to help you get started with it. - Jason On Wed, Feb 3, 2016 at 5:15 AM, Stefán Baxter <[email protected]> wrote: > Hi, > > I'm converting Avro to parquest and I'm getting this log entry back for a > timestamp field: > > Written 1,008,842B for [occurred_at] INT64: 591,435 values, 2,169,557B raw, > 1,008,606B comp, 5 pages, encodings: [BIT_PACKED, PLAIN, PLAIN_DICTIONARY, > RLE], dic { 123,832 entries, 990,656B raw, 123,832B comp} > > Can someone please tell me if this is the expected encoding for a timestamp > field. > > I'm a bit surprised that it seems to be dictionary based. (Yes, I have > enabled dictionary encoding for Parquet files). > > Regards, > -Stefán >
