Re: Parquet drill date fields

Jason Altekruse Thu, 04 Feb 2016 07:37:20 -0800

Hi Stefan,

There is a reason that dictionary is disabled by default. The parquet-mr
library we leverage for writing parquet files currently has the behavior to
write nearly all columns as dictionary encoded for all types when
dictionary encoding is enabled. This includes columns with integers,
doubles, dates and timestamps.

Do you have some data that you believe is well suited for dictionary
encoding in the dataset? I think there are good uses for it, such as data
coming from systems that support enumerations, that might be represented as
strings when exported from a database for use with Big Data tools like
Drill. Unfortunately we do not currently provide a mechanism for requesting
dictionary encoding on only some columns, and we don't do anything like
buffer values to determine if a given column is well-suited for dictionary
encoding before starting to write them.

In many cases it obviously is not a good choice, and so we actually take a
performance hit re-materializing the data out of the dictionary upon read.

If you would be interested in trying to contribute such an enhancement I
would be willing to help you get started with it.

- Jason

On Wed, Feb 3, 2016 at 5:15 AM, Stefán Baxter <[email protected]>
wrote:

> Hi,
>
> I'm converting Avro to parquest and I'm getting this log entry back for a
> timestamp field:
>
> Written 1,008,842B for [occurred_at] INT64: 591,435 values, 2,169,557B raw,
> 1,008,606B comp, 5 pages, encodings: [BIT_PACKED, PLAIN, PLAIN_DICTIONARY,
> RLE], dic { 123,832 entries, 990,656B raw, 123,832B comp}
>
> Can someone please tell me if this is the expected encoding for a timestamp
> field.
>
> I'm a bit surprised that it seems to be dictionary based. (Yes, I have
> enabled dictionary encoding for Parquet files).
>
> Regards,
>  -Stefán
>

Re: Parquet drill date fields

Reply via email to