Selection of encoding scheme

Robin Aly Fri, 01 Nov 2019 02:32:58 -0700

Hi,

I have a conceptual question about the selection of encoding schemes for 
parquet columns. Hopefully I didn’t miss this question in the archive.


If I understand correctly, arrow implements “all” encoding schemes that parquet 
supports. But how are these selected for given data of a column/dataset? Is 
this selection data driven (test on a small subset)? Can I somehow influence 
the selection?

Background: I am using python to store a pandas dataframe with relative 
standard iot data (device_id, timestamp, value).

device_id           timestamp     value
        0 2016-02-18 21:01:27  0.797649
        0 2016-02-18 23:01:27  0.485878
        0 2016-02-19 01:01:27  0.738183
        0 2016-02-19 03:01:27  0.866196
        0 2016-02-19 05:01:27  0.731805
      ...                 ...       ...
     9999 2016-04-17 08:49:21  0.794262
     9999 2016-04-17 10:49:21  0.659690
     9999 2016-04-17 12:49:21  0.885828
     9999 2016-04-17 14:49:21  0.000009
     9999 2016-04-17 16:49:21  0.805664

I am surprised that pyarrow doesn’t choose the delta / rle encoding for 
timestamp as it is increasing in fixed deletas per device_id:


row group 0

--------------------------------------------------------------------------------

device_id:  INT64 GZIP DO:0 FPO:4 SZ:156990/83620663/532.65 VC:10451833 
[more]...

timestamp:  INT64 GZIP DO:0 FPO:157081 SZ:54258488/83620743/1.54 VC:10451833 
[more]...

value:      DOUBLE GZIP DO:0 FPO:54415661 SZ:78769352/83620743/1.06 VC:10451833 
[more]...

Any help / pointers is welcome.

Cheers
Robin

Selection of encoding scheme

Reply via email to