Hi,
I have a conceptual question about the selection of encoding schemes for
parquet columns. Hopefully I didn’t miss this question in the archive.
If I understand correctly, arrow implements “all” encoding schemes that parquet
supports. But how are these selected for given data of a column/dataset? Is
this selection data driven (test on a small subset)? Can I somehow influence
the selection?
Background: I am using python to store a pandas dataframe with relative
standard iot data (device_id, timestamp, value).
device_id timestamp value
0 2016-02-18 21:01:27 0.797649
0 2016-02-18 23:01:27 0.485878
0 2016-02-19 01:01:27 0.738183
0 2016-02-19 03:01:27 0.866196
0 2016-02-19 05:01:27 0.731805
... ... ...
9999 2016-04-17 08:49:21 0.794262
9999 2016-04-17 10:49:21 0.659690
9999 2016-04-17 12:49:21 0.885828
9999 2016-04-17 14:49:21 0.000009
9999 2016-04-17 16:49:21 0.805664
I am surprised that pyarrow doesn’t choose the delta / rle encoding for
timestamp as it is increasing in fixed deletas per device_id:
row group 0
--------------------------------------------------------------------------------
device_id: INT64 GZIP DO:0 FPO:4 SZ:156990/83620663/532.65 VC:10451833
[more]...
timestamp: INT64 GZIP DO:0 FPO:157081 SZ:54258488/83620743/1.54 VC:10451833
[more]...
value: DOUBLE GZIP DO:0 FPO:54415661 SZ:78769352/83620743/1.06 VC:10451833
[more]...
Any help / pointers is welcome.
Cheers
Robin