Does Arrow / Parquet have any support for delta encoding?
Some data series compress better when their differences are stored rather than
the values themselves.
Here's an example where the differences are mostly equal to 7 but occasionally
more:
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
N = 500000
delta_r = np.full(N,7)
np.random.seed(123)
for _ in range(10):
delta_r[np.random.randint(N,size=N//100)] += 1
r = np.cumsum(delta_r)
drcheck = np.diff(r,prepend=0)
assert (delta_r == drcheck).all()
a = pa.array(r)
adiff = pa.array(delta_r)
t = pa.Table.from_arrays([a],['r'])
tdiff = pa.Table.from_arrays([adiff],['delta_r'])
pq.write_table(t,'t.pq')
pq.write_table(tdiff,'tdiff.pq')
=====
and when I look at the resulting files:
-rw-rw-rw- 1 user group 2591101 Nov 16 13:29 t.pq
-rw-rw-rw- 1 user group 81049 Nov 16 13:29 tdiff.pq