Does Arrow / Parquet have any support for delta encoding?

Some data series compress better when their differences are stored rather than 
the values themselves.

Here's an example where the differences are mostly equal to 7 but occasionally 
more:

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

N = 500000
delta_r = np.full(N,7)
np.random.seed(123)
for _ in range(10):
    delta_r[np.random.randint(N,size=N//100)] += 1
r = np.cumsum(delta_r)
drcheck = np.diff(r,prepend=0)
assert (delta_r == drcheck).all()

a = pa.array(r)
adiff = pa.array(delta_r)
t = pa.Table.from_arrays([a],['r'])
tdiff = pa.Table.from_arrays([adiff],['delta_r'])
pq.write_table(t,'t.pq')
pq.write_table(tdiff,'tdiff.pq')

=====

and when I look at the resulting files:

-rw-rw-rw-   1 user     group     2591101 Nov 16 13:29 t.pq
-rw-rw-rw-   1 user     group       81049 Nov 16 13:29 tdiff.pq

Reply via email to