ah .. got it. 

Thanks, I found 
https://github.com/apache/parquet-format/blob/ee02ef8c8f33bd3d5ed0582ded7e20439e12d933/Encodings.md

On 2020/11/16 20:33:38, Micah Kornfield <[email protected]> wrote: 
> Delta encoding hasn't been implemented in the C++ code that pyarrow binds
> to.  It is supported in the Parquet specification.
> 
> On Mon, Nov 16, 2020 at 12:30 PM Jason Sachs <[email protected]> wrote:
> 
> > Does Arrow / Parquet have any support for delta encoding?
> >
> > Some data series compress better when their differences are stored rather
> > than the values themselves.
> >
> > Here's an example where the differences are mostly equal to 7 but
> > occasionally more:
> >
> > import numpy as np
> > import pyarrow as pa
> > import pyarrow.parquet as pq
> >
> > N = 500000
> > delta_r = np.full(N,7)
> > np.random.seed(123)
> > for _ in range(10):
> >     delta_r[np.random.randint(N,size=N//100)] += 1
> > r = np.cumsum(delta_r)
> > drcheck = np.diff(r,prepend=0)
> > assert (delta_r == drcheck).all()
> >
> > a = pa.array(r)
> > adiff = pa.array(delta_r)
> > t = pa.Table.from_arrays([a],['r'])
> > tdiff = pa.Table.from_arrays([adiff],['delta_r'])
> > pq.write_table(t,'t.pq')
> > pq.write_table(tdiff,'tdiff.pq')
> >
> > =====
> >
> > and when I look at the resulting files:
> >
> > -rw-rw-rw-   1 user     group     2591101 Nov 16 13:29 t.pq
> > -rw-rw-rw-   1 user     group       81049 Nov 16 13:29 tdiff.pq
> >
> >
> 

Reply via email to