Hi Nikhil, > I would like to know if pyarrow has support for writing parquet files with > run-length encoding? There is mention of this in the Python Docs under the > compression section.
The C++ API might not have enough validation around it to be properly exposed to high level APIs. The parquet spec clarifies this further [1]: > Note that the RLE encoding method is only supported for the following > types of data: Repetition and definition levels > Dictionary indices > Boolean values in data pages, as an alternative to PLAIN encoding IIRC, The way the writing works, for pyarrow and C++ is they will try to dictionary encode values and use RLE until the dictionary grows too large. You can verify encodings by using pyarrow to see what encodings were used for a column [2]. The Arrow specification recently adopted Run end encoding which is very similar to RLE encoding [3] if you don't want to transfer parquet files this might be a good fit for your use-case. Thanks, Micah [1] https://parquet.apache.org/docs/file-format/data-pages/encodings/ [2] https://arrow.apache.org/docs/python/parquet.html#inspecting-the-parquet-file-metadata [3] https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout On Wed, Jan 11, 2023 at 4:04 PM Nikhil Makan <[email protected]> wrote: > Hi Team, > > Question 1: > I would like to know if pyarrow has support for writing parquet files with > run-length encoding? There is mention of this in the Python Docs under the > compression section. > > 'can be compressed after the encoding passes (dictionary, RLE encoding)' > > https://arrow.apache.org/docs/python/parquet.html#compression-encoding-and-file-compatibility > > However I am not seeing the option in the API reference: > > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table > > I do note it's covered off in the C++ documentation, anyway we can access > this in python? > https://arrow.apache.org/docs/cpp/parquet.html > > Question 2: > In addition to the above, I am interested to know if there are any methods > to apply this type of encoding to data in transit over a network. Our > actual use case has a large amount of data and would GREATLY benefit > from run-length encoding due to the repetition (sensors not changing values > that often). We are trying to send this data from a warehouse (the > warehouse has not been selected as yet) to an application back end, which > ultimately gets sent onto an application front end to visualise. > > Kind regards > Nikhil Makan >
