Looking through the code it appears that this isn't exposed to users. ArrowWriter doesn't use the VectorUnloader constructor that includes a compressor so no one is using this in Java. I found some great comments in the go code that are *super* helpful about the compressed buffer's format.
Who is using compression? Are you using it via the c++ dataset pathways or one of the various language wrappers? What about putting a nice C interface on top of all the c++ and then basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on top of one C interface? Seems like a hell of a lot less work than the bespoke language wrappers - and everyone gets access to all the features at the same time. On Thu, Jan 13, 2022 at 4:11 PM Chris Nuernberger <[email protected]> wrote: > Great, thanks, I just hadn't noticed until now - thanks! > > On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield <[email protected]> > wrote: > >> Hi Chris, >> >>> Upgrading to 6.0.X I noticed that record batches can have body >>> compression which I think is great. >> >> >> Small nit: this was released in Arrow 4. >> >> I had trouble finding examples in python or R (or java) of writing an IPC >>> file with various types of compression used for the record batch. >> >> >> Java code is at [1] with implementations for compression codec living in >> [2]. >> >> Is the compression applied per-column or upon the record batch after the >>> buffers have been serialized to the batch? If it is applied per column >>> which buffers - given that text for example can consist of 3 buffers >>> (validity, offset, data) is compression applied to all three or just data >>> or data and offset? >> >> It is applied per buffer, all buffers are compressed. >> >> Cheers, >> Micah >> >> >> [1] >> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100 >> [2] >> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src >> >> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger <[email protected]> >> wrote: >> >>> Upgrading to 6.0.X I noticed that record batches can have body >>> compression which I think is great. >>> >>> I had trouble finding examples in python or R (or java) of writing an >>> IPC file with various types of compression used for the record batch. >>> >>> Is the compression applied per-column or upon the record batch after the >>> buffers have been serialized to the batch? If it is applied per column >>> which buffers - given that text for example can consist of 3 buffers >>> (validity, offset, data) is compression applied to all three or just data >>> or data and offset? >>> >>
