Hi Chris, > Are there compression constants for snappy and brotli I am not seeing? > The flatbuffer definition of the constants only contains lz4 and zstd.
These are not in the spec. We chose to limit the number of compression standards unless there was a real demand for them. When I last looked, I believe lz4 is pretty much strictly better then snappy with a proper implementation. I believe Brotli might provide some space advantages over ZSTD but is generally slower. If there is a strong use-case for other codecs, I would suggest discussing it on the dev@ mailing list as an addition. > That performance discussion is interesting. It is disappointing that as > far as java libraries are concerned tech.ml.dataset isn't brought up as it > is both faster and supports more features than the base arrow java SDK. Generally outside of LZ4 performance of the java library hasn't been brought up much. The only reason why I mentioned javacpp was because the question about native bindings was asked (and you of course know about tech.ml.dataset). Cheers, Micah On Thu, Jan 20, 2022 at 8:32 AM Chris Nuernberger <[email protected]> wrote: > Are there compression constants for snappy and brotli I am not seeing? > The flatbuffer definition of the constants only contains lz4 and zstd. > > That performance discussion is interesting. It is disappointing that as > far as java libraries are concerned tech.ml.dataset isn't brought up as it > is both faster and supports more features than the base arrow java SDK. > > On Sun, Jan 16, 2022 at 9:33 PM Micah Kornfield <[email protected]> > wrote: > >> Hi Chris, >> >>> Looking through the code it appears that this isn't exposed to users. >>> ArrowWriter doesn't use the VectorUnloader constructor that includes a >>> compressor so no one is using this in Java. I found some great comments in >>> the go code that are *super* helpful about the compressed buffer's format. >> >> Unfortunately, the addition of compression didn't go through the normal >> path for integrating new features (integration tests between two or more >> languages actively running). Right now only the read path is tested from a >> statically generated file in C++ so this gap wasn't caught. A contribution >> to fix this would be welcome. >> >> Who is using compression? Are you using it via the c++ dataset pathways >>> or one of the various language wrappers? >> >> We use the decode path in Java (and other languages) to connect to a >> service my team owns that serves arrow data with optional compression. >> Note that LZ4 is very slow in Java today [1]. >> >> What about putting a nice C interface on top of all the c++ and then >>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on >>> top of one C interface? Seems like a hell of a lot less work than the >>> bespoke language wrappers - and everyone gets access to all the features at >>> the same time. >> >> There is already a gobject [2] interface on top of arrow C++ that is used >> in the Ruby bindings. R and Python bind directly to C++ already. In terms >> of other implementations there is value in not having every implementation >> have the same core, as it helps ensure the specification is understandable >> and can be implemented outside of the project if necessary. Also for some >> languages it makes prebuilt distribution easier if native code is not >> required. >> >> If you are looking for auto generated bindings for C++-Arrow in Java >> there is a project [3] that does that. I have never used it so I can't >> comment on its quality. >> >> -Micah >> >> >> [1] https://issues.apache.org/jira/browse/ARROW-11901 >> [2] https://en.wikipedia.org/wiki/GObject >> [3] https://github.com/bytedeco/javacpp-presets/tree/master/arrow >> >> >> On Fri, Jan 14, 2022 at 5:35 AM Chris Nuernberger <[email protected]> >> wrote: >> >>> Looking through the code it appears that this isn't exposed to users. >>> ArrowWriter doesn't use the VectorUnloader constructor that includes a >>> compressor so no one is using this in Java. I found some great comments in >>> the go code that are *super* helpful about the compressed buffer's format. >>> >>> Who is using compression? Are you using it via the c++ dataset pathways >>> or one of the various language wrappers? >>> >>> What about putting a nice C interface on top of all the c++ and then >>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on >>> top of one C interface? Seems like a hell of a lot less work than the >>> bespoke language wrappers - and everyone gets access to all the features at >>> the same time. >>> >>> On Thu, Jan 13, 2022 at 4:11 PM Chris Nuernberger <[email protected]> >>> wrote: >>> >>>> Great, thanks, I just hadn't noticed until now - thanks! >>>> >>>> On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield <[email protected]> >>>> wrote: >>>> >>>>> Hi Chris, >>>>> >>>>>> Upgrading to 6.0.X I noticed that record batches can have body >>>>>> compression which I think is great. >>>>> >>>>> >>>>> Small nit: this was released in Arrow 4. >>>>> >>>>> I had trouble finding examples in python or R (or java) of writing an >>>>>> IPC file with various types of compression used for the record batch. >>>>> >>>>> >>>>> Java code is at [1] with implementations for compression codec living >>>>> in [2]. >>>>> >>>>> Is the compression applied per-column or upon the record batch after >>>>>> the buffers have been serialized to the batch? If it is applied per >>>>>> column >>>>>> which buffers - given that text for example can consist of 3 buffers >>>>>> (validity, offset, data) is compression applied to all three or just data >>>>>> or data and offset? >>>>> >>>>> It is applied per buffer, all buffers are compressed. >>>>> >>>>> Cheers, >>>>> Micah >>>>> >>>>> >>>>> [1] >>>>> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100 >>>>> [2] >>>>> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src >>>>> >>>>> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger < >>>>> [email protected]> wrote: >>>>> >>>>>> Upgrading to 6.0.X I noticed that record batches can have body >>>>>> compression which I think is great. >>>>>> >>>>>> I had trouble finding examples in python or R (or java) of writing an >>>>>> IPC file with various types of compression used for the record batch. >>>>>> >>>>>> Is the compression applied per-column or upon the record batch after >>>>>> the buffers have been serialized to the batch? If it is applied per >>>>>> column >>>>>> which buffers - given that text for example can consist of 3 buffers >>>>>> (validity, offset, data) is compression applied to all three or just data >>>>>> or data and offset? >>>>>> >>>>>
