Makes sense. The apache lz4 compressor is very slow according to my timings -- at least 10x and more like 50X slower than zstd so I can totally understand the base of the questions.
On Thu, Jan 20, 2022 at 10:30 AM Micah Kornfield <[email protected]> wrote: > Hi Chris, > >> Are there compression constants for snappy and brotli I am not seeing? >> The flatbuffer definition of the constants only contains lz4 and zstd. > > These are not in the spec. We chose to limit the number of compression > standards unless there was a real demand for them. When I last looked, I > believe lz4 is pretty much strictly better then snappy with a proper > implementation. I believe Brotli might provide some space advantages over > ZSTD but is generally slower. If there is a strong use-case for other > codecs, I would suggest discussing it on the dev@ mailing list as an > addition. > > >> That performance discussion is interesting. It is disappointing that as >> far as java libraries are concerned tech.ml.dataset isn't brought up as it >> is both faster and supports more features than the base arrow java SDK. > > Generally outside of LZ4 performance of the java library hasn't been > brought up much. The only reason why I mentioned javacpp was because the > question about native bindings was asked (and you of course know about > tech.ml.dataset). > > Cheers, > Micah > > > On Thu, Jan 20, 2022 at 8:32 AM Chris Nuernberger <[email protected]> > wrote: > >> Are there compression constants for snappy and brotli I am not seeing? >> The flatbuffer definition of the constants only contains lz4 and zstd. >> >> That performance discussion is interesting. It is disappointing that as >> far as java libraries are concerned tech.ml.dataset isn't brought up as it >> is both faster and supports more features than the base arrow java SDK. >> >> On Sun, Jan 16, 2022 at 9:33 PM Micah Kornfield <[email protected]> >> wrote: >> >>> Hi Chris, >>> >>>> Looking through the code it appears that this isn't exposed to users. >>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a >>>> compressor so no one is using this in Java. I found some great comments in >>>> the go code that are *super* helpful about the compressed buffer's format. >>> >>> Unfortunately, the addition of compression didn't go through the normal >>> path for integrating new features (integration tests between two or more >>> languages actively running). Right now only the read path is tested from a >>> statically generated file in C++ so this gap wasn't caught. A contribution >>> to fix this would be welcome. >>> >>> Who is using compression? Are you using it via the c++ dataset pathways >>>> or one of the various language wrappers? >>> >>> We use the decode path in Java (and other languages) to connect to a >>> service my team owns that serves arrow data with optional compression. >>> Note that LZ4 is very slow in Java today [1]. >>> >>> What about putting a nice C interface on top of all the c++ and then >>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on >>>> top of one C interface? Seems like a hell of a lot less work than the >>>> bespoke language wrappers - and everyone gets access to all the features at >>>> the same time. >>> >>> There is already a gobject [2] interface on top of arrow C++ that is >>> used in the Ruby bindings. R and Python bind directly to C++ already. In >>> terms of other implementations there is value in not having every >>> implementation have the same core, as it helps ensure the specification is >>> understandable and can be implemented outside of the project if necessary. >>> Also for some languages it makes prebuilt distribution easier if native >>> code is not required. >>> >>> If you are looking for auto generated bindings for C++-Arrow in Java >>> there is a project [3] that does that. I have never used it so I can't >>> comment on its quality. >>> >>> -Micah >>> >>> >>> [1] https://issues.apache.org/jira/browse/ARROW-11901 >>> [2] https://en.wikipedia.org/wiki/GObject >>> [3] https://github.com/bytedeco/javacpp-presets/tree/master/arrow >>> >>> >>> On Fri, Jan 14, 2022 at 5:35 AM Chris Nuernberger <[email protected]> >>> wrote: >>> >>>> Looking through the code it appears that this isn't exposed to users. >>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a >>>> compressor so no one is using this in Java. I found some great comments in >>>> the go code that are *super* helpful about the compressed buffer's format. >>>> >>>> Who is using compression? Are you using it via the c++ dataset >>>> pathways or one of the various language wrappers? >>>> >>>> What about putting a nice C interface on top of all the c++ and then >>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on >>>> top of one C interface? Seems like a hell of a lot less work than the >>>> bespoke language wrappers - and everyone gets access to all the features at >>>> the same time. >>>> >>>> On Thu, Jan 13, 2022 at 4:11 PM Chris Nuernberger <[email protected]> >>>> wrote: >>>> >>>>> Great, thanks, I just hadn't noticed until now - thanks! >>>>> >>>>> On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Chris, >>>>>> >>>>>>> Upgrading to 6.0.X I noticed that record batches can have body >>>>>>> compression which I think is great. >>>>>> >>>>>> >>>>>> Small nit: this was released in Arrow 4. >>>>>> >>>>>> I had trouble finding examples in python or R (or java) of writing an >>>>>>> IPC file with various types of compression used for the record batch. >>>>>> >>>>>> >>>>>> Java code is at [1] with implementations for compression codec living >>>>>> in [2]. >>>>>> >>>>>> Is the compression applied per-column or upon the record batch after >>>>>>> the buffers have been serialized to the batch? If it is applied per >>>>>>> column >>>>>>> which buffers - given that text for example can consist of 3 buffers >>>>>>> (validity, offset, data) is compression applied to all three or just >>>>>>> data >>>>>>> or data and offset? >>>>>> >>>>>> It is applied per buffer, all buffers are compressed. >>>>>> >>>>>> Cheers, >>>>>> Micah >>>>>> >>>>>> >>>>>> [1] >>>>>> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100 >>>>>> [2] >>>>>> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src >>>>>> >>>>>> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Upgrading to 6.0.X I noticed that record batches can have body >>>>>>> compression which I think is great. >>>>>>> >>>>>>> I had trouble finding examples in python or R (or java) of writing >>>>>>> an IPC file with various types of compression used for the record batch. >>>>>>> >>>>>>> Is the compression applied per-column or upon the record batch after >>>>>>> the buffers have been serialized to the batch? If it is applied per >>>>>>> column >>>>>>> which buffers - given that text for example can consist of 3 buffers >>>>>>> (validity, offset, data) is compression applied to all three or just >>>>>>> data >>>>>>> or data and offset? >>>>>>> >>>>>>
