OK, well, for the record tech.ml.dataset supports three major features the official SDK does not - mmap, JDK-17 and just now compression as a user-accessible option during write - file:///home/chrisn/dev/tech.all/tech.ml.dataset/docs/tech.v3.libs.arrow.html.
Other lz4 compressors are faster than apache but regardless zstd gets the best compression ratio for the very simple test files I tested. On Thu, Jan 20, 2022 at 10:44 AM Chris Nuernberger <[email protected]> wrote: > Makes sense. The apache lz4 compressor is very slow according to my > timings -- at least 10x and more like 50X slower than zstd so I can totally > understand the base of the questions. > > On Thu, Jan 20, 2022 at 10:30 AM Micah Kornfield <[email protected]> > wrote: > >> Hi Chris, >> >>> Are there compression constants for snappy and brotli I am not seeing? >>> The flatbuffer definition of the constants only contains lz4 and zstd. >> >> These are not in the spec. We chose to limit the number of compression >> standards unless there was a real demand for them. When I last looked, I >> believe lz4 is pretty much strictly better then snappy with a proper >> implementation. I believe Brotli might provide some space advantages over >> ZSTD but is generally slower. If there is a strong use-case for other >> codecs, I would suggest discussing it on the dev@ mailing list as an >> addition. >> >> >>> That performance discussion is interesting. It is disappointing that as >>> far as java libraries are concerned tech.ml.dataset isn't brought up as it >>> is both faster and supports more features than the base arrow java SDK. >> >> Generally outside of LZ4 performance of the java library hasn't been >> brought up much. The only reason why I mentioned javacpp was because the >> question about native bindings was asked (and you of course know about >> tech.ml.dataset). >> >> Cheers, >> Micah >> >> >> On Thu, Jan 20, 2022 at 8:32 AM Chris Nuernberger <[email protected]> >> wrote: >> >>> Are there compression constants for snappy and brotli I am not seeing? >>> The flatbuffer definition of the constants only contains lz4 and zstd. >>> >>> That performance discussion is interesting. It is disappointing that as >>> far as java libraries are concerned tech.ml.dataset isn't brought up as it >>> is both faster and supports more features than the base arrow java SDK. >>> >>> On Sun, Jan 16, 2022 at 9:33 PM Micah Kornfield <[email protected]> >>> wrote: >>> >>>> Hi Chris, >>>> >>>>> Looking through the code it appears that this isn't exposed to users. >>>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a >>>>> compressor so no one is using this in Java. I found some great comments >>>>> in >>>>> the go code that are *super* helpful about the compressed buffer's format. >>>> >>>> Unfortunately, the addition of compression didn't go through the normal >>>> path for integrating new features (integration tests between two or more >>>> languages actively running). Right now only the read path is tested from a >>>> statically generated file in C++ so this gap wasn't caught. A contribution >>>> to fix this would be welcome. >>>> >>>> Who is using compression? Are you using it via the c++ dataset >>>>> pathways or one of the various language wrappers? >>>> >>>> We use the decode path in Java (and other languages) to connect to a >>>> service my team owns that serves arrow data with optional compression. >>>> Note that LZ4 is very slow in Java today [1]. >>>> >>>> What about putting a nice C interface on top of all the c++ and then >>>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on >>>>> top of one C interface? Seems like a hell of a lot less work than the >>>>> bespoke language wrappers - and everyone gets access to all the features >>>>> at >>>>> the same time. >>>> >>>> There is already a gobject [2] interface on top of arrow C++ that is >>>> used in the Ruby bindings. R and Python bind directly to C++ already. In >>>> terms of other implementations there is value in not having every >>>> implementation have the same core, as it helps ensure the specification is >>>> understandable and can be implemented outside of the project if necessary. >>>> Also for some languages it makes prebuilt distribution easier if native >>>> code is not required. >>>> >>>> If you are looking for auto generated bindings for C++-Arrow in Java >>>> there is a project [3] that does that. I have never used it so I can't >>>> comment on its quality. >>>> >>>> -Micah >>>> >>>> >>>> [1] https://issues.apache.org/jira/browse/ARROW-11901 >>>> [2] https://en.wikipedia.org/wiki/GObject >>>> [3] https://github.com/bytedeco/javacpp-presets/tree/master/arrow >>>> >>>> >>>> On Fri, Jan 14, 2022 at 5:35 AM Chris Nuernberger <[email protected]> >>>> wrote: >>>> >>>>> Looking through the code it appears that this isn't exposed to users. >>>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a >>>>> compressor so no one is using this in Java. I found some great comments >>>>> in >>>>> the go code that are *super* helpful about the compressed buffer's format. >>>>> >>>>> Who is using compression? Are you using it via the c++ dataset >>>>> pathways or one of the various language wrappers? >>>>> >>>>> What about putting a nice C interface on top of all the c++ and then >>>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on >>>>> top of one C interface? Seems like a hell of a lot less work than the >>>>> bespoke language wrappers - and everyone gets access to all the features >>>>> at >>>>> the same time. >>>>> >>>>> On Thu, Jan 13, 2022 at 4:11 PM Chris Nuernberger < >>>>> [email protected]> wrote: >>>>> >>>>>> Great, thanks, I just hadn't noticed until now - thanks! >>>>>> >>>>>> On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Chris, >>>>>>> >>>>>>>> Upgrading to 6.0.X I noticed that record batches can have body >>>>>>>> compression which I think is great. >>>>>>> >>>>>>> >>>>>>> Small nit: this was released in Arrow 4. >>>>>>> >>>>>>> I had trouble finding examples in python or R (or java) of writing >>>>>>>> an IPC file with various types of compression used for the record >>>>>>>> batch. >>>>>>> >>>>>>> >>>>>>> Java code is at [1] with implementations for compression codec >>>>>>> living in [2]. >>>>>>> >>>>>>> Is the compression applied per-column or upon the record batch after >>>>>>>> the buffers have been serialized to the batch? If it is applied per >>>>>>>> column >>>>>>>> which buffers - given that text for example can consist of 3 buffers >>>>>>>> (validity, offset, data) is compression applied to all three or just >>>>>>>> data >>>>>>>> or data and offset? >>>>>>> >>>>>>> It is applied per buffer, all buffers are compressed. >>>>>>> >>>>>>> Cheers, >>>>>>> Micah >>>>>>> >>>>>>> >>>>>>> [1] >>>>>>> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100 >>>>>>> [2] >>>>>>> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src >>>>>>> >>>>>>> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Upgrading to 6.0.X I noticed that record batches can have body >>>>>>>> compression which I think is great. >>>>>>>> >>>>>>>> I had trouble finding examples in python or R (or java) of writing >>>>>>>> an IPC file with various types of compression used for the record >>>>>>>> batch. >>>>>>>> >>>>>>>> Is the compression applied per-column or upon the record batch >>>>>>>> after the buffers have been serialized to the batch? If it is applied >>>>>>>> per >>>>>>>> column which buffers - given that text for example can consist of 3 >>>>>>>> buffers >>>>>>>> (validity, offset, data) is compression applied to all three or just >>>>>>>> data >>>>>>>> or data and offset? >>>>>>>> >>>>>>>
