Re: examples of using new compression scheme

Chris Nuernberger Thu, 20 Jan 2022 13:42:44 -0800

OK, well, for the record tech.ml.dataset supports three major features the
official SDK does not - mmap, JDK-17 and just now compression as a
user-accessible option during write
- file:///home/chrisn/dev/tech.all/tech.ml.dataset/docs/tech.v3.libs.arrow.html.


Other lz4 compressors are faster than apache but regardless zstd gets the
best compression ratio for the very simple test files I tested.

On Thu, Jan 20, 2022 at 10:44 AM Chris Nuernberger <[email protected]>
wrote:

> Makes sense.  The apache lz4 compressor is very slow according to my
> timings -- at least 10x and more like 50X slower than zstd so I can totally
> understand the base of the questions.
>
> On Thu, Jan 20, 2022 at 10:30 AM Micah Kornfield <[email protected]>
> wrote:
>
>> Hi Chris,
>>
>>> Are there compression constants for snappy and brotli I am not seeing?
>>> The flatbuffer definition of the constants only contains lz4 and zstd.
>>
>> These are not in the spec.  We chose to limit the number of compression
>> standards unless there was a real demand for them.  When I last looked, I
>> believe lz4 is pretty much strictly better then snappy with a proper
>> implementation.  I believe Brotli might provide some space advantages over
>> ZSTD but is generally slower.  If there is a strong use-case for other
>> codecs, I would suggest discussing it on the dev@ mailing list as an
>> addition.
>>
>>
>>> That performance discussion is interesting.  It is disappointing that as
>>> far as java libraries are concerned tech.ml.dataset isn't brought up as it
>>> is both faster and supports more features than the base arrow java SDK.
>>
>> Generally outside of LZ4 performance of the java library hasn't been
>> brought up much.  The only reason why I mentioned javacpp was because the
>> question about native bindings was asked (and you of course know about
>> tech.ml.dataset).
>>
>> Cheers,
>> Micah
>>
>>
>> On Thu, Jan 20, 2022 at 8:32 AM Chris Nuernberger <[email protected]>
>> wrote:
>>
>>> Are there compression constants for snappy and brotli I am not seeing?
>>> The flatbuffer definition of the constants only contains lz4 and zstd.
>>>
>>> That performance discussion is interesting.  It is disappointing that as
>>> far as java libraries are concerned tech.ml.dataset isn't brought up as it
>>> is both faster and supports more features than the base arrow java SDK.
>>>
>>> On Sun, Jan 16, 2022 at 9:33 PM Micah Kornfield <[email protected]>
>>> wrote:
>>>
>>>> Hi Chris,
>>>>
>>>>> Looking through the code it appears that this isn't exposed to users.
>>>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a
>>>>> compressor so no one is using this in Java.  I found some great comments 
>>>>> in
>>>>> the go code that are *super* helpful about the compressed buffer's format.
>>>>
>>>> Unfortunately, the addition of compression didn't go through the normal
>>>> path for integrating new features (integration tests between two or more
>>>> languages actively running).  Right now only the read path is tested from a
>>>> statically generated file in C++ so this gap wasn't caught.  A contribution
>>>> to fix this would be welcome.
>>>>
>>>> Who is using compression?  Are you using it via the c++ dataset
>>>>> pathways or one of the various language wrappers?
>>>>
>>>> We use the decode path in Java (and other languages) to connect to a
>>>> service my team owns that serves arrow data with optional compression.
>>>>  Note that LZ4 is very slow in Java today [1].
>>>>
>>>> What about putting a nice C interface on top of all the c++ and then
>>>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
>>>>> top of one C interface?  Seems like a hell of a lot less work than the
>>>>> bespoke language wrappers - and everyone gets access to all the features 
>>>>> at
>>>>> the same time.
>>>>
>>>> There is already a gobject [2] interface on top of arrow C++ that is
>>>> used in the Ruby bindings.  R and Python bind directly to C++ already.  In
>>>> terms of other implementations there is value in not having every
>>>> implementation have the same core, as it helps ensure the specification is
>>>> understandable and can be implemented outside of the project if necessary.
>>>> Also for some languages it makes prebuilt distribution easier if native
>>>> code is not required.
>>>>
>>>>  If you are looking for auto generated bindings for C++-Arrow in Java
>>>> there is a project [3] that does that.  I have never used it so I can't
>>>> comment on its quality.
>>>>
>>>> -Micah
>>>>
>>>>
>>>> [1] https://issues.apache.org/jira/browse/ARROW-11901
>>>> [2] https://en.wikipedia.org/wiki/GObject
>>>> [3] https://github.com/bytedeco/javacpp-presets/tree/master/arrow
>>>>
>>>>
>>>> On Fri, Jan 14, 2022 at 5:35 AM Chris Nuernberger <[email protected]>
>>>> wrote:
>>>>
>>>>> Looking through the code it appears that this isn't exposed to users.
>>>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a
>>>>> compressor so no one is using this in Java.  I found some great comments 
>>>>> in
>>>>> the go code that are *super* helpful about the compressed buffer's format.
>>>>>
>>>>> Who is using compression?  Are you using it via the c++ dataset
>>>>> pathways or one of the various language wrappers?
>>>>>
>>>>> What about putting a nice C interface on top of all the c++ and then
>>>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
>>>>> top of one C interface?  Seems like a hell of a lot less work than the
>>>>> bespoke language wrappers - and everyone gets access to all the features 
>>>>> at
>>>>> the same time.
>>>>>
>>>>> On Thu, Jan 13, 2022 at 4:11 PM Chris Nuernberger <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Great, thanks, I just hadn't noticed until now - thanks!
>>>>>>
>>>>>> On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Chris,
>>>>>>>
>>>>>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>>>>>> compression which I think is great.
>>>>>>>
>>>>>>>
>>>>>>> Small nit: this was released in Arrow 4.
>>>>>>>
>>>>>>> I had trouble finding examples in python or R (or java) of writing
>>>>>>>> an IPC file with various types of compression used for the record 
>>>>>>>> batch.
>>>>>>>
>>>>>>>
>>>>>>> Java code is at [1] with implementations for compression codec
>>>>>>> living in [2].
>>>>>>>
>>>>>>> Is the compression applied per-column or upon the record batch after
>>>>>>>> the buffers have been serialized to the batch?  If it is applied per 
>>>>>>>> column
>>>>>>>> which buffers - given that text for example can consist of 3 buffers
>>>>>>>> (validity, offset, data) is compression applied to all three or just 
>>>>>>>> data
>>>>>>>> or data and offset?
>>>>>>>
>>>>>>> It is applied per buffer, all buffers are compressed.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Micah
>>>>>>>
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100
>>>>>>> [2]
>>>>>>> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src
>>>>>>>
>>>>>>> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>>>>>> compression which I think is great.
>>>>>>>>
>>>>>>>> I had trouble finding examples in python or R (or java) of writing
>>>>>>>> an IPC file with various types of compression used for the record 
>>>>>>>> batch.
>>>>>>>>
>>>>>>>> Is the compression applied per-column or upon the record batch
>>>>>>>> after the buffers have been serialized to the batch?  If it is applied 
>>>>>>>> per
>>>>>>>> column which buffers - given that text for example can consist of 3 
>>>>>>>> buffers
>>>>>>>> (validity, offset, data) is compression applied to all three or just 
>>>>>>>> data
>>>>>>>> or data and offset?
>>>>>>>>
>>>>>>>

Re: examples of using new compression scheme

Reply via email to