Re: examples of using new compression scheme

Chris Nuernberger Thu, 20 Jan 2022 09:44:30 -0800

Makes sense.  The apache lz4 compressor is very slow according to my
timings -- at least 10x and more like 50X slower than zstd so I can totally
understand the base of the questions.


On Thu, Jan 20, 2022 at 10:30 AM Micah Kornfield <[email protected]>
wrote:

> Hi Chris,
>
>> Are there compression constants for snappy and brotli I am not seeing?
>> The flatbuffer definition of the constants only contains lz4 and zstd.
>
> These are not in the spec.  We chose to limit the number of compression
> standards unless there was a real demand for them.  When I last looked, I
> believe lz4 is pretty much strictly better then snappy with a proper
> implementation.  I believe Brotli might provide some space advantages over
> ZSTD but is generally slower.  If there is a strong use-case for other
> codecs, I would suggest discussing it on the dev@ mailing list as an
> addition.
>
>
>> That performance discussion is interesting.  It is disappointing that as
>> far as java libraries are concerned tech.ml.dataset isn't brought up as it
>> is both faster and supports more features than the base arrow java SDK.
>
> Generally outside of LZ4 performance of the java library hasn't been
> brought up much.  The only reason why I mentioned javacpp was because the
> question about native bindings was asked (and you of course know about
> tech.ml.dataset).
>
> Cheers,
> Micah
>
>
> On Thu, Jan 20, 2022 at 8:32 AM Chris Nuernberger <[email protected]>
> wrote:
>
>> Are there compression constants for snappy and brotli I am not seeing?
>> The flatbuffer definition of the constants only contains lz4 and zstd.
>>
>> That performance discussion is interesting.  It is disappointing that as
>> far as java libraries are concerned tech.ml.dataset isn't brought up as it
>> is both faster and supports more features than the base arrow java SDK.
>>
>> On Sun, Jan 16, 2022 at 9:33 PM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> Hi Chris,
>>>
>>>> Looking through the code it appears that this isn't exposed to users.
>>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a
>>>> compressor so no one is using this in Java.  I found some great comments in
>>>> the go code that are *super* helpful about the compressed buffer's format.
>>>
>>> Unfortunately, the addition of compression didn't go through the normal
>>> path for integrating new features (integration tests between two or more
>>> languages actively running).  Right now only the read path is tested from a
>>> statically generated file in C++ so this gap wasn't caught.  A contribution
>>> to fix this would be welcome.
>>>
>>> Who is using compression?  Are you using it via the c++ dataset pathways
>>>> or one of the various language wrappers?
>>>
>>> We use the decode path in Java (and other languages) to connect to a
>>> service my team owns that serves arrow data with optional compression.
>>>  Note that LZ4 is very slow in Java today [1].
>>>
>>> What about putting a nice C interface on top of all the c++ and then
>>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
>>>> top of one C interface?  Seems like a hell of a lot less work than the
>>>> bespoke language wrappers - and everyone gets access to all the features at
>>>> the same time.
>>>
>>> There is already a gobject [2] interface on top of arrow C++ that is
>>> used in the Ruby bindings.  R and Python bind directly to C++ already.  In
>>> terms of other implementations there is value in not having every
>>> implementation have the same core, as it helps ensure the specification is
>>> understandable and can be implemented outside of the project if necessary.
>>> Also for some languages it makes prebuilt distribution easier if native
>>> code is not required.
>>>
>>>  If you are looking for auto generated bindings for C++-Arrow in Java
>>> there is a project [3] that does that.  I have never used it so I can't
>>> comment on its quality.
>>>
>>> -Micah
>>>
>>>
>>> [1] https://issues.apache.org/jira/browse/ARROW-11901
>>> [2] https://en.wikipedia.org/wiki/GObject
>>> [3] https://github.com/bytedeco/javacpp-presets/tree/master/arrow
>>>
>>>
>>> On Fri, Jan 14, 2022 at 5:35 AM Chris Nuernberger <[email protected]>
>>> wrote:
>>>
>>>> Looking through the code it appears that this isn't exposed to users.
>>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a
>>>> compressor so no one is using this in Java.  I found some great comments in
>>>> the go code that are *super* helpful about the compressed buffer's format.
>>>>
>>>> Who is using compression?  Are you using it via the c++ dataset
>>>> pathways or one of the various language wrappers?
>>>>
>>>> What about putting a nice C interface on top of all the c++ and then
>>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
>>>> top of one C interface?  Seems like a hell of a lot less work than the
>>>> bespoke language wrappers - and everyone gets access to all the features at
>>>> the same time.
>>>>
>>>> On Thu, Jan 13, 2022 at 4:11 PM Chris Nuernberger <[email protected]>
>>>> wrote:
>>>>
>>>>> Great, thanks, I just hadn't noticed until now - thanks!
>>>>>
>>>>> On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Chris,
>>>>>>
>>>>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>>>>> compression which I think is great.
>>>>>>
>>>>>>
>>>>>> Small nit: this was released in Arrow 4.
>>>>>>
>>>>>> I had trouble finding examples in python or R (or java) of writing an
>>>>>>> IPC file with various types of compression used for the record batch.
>>>>>>
>>>>>>
>>>>>> Java code is at [1] with implementations for compression codec living
>>>>>> in [2].
>>>>>>
>>>>>> Is the compression applied per-column or upon the record batch after
>>>>>>> the buffers have been serialized to the batch?  If it is applied per 
>>>>>>> column
>>>>>>> which buffers - given that text for example can consist of 3 buffers
>>>>>>> (validity, offset, data) is compression applied to all three or just 
>>>>>>> data
>>>>>>> or data and offset?
>>>>>>
>>>>>> It is applied per buffer, all buffers are compressed.
>>>>>>
>>>>>> Cheers,
>>>>>> Micah
>>>>>>
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100
>>>>>> [2]
>>>>>> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src
>>>>>>
>>>>>> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>>>>> compression which I think is great.
>>>>>>>
>>>>>>> I had trouble finding examples in python or R (or java) of writing
>>>>>>> an IPC file with various types of compression used for the record batch.
>>>>>>>
>>>>>>> Is the compression applied per-column or upon the record batch after
>>>>>>> the buffers have been serialized to the batch?  If it is applied per 
>>>>>>> column
>>>>>>> which buffers - given that text for example can consist of 3 buffers
>>>>>>> (validity, offset, data) is compression applied to all three or just 
>>>>>>> data
>>>>>>> or data and offset?
>>>>>>>
>>>>>>

Re: examples of using new compression scheme

Reply via email to