Hi Chris,

> Are there compression constants for snappy and brotli I am not seeing?
> The flatbuffer definition of the constants only contains lz4 and zstd.

These are not in the spec.  We chose to limit the number of compression
standards unless there was a real demand for them.  When I last looked, I
believe lz4 is pretty much strictly better then snappy with a proper
implementation.  I believe Brotli might provide some space advantages over
ZSTD but is generally slower.  If there is a strong use-case for other
codecs, I would suggest discussing it on the dev@ mailing list as an
addition.


> That performance discussion is interesting.  It is disappointing that as
> far as java libraries are concerned tech.ml.dataset isn't brought up as it
> is both faster and supports more features than the base arrow java SDK.

Generally outside of LZ4 performance of the java library hasn't been
brought up much.  The only reason why I mentioned javacpp was because the
question about native bindings was asked (and you of course know about
tech.ml.dataset).

Cheers,
Micah


On Thu, Jan 20, 2022 at 8:32 AM Chris Nuernberger <[email protected]>
wrote:

> Are there compression constants for snappy and brotli I am not seeing?
> The flatbuffer definition of the constants only contains lz4 and zstd.
>
> That performance discussion is interesting.  It is disappointing that as
> far as java libraries are concerned tech.ml.dataset isn't brought up as it
> is both faster and supports more features than the base arrow java SDK.
>
> On Sun, Jan 16, 2022 at 9:33 PM Micah Kornfield <[email protected]>
> wrote:
>
>> Hi Chris,
>>
>>> Looking through the code it appears that this isn't exposed to users.
>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a
>>> compressor so no one is using this in Java.  I found some great comments in
>>> the go code that are *super* helpful about the compressed buffer's format.
>>
>> Unfortunately, the addition of compression didn't go through the normal
>> path for integrating new features (integration tests between two or more
>> languages actively running).  Right now only the read path is tested from a
>> statically generated file in C++ so this gap wasn't caught.  A contribution
>> to fix this would be welcome.
>>
>> Who is using compression?  Are you using it via the c++ dataset pathways
>>> or one of the various language wrappers?
>>
>> We use the decode path in Java (and other languages) to connect to a
>> service my team owns that serves arrow data with optional compression.
>>  Note that LZ4 is very slow in Java today [1].
>>
>> What about putting a nice C interface on top of all the c++ and then
>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
>>> top of one C interface?  Seems like a hell of a lot less work than the
>>> bespoke language wrappers - and everyone gets access to all the features at
>>> the same time.
>>
>> There is already a gobject [2] interface on top of arrow C++ that is used
>> in the Ruby bindings.  R and Python bind directly to C++ already.  In terms
>> of other implementations there is value in not having every implementation
>> have the same core, as it helps ensure the specification is understandable
>> and can be implemented outside of the project if necessary.  Also for some
>> languages it makes prebuilt distribution easier if native code is not
>> required.
>>
>>  If you are looking for auto generated bindings for C++-Arrow in Java
>> there is a project [3] that does that.  I have never used it so I can't
>> comment on its quality.
>>
>> -Micah
>>
>>
>> [1] https://issues.apache.org/jira/browse/ARROW-11901
>> [2] https://en.wikipedia.org/wiki/GObject
>> [3] https://github.com/bytedeco/javacpp-presets/tree/master/arrow
>>
>>
>> On Fri, Jan 14, 2022 at 5:35 AM Chris Nuernberger <[email protected]>
>> wrote:
>>
>>> Looking through the code it appears that this isn't exposed to users.
>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a
>>> compressor so no one is using this in Java.  I found some great comments in
>>> the go code that are *super* helpful about the compressed buffer's format.
>>>
>>> Who is using compression?  Are you using it via the c++ dataset pathways
>>> or one of the various language wrappers?
>>>
>>> What about putting a nice C interface on top of all the c++ and then
>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
>>> top of one C interface?  Seems like a hell of a lot less work than the
>>> bespoke language wrappers - and everyone gets access to all the features at
>>> the same time.
>>>
>>> On Thu, Jan 13, 2022 at 4:11 PM Chris Nuernberger <[email protected]>
>>> wrote:
>>>
>>>> Great, thanks, I just hadn't noticed until now - thanks!
>>>>
>>>> On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Chris,
>>>>>
>>>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>>>> compression which I think is great.
>>>>>
>>>>>
>>>>> Small nit: this was released in Arrow 4.
>>>>>
>>>>> I had trouble finding examples in python or R (or java) of writing an
>>>>>> IPC file with various types of compression used for the record batch.
>>>>>
>>>>>
>>>>> Java code is at [1] with implementations for compression codec living
>>>>> in [2].
>>>>>
>>>>> Is the compression applied per-column or upon the record batch after
>>>>>> the buffers have been serialized to the batch?  If it is applied per 
>>>>>> column
>>>>>> which buffers - given that text for example can consist of 3 buffers
>>>>>> (validity, offset, data) is compression applied to all three or just data
>>>>>> or data and offset?
>>>>>
>>>>> It is applied per buffer, all buffers are compressed.
>>>>>
>>>>> Cheers,
>>>>> Micah
>>>>>
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100
>>>>> [2]
>>>>> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src
>>>>>
>>>>> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>>>> compression which I think is great.
>>>>>>
>>>>>> I had trouble finding examples in python or R (or java) of writing an
>>>>>> IPC file with various types of compression used for the record batch.
>>>>>>
>>>>>> Is the compression applied per-column or upon the record batch after
>>>>>> the buffers have been serialized to the batch?  If it is applied per 
>>>>>> column
>>>>>> which buffers - given that text for example can consist of 3 buffers
>>>>>> (validity, offset, data) is compression applied to all three or just data
>>>>>> or data and offset?
>>>>>>
>>>>>

Reply via email to