Re: examples of using new compression scheme

Micah Kornfield Sun, 16 Jan 2022 20:33:44 -0800

Hi Chris,

> Looking through the code it appears that this isn't exposed to users.
> ArrowWriter doesn't use the VectorUnloader constructor that includes a
> compressor so no one is using this in Java.  I found some great comments in
> the go code that are *super* helpful about the compressed buffer's format.


Unfortunately, the addition of compression didn't go through the normal
path for integrating new features (integration tests between two or more
languages actively running).  Right now only the read path is tested from a
statically generated file in C++ so this gap wasn't caught.  A contribution
to fix this would be welcome.

Who is using compression?  Are you using it via the c++ dataset pathways or
> one of the various language wrappers?

We use the decode path in Java (and other languages) to connect to a
service my team owns that serves arrow data with optional compression.
 Note that LZ4 is very slow in Java today [1].

What about putting a nice C interface on top of all the c++ and then basing
> R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on top of
> one C interface?  Seems like a hell of a lot less work than the bespoke
> language wrappers - and everyone gets access to all the features at the
> same time.

There is already a gobject [2] interface on top of arrow C++ that is used
in the Ruby bindings.  R and Python bind directly to C++ already.  In terms
of other implementations there is value in not having every implementation
have the same core, as it helps ensure the specification is understandable
and can be implemented outside of the project if necessary.  Also for some
languages it makes prebuilt distribution easier if native code is not
required.

 If you are looking for auto generated bindings for C++-Arrow in Java there
is a project [3] that does that.  I have never used it so I can't comment
on its quality.

-Micah


[1] https://issues.apache.org/jira/browse/ARROW-11901
[2] https://en.wikipedia.org/wiki/GObject
[3] https://github.com/bytedeco/javacpp-presets/tree/master/arrow


On Fri, Jan 14, 2022 at 5:35 AM Chris Nuernberger <[email protected]>
wrote:

> Looking through the code it appears that this isn't exposed to users.
> ArrowWriter doesn't use the VectorUnloader constructor that includes a
> compressor so no one is using this in Java.  I found some great comments in
> the go code that are *super* helpful about the compressed buffer's format.
>
> Who is using compression?  Are you using it via the c++ dataset pathways
> or one of the various language wrappers?
>
> What about putting a nice C interface on top of all the c++ and then
> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
> top of one C interface?  Seems like a hell of a lot less work than the
> bespoke language wrappers - and everyone gets access to all the features at
> the same time.
>
> On Thu, Jan 13, 2022 at 4:11 PM Chris Nuernberger <[email protected]>
> wrote:
>
>> Great, thanks, I just hadn't noticed until now - thanks!
>>
>> On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> Hi Chris,
>>>
>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>> compression which I think is great.
>>>
>>>
>>> Small nit: this was released in Arrow 4.
>>>
>>> I had trouble finding examples in python or R (or java) of writing an
>>>> IPC file with various types of compression used for the record batch.
>>>
>>>
>>> Java code is at [1] with implementations for compression codec living in
>>> [2].
>>>
>>> Is the compression applied per-column or upon the record batch after the
>>>> buffers have been serialized to the batch?  If it is applied per column
>>>> which buffers - given that text for example can consist of 3 buffers
>>>> (validity, offset, data) is compression applied to all three or just data
>>>> or data and offset?
>>>
>>> It is applied per buffer, all buffers are compressed.
>>>
>>> Cheers,
>>> Micah
>>>
>>>
>>> [1]
>>> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100
>>> [2]
>>> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src
>>>
>>> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger <[email protected]>
>>> wrote:
>>>
>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>> compression which I think is great.
>>>>
>>>> I had trouble finding examples in python or R (or java) of writing an
>>>> IPC file with various types of compression used for the record batch.
>>>>
>>>> Is the compression applied per-column or upon the record batch after
>>>> the buffers have been serialized to the batch?  If it is applied per column
>>>> which buffers - given that text for example can consist of 3 buffers
>>>> (validity, offset, data) is compression applied to all three or just data
>>>> or data and offset?
>>>>
>>>

Re: examples of using new compression scheme

Reply via email to