Looking through the code it appears that this isn't exposed to users.
ArrowWriter doesn't use the VectorUnloader constructor that includes a
compressor so no one is using this in Java.  I found some great comments in
the go code that are *super* helpful about the compressed buffer's format.

Who is using compression?  Are you using it via the c++ dataset pathways or
one of the various language wrappers?

What about putting a nice C interface on top of all the c++ and then basing
R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on top of
one C interface?  Seems like a hell of a lot less work than the bespoke
language wrappers - and everyone gets access to all the features at the
same time.

On Thu, Jan 13, 2022 at 4:11 PM Chris Nuernberger <[email protected]>
wrote:

> Great, thanks, I just hadn't noticed until now - thanks!
>
> On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield <[email protected]>
> wrote:
>
>> Hi Chris,
>>
>>> Upgrading to 6.0.X I noticed that record batches can have body
>>> compression which I think is great.
>>
>>
>> Small nit: this was released in Arrow 4.
>>
>> I had trouble finding examples in python or R (or java) of writing an IPC
>>> file with various types of compression used for the record batch.
>>
>>
>> Java code is at [1] with implementations for compression codec living in
>> [2].
>>
>> Is the compression applied per-column or upon the record batch after the
>>> buffers have been serialized to the batch?  If it is applied per column
>>> which buffers - given that text for example can consist of 3 buffers
>>> (validity, offset, data) is compression applied to all three or just data
>>> or data and offset?
>>
>> It is applied per buffer, all buffers are compressed.
>>
>> Cheers,
>> Micah
>>
>>
>> [1]
>> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100
>> [2]
>> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src
>>
>> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger <[email protected]>
>> wrote:
>>
>>> Upgrading to 6.0.X I noticed that record batches can have body
>>> compression which I think is great.
>>>
>>> I had trouble finding examples in python or R (or java) of writing an
>>> IPC file with various types of compression used for the record batch.
>>>
>>> Is the compression applied per-column or upon the record batch after the
>>> buffers have been serialized to the batch?  If it is applied per column
>>> which buffers - given that text for example can consist of 3 buffers
>>> (validity, offset, data) is compression applied to all three or just data
>>> or data and offset?
>>>
>>

Reply via email to