Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-04-16 Thread Wes McKinney
It seems like there is reasonable consensus in the PR. If there are no further comments I'll start a vote about this within the next several days On Mon, Apr 6, 2020 at 10:55 PM Wes McKinney wrote: > > I updated the Format proposal again, please have a look > > https://github.com/apache/arrow/pul

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-04-06 Thread Wes McKinney
I updated the Format proposal again, please have a look https://github.com/apache/arrow/pull/6707 On Wed, Apr 1, 2020 at 10:15 AM Wes McKinney wrote: > > For uncompressed, memory mapping is disabled, so all of the bytes are > being read into RAM. I wanted to show that even when your IO pipe is >

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-04-01 Thread Wes McKinney
For uncompressed, memory mapping is disabled, so all of the bytes are being read into RAM. I wanted to show that even when your IO pipe is very fast (in the case with an NVMe SSD like I have, > 1GB/s for read from disk) that you can still load faster with compressed files. Here were the prior Read

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-04-01 Thread Antoine Pitrou
The read times are still with memory mapping for the uncompressed case? If so, impressive! Regards Antoine. Le 01/04/2020 à 16:44, Wes McKinney a écrit : > Several pieces of work got done in the last few days: > > * Changing from LZ4 raw to LZ4 frame format (what is recommended for > intero

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-04-01 Thread Wes McKinney
Several pieces of work got done in the last few days: * Changing from LZ4 raw to LZ4 frame format (what is recommended for interoperability) * Parallelizing both compression and decompression at the field level Here are the results (using 8 threads on an 8-core laptop). I disabled the "memory map

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-26 Thread Wes McKinney
Here are the results: File size: https://ibb.co/71sBsg3 Read time: https://ibb.co/4ZncdF8 Write time: https://ibb.co/xhNkRS2 Code: https://github.com/wesm/notebooks/blob/master/20190919file_benchmarks/FeatherCompression.ipynb (based on https://github.com/apache/arrow/pull/6694) High level summa

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-26 Thread Wes McKinney
I'll run a grid of batch sizes (from 1024 to 64K or 128K) and let you know the read/write times and compression ratios. Shouldn't take too long On Wed, Mar 25, 2020 at 10:37 PM Fan Liya wrote: > > Thanks a lot for sharing the good results. > > As investigated by Wes, we have existing zstd library

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-25 Thread Fan Liya
Thanks a lot for sharing the good results. As investigated by Wes, we have existing zstd library for Java (zstd-jni) [1], and lz4 library for Java (lz4-java) [2]. +1 for the 1024 batch size, as it represents an important scenario where the batch fits into the L1 cache (IMO). Best, Liya Fan [1] h

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-25 Thread Micah Kornfield
If it isn't hard could you run with batch sizes of 1024 or 2048 records? I think there was a question previously raised if there was benefit for smaller sizes buffers. Thanks, Micah On Wed, Mar 25, 2020 at 8:59 AM Wes McKinney wrote: > On Tue, Mar 24, 2020 at 9:22 PM Micah Kornfield > wrote:

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-25 Thread Wes McKinney
On Tue, Mar 24, 2020 at 9:22 PM Micah Kornfield wrote: > > > > > Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on > > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae > > dataset. So that's a huge space savings > > One more question on this. What was the a

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-25 Thread Sebastien Binet
On Wed, Mar 25, 2020 at 2:32 AM Wes McKinney wrote: > From what I've found searching on the internet > > - Java: > * ZSTD -- JNI-based library available > * LZ4 -- both JNI and native Java available > > - Go: ZSTD is a C binding, while there is an LZ4 native Go implementation > AFAIK, one has acc

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-24 Thread Micah Kornfield
> > Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae > dataset. So that's a huge space savings One more question on this. What was the average row-batch size used? I see in the proposal some buffers might

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-24 Thread Wes McKinney
>From what I've found searching on the internet - Java: * ZSTD -- JNI-based library available * LZ4 -- both JNI and native Java available - Go: ZSTD is a C binding, while there is an LZ4 native Go implementation - Rust: bindings to both C libraries available - C# wrapper libraries seem to be av

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-24 Thread Micah Kornfield
Thanks Wes, It would be nice if contributors to other languages could express there opinions on the two compression formats selected (in particular if they represent challenges in using a suitable library for decompressing) -Micah On Tue, Mar 24, 2020 at 3:08 PM Wes McKinney wrote: > I just

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-24 Thread Wes McKinney
I just opened this pull request with the proposed format additions based on this discussion: https://github.com/apache/arrow/pull/6707 If there is more feedback about the details, it would be good to know it now. In a couple of days I would like to call a vote to see if there is interest in forma

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-23 Thread Antoine Pitrou
Le 24/03/2020 à 00:39, Wes McKinney a écrit : > > As far as what Micah said about having a limited number of > compressors: I would be in favor of having just LZ4 and ZSTD. +1, exactly my thought as well. Regards Antoine.

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-23 Thread Wes McKinney
hi folks, Sorry it's taken me a little while to produce supporting benchmarks. * I implemented experimental trivial body buffer compression in https://github.com/apache/arrow/pull/6638 * I hooked up the Arrow IPC file format with compression as the new Feather V2 format in https://github.com/apac

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-06 Thread Fan Liya
Hi Wes, Thanks a lot for the additional information. Looking forward to see the good results from your experiments. Best, Liya Fan On Thu, Mar 5, 2020 at 11:42 PM Wes McKinney wrote: > I see, thank you. > > For such a scenario, implementations would need to define a > "UserDefinedCodec" interf

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-05 Thread Wes McKinney
I see, thank you. For such a scenario, implementations would need to define a "UserDefinedCodec" interface to enable codecs to be registered from third party code, similar to what is done for extension types [1] I'll update this thread when I get my experimental C++ patch up to see what I'm think

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-05 Thread Fan Liya
Hi Wes, Thanks a lot for your further clarification. Some of my prelimiary thoughts: 1. We assign a unique GUID to each pair of compression/decompression strategies. The GUID is stored as part of the Message.custom_metadata. When receiving the GUID, the receiver knows which decompression strateg

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-04 Thread Wes McKinney
Okay, I guess my question is how the receiver is going to be able to determine how to "rehydrate" the record batch buffers: What I've proposed amounts to the following: * UNCOMPRESSED: the current behavior * ZSTD/LZ4/...: each buffer is compressed and written with an int64 length prefix (I'm clo

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-04 Thread Fan Liya
Hi Wes, I am thinking of adding an option named "USER_DEFINED" (or something similar) to enum CompressionType in your proposal. IMO, this option should be used primarily in Flight. Best, Liya Fan On Wed, Mar 4, 2020 at 11:12 AM Wes McKinney wrote: > On Tue, Mar 3, 2020, 8:11 PM Fan Liya wrote

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-03 Thread Wes McKinney
On Tue, Mar 3, 2020, 8:11 PM Fan Liya wrote: > Sure. I agree with you that we should not overdo this. > I am wondering if we should provide an option to allow users to plugin > their customized compression strategies. > Can you provide a patch showing changes to Message.fbs (or Schema.fbs) that

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-03 Thread Fan Liya
Sure. I agree with you that we should not overdo this. I am wondering if we should provide an option to allow users to plugin their customized compression strategies. Best, Liya Fan On Tue, Mar 3, 2020 at 9:47 PM Wes McKinney wrote: > On Tue, Mar 3, 2020, 7:36 AM Fan Liya wrote: > > > I am so

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-03 Thread Wes McKinney
On Tue, Mar 3, 2020, 7:36 AM Fan Liya wrote: > I am so glad to see this discussion, and I am willing to provide help from > the Java side. > > In the proposal, I see the support for basic compression strategies > (e.g.gzip, snappy). > IMO, applying a single basic strategy is not likely to achieve

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-03 Thread Antoine Pitrou
Well, we shouldn't overdo this either. We are not trying to replicate the Parquet format. Regards Antoine. Le 03/03/2020 à 14:36, Fan Liya a écrit : > I am so glad to see this discussion, and I am willing to provide help from > the Java side. > > In the proposal, I see the support for basic

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-03 Thread Fan Liya
I am so glad to see this discussion, and I am willing to provide help from the Java side. In the proposal, I see the support for basic compression strategies (e.g.gzip, snappy). IMO, applying a single basic strategy is not likely to achieve performance improvement for most scenarios. The optimal c

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-03 Thread Antoine Pitrou
If we want to use a HTTP header, it would be more of a Accept-Encoding header, no? In any case, we would have to put non-standard values there (e.g. lz4), so I'm not sure how desirable it is to repurpose HTTP headers for that, rather than add some dedicated field to the Flight messages. Regards

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-03 Thread David Li
gRPC supports headers so for Flight, we could send essentially an Accept header and perhaps a Content-Type header. David On Mon, Mar 2, 2020, 23:15 Micah Kornfield wrote: > Hi Wes, > A few thoughts on this. In general, I think it is a good idea. But before > proceeding, I think the following

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-02 Thread Micah Kornfield
Hi Wes, A few thoughts on this. In general, I think it is a good idea. But before proceeding, I think the following points are worth discussing: 1. Does this actually improve throughput/latency for Flight? (I think you mentioned you would follow-up with benchmarks). 2. I think we should limit t

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-01 Thread Wes McKinney
On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou wrote: > > > Le 01/03/2020 à 22:01, Wes McKinney a écrit : > > In the context of a "next version of the Feather format" ARROW-5510 > > (which is consumed only by Python and R at the moment), I have been > > looking at compressing buffers using fast com

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-01 Thread Neville Dipale
I also support compression at the buffer level, and making it an extra message. Talking about compression and flight, has anyone tested using grpc's compression to compress at the transport level (if that's a correct way to describe it)? I believe only gzip and brotli are currently supported, so t

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-01 Thread Antoine Pitrou
Le 01/03/2020 à 22:01, Wes McKinney a écrit : > In the context of a "next version of the Feather format" ARROW-5510 > (which is consumed only by Python and R at the moment), I have been > looking at compressing buffers using fast compressors like ZSTD when > writing the RecordBatch bodies. This c

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-01 Thread Wes McKinney
On Sun, Mar 1, 2020 at 3:01 PM Wes McKinney wrote: > > In the context of a "next version of the Feather format" ARROW-5510 > (which is consumed only by Python and R at the moment), I have been > looking at compressing buffers using fast compressors like ZSTD when > writing the RecordBatch bodies.

[DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-01 Thread Wes McKinney
In the context of a "next version of the Feather format" ARROW-5510 (which is consumed only by Python and R at the moment), I have been looking at compressing buffers using fast compressors like ZSTD when writing the RecordBatch bodies. This could be handled privately as an implementation detail of