[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959827#comment-16959827 ] Wes McKinney commented on ARROW-300: There are some discussions on going at https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959552#comment-16959552 ] Yuan Zhou commented on ARROW-300: - Hi [~wesm], thanks for providing the general idea, I'm quite interested in this feature. Do you happen to have some updates on the detail proposal? Cheers, -yuan > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617883#comment-16617883 ] Wes McKinney commented on ARROW-300: Moving this to 0.12. I will make a proposal for compressed record batches after the 0.11 release goes out. My gut instinct on this would be to create a {{CompressedBuffer}} metadata type and a {{CompressedRecordBatch}} message. Some reasons: * Does not complicate or bloat the existing RecordBatch message type * Support buffer-level compression (each buffer can be compressed or not) Readers can choose to materialize right away or on demand -- in C++, we can create a {{arrow::CompressedRecordBatch}} class if we want that does late materialization. This does not necessarily accommodate other kinds of type-specific compression, like RLE-encoding, or it might be that RLE can be used on the values buffer of primitive types, e.g. {code} CompressedBuffer { CompressionType type; int64 offset; int64 compressed_size; int64 uncompressed_size; } {code} So if we wanted to use the Parquet RLE_BITPACKED_HYBRID compression style for integers, say, we could do that. Another question here is how to handle compressions which may have additional parameters. {{CompressionType}} or {{Compression}} could be a union, but that would make the message sizes larger (but maybe that's OK) > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392262#comment-16392262 ] Wes McKinney commented on ARROW-300: We haven't done any work on this yet. I think the first step would be to propose additional metadata (in the Flatbuffers files) for record batches to indicate the style of compression. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392124#comment-16392124 ] Lawrence Chan commented on ARROW-300: - What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. My current workaround uses a fixed length byte array but it's pretty clunky to do this efficiently, at least in the parquet-cpp implementation. There are maybe also some alignment concerns with that latter approach that I'm just ignoring right now. Happy to help, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16176938#comment-16176938 ] Wes McKinney commented on ARROW-300: We have had all of the pieces in place in C++ that we need to do this since 0.6.0. I will propose metadata extensions to support compressed record batches and a trial implementation > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney > Fix For: 0.8.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008315#comment-16008315 ] Kazuaki Ishizaki commented on ARROW-300: Thank you for your response. I was also busy for preparing materials for GTC. It is good time to make a document, now. It sounds good to prepare a Google document for collecting public comments. I will start creating a document for purpose, scope, and design. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999611#comment-15999611 ] Wes McKinney commented on ARROW-300: I'm sorry for the delay. With the 0.3 Arrow release done, it would be good to make a push on compression and encoding. How about we start a Google Document that supports public comments and you can give edit support to whomever you like? Once we agree on the design, one of us can make a pull request containing the Flatbuffer metadata for the compression / encoding details. Does that sound good? > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975288#comment-15975288 ] Kazuaki Ishizaki commented on ARROW-300: [~wesmckinn] Thank you for your kindly and positive comment. I will work for preparing a proposal (It would take some time since I have to prepare a presentation for GTC, too). [~xhochy] IIUC, Parquet is used for a persistent file. Arrow is used for in-memory format. What level of proposal do you expect? For example, * What we want to do (e.g. RLE, Delta-encoding) * New meta data format to support new compression schemes (new .fbs file) * Data format for new compression schemes * Prototype implementation * others Also, will that proposal be posted into another JIRA entry or a comment in this JIRA entry? > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967616#comment-15967616 ] Wes McKinney commented on ARROW-300: [~kiszk] I agree that having in-memory compression schemes like in Spark is a good idea, in addition to simpler snappy/lz4/zlib buffer compression. Would you like to make a proposal for improvements to the Arrow metadata to support these compression schemes? We should indicate that Arrow implementations are not required to implement these in general, so for now they can be marked as experimental and optional for implementations (e.g. we wouldn't necessarily integration test them). For scan-based in-memory columnar workloads, these encodings can yield better scan throughput because of better cache efficiency, and many column-oriented databases rely on this to be able to achieve high performance, so having it natively in the Arrow libraries seems useful. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967452#comment-15967452 ] Uwe L. Korn commented on ARROW-300: --- Adding methods like RLE- or Delta-encoding brings us very much in the space of Parquet. Given that some of these methods are really fast, it might make sense to support them for IPC. But then I fear that we will get very much in a region where there is no clear distinction between Arrow and Parquet anymore. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963751#comment-15963751 ] Kazuaki Ishizaki commented on ARROW-300: Current Apache Spark supports [the following compression schemes|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressionScheme.scala#L66] for in-memory columnar storage. Currently, compressed in-memory columnar storage is used when DataFrame.cache or Dataset.cache method is executed. Would it be possible to support these schemes in addition to LZ4/(current)DictonaryEncoding? * RunLengthEncoding: Generic run-length encoding (e.g. 1,1,1,2,2,2,2 -> [3, 1], [4, 2]) * IntDelta: Represent a sequence using a base value with byte deltas from previous one. (e.g. 1,3,5,7,10 -> [1, 2, 2, 2, 3]) * LongDelta: Represent a sequence using a base value with byte deltas from previous one. (e.g. 1,3,5,7,10 -> [1, 2, 2, 2, 3]) > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15608798#comment-15608798 ] Uwe L. Korn commented on ARROW-300: --- +1 Compression makes sense to me and also the list of initial algorithms. High compression ratios probably only make sense once you have cross-datacenter traffic. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15608768#comment-15608768 ] Wes McKinney commented on ARROW-300: It may make sense to limit to compressors designed for fast decompression performance: snappy, zstd, lz4. High compression ratios might be less interesting, but I'm interested in more feedback on use cases. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v6.3.4#6332)