[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2019-10-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959827#comment-16959827
 ] 

Wes McKinney commented on ARROW-300:


There are some discussions on going at 
https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2019-10-25 Thread Yuan Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959552#comment-16959552
 ] 

Yuan Zhou commented on ARROW-300:
-

Hi [~wesm], thanks for providing the general idea, I'm quite interested in this 
feature. Do you happen to have some updates on the detail proposal?   

Cheers, -yuan

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-09-17 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617883#comment-16617883
 ] 

Wes McKinney commented on ARROW-300:


Moving this to 0.12. I will make a proposal for compressed record batches after 
the 0.11 release goes out.

My gut instinct on this would be to create a {{CompressedBuffer}} metadata type 
and a {{CompressedRecordBatch}} message. Some reasons:

* Does not complicate or bloat the existing RecordBatch message type
* Support buffer-level compression (each buffer can be compressed or not)

Readers can choose to materialize right away or on demand -- in C++, we can 
create a {{arrow::CompressedRecordBatch}} class if we want that does late 
materialization.

This does not necessarily accommodate other kinds of type-specific compression, 
like RLE-encoding, or it might be that RLE can be used on the values buffer of 
primitive types, e.g.

{code}
CompressedBuffer {
  CompressionType type;
  int64 offset;
  int64 compressed_size;
  int64 uncompressed_size;
}
{code}

So if we wanted to use the Parquet RLE_BITPACKED_HYBRID compression style for 
integers, say, we could do that.

Another question here is how to handle compressions which may have additional 
parameters. {{CompressionType}} or {{Compression}} could be a union, but that 
would make the message sizes larger (but maybe that's OK)

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-03-08 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392262#comment-16392262
 ] 

Wes McKinney commented on ARROW-300:


We haven't done any work on this yet. I think the first step would be to 
propose additional metadata (in the Flatbuffers files) for record batches to 
indicate the style of compression. 

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-03-08 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392124#comment-16392124
 ] 

Lawrence Chan commented on ARROW-300:
-

What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression. My current workaround uses a fixed length byte array but it's 
pretty clunky to do this efficiently, at least in the parquet-cpp 
implementation. There are maybe also some alignment concerns with that latter 
approach that I'm just ignoring right now.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2017-09-22 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16176938#comment-16176938
 ] 

Wes McKinney commented on ARROW-300:


We have had all of the pieces in place in C++ that we need to do this since 
0.6.0. I will propose metadata extensions to support compressed record batches 
and a trial implementation 

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2017-05-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008315#comment-16008315
 ] 

Kazuaki Ishizaki commented on ARROW-300:


Thank you for your response. I was also busy for preparing materials for GTC. 
It is good time to make a document, now.

It sounds good to prepare a Google document for collecting public comments. I 
will start creating a document for purpose, scope, and design.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2017-05-06 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999611#comment-15999611
 ] 

Wes McKinney commented on ARROW-300:


I'm sorry for the delay. With the 0.3 Arrow release done, it would be good to 
make a push on compression and encoding. 

How about we start a Google Document that supports public comments and you can 
give edit support to whomever you like? Once we agree on the design, one of us 
can make a pull request containing the Flatbuffer metadata for the compression 
/ encoding details. Does that sound good?

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2017-04-19 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975288#comment-15975288
 ] 

Kazuaki Ishizaki commented on ARROW-300:


[~wesmckinn] Thank you for your kindly and positive comment. I will work for 
preparing a proposal (It would take some time since I have to prepare a 
presentation for GTC, too).
[~xhochy] IIUC, Parquet is used for a persistent file. Arrow is used for 
in-memory format.

What level of proposal do you expect? For example,
* What we want to do (e.g. RLE, Delta-encoding)
* New meta data format to support new compression schemes (new .fbs file)
* Data format for new compression schemes
* Prototype implementation
* others

Also, will that proposal be posted into another JIRA entry or a comment in this 
JIRA entry?




> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2017-04-13 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967616#comment-15967616
 ] 

Wes McKinney commented on ARROW-300:


[~kiszk] I agree that having in-memory compression schemes like in Spark is a 
good idea, in addition to simpler snappy/lz4/zlib buffer compression. Would you 
like to make a proposal for improvements to the Arrow metadata to support these 
compression schemes? We should indicate that Arrow implementations are not 
required to implement these in general, so for now they can be marked as 
experimental and optional for implementations (e.g. we wouldn't necessarily 
integration test them). For scan-based in-memory columnar workloads, these 
encodings can yield better scan throughput because of better cache efficiency, 
and many column-oriented databases rely on this to be able to achieve high 
performance, so having it natively in the Arrow libraries seems useful. 

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2017-04-13 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967452#comment-15967452
 ] 

Uwe L. Korn commented on ARROW-300:
---

Adding methods like RLE- or Delta-encoding brings us very much in the space of 
Parquet. Given that some of these methods are really fast, it might make sense 
to support them for IPC. But then I fear that we will get very much in a region 
where there is no clear distinction between Arrow and Parquet anymore.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2017-04-10 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963751#comment-15963751
 ] 

Kazuaki Ishizaki commented on ARROW-300:


Current Apache Spark supports [the following compression 
schemes|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressionScheme.scala#L66]
 for in-memory columnar storage. Currently, compressed in-memory columnar 
storage is used when DataFrame.cache or Dataset.cache method is executed.
Would it be possible to support these schemes in addition to 
LZ4/(current)DictonaryEncoding?

* RunLengthEncoding: Generic run-length encoding (e.g. 1,1,1,2,2,2,2 -> [3, 1], 
[4, 2])
* IntDelta: Represent a sequence using a base value with byte deltas from 
previous one. (e.g. 1,3,5,7,10 -> [1, 2, 2, 2, 3])
* LongDelta: Represent a sequence using a base value with byte deltas from 
previous one. (e.g. 1,3,5,7,10 -> [1, 2, 2, 2, 3])


> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2016-10-26 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15608798#comment-15608798
 ] 

Uwe L. Korn commented on ARROW-300:
---

+1 Compression makes sense to me and also the list of initial algorithms. High 
compression ratios probably only make sense once you have cross-datacenter 
traffic.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2016-10-26 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15608768#comment-15608768
 ] 

Wes McKinney commented on ARROW-300:


It may make sense to limit to compressors designed for fast decompression 
performance: snappy, zstd, lz4. High compression ratios might be less 
interesting, but I'm interested in more feedback on use cases. 

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)