[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-03-08 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392124#comment-16392124
 ] 

Lawrence Chan edited comment on ARROW-300 at 3/9/18 2:09 AM:
-

What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression. I tried to hack it up with FixedLenByteArray but there are a 
slew of complications with that, not to mention alignment concerns etc.

Anyways I'm happy to help on this, but I'm not familiar enough with the code 
base to place it in the right spot. If we make a branch with some 
TODOs/placeholders I can probably plug in more easily.


was (Author: llchan):
What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-03-08 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392124#comment-16392124
 ] 

Lawrence Chan edited comment on ARROW-300 at 3/9/18 2:00 AM:
-

What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.


was (Author: llchan):
What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression. My current workaround uses a fixed length byte array but it's 
pretty clunky to do this efficiently, at least in the parquet-cpp 
implementation. There are maybe also some alignment concerns with that approach 
that I'm just ignoring right now.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-03-08 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392124#comment-16392124
 ] 

Lawrence Chan edited comment on ARROW-300 at 3/8/18 11:46 PM:
--

What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression. My current workaround uses a fixed length byte array but it's 
pretty clunky to do this efficiently, at least in the parquet-cpp 
implementation. There are maybe also some alignment concerns with that approach 
that I'm just ignoring right now.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.


was (Author: llchan):
What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression. My current workaround uses a fixed length byte array but it's 
pretty clunky to do this efficiently, at least in the parquet-cpp 
implementation. There are maybe also some alignment concerns with that latter 
approach that I'm just ignoring right now.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format

2016-11-15 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15667983#comment-15667983
 ] 

Uwe L. Korn edited comment on ARROW-300 at 11/15/16 7:17 PM:
-

I'm not so sure about the benefit of a compressed arrow file format. For me the 
main distinction is that Parquet provides efficient storage (with the tradeoff 
of not being to randomly access a single row) and Arrow random access, both for 
columnar data.

The one point where I see an Arrow file format as beneficial is where you need 
random access to its data but cannot load it fully into RAM but instead use a 
memory mapped file. If you add compression (either column-wise or whole-file 
level), you cannot memorymap it anymore.

The only point where I can see that having columnar compression for Arrow 
batches is better than on the whole batch layer is that it actually produces 
better compression behaviour. This means that doing compression on a per-column 
basis can be parallelised independently of the underyling algorithm thus 
leading to better CPU usage. Furthermore the compression may be better if done 
on a column level (with a sufficient number of rows) as the data inside a 
column is very similar thus leading to smaller compression dictionaries and 
better compresssion ratios at the end. Both things mentioned are just 
assumptions that should be tested before being implemented.


was (Author: xhochy):
Given my latest (sadly internal) performance tests, I'm not so sure about the 
benefit of a compressed arrow file format. For me the main distinction is that 
Parquet provides efficient storage (with the tradeoff of not being to randomly 
access a single row) and Arrow random access, both for columnar data.

The one point where I see an Arrow file format as beneficial is where you need 
random access to its data but cannot load it fully into RAM but instead use a 
memory mapped file. If you add compression (either column-wise or whole-file 
level), you cannot memorymap it anymore.

The only point where I can see that having columnar compression for Arrow 
batches is better than on the whole batch layer is that it actually produces 
better compression behaviour. This means that doing compression on a per-column 
basis can be parallelised independently of the underyling algorithm thus 
leading to better CPU usage. Furthermore the compression may be better if done 
on a column level (with a sufficient number of rows) as the data inside a 
column is very similar thus leading to smaller compression dictionaries and 
better compresssion ratios at the end. Both things mentioned are just 
assumptions that should be tested before being implemented.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)