Re: [DISCUSS] Format additions for encoding/compression

2020-01-24 Thread Micah Kornfield
Great John, I'd be interesting to hear about progress. Also, IMO I think we should be only focusing on encoding that have the potential to be exploited for computational benefits (not just compressibility). I think this is what distinguishes Arrow from other formats like Parquet. I think this ech

Re: [DISCUSS] Format additions for encoding/compression

2020-01-24 Thread John Muehlhausen
Thanks Micah, I will see if I can find some time to explore this further. On Thu, Jan 23, 2020 at 10:56 PM Micah Kornfield wrote: > Hi John, > Not Wes, but my thoughts on this are as follows: > > 1. Alternate bit/byte arrangements can also be useful for processing [1] in > addition to compressio

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Micah Kornfield
Hi John, Not Wes, but my thoughts on this are as follows: 1. Alternate bit/byte arrangements can also be useful for processing [1] in addition to compression. 2. I think they are quite a bit more complicated then the existing schemes proposed in [2], so I think it would be more expedient to get th

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread John Muehlhausen
Wes, what do you think about Arrow supporting a new suite of fixed-length data types that unshuffle on column->Value(i) calls? This would allow memory/swap compressors and memory maps backed by compressing filesystems (ZFS) or block devices (VDO) to operate more efficiently. By doing it with new

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Wes McKinney
On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen wrote: > > Again, I know very little about Parquet, so your patience is appreciated. > > At the moment I can Arrow/mmap a file without having anywhere nearly as > much available memory as the file size. I can visit random place in the > file (such

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread John Muehlhausen
Again, I know very little about Parquet, so your patience is appreciated. At the moment I can Arrow/mmap a file without having anywhere nearly as much available memory as the file size. I can visit random place in the file (such as a binary search if it is ordered) and only the locations visited

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Wes McKinney
Parquet is most relevant in scenarios filesystem IO is constrained (spinning rust HDD, network FS, cloud storage / S3 / GCS). For those use cases memory-mapped Arrow is not viable. Against local NVMe (> 2000 MB/s read throughput) your mileage may vary. On Thu, Jan 23, 2020 at 12:06 PM Francois Sa

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Francois Saint-Jacques
What's the point of having zero copy if the OS is doing the decompression in kernel (which trumps the zero-copy argument)? You might as well just use parquet without filesystem compression. I prefer to have compression algorithm where the columnar engine can benefit from it [1] than marginally impr

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread John Muehlhausen
This could also have utility in memory via things like zram/zswap, right? Mac also has a memory compressor? I don't think Parquet is an option for me unless the integration with Arrow is tighter than I imagine (i.e. zero-copy). That said, I confess I know next to nothing about Parquet. On Thu, J

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Antoine Pitrou
Forgot to give the URL: https://github.com/apache/arrow/pull/6005 Regards Antoine. Le 23/01/2020 à 18:23, Antoine Pitrou a écrit : > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : >> Perhaps related to this thread, are there any current or proposed tools to >> transform columns for fixe

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Antoine Pitrou
Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > Perhaps related to this thread, are there any current or proposed tools to > transform columns for fixed-length data types according to a "shuffle?" > For precedent see the implementation of the shuffle filter in hdf5. > https://support.hdfgrou

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2020-01-23 Thread John Muehlhausen
Perhaps related to this thread, are there any current or proposed tools to transform columns for fixed-length data types according to a "shuffle?" For precedent see the implementation of the shuffle filter in hdf5. https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-alg

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-08-25 Thread Micah Kornfield
Hi Ippokratis, Thank you for the feedback, I have some questions based on the links you provided. > I think that lightweight encodings (like the FrameOfReference Micah > suggests) do make a lot of sense for Arrow. There are a few implementations > of those in commercial systems. One related paper

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-08-25 Thread Ippokratis Pandis
I think that lightweight encodings (like the FrameOfReference Micah suggests) do make a lot of sense for Arrow. There are a few implementations of those in commercial systems. One related paper in the literature is http://www.cs.columbia.edu/~orestis/damon15.pdf I would actually also look into som

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-25 Thread Micah Kornfield
> > It's not just computation libraries, it's any library peeking inside > Arrow data. Currently, the Arrow data types are simple, which makes it > easy and non-intimidating to build data processing utilities around > them. If we start adding sophisticated encodings, we also raise the > cost of s

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-22 Thread Antoine Pitrou
On Mon, 22 Jul 2019 08:40:08 -0700 Brian Hulette wrote: > To me, the most important aspect of this proposal is the addition of sparse > encodings, and I'm curious if there are any more objections to that > specifically. So far I believe the only one is that it will make > computation libraries mor

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-22 Thread Brian Hulette
To me, the most important aspect of this proposal is the addition of sparse encodings, and I'm curious if there are any more objections to that specifically. So far I believe the only one is that it will make computation libraries more complicated. This is absolutely true, but I think it's worth th

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-13 Thread Wes McKinney
On Sat, Jul 13, 2019 at 11:23 AM Antoine Pitrou wrote: > > On Fri, 12 Jul 2019 20:37:15 -0700 > Micah Kornfield wrote: > > > > If the latter, I wonder why Parquet cannot simply be used instead of > > > reinventing something similar but different. > > > > This is a reasonable point. However there

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-13 Thread Antoine Pitrou
On Fri, 12 Jul 2019 20:37:15 -0700 Micah Kornfield wrote: > > If the latter, I wonder why Parquet cannot simply be used instead of > > reinventing something similar but different. > > This is a reasonable point. However there is continuum here between file > size and read and write times. P

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Micah Kornfield
Hi Antoine, I think Liya Fan raised some good points in his reply but I'd like to answer your questions directly. > So the question is whether this really needs to be in the in-memory > format, i.e. is it desired to operate directly on this compressed > format, or is it solely for transport? I t

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Fan Liya
@Antoine Pitrou, Good question. I think the answer depends on the concrete encoding scheme. For some encoding schemes, it is not a good idea to use them for in-memory data compression. For others, it is beneficial to operator directly on the compressed data. For example, it is beneficial to dire

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Antoine Pitrou
Le 12/07/2019 à 10:08, Micah Kornfield a écrit : > OK, I've created a separate thread for data integrity/digests [1], and > retitled this thread to continue the discussion on compression and > encodings. As a reminder the PR for the format additions [2] suggested a > new SparseRecordBatch that w

[DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Micah Kornfield
OK, I've created a separate thread for data integrity/digests [1], and retitled this thread to continue the discussion on compression and encodings. As a reminder the PR for the format additions [2] suggested a new SparseRecordBatch that would allow for the following features: 1. Different data e