Re: Current status of Data Page V2?

2020-10-26 Thread Gabor Szadovszky
Hi Micah, V2 pages are not only about the new encodings but a couple of other things the community thought would be better than V1. One of these improvements was to break pages at row boundaries. This one is outdated because during the development of column-indexes we had to implement the same

Re: Current status of Data Page V2?

2020-10-22 Thread Micah Kornfield
Hi Gabor, > It is still not clear to me if we want to recommend V2 for production use > at all Again, I'm missing context here, but what is blocking V2 for production use? Is it specification finalization, implementation finalization? Something else? or simply introduce the new encodings for

Re: Current status of Data Page V2?

2020-10-22 Thread Gabor Szadovszky
It is still not clear to me if we want to recommend V2 for production use at all or simply introduce the new encodings for V1. I would suggest discussing this topic on the parquet sync next Tuesday. On Thu, Oct 22, 2020 at 6:04 AM Micah Kornfield wrote: > I've created

Re: Current status of Data Page V2?

2020-10-21 Thread Micah Kornfield
I've created https://github.com/apache/parquet-format/pull/163 to try to document these (note I really don't have historical context here so please review carefully). I would appreciate it if someone could point me to a reference on what the current status of V2 is? What is left unsettled? When

Re: Current status of Data Page V2?

2020-10-13 Thread Micah Kornfield
> > I am not sure 2.0 means the v2 pages here. I think there was/is a bit of > confusion between the v1/v2 pages and the parquet-mr releases. Maybe the > parquet-format releases are also part of it. +1 to the confusion part. The reason why I originally started this thread is that none of this

Re: Current status of Data Page V2?

2020-10-13 Thread Gabor Szadovszky
I am not sure 2.0 means the v2 pages here. I think there was/is a bit of confusion between the v1/v2 pages and the parquet-mr releases. Maybe the parquet-format releases are also part of it. In this table many features are not related to the pages so I don't think the "Expected release" meant the

Re: Current status of Data Page V2?

2020-10-12 Thread Ryan Blue
I remembered that there used to be a table. Looks like it was removed: https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8 The table used to list delta as a 2.0 feature. On Mon, Oct 12, 2020 at 1:38 AM

Re: Current status of Data Page V2?

2020-10-12 Thread Gabor Szadovszky
That answer I wrote to the other thread was based on the current code. So that is how parquet-mr is working now. It does not mean though how shall it work or how it works in other implementations. Unfortunately, the spec does not say anything about v1 and v2 in the context of encodings. Meanwhile,

Re: Current status of Data Page V2?

2020-10-09 Thread Jacques Nadeau
Gabor seems to agree that delta is V2 only. To summarize, no delta encodings are used for V1 pages. They are available > for V2 only. https://www.mail-archive.com/dev@parquet.apache.org/msg11826.html On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau wrote: > Good point. I had mentally

Re: Current status of Data Page V2?

2020-10-09 Thread Jacques Nadeau
Good point. I had mentally categorized this as V2, not based on the docs? I don't think most tools write this but I can't see anywhere that it says it is limited to v2 readers/writers. I'm not sure how many tools vectorize read it versus delegate to the legacy mr path (at least in java), either.

Re: Current status of Data Page V2?

2020-10-09 Thread Micah Kornfield
> > The big win in v2 pages (if I remember correctly) is that the variable > length encoding is no longer interleaved. That would provide a big > performance lift when pulling into arrow vectors (and variable length > decoding typically dominates total read processing time, on average I've > seen

Re: Current status of Data Page V2?

2020-10-09 Thread Jacques Nadeau
The big win in v2 pages (if I remember correctly) is that the variable length encoding is no longer interleaved. That would provide a big performance lift when pulling into arrow vectors (and variable length decoding typically dominates total read processing time, on average I've seen 5-10x per

Re: Current status of Data Page V2?

2020-10-08 Thread Micah Kornfield
Thanks for the quick reply Ryan. > We only use v1 and it still works well. That said, I'd love to make some > progress on better encodings and finalizing v2 so we can use them! Are there JIRAs or other documentation that is tracking this work? Thanks, Micah On Thu, Oct 8, 2020 at 12:55 PM

Re: Current status of Data Page V2?

2020-10-08 Thread Ryan Blue
While there isn't anything wrong with it, the same challenges have been solved in different ways with v1 pages. The main difference is that v2 pages are broken at record boundaries, and v1 pages weren't guaranteed to be. But, in order to write page indexes near the footer, breaking pages at record

Current status of Data Page V2?

2020-10-08 Thread Micah Kornfield
What is the current status of support for Data Page V2? Is it recommended for production workloads? Thanks, Micah