Hi Micah,
V2 pages are not only about the new encodings but a couple of other things
the community thought would be better than V1.
One of these improvements was to break pages at row boundaries. This one is
outdated because during the development of column-indexes we had to
implement the same
Hi Gabor,
> It is still not clear to me if we want to recommend V2 for production use
> at all
Again, I'm missing context here, but what is blocking V2 for production
use? Is it specification finalization, implementation finalization?
Something else?
or simply introduce the new encodings for
It is still not clear to me if we want to recommend V2 for production use
at all or simply introduce the new encodings for V1. I would suggest
discussing this topic on the parquet sync next Tuesday.
On Thu, Oct 22, 2020 at 6:04 AM Micah Kornfield
wrote:
> I've created
I've created https://github.com/apache/parquet-format/pull/163 to try to
document these (note I really don't have historical context here so please
review carefully).
I would appreciate it if someone could point me to a reference on what the
current status of V2 is? What is left unsettled? When
>
> I am not sure 2.0 means the v2 pages here. I think there was/is a bit of
> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the
> parquet-format releases are also part of it.
+1 to the confusion part. The reason why I originally started this thread
is that none of this
I am not sure 2.0 means the v2 pages here. I think there was/is a bit of
confusion between the v1/v2 pages and the parquet-mr releases. Maybe the
parquet-format releases are also part of it.
In this table many features are not related to the pages so I don't think
the "Expected release" meant the
I remembered that there used to be a table. Looks like it was removed:
https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8
The table used to list delta as a 2.0 feature.
On Mon, Oct 12, 2020 at 1:38 AM
That answer I wrote to the other thread was based on the current code. So
that is how parquet-mr is working now. It does not mean though how shall it
work or how it works in other implementations. Unfortunately, the spec does
not say anything about v1 and v2 in the context of encodings.
Meanwhile,
Gabor seems to agree that delta is V2 only.
To summarize, no delta encodings are used for V1 pages. They are available
> for V2 only.
https://www.mail-archive.com/dev@parquet.apache.org/msg11826.html
On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau wrote:
> Good point. I had mentally
Good point. I had mentally categorized this as V2, not based on the docs?
I don't think most tools write this but I can't see anywhere that it says
it is limited to v2 readers/writers. I'm not sure how many tools vectorize
read it versus delegate to the legacy mr path (at least in java), either.
>
> The big win in v2 pages (if I remember correctly) is that the variable
> length encoding is no longer interleaved. That would provide a big
> performance lift when pulling into arrow vectors (and variable length
> decoding typically dominates total read processing time, on average I've
> seen
The big win in v2 pages (if I remember correctly) is that the variable
length encoding is no longer interleaved. That would provide a big
performance lift when pulling into arrow vectors (and variable length
decoding typically dominates total read processing time, on average I've
seen 5-10x per
Thanks for the quick reply Ryan.
> We only use v1 and it still works well. That said, I'd love to make some
> progress on better encodings and finalizing v2 so we can use them!
Are there JIRAs or other documentation that is tracking this work?
Thanks,
Micah
On Thu, Oct 8, 2020 at 12:55 PM
While there isn't anything wrong with it, the same challenges have been
solved in different ways with v1 pages. The main difference is that v2
pages are broken at record boundaries, and v1 pages weren't guaranteed to
be. But, in order to write page indexes near the footer, breaking pages at
record
What is the current status of support for Data Page V2? Is it recommended
for production workloads?
Thanks,
Micah
15 matches
Mail list logo