In case anyone else is interested, the relevant parts of parquet.thrift I
think are [1] and [2].
I agree with Gang's interpretation that `num_nulls` is required vs
`null_count) in Statistics is optional.
Since Statistics is used in other places (e.g. ColumnMetadata[3]) I don't
think we could make the null_count required there (not that you were
proposing this)
A bit off topic, but I think including Statistics in general in page
headers is of limited use as to read them you need to have already fetched
the page (and thus the amount of work that can be skipped is often pretty
low by the time you have the page header). A better way is to include the
statistics in the ColumnIndex[4] which can be fetched independently and
then used to skip many pages at once
Andrew
[1]:
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L724
[2]:
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L291
[3]:
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L912
[4]:
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L1163
On Thu, Oct 9, 2025 at 2:15 AM Gang Wu wrote:
> I think you're right.
>
> The only difference is that statistics is optional but the field in the
> header is required.
>
> Best,
> Gang
>
> On Wed, Oct 8, 2025 at 8:12 PM Antoine Pitrou wrote:
>
> >
> > Hello,
> >
> > It seems a V2 data page can have its number of nulls recorded in two
> > adjacent locations:
> > 1. the `num_nulls` field in `DataPageHeaderV2`
> > 2. the `null_count` field in `DataPageHeaderV2.statistics`
> >
> > Is this interpretation right? Or do those two fields actually have
> > different semantics.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>