Re: [DISCUSS] flatbuf footer

2026-02-08 Thread Alkis Evlogimenos via dev
Thank you Micah. Will follow up on the PR.

On Sun, Feb 8, 2026 at 8:31 PM Micah Kornfield 
wrote:

> Just wanted to follow-up. I did a first pass review on the
> flatbuf definitions.
>
> Cheers,
> Micah
>
> On Thu, Dec 11, 2025 at 11:58 PM Alkis Evlogimenos via dev <
> [email protected]> wrote:
>
>> PR for linking proposal here:
>> https://github.com/apache/parquet-format/pull/543
>> PR for parquet footer flatbuf definition:
>> https://github.com/apache/parquet-format/pull/544
>>
>> On Tue, Dec 9, 2025 at 1:26 AM Julien Le Dem  wrote:
>>
>> > Hello Alkis,
>> > Do you think you could add your footer proposal to the proposals page?
>> >
>> >
>> >
>> https://github.com/apache/parquet-format/tree/master/proposals#active-proposals
>> > That way it gets more visibility.
>> > Cheers
>> > Julien
>> >
>> > On Tue, Oct 21, 2025 at 11:49 AM Steve Loughran
>> > 
>> > wrote:
>> >
>> > > On Mon, 20 Oct 2025 at 18:24, Ed Seidl  wrote:
>> > >
>> > > > IIUC a flatbuffer aware decoder would read the last 36 bytes or so
>> of
>> > the
>> > > > file and look for a known UUID along with size information. With
>> this
>> > it
>> > > > could then read only the flatbuffer bytes. I think this would work
>> as
>> > > well
>> > > > as current systems that prefetch some number of bytes in an attempt
>> to
>> > > get
>> > > > the whole footer in a single get.
>> > > >
>> > > > Old readers, however, will have to fetch both footers, but won't
>> have
>> > any
>> > > > additional decoding work because the new footer is a binary field
>> that
>> > > can
>> > > > be easily skipped.
>> > > >
>> > >
>> > > really depends what the readers do with footer prefetching. For the
>> java
>> > > clients
>> > >
>> > >
>> > >1. s3a classic stream: the backwards seek()  switches it to random
>> IO
>> > >mode, next read() from base of thrift will pull in
>> > > fs.s3a.readahead.range
>> > >of data  No penalty
>> > >2. google gs://. There's a footer cache option which will need to
>> be
>> > set
>> > >to a larger value
>> > >3. azure abfs:// there's a footer cache option which will need to
>> be
>> > set
>> > >to a larger value
>> > >4. s3a + amazon analytics stream. This stream is *parquet aware*
>> and
>> > >actually parses the footer to know what to predictively prefetch.
>> The
>> > > AWS
>> > >developers do know of this work -moving to support the new footer
>> > would
>> > > be
>> > >the ideal strategy here.
>> > >5. Iceberg classic input. no idea.
>> > >6. iceberg + amazon analytics. same as S3A though without some of
>> the
>> > >tuning we've been doing for vector reads.
>> > >
>> > > I wouldn't worry too much about the impact of that footer size
>> increase.
>> > > Some extra footer prefetch options should compensate, and once apps
>> move
>> > to
>> > > a parquet v3 reader they've got a faster parse time. Of course,
>> > ironically,
>> > > read time then may dominate even more there -it'll be important to do
>> > that
>> > > read as efficiently as possible (use a readFully() into a buffer, not
>> > lots
>> > > of single byte read() calls)
>> > >
>> >
>>
>


Re: [DISCUSS] flatbuf footer

2026-02-08 Thread Micah Kornfield
Just wanted to follow-up. I did a first pass review on the
flatbuf definitions.

Cheers,
Micah

On Thu, Dec 11, 2025 at 11:58 PM Alkis Evlogimenos via dev <
[email protected]> wrote:

> PR for linking proposal here:
> https://github.com/apache/parquet-format/pull/543
> PR for parquet footer flatbuf definition:
> https://github.com/apache/parquet-format/pull/544
>
> On Tue, Dec 9, 2025 at 1:26 AM Julien Le Dem  wrote:
>
> > Hello Alkis,
> > Do you think you could add your footer proposal to the proposals page?
> >
> >
> >
> https://github.com/apache/parquet-format/tree/master/proposals#active-proposals
> > That way it gets more visibility.
> > Cheers
> > Julien
> >
> > On Tue, Oct 21, 2025 at 11:49 AM Steve Loughran
> > 
> > wrote:
> >
> > > On Mon, 20 Oct 2025 at 18:24, Ed Seidl  wrote:
> > >
> > > > IIUC a flatbuffer aware decoder would read the last 36 bytes or so of
> > the
> > > > file and look for a known UUID along with size information. With this
> > it
> > > > could then read only the flatbuffer bytes. I think this would work as
> > > well
> > > > as current systems that prefetch some number of bytes in an attempt
> to
> > > get
> > > > the whole footer in a single get.
> > > >
> > > > Old readers, however, will have to fetch both footers, but won't have
> > any
> > > > additional decoding work because the new footer is a binary field
> that
> > > can
> > > > be easily skipped.
> > > >
> > >
> > > really depends what the readers do with footer prefetching. For the
> java
> > > clients
> > >
> > >
> > >1. s3a classic stream: the backwards seek()  switches it to random
> IO
> > >mode, next read() from base of thrift will pull in
> > > fs.s3a.readahead.range
> > >of data  No penalty
> > >2. google gs://. There's a footer cache option which will need to be
> > set
> > >to a larger value
> > >3. azure abfs:// there's a footer cache option which will need to be
> > set
> > >to a larger value
> > >4. s3a + amazon analytics stream. This stream is *parquet aware* and
> > >actually parses the footer to know what to predictively prefetch.
> The
> > > AWS
> > >developers do know of this work -moving to support the new footer
> > would
> > > be
> > >the ideal strategy here.
> > >5. Iceberg classic input. no idea.
> > >6. iceberg + amazon analytics. same as S3A though without some of
> the
> > >tuning we've been doing for vector reads.
> > >
> > > I wouldn't worry too much about the impact of that footer size
> increase.
> > > Some extra footer prefetch options should compensate, and once apps
> move
> > to
> > > a parquet v3 reader they've got a faster parse time. Of course,
> > ironically,
> > > read time then may dominate even more there -it'll be important to do
> > that
> > > read as efficiently as possible (use a readFully() into a buffer, not
> > lots
> > > of single byte read() calls)
> > >
> >
>


Re: [DISCUSS] flatbuf footer

2025-12-11 Thread Alkis Evlogimenos via dev
PR for linking proposal here:
https://github.com/apache/parquet-format/pull/543
PR for parquet footer flatbuf definition:
https://github.com/apache/parquet-format/pull/544

On Tue, Dec 9, 2025 at 1:26 AM Julien Le Dem  wrote:

> Hello Alkis,
> Do you think you could add your footer proposal to the proposals page?
>
>
> https://github.com/apache/parquet-format/tree/master/proposals#active-proposals
> That way it gets more visibility.
> Cheers
> Julien
>
> On Tue, Oct 21, 2025 at 11:49 AM Steve Loughran
> 
> wrote:
>
> > On Mon, 20 Oct 2025 at 18:24, Ed Seidl  wrote:
> >
> > > IIUC a flatbuffer aware decoder would read the last 36 bytes or so of
> the
> > > file and look for a known UUID along with size information. With this
> it
> > > could then read only the flatbuffer bytes. I think this would work as
> > well
> > > as current systems that prefetch some number of bytes in an attempt to
> > get
> > > the whole footer in a single get.
> > >
> > > Old readers, however, will have to fetch both footers, but won't have
> any
> > > additional decoding work because the new footer is a binary field that
> > can
> > > be easily skipped.
> > >
> >
> > really depends what the readers do with footer prefetching. For the java
> > clients
> >
> >
> >1. s3a classic stream: the backwards seek()  switches it to random IO
> >mode, next read() from base of thrift will pull in
> > fs.s3a.readahead.range
> >of data  No penalty
> >2. google gs://. There's a footer cache option which will need to be
> set
> >to a larger value
> >3. azure abfs:// there's a footer cache option which will need to be
> set
> >to a larger value
> >4. s3a + amazon analytics stream. This stream is *parquet aware* and
> >actually parses the footer to know what to predictively prefetch. The
> > AWS
> >developers do know of this work -moving to support the new footer
> would
> > be
> >the ideal strategy here.
> >5. Iceberg classic input. no idea.
> >6. iceberg + amazon analytics. same as S3A though without some of the
> >tuning we've been doing for vector reads.
> >
> > I wouldn't worry too much about the impact of that footer size increase.
> > Some extra footer prefetch options should compensate, and once apps move
> to
> > a parquet v3 reader they've got a faster parse time. Of course,
> ironically,
> > read time then may dominate even more there -it'll be important to do
> that
> > read as efficiently as possible (use a readFully() into a buffer, not
> lots
> > of single byte read() calls)
> >
>


Re: [DISCUSS] flatbuf footer

2025-12-08 Thread Julien Le Dem
Hello Alkis,
Do you think you could add your footer proposal to the proposals page?

https://github.com/apache/parquet-format/tree/master/proposals#active-proposals
That way it gets more visibility.
Cheers
Julien

On Tue, Oct 21, 2025 at 11:49 AM Steve Loughran 
wrote:

> On Mon, 20 Oct 2025 at 18:24, Ed Seidl  wrote:
>
> > IIUC a flatbuffer aware decoder would read the last 36 bytes or so of the
> > file and look for a known UUID along with size information. With this it
> > could then read only the flatbuffer bytes. I think this would work as
> well
> > as current systems that prefetch some number of bytes in an attempt to
> get
> > the whole footer in a single get.
> >
> > Old readers, however, will have to fetch both footers, but won't have any
> > additional decoding work because the new footer is a binary field that
> can
> > be easily skipped.
> >
>
> really depends what the readers do with footer prefetching. For the java
> clients
>
>
>1. s3a classic stream: the backwards seek()  switches it to random IO
>mode, next read() from base of thrift will pull in
> fs.s3a.readahead.range
>of data  No penalty
>2. google gs://. There's a footer cache option which will need to be set
>to a larger value
>3. azure abfs:// there's a footer cache option which will need to be set
>to a larger value
>4. s3a + amazon analytics stream. This stream is *parquet aware* and
>actually parses the footer to know what to predictively prefetch. The
> AWS
>developers do know of this work -moving to support the new footer would
> be
>the ideal strategy here.
>5. Iceberg classic input. no idea.
>6. iceberg + amazon analytics. same as S3A though without some of the
>tuning we've been doing for vector reads.
>
> I wouldn't worry too much about the impact of that footer size increase.
> Some extra footer prefetch options should compensate, and once apps move to
> a parquet v3 reader they've got a faster parse time. Of course, ironically,
> read time then may dominate even more there -it'll be important to do that
> read as efficiently as possible (use a readFully() into a buffer, not lots
> of single byte read() calls)
>


Re: [DISCUSS] flatbuf footer: offsets

2025-11-03 Thread Alkis Evlogimenos
Assuming LZ4 compression at 2gb/sec (per core) and network bandwidth at
1gb/sec, and taking as example the 367mb thrift footer in the proposal, the
tradeoff is as follows:
T=thrift, F32=flatbuf with 32-bit offsets, F64=flatbuf with 64-bit offsets

T (367mb): 50ms latency + 370ms transfer --> 420ms (ignoring parse time)
F32 (113mb raw / 50mb lz4): 50ms latency + 50ms transfer + 56ms
decompression --> 156ms
F64 (155mb raw / 52mb lz4): 50ms latency + 52ms transfer + 78ms
decompression --> 180ms

Going with 64 bit offsets leaves some performance on the table and it will
make lz4 compression pretty much required for most footers above 256kb.
That said 64-bit offsets are still much faster at transfer than thrift even
ignoring the horrendous parse times.

For simplicity I am still slightly in favor of 64 bit offsets but I am open
to argumentation for 32 bit relative offsets plus alignment to bring row
group size to 64gb.

Thoughts?


On Tue, Oct 28, 2025 at 10:57 AM Antoine Pitrou  wrote:

>
> Hi,
>
> I expect LZ4 to be optional, but enabled by default by most writers.
> LZ4 decompression is extremely fast, typically several GB/s on a modern
> CPU.
>
> Regards
>
> Antoine.
>
>
> On Mon, 27 Oct 2025 17:06:07 +0100
> Jan Finis  wrote:
> > You are right that even without LZ4, we would still need I/O for the
> whole
> > footer. And I guess LZ4 is way faster than thrift, so flatbuf+LZ4 would
> be
> > an improvement over thrift. If you want superb partial decoding, we would
> > indeed need to somehow support only reading part of the footer from
> > storage. In the end, it's a trade-off. The more flexibility we want
> w.r.t.
> > partial reads, the more complexity we have to introduce. Maybe flatbuf
> > alone is already the sweet spot here and we shouldn't introduce
> additional
> > complexity. LZ4 compression would after all still be optional, right?
> >
> > Someone mentioned that they have footers with millions of columns. Maybe
> > they should comment on how much partial reading would be required for
> their
> > use case. I guess the answer will be "the more support for partial
> > reading/decoding the better".
> >
> > You could argue that if you have such a wide file, just don't use LZ4
> then
> > and that's probably a valid argument.
> >
> > Cheers,
> > Jan
> >
> >
> >
> > Am Mo., 27. Okt. 2025 um 09:28 Uhr schrieb Antoine Pitrou <
> > [email protected]>:
> >
> > >
> > > Hmmm... does it?
> > >
> > > I may be mistaken, but I had the impression that what you call "read
> > > only the parts of the footer I'm interested in" is actually "*decode*
> > > only the parts of the footer I'm interested in".
> > >
> > > That is, you still read the entire footer, which is a larger IO than
> > > doing smaller reads, but it's also a single IO rather than several
> > > smaller ones.
> > >
> > > Of course, if we want to make things more flexible, we can have
> > > individual Flatbuffers metadata pieces for each column, each
> > > LZ4-compressed. And embed two sizes at the end of the file: the size of
> > > the "core footer" metadata (without columns) and the size of the "full
> > > footer" metadata (with columns); so that readers can choose their
> > > preferred strategy.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > On Sat, 25 Oct 2025 14:39:37 +0200
> > > Jan Finis  wrote:
> > > > Note that LZ4 compression destroys the whole "I can read only the
> parts
> > > of
> > > > the footer I'm interested in", so I wouldn't say that LZ4 can be the
> > > > solution to everything.
> > > >
> > > > Cheers,
> > > > Jan
> > > >
> > > > On Sat, Oct 25, 2025, 12:33 Antoine Pitrou <
> > > [email protected]> wrote:
> > > >
> > > > > On Fri, 24 Oct 2025 12:12:02 -0700
> > > > > Julien Le Dem  wrote:
> > > > > > I had an idea about this topic.
> > > > > > What if we say the offset is always a multiple of 16? (I'm
> saying
> > > 16, but
> > > > > > it works with 8 or 32 or any other power of 2).
> > > > > > Then we store in the footer the offset divided by 16.
> > > > > > That means you need to pad each row group by up to 16 bytes.
> > > > > > But now the max size of the file is 32GB.
> > > > > >
> > > > > > Personally, I still don't like having arbitrary limits but 32GB
> > > seems a
> > > > > lot
> > > > > > less like a restricting limit than 2GB.
> > > > > > If we get crazy, we add this to the footer as metadata and the
> > > writer
> > > > > gets
> > > > > > to pick whether you multiply offsets by 32, 64 or 128 if ten
> years
> > > from
> > > > > now
> > > > > > we start having much bigger files.
> > > > > > The size of the padding becomes negligible over the size of the
> file.
> > > > > >
> > > > > > Thoughts?
> > > > >
> > > > > That's an interesting suggestion. I would be fine with it
> personally,
> > > > > provided the multiplier is either large enough (say, 64) or
> embedded in
> > > > > the footer.
> > > > >
> > > > > That said, I would first wait for the outcome of the experiment
> with
> > > > > L

Re: [DISCUSS] flatbuf footer: offsets

2025-10-28 Thread Antoine Pitrou


Hi,

I expect LZ4 to be optional, but enabled by default by most writers.
LZ4 decompression is extremely fast, typically several GB/s on a modern
CPU.

Regards

Antoine.


On Mon, 27 Oct 2025 17:06:07 +0100
Jan Finis  wrote:
> You are right that even without LZ4, we would still need I/O for the whole
> footer. And I guess LZ4 is way faster than thrift, so flatbuf+LZ4 would be
> an improvement over thrift. If you want superb partial decoding, we would
> indeed need to somehow support only reading part of the footer from
> storage. In the end, it's a trade-off. The more flexibility we want w.r.t.
> partial reads, the more complexity we have to introduce. Maybe flatbuf
> alone is already the sweet spot here and we shouldn't introduce additional
> complexity. LZ4 compression would after all still be optional, right?
> 
> Someone mentioned that they have footers with millions of columns. Maybe
> they should comment on how much partial reading would be required for their
> use case. I guess the answer will be "the more support for partial
> reading/decoding the better".
> 
> You could argue that if you have such a wide file, just don't use LZ4 then
> and that's probably a valid argument.
> 
> Cheers,
> Jan
> 
> 
> 
> Am Mo., 27. Okt. 2025 um 09:28 Uhr schrieb Antoine Pitrou <
> [email protected]>:
> 
> >
> > Hmmm... does it?
> >
> > I may be mistaken, but I had the impression that what you call "read
> > only the parts of the footer I'm interested in" is actually "*decode*
> > only the parts of the footer I'm interested in".
> >
> > That is, you still read the entire footer, which is a larger IO than
> > doing smaller reads, but it's also a single IO rather than several
> > smaller ones.
> >
> > Of course, if we want to make things more flexible, we can have
> > individual Flatbuffers metadata pieces for each column, each
> > LZ4-compressed. And embed two sizes at the end of the file: the size of
> > the "core footer" metadata (without columns) and the size of the "full
> > footer" metadata (with columns); so that readers can choose their
> > preferred strategy.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Sat, 25 Oct 2025 14:39:37 +0200
> > Jan Finis  wrote:  
> > > Note that LZ4 compression destroys the whole "I can read only the parts  
> > of  
> > > the footer I'm interested in", so I wouldn't say that LZ4 can be the
> > > solution to everything.
> > >
> > > Cheers,
> > > Jan
> > >
> > > On Sat, Oct 25, 2025, 12:33 Antoine Pitrou <  
> > [email protected]> wrote:  
> > >  
> > > > On Fri, 24 Oct 2025 12:12:02 -0700
> > > > Julien Le Dem  wrote:  
> > > > > I had an idea about this topic.
> > > > > What if we say the offset is always a multiple of 16? (I'm saying  
> > 16, but  
> > > > > it works with 8 or 32 or any other power of 2).
> > > > > Then we store in the footer the offset divided by 16.
> > > > > That means you need to pad each row group by up to 16 bytes.
> > > > > But now the max size of the file is 32GB.
> > > > >
> > > > > Personally, I still don't like having arbitrary limits but 32GB  
> > seems a  
> > > > lot  
> > > > > less like a restricting limit than 2GB.
> > > > > If we get crazy, we add this to the footer as metadata and the  
> > writer  
> > > > gets  
> > > > > to pick whether you multiply offsets by 32, 64 or 128 if ten years  
> > from  
> > > > now  
> > > > > we start having much bigger files.
> > > > > The size of the padding becomes negligible over the size of the file.
> > > > >
> > > > > Thoughts?  
> > > >
> > > > That's an interesting suggestion. I would be fine with it personally,
> > > > provided the multiplier is either large enough (say, 64) or embedded in
> > > > the footer.
> > > >
> > > > That said, I would first wait for the outcome of the experiment with
> > > > LZ4 compression. If it negates the additional cost of 64-bit offsets,
> > > > then we should not bother with this multiplier mechanism.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >  
> > > > >
> > > > >
> > > > > On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
> > > > >  
> > > > > wrote:
> > > > >  
> > > > > > We've analyzed a large footer from our production environment to  
> > > > understand  
> > > > > > byte distribution across its fields. The detailed analysis is  
> > > > available in  
> > > > > > the proposal document here:
> > > > > >
> > > > > >  
> > > >  
> > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
> >  
> > > > > > .
> > > > > >
> > > > > > To illustrate the impact of 64-bit fields, we conducted an  
> > experiment  
> > > > where  
> > > > > > all proposed 32-bit fields in the Flatbuf footer were changed to  
> > > > 64-bit.  
> > > > > > This resulted in a *40% increase* in footer size.
> > > > > >
> > > > > > That said, LZ4 manages to compress this away. We will do some  
> > more  
> > > > testing  
> > > > > > with 64 bit offsets/numvals/s

Re: [DISCUSS] flatbuf footer: offsets

2025-10-27 Thread Jan Finis
You are right that even without LZ4, we would still need I/O for the whole
footer. And I guess LZ4 is way faster than thrift, so flatbuf+LZ4 would be
an improvement over thrift. If you want superb partial decoding, we would
indeed need to somehow support only reading part of the footer from
storage. In the end, it's a trade-off. The more flexibility we want w.r.t.
partial reads, the more complexity we have to introduce. Maybe flatbuf
alone is already the sweet spot here and we shouldn't introduce additional
complexity. LZ4 compression would after all still be optional, right?

Someone mentioned that they have footers with millions of columns. Maybe
they should comment on how much partial reading would be required for their
use case. I guess the answer will be "the more support for partial
reading/decoding the better".

You could argue that if you have such a wide file, just don't use LZ4 then
and that's probably a valid argument.

Cheers,
Jan



Am Mo., 27. Okt. 2025 um 09:28 Uhr schrieb Antoine Pitrou <
[email protected]>:

>
> Hmmm... does it?
>
> I may be mistaken, but I had the impression that what you call "read
> only the parts of the footer I'm interested in" is actually "*decode*
> only the parts of the footer I'm interested in".
>
> That is, you still read the entire footer, which is a larger IO than
> doing smaller reads, but it's also a single IO rather than several
> smaller ones.
>
> Of course, if we want to make things more flexible, we can have
> individual Flatbuffers metadata pieces for each column, each
> LZ4-compressed. And embed two sizes at the end of the file: the size of
> the "core footer" metadata (without columns) and the size of the "full
> footer" metadata (with columns); so that readers can choose their
> preferred strategy.
>
> Regards
>
> Antoine.
>
>
> On Sat, 25 Oct 2025 14:39:37 +0200
> Jan Finis  wrote:
> > Note that LZ4 compression destroys the whole "I can read only the parts
> of
> > the footer I'm interested in", so I wouldn't say that LZ4 can be the
> > solution to everything.
> >
> > Cheers,
> > Jan
> >
> > On Sat, Oct 25, 2025, 12:33 Antoine Pitrou <
> [email protected]> wrote:
> >
> > > On Fri, 24 Oct 2025 12:12:02 -0700
> > > Julien Le Dem  wrote:
> > > > I had an idea about this topic.
> > > > What if we say the offset is always a multiple of 16? (I'm saying
> 16, but
> > > > it works with 8 or 32 or any other power of 2).
> > > > Then we store in the footer the offset divided by 16.
> > > > That means you need to pad each row group by up to 16 bytes.
> > > > But now the max size of the file is 32GB.
> > > >
> > > > Personally, I still don't like having arbitrary limits but 32GB
> seems a
> > > lot
> > > > less like a restricting limit than 2GB.
> > > > If we get crazy, we add this to the footer as metadata and the
> writer
> > > gets
> > > > to pick whether you multiply offsets by 32, 64 or 128 if ten years
> from
> > > now
> > > > we start having much bigger files.
> > > > The size of the padding becomes negligible over the size of the file.
> > > >
> > > > Thoughts?
> > >
> > > That's an interesting suggestion. I would be fine with it personally,
> > > provided the multiplier is either large enough (say, 64) or embedded in
> > > the footer.
> > >
> > > That said, I would first wait for the outcome of the experiment with
> > > LZ4 compression. If it negates the additional cost of 64-bit offsets,
> > > then we should not bother with this multiplier mechanism.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > >
> > > >
> > > > On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
> > > >  wrote:
> > > >
> > > > > We've analyzed a large footer from our production environment to
> > > understand
> > > > > byte distribution across its fields. The detailed analysis is
> > > available in
> > > > > the proposal document here:
> > > > >
> > > > >
> > >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
>
> > > > > .
> > > > >
> > > > > To illustrate the impact of 64-bit fields, we conducted an
> experiment
> > > where
> > > > > all proposed 32-bit fields in the Flatbuf footer were changed to
> > > 64-bit.
> > > > > This resulted in a *40% increase* in footer size.
> > > > >
> > > > > That said, LZ4 manages to compress this away. We will do some
> more
> > > testing
> > > > > with 64 bit offsets/numvals/sizes and revert back. If it all goes
> well
> > > we
> > > > > can resolve this by going 64 bit.
> > > > >
> > > > >
> > > > > On Wed, Oct 15, 2025 at 12:49 PM Jan Finis <
> > > [email protected]> wrote:
> > > > >
> > > > > > Hi Alkis,
> > > > > >
> > > > > > one more very simple argument why you want these offsets to be
> i64:
> > > > > > What if you want to store a single value larger than 4GB? I know
> this
> > > > > > sounds absurd at first, but some use cases might want to store
> data
> > > that
> > > > > > can sometimes b

Re: [DISCUSS] flatbuf footer: offsets

2025-10-27 Thread Antoine Pitrou


Hmmm... does it?

I may be mistaken, but I had the impression that what you call "read
only the parts of the footer I'm interested in" is actually "*decode*
only the parts of the footer I'm interested in".

That is, you still read the entire footer, which is a larger IO than
doing smaller reads, but it's also a single IO rather than several
smaller ones.

Of course, if we want to make things more flexible, we can have
individual Flatbuffers metadata pieces for each column, each
LZ4-compressed. And embed two sizes at the end of the file: the size of
the "core footer" metadata (without columns) and the size of the "full
footer" metadata (with columns); so that readers can choose their
preferred strategy.

Regards

Antoine.


On Sat, 25 Oct 2025 14:39:37 +0200
Jan Finis  wrote:
> Note that LZ4 compression destroys the whole "I can read only the parts of
> the footer I'm interested in", so I wouldn't say that LZ4 can be the
> solution to everything.
> 
> Cheers,
> Jan
> 
> On Sat, Oct 25, 2025, 12:33 Antoine Pitrou 
>  wrote:
> 
> > On Fri, 24 Oct 2025 12:12:02 -0700
> > Julien Le Dem  wrote:  
> > > I had an idea about this topic.
> > > What if we say the offset is always a multiple of 16? (I'm saying 16, but
> > > it works with 8 or 32 or any other power of 2).
> > > Then we store in the footer the offset divided by 16.
> > > That means you need to pad each row group by up to 16 bytes.
> > > But now the max size of the file is 32GB.
> > >
> > > Personally, I still don't like having arbitrary limits but 32GB seems a  
> > lot  
> > > less like a restricting limit than 2GB.
> > > If we get crazy, we add this to the footer as metadata and the writer  
> > gets  
> > > to pick whether you multiply offsets by 32, 64 or 128 if ten years from  
> > now  
> > > we start having much bigger files.
> > > The size of the padding becomes negligible over the size of the file.
> > >
> > > Thoughts?  
> >
> > That's an interesting suggestion. I would be fine with it personally,
> > provided the multiplier is either large enough (say, 64) or embedded in
> > the footer.
> >
> > That said, I would first wait for the outcome of the experiment with
> > LZ4 compression. If it negates the additional cost of 64-bit offsets,
> > then we should not bother with this multiplier mechanism.
> >
> > Regards
> >
> > Antoine.
> >
> >  
> > >
> > >
> > > On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
> > >  wrote:
> > >  
> > > > We've analyzed a large footer from our production environment to  
> > understand  
> > > > byte distribution across its fields. The detailed analysis is  
> > available in  
> > > > the proposal document here:
> > > >
> > > >  
> > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
> >   
> > > > .
> > > >
> > > > To illustrate the impact of 64-bit fields, we conducted an experiment  
> > where  
> > > > all proposed 32-bit fields in the Flatbuf footer were changed to  
> > 64-bit.  
> > > > This resulted in a *40% increase* in footer size.
> > > >
> > > > That said, LZ4 manages to compress this away. We will do some more  
> > testing  
> > > > with 64 bit offsets/numvals/sizes and revert back. If it all goes well  
> > we  
> > > > can resolve this by going 64 bit.
> > > >
> > > >
> > > > On Wed, Oct 15, 2025 at 12:49 PM Jan Finis <  
> > [email protected]> wrote:  
> > > >  
> > > > > Hi Alkis,
> > > > >
> > > > > one more very simple argument why you want these offsets to be i64:
> > > > > What if you want to store a single value larger than 4GB? I know this
> > > > > sounds absurd at first, but some use cases might want to store data  
> > that  
> > > > > can sometimes be very large (e.g. blob data, or insanely complex  
> > geo  
> > > > data).  
> > > > > And it would be a shame if that would mean that they cannot use  
> > Parquet  
> > > > at  
> > > > > all.
> > > > >
> > > > > Thus, my opinion here is that we can limit to i32 all fields that  
> > the  
> > > > file  
> > > > > writer has under control, e.g., the number of rows within a row  
> > group,  
> > > > but  
> > > > > we shouldn't limit any values that a file writer doesn't have under
> > > > > control, as they fully depend on the input data.
> > > > >
> > > > > Note though that this means that the number of values in a column  
> > chunk  
> > > > > could also exceed i32, if a user has nested data with more than 4  
> > billion  
> > > > > entries. With such data, the file writer again couldn't do anything  
> > to  
> > > > > avoid writing a row group with more
> > > > > than i32 values, as a single row may not span multiple row groups.  
> > That  
> > > > > being said, I think that nested data with more than 4 billion  
> > entries is  
> > > > > less likely than a single large blob of 4 billion bytes.
> > > > >
> > > > > I know that smaller row groups is what most / all engines prefer,  
> > but we  
> > > > > have to make sure 

Re: [DISCUSS] flatbuf footer: offsets

2025-10-25 Thread Jan Finis
Note that LZ4 compression destroys the whole "I can read only the parts of
the footer I'm interested in", so I wouldn't say that LZ4 can be the
solution to everything.

Cheers,
Jan

On Sat, Oct 25, 2025, 12:33 Antoine Pitrou  wrote:

> On Fri, 24 Oct 2025 12:12:02 -0700
> Julien Le Dem  wrote:
> > I had an idea about this topic.
> > What if we say the offset is always a multiple of 16? (I'm saying 16, but
> > it works with 8 or 32 or any other power of 2).
> > Then we store in the footer the offset divided by 16.
> > That means you need to pad each row group by up to 16 bytes.
> > But now the max size of the file is 32GB.
> >
> > Personally, I still don't like having arbitrary limits but 32GB seems a
> lot
> > less like a restricting limit than 2GB.
> > If we get crazy, we add this to the footer as metadata and the writer
> gets
> > to pick whether you multiply offsets by 32, 64 or 128 if ten years from
> now
> > we start having much bigger files.
> > The size of the padding becomes negligible over the size of the file.
> >
> > Thoughts?
>
> That's an interesting suggestion. I would be fine with it personally,
> provided the multiplier is either large enough (say, 64) or embedded in
> the footer.
>
> That said, I would first wait for the outcome of the experiment with
> LZ4 compression. If it negates the additional cost of 64-bit offsets,
> then we should not bother with this multiplier mechanism.
>
> Regards
>
> Antoine.
>
>
> >
> >
> > On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
> >  wrote:
> >
> > > We've analyzed a large footer from our production environment to
> understand
> > > byte distribution across its fields. The detailed analysis is
> available in
> > > the proposal document here:
> > >
> > >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
> > > .
> > >
> > > To illustrate the impact of 64-bit fields, we conducted an experiment
> where
> > > all proposed 32-bit fields in the Flatbuf footer were changed to
> 64-bit.
> > > This resulted in a *40% increase* in footer size.
> > >
> > > That said, LZ4 manages to compress this away. We will do some more
> testing
> > > with 64 bit offsets/numvals/sizes and revert back. If it all goes well
> we
> > > can resolve this by going 64 bit.
> > >
> > >
> > > On Wed, Oct 15, 2025 at 12:49 PM Jan Finis <
> [email protected]> wrote:
> > >
> > > > Hi Alkis,
> > > >
> > > > one more very simple argument why you want these offsets to be i64:
> > > > What if you want to store a single value larger than 4GB? I know this
> > > > sounds absurd at first, but some use cases might want to store data
> that
> > > > can sometimes be very large (e.g. blob data, or insanely complex
> geo
> > > data).
> > > > And it would be a shame if that would mean that they cannot use
> Parquet
> > > at
> > > > all.
> > > >
> > > > Thus, my opinion here is that we can limit to i32 all fields that
> the
> > > file
> > > > writer has under control, e.g., the number of rows within a row
> group,
> > > but
> > > > we shouldn't limit any values that a file writer doesn't have under
> > > > control, as they fully depend on the input data.
> > > >
> > > > Note though that this means that the number of values in a column
> chunk
> > > > could also exceed i32, if a user has nested data with more than 4
> billion
> > > > entries. With such data, the file writer again couldn't do anything
> to
> > > > avoid writing a row group with more
> > > > than i32 values, as a single row may not span multiple row groups.
> That
> > > > being said, I think that nested data with more than 4 billion
> entries is
> > > > less likely than a single large blob of 4 billion bytes.
> > > >
> > > > I know that smaller row groups is what most / all engines prefer,
> but we
> > > > have to make sure the format also works for edge cases.
> > > >
> > > > Cheers,
> > > > Jan
> > > >
> > > > Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve
>  > > >:
> > > >
> > > > > Hi Alkis
> > > > >
> > > > > Thanks for all your work on this proposal.
> > > > >
> > > > > I'd be in favour of keeping the offsets as i64 and not reducing
> the
> > > > maximum
> > > > > row group size, even if this results in slightly larger footers.
> I've
> > > > heard
> > > > > from some of our users within G-Research that they do have files
> with
> > > row
> > > > > groups > 2 GiB. This is often when they use lower-level APIs to
> write
> > > > > Parquet that don't automatically split data into row groups, and
> they
> > > > > either write a single row group for simplicity or have some logical
> > > > > partitioning of data into row groups. They might also have wide
> tables
> > > > with
> > > > > many columns, or wide array/tensor valued columns that lead to
> large
> > > row
> > > > > groups.
> > > > >
> > > > > In many workflows we don't read Parquet with a query engine that
> > > supports
> > > > > filters and skipping row groups, 

Re: [DISCUSS] flatbuf footer: offsets

2025-10-25 Thread Antoine Pitrou
On Fri, 24 Oct 2025 12:12:02 -0700
Julien Le Dem  wrote:
> I had an idea about this topic.
> What if we say the offset is always a multiple of 16? (I'm saying 16, but
> it works with 8 or 32 or any other power of 2).
> Then we store in the footer the offset divided by 16.
> That means you need to pad each row group by up to 16 bytes.
> But now the max size of the file is 32GB.
> 
> Personally, I still don't like having arbitrary limits but 32GB seems a lot
> less like a restricting limit than 2GB.
> If we get crazy, we add this to the footer as metadata and the writer gets
> to pick whether you multiply offsets by 32, 64 or 128 if ten years from now
> we start having much bigger files.
> The size of the padding becomes negligible over the size of the file.
> 
> Thoughts?

That's an interesting suggestion. I would be fine with it personally,
provided the multiplier is either large enough (say, 64) or embedded in
the footer.

That said, I would first wait for the outcome of the experiment with
LZ4 compression. If it negates the additional cost of 64-bit offsets,
then we should not bother with this multiplier mechanism.

Regards

Antoine.


> 
> 
> On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
>  wrote:
> 
> > We've analyzed a large footer from our production environment to understand
> > byte distribution across its fields. The detailed analysis is available in
> > the proposal document here:
> >
> > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
> > .
> >
> > To illustrate the impact of 64-bit fields, we conducted an experiment where
> > all proposed 32-bit fields in the Flatbuf footer were changed to 64-bit.
> > This resulted in a *40% increase* in footer size.
> >
> > That said, LZ4 manages to compress this away. We will do some more testing
> > with 64 bit offsets/numvals/sizes and revert back. If it all goes well we
> > can resolve this by going 64 bit.
> >
> >
> > On Wed, Oct 15, 2025 at 12:49 PM Jan Finis 
> >  wrote:
> >  
> > > Hi Alkis,
> > >
> > > one more very simple argument why you want these offsets to be i64:
> > > What if you want to store a single value larger than 4GB? I know this
> > > sounds absurd at first, but some use cases might want to store data that
> > > can sometimes be very large (e.g. blob data, or insanely complex geo  
> > data).  
> > > And it would be a shame if that would mean that they cannot use Parquet  
> > at  
> > > all.
> > >
> > > Thus, my opinion here is that we can limit to i32 all fields that the  
> > file  
> > > writer has under control, e.g., the number of rows within a row group,  
> > but  
> > > we shouldn't limit any values that a file writer doesn't have under
> > > control, as they fully depend on the input data.
> > >
> > > Note though that this means that the number of values in a column chunk
> > > could also exceed i32, if a user has nested data with more than 4 billion
> > > entries. With such data, the file writer again couldn't do anything to
> > > avoid writing a row group with more
> > > than i32 values, as a single row may not span multiple row groups. That
> > > being said, I think that nested data with more than 4 billion entries is
> > > less likely than a single large blob of 4 billion bytes.
> > >
> > > I know that smaller row groups is what most / all engines prefer, but we
> > > have to make sure the format also works for edge cases.
> > >
> > > Cheers,
> > > Jan
> > >
> > > Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve 
> > >  > >:
> > >  
> > > > Hi Alkis
> > > >
> > > > Thanks for all your work on this proposal.
> > > >
> > > > I'd be in favour of keeping the offsets as i64 and not reducing the  
> > > maximum  
> > > > row group size, even if this results in slightly larger footers. I've  
> > > heard  
> > > > from some of our users within G-Research that they do have files with  
> > row  
> > > > groups > 2 GiB. This is often when they use lower-level APIs to write
> > > > Parquet that don't automatically split data into row groups, and they
> > > > either write a single row group for simplicity or have some logical
> > > > partitioning of data into row groups. They might also have wide tables  
> > > with  
> > > > many columns, or wide array/tensor valued columns that lead to large  
> > row  
> > > > groups.
> > > >
> > > > In many workflows we don't read Parquet with a query engine that  
> > supports  
> > > > filters and skipping row groups, but just read all rows, or directly
> > > > specify the row groups to read if there is some known logical  
> > > partitioning  
> > > > into row groups. I'm sure we could work around a 2 or 4 GiB row group  
> > > size  
> > > > limitation if we had to, but it's a new constraint that reduces the
> > > > flexibility of the format and makes more work for users who now need to
> > > > ensure they don't hit this limit.
> > > >
> > > > Do you have any measurements of how much of a difference 4 byte

Re: [DISCUSS] flatbuf footer: offsets

2025-10-24 Thread Julien Le Dem
I had an idea about this topic.
What if we say the offset is always a multiple of 16? (I'm saying 16, but
it works with 8 or 32 or any other power of 2).
Then we store in the footer the offset divided by 16.
That means you need to pad each row group by up to 16 bytes.
But now the max size of the file is 32GB.

Personally, I still don't like having arbitrary limits but 32GB seems a lot
less like a restricting limit than 2GB.
If we get crazy, we add this to the footer as metadata and the writer gets
to pick whether you multiply offsets by 32, 64 or 128 if ten years from now
we start having much bigger files.
The size of the padding becomes negligible over the size of the file.

Thoughts?


On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
 wrote:

> We've analyzed a large footer from our production environment to understand
> byte distribution across its fields. The detailed analysis is available in
> the proposal document here:
>
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
> .
>
> To illustrate the impact of 64-bit fields, we conducted an experiment where
> all proposed 32-bit fields in the Flatbuf footer were changed to 64-bit.
> This resulted in a *40% increase* in footer size.
>
> That said, LZ4 manages to compress this away. We will do some more testing
> with 64 bit offsets/numvals/sizes and revert back. If it all goes well we
> can resolve this by going 64 bit.
>
>
> On Wed, Oct 15, 2025 at 12:49 PM Jan Finis  wrote:
>
> > Hi Alkis,
> >
> > one more very simple argument why you want these offsets to be i64:
> > What if you want to store a single value larger than 4GB? I know this
> > sounds absurd at first, but some use cases might want to store data that
> > can sometimes be very large (e.g. blob data, or insanely complex geo
> data).
> > And it would be a shame if that would mean that they cannot use Parquet
> at
> > all.
> >
> > Thus, my opinion here is that we can limit to i32 all fields that the
> file
> > writer has under control, e.g., the number of rows within a row group,
> but
> > we shouldn't limit any values that a file writer doesn't have under
> > control, as they fully depend on the input data.
> >
> > Note though that this means that the number of values in a column chunk
> > could also exceed i32, if a user has nested data with more than 4 billion
> > entries. With such data, the file writer again couldn't do anything to
> > avoid writing a row group with more
> > than i32 values, as a single row may not span multiple row groups. That
> > being said, I think that nested data with more than 4 billion entries is
> > less likely than a single large blob of 4 billion bytes.
> >
> > I know that smaller row groups is what most / all engines prefer, but we
> > have to make sure the format also works for edge cases.
> >
> > Cheers,
> > Jan
> >
> > Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve  >:
> >
> > > Hi Alkis
> > >
> > > Thanks for all your work on this proposal.
> > >
> > > I'd be in favour of keeping the offsets as i64 and not reducing the
> > maximum
> > > row group size, even if this results in slightly larger footers. I've
> > heard
> > > from some of our users within G-Research that they do have files with
> row
> > > groups > 2 GiB. This is often when they use lower-level APIs to write
> > > Parquet that don't automatically split data into row groups, and they
> > > either write a single row group for simplicity or have some logical
> > > partitioning of data into row groups. They might also have wide tables
> > with
> > > many columns, or wide array/tensor valued columns that lead to large
> row
> > > groups.
> > >
> > > In many workflows we don't read Parquet with a query engine that
> supports
> > > filters and skipping row groups, but just read all rows, or directly
> > > specify the row groups to read if there is some known logical
> > partitioning
> > > into row groups. I'm sure we could work around a 2 or 4 GiB row group
> > size
> > > limitation if we had to, but it's a new constraint that reduces the
> > > flexibility of the format and makes more work for users who now need to
> > > ensure they don't hit this limit.
> > >
> > > Do you have any measurements of how much of a difference 4 byte offsets
> > > make to footer sizes in your data, with and without the optional LZ4
> > > compression?
> > >
> > > Thanks,
> > > Adam
> > >
> > > On Tue, 14 Oct 2025 at 21:02, Alkis Evlogimenos
> > >  wrote:
> > >
> > > > Hi all,
> > > >
> > > > From the comments on the [EXTERNAL] Parquet metadata
> > > > <
> > > >
> > >
> >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0
> > > > >
> > > > document,
> > > > it appears there's a general consensus on most aspects, with the
> > > exception
> > > > of the relative 32-bit offsets for column chunks.
> > > >
> > > > I'm starting this thread to discuss this topic further and work
> > towards a
> > > > reso

Re: [DISCUSS] flatbuf footer

2025-10-21 Thread Steve Loughran
On Mon, 20 Oct 2025 at 18:24, Ed Seidl  wrote:

> IIUC a flatbuffer aware decoder would read the last 36 bytes or so of the
> file and look for a known UUID along with size information. With this it
> could then read only the flatbuffer bytes. I think this would work as well
> as current systems that prefetch some number of bytes in an attempt to get
> the whole footer in a single get.
>
> Old readers, however, will have to fetch both footers, but won't have any
> additional decoding work because the new footer is a binary field that can
> be easily skipped.
>

really depends what the readers do with footer prefetching. For the java
clients


   1. s3a classic stream: the backwards seek()  switches it to random IO
   mode, next read() from base of thrift will pull in fs.s3a.readahead.range
   of data  No penalty
   2. google gs://. There's a footer cache option which will need to be set
   to a larger value
   3. azure abfs:// there's a footer cache option which will need to be set
   to a larger value
   4. s3a + amazon analytics stream. This stream is *parquet aware* and
   actually parses the footer to know what to predictively prefetch. The AWS
   developers do know of this work -moving to support the new footer would be
   the ideal strategy here.
   5. Iceberg classic input. no idea.
   6. iceberg + amazon analytics. same as S3A though without some of the
   tuning we've been doing for vector reads.

I wouldn't worry too much about the impact of that footer size increase.
Some extra footer prefetch options should compensate, and once apps move to
a parquet v3 reader they've got a faster parse time. Of course, ironically,
read time then may dominate even more there -it'll be important to do that
read as efficiently as possible (use a readFully() into a buffer, not lots
of single byte read() calls)


Re: [DISCUSS] flatbuf footer: offsets

2025-10-21 Thread Alkis Evlogimenos
We've analyzed a large footer from our production environment to understand
byte distribution across its fields. The detailed analysis is available in
the proposal document here:
https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
.

To illustrate the impact of 64-bit fields, we conducted an experiment where
all proposed 32-bit fields in the Flatbuf footer were changed to 64-bit.
This resulted in a *40% increase* in footer size.

That said, LZ4 manages to compress this away. We will do some more testing
with 64 bit offsets/numvals/sizes and revert back. If it all goes well we
can resolve this by going 64 bit.


On Wed, Oct 15, 2025 at 12:49 PM Jan Finis  wrote:

> Hi Alkis,
>
> one more very simple argument why you want these offsets to be i64:
> What if you want to store a single value larger than 4GB? I know this
> sounds absurd at first, but some use cases might want to store data that
> can sometimes be very large (e.g. blob data, or insanely complex geo data).
> And it would be a shame if that would mean that they cannot use Parquet at
> all.
>
> Thus, my opinion here is that we can limit to i32 all fields that the file
> writer has under control, e.g., the number of rows within a row group, but
> we shouldn't limit any values that a file writer doesn't have under
> control, as they fully depend on the input data.
>
> Note though that this means that the number of values in a column chunk
> could also exceed i32, if a user has nested data with more than 4 billion
> entries. With such data, the file writer again couldn't do anything to
> avoid writing a row group with more
> than i32 values, as a single row may not span multiple row groups. That
> being said, I think that nested data with more than 4 billion entries is
> less likely than a single large blob of 4 billion bytes.
>
> I know that smaller row groups is what most / all engines prefer, but we
> have to make sure the format also works for edge cases.
>
> Cheers,
> Jan
>
> Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve :
>
> > Hi Alkis
> >
> > Thanks for all your work on this proposal.
> >
> > I'd be in favour of keeping the offsets as i64 and not reducing the
> maximum
> > row group size, even if this results in slightly larger footers. I've
> heard
> > from some of our users within G-Research that they do have files with row
> > groups > 2 GiB. This is often when they use lower-level APIs to write
> > Parquet that don't automatically split data into row groups, and they
> > either write a single row group for simplicity or have some logical
> > partitioning of data into row groups. They might also have wide tables
> with
> > many columns, or wide array/tensor valued columns that lead to large row
> > groups.
> >
> > In many workflows we don't read Parquet with a query engine that supports
> > filters and skipping row groups, but just read all rows, or directly
> > specify the row groups to read if there is some known logical
> partitioning
> > into row groups. I'm sure we could work around a 2 or 4 GiB row group
> size
> > limitation if we had to, but it's a new constraint that reduces the
> > flexibility of the format and makes more work for users who now need to
> > ensure they don't hit this limit.
> >
> > Do you have any measurements of how much of a difference 4 byte offsets
> > make to footer sizes in your data, with and without the optional LZ4
> > compression?
> >
> > Thanks,
> > Adam
> >
> > On Tue, 14 Oct 2025 at 21:02, Alkis Evlogimenos
> >  wrote:
> >
> > > Hi all,
> > >
> > > From the comments on the [EXTERNAL] Parquet metadata
> > > <
> > >
> >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0
> > > >
> > > document,
> > > it appears there's a general consensus on most aspects, with the
> > exception
> > > of the relative 32-bit offsets for column chunks.
> > >
> > > I'm starting this thread to discuss this topic further and work
> towards a
> > > resolution. Adam Reeve suggested raising the limitation to 2^32, and he
> > > confirmed that Java does not have any issues with this. I am open to
> this
> > > change as it increases the limit without introducing any drawbacks.
> > >
> > > However, some still feel that a 2^32-byte limit for a row group is too
> > > restrictive. I'd like to understand these specific use cases better.
> From
> > > my perspective, for most engines, the row group is the primary unit of
> > > skipping, making very large row groups less desirable. In our fleet's
> > > workloads, it's rare to see row groups larger than 100MB, as anything
> > > larger tends to make statistics-based skipping ineffective.
> > >
> > > Cheers,
> > >
> >
>


Re: [DISCUSS] flatbuf footer

2025-10-20 Thread Ed Seidl
IIUC a flatbuffer aware decoder would read the last 36 bytes or so of the file 
and look for a known UUID along with size information. With this it could then 
read only the flatbuffer bytes. I think this would work as well as current 
systems that prefetch some number of bytes in an attempt to get the whole 
footer in a single get.

Old readers, however, will have to fetch both footers, but won't have any 
additional decoding work because the new footer is a binary field that can be 
easily skipped.

On 2025/10/20 15:59:14 Adrian Garcia Badaracco wrote:
> If we embed both a flat buffer footer and a thrift footer, will readers be 
> able to completely skip the thrift footer to read the flat buffer footer? Or 
> will they have to download / read both? Especially if they have to download 
> the bytes for both I’m not sure how big the win will be, on object storage 
> slow IO can be what dominates.
> 
> > On Oct 20, 2025, at 9:49 AM, Raphael Taylor-Davies 
> >  wrote:
> > 
> > I don't disagree that two files is much harder than one file, but is that 
> > the use-case that the flatbuffer format is solving for, or is that 
> > adequately serviced by the existing thrift-based footer? I had interpreted 
> > the flatbuffer more as a way to accelerate larger datasets consisting of 
> > many files, and of less utility for the single-file use-case.
> > 
> > That being said I misread the proposal, I thought it was proposing 
> > replacing the thrift based footer with a flatbuffer, which would be very 
> > disruptive. However, it looks like instead the (new?) proposal is to just 
> > create a duplicate flatbuffer footer embedded within the thrift footer, 
> > which can just be ignored by readers. The proposal is a bit vague when it 
> > comes to whether all information would be duplicated, or whether some 
> > information would only be embedded in the flatbuffer payload, but presuming 
> > it is a true duplicate, many of my points don't apply.
> > 
> > Kind Regards,
> > 
> > Raphael
> > 
> > On 20/10/2025 15:28, Antoine Pitrou wrote:
> >> I don't think it's a "small price to pay". Parquet files are widely
> >> used to share or transfer data of all kinds (in a way, they replace CSV
> >> with much better characteristics). Sharing a single file is easy,
> >> sharing two related files while keeping their relationship intact is an
> >> order of magnitude more difficult.
> >> 
> >> Regards
> >> 
> >> Antoine.
> >> 
> >> 
> >> On Mon, 20 Oct 2025 12:23:20 +0100
> >> Personal
> >> 
> >> wrote:
> >>> Apologies if this has already been discussed, but have we considered 
> >>> simply storing these flatbuffers as separate files alongside existing 
> >>> parquet files. I think this would have a number of quite compelling 
> >>> advantages:
> >>> 
> >>> - no breaking format changes, all readers can continue to still read all 
> >>> parquet files
> >>> - people can generate these "index" files for existing datasets without 
> >>> having to rewrite all their files
> >>> - older and newer readers can coexist - no stop the world migrations
> >>> - can potentially combine multiple flatbuffers into a single file for 
> >>> better IO when scanning collections of files - potentially very valuable 
> >>> for object stores, and would also help for people on HDFS and other 
> >>> systems that struggle with small files
> >>> - could potentially even inline these flatbuffers into catalogs like 
> >>> iceberg
> >>> - can continue to iterate at a faster rate, without the constraints of 
> >>> needing to move in lockstep with parquet versioning
> >>> - potentially less confusing for users, parquet files are still the same, 
> >>> they just can be accelerated by these new index files
> >>> 
> >>> This would mean some data duplication, but that seems a small price to 
> >>> pay, and would be strictly opt-in for users with use-cases that justify 
> >>> it?
> >>> 
> >>> Kind Regards,
> >>> 
> >>> Raphael
> >>> 
> >>> On 20 October 2025 11:08:59 BST, Alkis Evlogimenos 
> >>>  wrote:
> > Thank you, these are interesting. Can you share instructions on how to
> > reproduce the reported numbers? I am interested to review the code used 
> > to
> > generate these results (esp the C++ thrift code)
>  
>  The numbers are based on internal code (Photon). They are not very far 
>  off
> >>> >from https://github.com/apache/arrow/pull/43793. I will update that PR in
>  the coming weeks so that we can repro the same benchmarks with open 
>  source
>  code too.
>  
>  On Fri, Oct 17, 2025 at 5:52 PM Andrew Lamb  
>  wrote:
>   
> > Thanks Alkis, that is interesting data.
> >  
> >> We found that the reported numbers were not reproducible on AWS 
> >> instances
> > I just updated the benchmark results[1] to include results from
> > AWS m6id.8xlarge instance (and they are indeed about 2x slower than when
> > run on my 2023 Mac laptop)
> >  
> >> You can find the summary of our f

Re: [DISCUSS] flatbuf footer

2025-10-20 Thread Adrian Garcia Badaracco
If we embed both a flat buffer footer and a thrift footer, will readers be able 
to completely skip the thrift footer to read the flat buffer footer? Or will 
they have to download / read both? Especially if they have to download the 
bytes for both I’m not sure how big the win will be, on object storage slow IO 
can be what dominates.

> On Oct 20, 2025, at 9:49 AM, Raphael Taylor-Davies 
>  wrote:
> 
> I don't disagree that two files is much harder than one file, but is that the 
> use-case that the flatbuffer format is solving for, or is that adequately 
> serviced by the existing thrift-based footer? I had interpreted the 
> flatbuffer more as a way to accelerate larger datasets consisting of many 
> files, and of less utility for the single-file use-case.
> 
> That being said I misread the proposal, I thought it was proposing replacing 
> the thrift based footer with a flatbuffer, which would be very disruptive. 
> However, it looks like instead the (new?) proposal is to just create a 
> duplicate flatbuffer footer embedded within the thrift footer, which can just 
> be ignored by readers. The proposal is a bit vague when it comes to whether 
> all information would be duplicated, or whether some information would only 
> be embedded in the flatbuffer payload, but presuming it is a true duplicate, 
> many of my points don't apply.
> 
> Kind Regards,
> 
> Raphael
> 
> On 20/10/2025 15:28, Antoine Pitrou wrote:
>> I don't think it's a "small price to pay". Parquet files are widely
>> used to share or transfer data of all kinds (in a way, they replace CSV
>> with much better characteristics). Sharing a single file is easy,
>> sharing two related files while keeping their relationship intact is an
>> order of magnitude more difficult.
>> 
>> Regards
>> 
>> Antoine.
>> 
>> 
>> On Mon, 20 Oct 2025 12:23:20 +0100
>> Personal
>> 
>> wrote:
>>> Apologies if this has already been discussed, but have we considered simply 
>>> storing these flatbuffers as separate files alongside existing parquet 
>>> files. I think this would have a number of quite compelling advantages:
>>> 
>>> - no breaking format changes, all readers can continue to still read all 
>>> parquet files
>>> - people can generate these "index" files for existing datasets without 
>>> having to rewrite all their files
>>> - older and newer readers can coexist - no stop the world migrations
>>> - can potentially combine multiple flatbuffers into a single file for 
>>> better IO when scanning collections of files - potentially very valuable 
>>> for object stores, and would also help for people on HDFS and other systems 
>>> that struggle with small files
>>> - could potentially even inline these flatbuffers into catalogs like iceberg
>>> - can continue to iterate at a faster rate, without the constraints of 
>>> needing to move in lockstep with parquet versioning
>>> - potentially less confusing for users, parquet files are still the same, 
>>> they just can be accelerated by these new index files
>>> 
>>> This would mean some data duplication, but that seems a small price to pay, 
>>> and would be strictly opt-in for users with use-cases that justify it?
>>> 
>>> Kind Regards,
>>> 
>>> Raphael
>>> 
>>> On 20 October 2025 11:08:59 BST, Alkis Evlogimenos 
>>>  wrote:
> Thank you, these are interesting. Can you share instructions on how to
> reproduce the reported numbers? I am interested to review the code used to
> generate these results (esp the C++ thrift code)
 
 The numbers are based on internal code (Photon). They are not very far off
>>> >from https://github.com/apache/arrow/pull/43793. I will update that PR in
 the coming weeks so that we can repro the same benchmarks with open source
 code too.
 
 On Fri, Oct 17, 2025 at 5:52 PM Andrew Lamb  wrote:
  
> Thanks Alkis, that is interesting data.
>  
>> We found that the reported numbers were not reproducible on AWS instances
> I just updated the benchmark results[1] to include results from
> AWS m6id.8xlarge instance (and they are indeed about 2x slower than when
> run on my 2023 Mac laptop)
>  
>> You can find the summary of our findings in a separate tab in the
> proposal document:
> 
> Thank you, these are interesting. Can you share instructions on how to
> reproduce the reported numbers? I am interested to review the code used to
> generate these results (esp the C++ thrift code)
> 
> Thanks
> Andrew
> 
> 
> [1]:
> 
> https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux
> 
> 
> On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
>  wrote:
>  
>> Thank you Andrew for putting the code in open source so that we can repro
>> it.
>> 
>> We have run the rust benchmarks and also run the flatbuf proposal with
> our
>> C++ thrift parser, the flatbuf footer with Thrift conversion, the
>> flatbuf foote

Re: [DISCUSS] flatbuf footer

2025-10-20 Thread Raphael Taylor-Davies
I don't disagree that two files is much harder than one file, but is 
that the use-case that the flatbuffer format is solving for, or is that 
adequately serviced by the existing thrift-based footer? I had 
interpreted the flatbuffer more as a way to accelerate larger datasets 
consisting of many files, and of less utility for the single-file use-case.


That being said I misread the proposal, I thought it was proposing 
replacing the thrift based footer with a flatbuffer, which would be very 
disruptive. However, it looks like instead the (new?) proposal is to 
just create a duplicate flatbuffer footer embedded within the thrift 
footer, which can just be ignored by readers. The proposal is a bit 
vague when it comes to whether all information would be duplicated, or 
whether some information would only be embedded in the flatbuffer 
payload, but presuming it is a true duplicate, many of my points don't 
apply.


Kind Regards,

Raphael

On 20/10/2025 15:28, Antoine Pitrou wrote:

I don't think it's a "small price to pay". Parquet files are widely
used to share or transfer data of all kinds (in a way, they replace CSV
with much better characteristics). Sharing a single file is easy,
sharing two related files while keeping their relationship intact is an
order of magnitude more difficult.

Regards

Antoine.


On Mon, 20 Oct 2025 12:23:20 +0100
Personal

wrote:

Apologies if this has already been discussed, but have we considered simply 
storing these flatbuffers as separate files alongside existing parquet files. I 
think this would have a number of quite compelling advantages:

- no breaking format changes, all readers can continue to still read all 
parquet files
- people can generate these "index" files for existing datasets without having 
to rewrite all their files
- older and newer readers can coexist - no stop the world migrations
- can potentially combine multiple flatbuffers into a single file for better IO 
when scanning collections of files - potentially very valuable for object 
stores, and would also help for people on HDFS and other systems that struggle 
with small files
- could potentially even inline these flatbuffers into catalogs like iceberg
- can continue to iterate at a faster rate, without the constraints of needing 
to move in lockstep with parquet versioning
- potentially less confusing for users, parquet files are still the same, they 
just can be accelerated by these new index files

This would mean some data duplication, but that seems a small price to pay, and 
would be strictly opt-in for users with use-cases that justify it?

Kind Regards,

Raphael

On 20 October 2025 11:08:59 BST, Alkis Evlogimenos 
 wrote:

Thank you, these are interesting. Can you share instructions on how to
reproduce the reported numbers? I am interested to review the code used to
generate these results (esp the C++ thrift code)


The numbers are based on internal code (Photon). They are not very far off

>from https://github.com/apache/arrow/pull/43793. I will update that PR in

the coming weeks so that we can repro the same benchmarks with open source
code too.

On Fri, Oct 17, 2025 at 5:52 PM Andrew Lamb  wrote:
  

Thanks Alkis, that is interesting data.
  

We found that the reported numbers were not reproducible on AWS instances

I just updated the benchmark results[1] to include results from
AWS m6id.8xlarge instance (and they are indeed about 2x slower than when
run on my 2023 Mac laptop)
  

You can find the summary of our findings in a separate tab in the

proposal document:

Thank you, these are interesting. Can you share instructions on how to
reproduce the reported numbers? I am interested to review the code used to
generate these results (esp the C++ thrift code)

Thanks
Andrew


[1]:

https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux


On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
 wrote:
  

Thank you Andrew for putting the code in open source so that we can repro
it.

We have run the rust benchmarks and also run the flatbuf proposal with

our

C++ thrift parser, the flatbuf footer with Thrift conversion, the
flatbuf footer without Thrift conversion, and the flatbuf footer
without Thrift conversion and without verification. You can find the
summary of our findings in a separate tab in the proposal document:

  

https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s

The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the
optimized Thrift parsing. It also remains faster than the Thrift parser
even if the Thrift parser skips statistics. Furthermore if Thrift
conversion is skipped, the speedup is 50x, and if verification is skipped
it goes beyond 150x.


On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb 
wrote:
  

Hello,

I did some benchmarking for the new parser[2] we are working on in
arrow-rs.

This benchmark achieves nearly an order of magnitude improvement (7x)
pars

Re: [DISCUSS] flatbuf footer

2025-10-20 Thread Antoine Pitrou


I don't think it's a "small price to pay". Parquet files are widely
used to share or transfer data of all kinds (in a way, they replace CSV
with much better characteristics). Sharing a single file is easy,
sharing two related files while keeping their relationship intact is an
order of magnitude more difficult.

Regards

Antoine.


On Mon, 20 Oct 2025 12:23:20 +0100
Personal

wrote:
> Apologies if this has already been discussed, but have we considered simply 
> storing these flatbuffers as separate files alongside existing parquet files. 
> I think this would have a number of quite compelling advantages:
> 
> - no breaking format changes, all readers can continue to still read all 
> parquet files
> - people can generate these "index" files for existing datasets without 
> having to rewrite all their files
> - older and newer readers can coexist - no stop the world migrations
> - can potentially combine multiple flatbuffers into a single file for better 
> IO when scanning collections of files - potentially very valuable for object 
> stores, and would also help for people on HDFS and other systems that 
> struggle with small files
> - could potentially even inline these flatbuffers into catalogs like iceberg
> - can continue to iterate at a faster rate, without the constraints of 
> needing to move in lockstep with parquet versioning
> - potentially less confusing for users, parquet files are still the same, 
> they just can be accelerated by these new index files
> 
> This would mean some data duplication, but that seems a small price to pay, 
> and would be strictly opt-in for users with use-cases that justify it?
> 
> Kind Regards,
> 
> Raphael
> 
> On 20 October 2025 11:08:59 BST, Alkis Evlogimenos 
>  wrote:
> >>
> >> Thank you, these are interesting. Can you share instructions on how to
> >> reproduce the reported numbers? I am interested to review the code used to
> >> generate these results (esp the C++ thrift code)  
> >
> >
> >The numbers are based on internal code (Photon). They are not very far off
> >from https://github.com/apache/arrow/pull/43793. I will update that PR in
> >the coming weeks so that we can repro the same benchmarks with open source
> >code too.
> >
> >On Fri, Oct 17, 2025 at 5:52 PM Andrew Lamb  wrote:
> >  
> >> Thanks Alkis, that is interesting data.
> >>  
> >> > We found that the reported numbers were not reproducible on AWS 
> >> > instances  
> >>
> >> I just updated the benchmark results[1] to include results from
> >> AWS m6id.8xlarge instance (and they are indeed about 2x slower than when
> >> run on my 2023 Mac laptop)
> >>  
> >> > You can find the summary of our findings in a separate tab in the  
> >> proposal document:
> >>
> >> Thank you, these are interesting. Can you share instructions on how to
> >> reproduce the reported numbers? I am interested to review the code used to
> >> generate these results (esp the C++ thrift code)
> >>
> >> Thanks
> >> Andrew
> >>
> >>
> >> [1]:
> >>
> >> https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux
> >>
> >>
> >> On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
> >>  wrote:
> >>  
> >> > Thank you Andrew for putting the code in open source so that we can repro
> >> > it.
> >> >
> >> > We have run the rust benchmarks and also run the flatbuf proposal with  
> >> our  
> >> > C++ thrift parser, the flatbuf footer with Thrift conversion, the
> >> > flatbuf footer without Thrift conversion, and the flatbuf footer
> >> > without Thrift conversion and without verification. You can find the
> >> > summary of our findings in a separate tab in the proposal document:
> >> >
> >> >  
> >> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s
> >>   
> >> >
> >> > The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the
> >> > optimized Thrift parsing. It also remains faster than the Thrift parser
> >> > even if the Thrift parser skips statistics. Furthermore if Thrift
> >> > conversion is skipped, the speedup is 50x, and if verification is skipped
> >> > it goes beyond 150x.
> >> >
> >> >
> >> > On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb 
> >> > wrote:
> >> >  
> >> > > Hello,
> >> > >
> >> > > I did some benchmarking for the new parser[2] we are working on in
> >> > > arrow-rs.
> >> > >
> >> > > This benchmark achieves nearly an order of magnitude improvement (7x)
> >> > > parsing Parquet metadata with no changes to the Parquet format, by  
> >> simply  
> >> > > writing a more efficient thrift decoder (which can also skip  
> >> statistics).  
> >> > >
> >> > > While we have not implemented a similar decoder in other languages 
> >> > > such  
> >> > as  
> >> > > C/C++ or Java, given the similarities in the existing thrift libraries 
> >> > >  
> >> > and  
> >> > > usage, we expect similar improvements are possible in those languages  
> >> as  
> >> > > well.
> >> > >
> >> > > Here are some inline image

Re: [DISCUSS] flatbuf footer

2025-10-20 Thread Andrew Lamb
>  I don't see any issue here:
https://github.com/apache/parquet-format/issues

That is a good call -- I filed
https://github.com/apache/parquet-format/issues/530 to track

On Mon, Oct 20, 2025 at 8:17 AM Andrew Bell 
wrote:

> On Mon, Oct 20, 2025 at 6:07 AM Alkis Evlogimenos
>  wrote:
>
> > Flatbuf parsing is trivial compared to thrift. Thrift walks the bytes of
> > the serialized form and picks fields out of it one by one. Flatbuf
> instead
> > takes the serialized form and uses offsets already embedded in it to
> > extract fields from the serialized form directly. In other words there is
> > no parsing done. We have 3 ways to use the flatbuf each of which adds
> more
> > overhead
> >
> ...
>
> Maybe I was confused by this:
>
>  This benchmark achieves nearly an order of magnitude improvement (7x)
> > > > > parsing Parquet metadata with no changes to the Parquet format, by
> > > simply
> > > > > writing a more efficient thrift decoder (which can also skip
> > > statistics).
>
> It was unclear to me if this was still about flatbuf or about writing
> better thrift decoder. Is there a write-up describing exactly what's being
> proposed? I don't see any issue here:
> https://github.com/apache/parquet-format/issues
>
> --
> Andrew Bell
> [email protected]
>


Re: [DISCUSS] flatbuf footer

2025-10-20 Thread Andrew Lamb
> Maybe I was confused by this:

There are (at least) two parallel things going on:

1. Work in the Rust Parquet implementation to speed up the parsing of
thrift footers (no change to Parquet format)[1][2]
2. A proposal to change the Parquet format to add a optional FlatBuffers
based footer [3]

[1]: https://github.com/apache/arrow-rs/issues/5854
[2]: https://github.com/alamb/parquet_footer_parsing
[3]:
https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0#heading=h.ccu4zzsy0tm5

Andrew

On Mon, Oct 20, 2025 at 8:17 AM Andrew Bell 
wrote:

> On Mon, Oct 20, 2025 at 6:07 AM Alkis Evlogimenos
>  wrote:
>
> > Flatbuf parsing is trivial compared to thrift. Thrift walks the bytes of
> > the serialized form and picks fields out of it one by one. Flatbuf
> instead
> > takes the serialized form and uses offsets already embedded in it to
> > extract fields from the serialized form directly. In other words there is
> > no parsing done. We have 3 ways to use the flatbuf each of which adds
> more
> > overhead
> >
> ...
>
> Maybe I was confused by this:
>
>  This benchmark achieves nearly an order of magnitude improvement (7x)
> > > > > parsing Parquet metadata with no changes to the Parquet format, by
> > > simply
> > > > > writing a more efficient thrift decoder (which can also skip
> > > statistics).
>
> It was unclear to me if this was still about flatbuf or about writing
> better thrift decoder. Is there a write-up describing exactly what's being
> proposed? I don't see any issue here:
> https://github.com/apache/parquet-format/issues
>
> --
> Andrew Bell
> [email protected]
>


Re: [DISCUSS] flatbuf footer

2025-10-20 Thread Andrew Bell
On Mon, Oct 20, 2025 at 6:07 AM Alkis Evlogimenos
 wrote:

> Flatbuf parsing is trivial compared to thrift. Thrift walks the bytes of
> the serialized form and picks fields out of it one by one. Flatbuf instead
> takes the serialized form and uses offsets already embedded in it to
> extract fields from the serialized form directly. In other words there is
> no parsing done. We have 3 ways to use the flatbuf each of which adds more
> overhead
>
...

Maybe I was confused by this:

 This benchmark achieves nearly an order of magnitude improvement (7x)
> > > > parsing Parquet metadata with no changes to the Parquet format, by
> > simply
> > > > writing a more efficient thrift decoder (which can also skip
> > statistics).

It was unclear to me if this was still about flatbuf or about writing
better thrift decoder. Is there a write-up describing exactly what's being
proposed? I don't see any issue here:
https://github.com/apache/parquet-format/issues

-- 
Andrew Bell
[email protected]


Re: [DISCUSS] flatbuf footer

2025-10-20 Thread Personal
Apologies if this has already been discussed, but have we considered simply 
storing these flatbuffers as separate files alongside existing parquet files. I 
think this would have a number of quite compelling advantages:

- no breaking format changes, all readers can continue to still read all 
parquet files
- people can generate these "index" files for existing datasets without having 
to rewrite all their files
- older and newer readers can coexist - no stop the world migrations
- can potentially combine multiple flatbuffers into a single file for better IO 
when scanning collections of files - potentially very valuable for object 
stores, and would also help for people on HDFS and other systems that struggle 
with small files
- could potentially even inline these flatbuffers into catalogs like iceberg
- can continue to iterate at a faster rate, without the constraints of needing 
to move in lockstep with parquet versioning
- potentially less confusing for users, parquet files are still the same, they 
just can be accelerated by these new index files

This would mean some data duplication, but that seems a small price to pay, and 
would be strictly opt-in for users with use-cases that justify it?

Kind Regards,

Raphael

On 20 October 2025 11:08:59 BST, Alkis Evlogimenos 
 wrote:
>>
>> Thank you, these are interesting. Can you share instructions on how to
>> reproduce the reported numbers? I am interested to review the code used to
>> generate these results (esp the C++ thrift code)
>
>
>The numbers are based on internal code (Photon). They are not very far off
>from https://github.com/apache/arrow/pull/43793. I will update that PR in
>the coming weeks so that we can repro the same benchmarks with open source
>code too.
>
>On Fri, Oct 17, 2025 at 5:52 PM Andrew Lamb  wrote:
>
>> Thanks Alkis, that is interesting data.
>>
>> > We found that the reported numbers were not reproducible on AWS instances
>>
>> I just updated the benchmark results[1] to include results from
>> AWS m6id.8xlarge instance (and they are indeed about 2x slower than when
>> run on my 2023 Mac laptop)
>>
>> > You can find the summary of our findings in a separate tab in the
>> proposal document:
>>
>> Thank you, these are interesting. Can you share instructions on how to
>> reproduce the reported numbers? I am interested to review the code used to
>> generate these results (esp the C++ thrift code)
>>
>> Thanks
>> Andrew
>>
>>
>> [1]:
>>
>> https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux
>>
>>
>> On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
>>  wrote:
>>
>> > Thank you Andrew for putting the code in open source so that we can repro
>> > it.
>> >
>> > We have run the rust benchmarks and also run the flatbuf proposal with
>> our
>> > C++ thrift parser, the flatbuf footer with Thrift conversion, the
>> > flatbuf footer without Thrift conversion, and the flatbuf footer
>> > without Thrift conversion and without verification. You can find the
>> > summary of our findings in a separate tab in the proposal document:
>> >
>> >
>> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s
>> >
>> > The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the
>> > optimized Thrift parsing. It also remains faster than the Thrift parser
>> > even if the Thrift parser skips statistics. Furthermore if Thrift
>> > conversion is skipped, the speedup is 50x, and if verification is skipped
>> > it goes beyond 150x.
>> >
>> >
>> > On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb 
>> > wrote:
>> >
>> > > Hello,
>> > >
>> > > I did some benchmarking for the new parser[2] we are working on in
>> > > arrow-rs.
>> > >
>> > > This benchmark achieves nearly an order of magnitude improvement (7x)
>> > > parsing Parquet metadata with no changes to the Parquet format, by
>> simply
>> > > writing a more efficient thrift decoder (which can also skip
>> statistics).
>> > >
>> > > While we have not implemented a similar decoder in other languages such
>> > as
>> > > C/C++ or Java, given the similarities in the existing thrift libraries
>> > and
>> > > usage, we expect similar improvements are possible in those languages
>> as
>> > > well.
>> > >
>> > > Here are some inline images:
>> > > [image: image.png]
>> > > [image: image.png]
>> > >
>> > >
>> > > You can find full details here [1]
>> > >
>> > > Andrew
>> > >
>> > >
>> > > [1]: https://github.com/alamb/parquet_footer_parsing
>> > > [2]: https://github.com/apache/arrow-rs/issues/5854
>> > >
>> > >
>> > > On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl  wrote:
>> > >
>> > >> > Concerning Thrift optimization, while a 2-3x improvement might be
>> > >> > achievable, Flatbuffers are currently demonstrating a 10x
>> improvement.
>> > >> > Andrew, do you have a more precise estimate for the speedup we could
>> > >> expect
>> > >> > in C++?
>> > >>
>> > >> Given my past experience on cuDF, I'd estimate about 2X th

Re: [DISCUSS] flatbuf footer

2025-10-20 Thread Alkis Evlogimenos
>
> Thank you, these are interesting. Can you share instructions on how to
> reproduce the reported numbers? I am interested to review the code used to
> generate these results (esp the C++ thrift code)


The numbers are based on internal code (Photon). They are not very far off
from https://github.com/apache/arrow/pull/43793. I will update that PR in
the coming weeks so that we can repro the same benchmarks with open source
code too.

On Fri, Oct 17, 2025 at 5:52 PM Andrew Lamb  wrote:

> Thanks Alkis, that is interesting data.
>
> > We found that the reported numbers were not reproducible on AWS instances
>
> I just updated the benchmark results[1] to include results from
> AWS m6id.8xlarge instance (and they are indeed about 2x slower than when
> run on my 2023 Mac laptop)
>
> > You can find the summary of our findings in a separate tab in the
> proposal document:
>
> Thank you, these are interesting. Can you share instructions on how to
> reproduce the reported numbers? I am interested to review the code used to
> generate these results (esp the C++ thrift code)
>
> Thanks
> Andrew
>
>
> [1]:
>
> https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux
>
>
> On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
>  wrote:
>
> > Thank you Andrew for putting the code in open source so that we can repro
> > it.
> >
> > We have run the rust benchmarks and also run the flatbuf proposal with
> our
> > C++ thrift parser, the flatbuf footer with Thrift conversion, the
> > flatbuf footer without Thrift conversion, and the flatbuf footer
> > without Thrift conversion and without verification. You can find the
> > summary of our findings in a separate tab in the proposal document:
> >
> >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s
> >
> > The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the
> > optimized Thrift parsing. It also remains faster than the Thrift parser
> > even if the Thrift parser skips statistics. Furthermore if Thrift
> > conversion is skipped, the speedup is 50x, and if verification is skipped
> > it goes beyond 150x.
> >
> >
> > On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb 
> > wrote:
> >
> > > Hello,
> > >
> > > I did some benchmarking for the new parser[2] we are working on in
> > > arrow-rs.
> > >
> > > This benchmark achieves nearly an order of magnitude improvement (7x)
> > > parsing Parquet metadata with no changes to the Parquet format, by
> simply
> > > writing a more efficient thrift decoder (which can also skip
> statistics).
> > >
> > > While we have not implemented a similar decoder in other languages such
> > as
> > > C/C++ or Java, given the similarities in the existing thrift libraries
> > and
> > > usage, we expect similar improvements are possible in those languages
> as
> > > well.
> > >
> > > Here are some inline images:
> > > [image: image.png]
> > > [image: image.png]
> > >
> > >
> > > You can find full details here [1]
> > >
> > > Andrew
> > >
> > >
> > > [1]: https://github.com/alamb/parquet_footer_parsing
> > > [2]: https://github.com/apache/arrow-rs/issues/5854
> > >
> > >
> > > On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl  wrote:
> > >
> > >> > Concerning Thrift optimization, while a 2-3x improvement might be
> > >> > achievable, Flatbuffers are currently demonstrating a 10x
> improvement.
> > >> > Andrew, do you have a more precise estimate for the speedup we could
> > >> expect
> > >> > in C++?
> > >>
> > >> Given my past experience on cuDF, I'd estimate about 2X there as well.
> > >> cuDF has it's own metadata parser that I once benchmarked against the
> > >> thrift generated parser.
> > >>
> > >> And I'd point out that beyond the initial 2X improvement, rolling your
> > >> own parser frees you of having to parse out every structure in the
> > metadata.
> > >>
> > >
> >
>


Re: [DISCUSS] flatbuf footer

2025-10-20 Thread Alkis Evlogimenos
Flatbuf parsing is trivial compared to thrift. Thrift walks the bytes of
the serialized form and picks fields out of it one by one. Flatbuf instead
takes the serialized form and uses offsets already embedded in it to
extract fields from the serialized form directly. In other words there is
no parsing done. We have 3 ways to use the flatbuf each of which adds more
overhead:
1. raw flatbuf without postprocessing: +150x speedup
2. verified flatbuf: +50x speedup. Verified flatbuf means that before any
flatbuf access, all offsets are bounds check to not cause memory accesses
outside the flatbuf encoded blob
3. verified flatbuf + conversion to FileMetadata struct generated by thrift
compiler: +5x speedup. This is the easy migration path for most engines
where we take flatbuf, verify it and then put together the same
FileMetadata struct that would come out of thrift parser had we parsed the
thrift representation.


On Fri, Oct 17, 2025 at 5:19 PM Andrew Bell 
wrote:

> On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
>  wrote:
>
> > Thank you Andrew for putting the code in open source so that we can repro
> > it.
> >
> > The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the
> > optimized Thrift parsing. It also remains faster than the Thrift parser
> > even if the Thrift parser skips statistics. Furthermore if Thrift
> > conversion is skipped, the speedup is 50x, and if verification is skipped
> > it goes beyond 150x.
>
>
> Can you explain a bit the differences/changes in the parser that provides
> such a speedup?
>
> --
> Andrew Bell
> [email protected]
>


Re: [DISCUSS] flatbuf footer

2025-10-18 Thread Pierre Lacave
We have to deal with very wide files (up to million columns)

The approach we took is very similar to flatbuffer metadata + skiplist

Seeing this happening in parquet open interesting possibilities

This is explained in this blog:

Husky: Efficient compaction at Datadog scale | Datadog
https://www.datadoghq.com/blog/engineering/husky-storage-compaction/

https://imgix.datadoghq.com/img/blog/engineering/husky-storage-compaction/compaction_static_diagram_2_rev.png


On Sat, Oct 18, 2025, 9:58 PM Ed Seidl  wrote:

> Of course there's nothing to preclude adding just such an index to the
> current format.
>
> On 2025/10/17 22:10:36 Corwin Joy wrote:
> > For us, the exciting thing about the flatbuf footer approach is the
> > potential for fast random access. For wide tables, the metadata becomes
> > huge, and there is a lot of overhead to access a particular rowgroup.
> (See
> > previous discussions, e.g., https://github.com/apache/arrow/issues/38149
> ).
> > Even if we can get a faster thrift parser, this is still limited, because
> > you have to parse the entire metadata, which is inherently slow. Pulling
> > information for a selected rowgroup is a lot faster.
> > Right now, we have a workaround: we create an external index to get fast
> > random access. (https://github.com/G-Research/PalletJack). But, having a
> > fast internal random access index like the proposed flatbuf footer would
> be
> > a big step forward.
> >
> > On Fri, Oct 17, 2025 at 8:50 AM Andrew Lamb 
> wrote:
> >
> > > Thanks Alkis, that is interesting data.
> > >
> > > > We found that the reported numbers were not reproducible on AWS
> instances
> > >
> > > I just updated the benchmark results[1] to include results from
> > > AWS m6id.8xlarge instance (and they are indeed about 2x slower than
> when
> > > run on my 2023 Mac laptop)
> > >
> > > > You can find the summary of our findings in a separate tab in the
> > > proposal document:
> > >
> > > Thank you, these are interesting. Can you share instructions on how to
> > > reproduce the reported numbers? I am interested to review the code
> used to
> > > generate these results (esp the C++ thrift code)
> > >
> > > Thanks
> > > Andrew
> > >
> > >
> > > [1]:
> > >
> > >
> https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux
> > >
> > >
> > > On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
> > >  wrote:
> > >
> > > > Thank you Andrew for putting the code in open source so that we can
> repro
> > > > it.
> > > >
> > > > We have run the rust benchmarks and also run the flatbuf proposal
> with
> > > our
> > > > C++ thrift parser, the flatbuf footer with Thrift conversion, the
> > > > flatbuf footer without Thrift conversion, and the flatbuf footer
> > > > without Thrift conversion and without verification. You can find the
> > > > summary of our findings in a separate tab in the proposal document:
> > > >
> > > >
> > >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s
> > > >
> > > > The TLDR is that flatbuf is 5x faster with the Thrift conversion vs
> the
> > > > optimized Thrift parsing. It also remains faster than the Thrift
> parser
> > > > even if the Thrift parser skips statistics. Furthermore if Thrift
> > > > conversion is skipped, the speedup is 50x, and if verification is
> skipped
> > > > it goes beyond 150x.
> > > >
> > > >
> > > > On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb 
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I did some benchmarking for the new parser[2] we are working on in
> > > > > arrow-rs.
> > > > >
> > > > > This benchmark achieves nearly an order of magnitude improvement
> (7x)
> > > > > parsing Parquet metadata with no changes to the Parquet format, by
> > > simply
> > > > > writing a more efficient thrift decoder (which can also skip
> > > statistics).
> > > > >
> > > > > While we have not implemented a similar decoder in other languages
> such
> > > > as
> > > > > C/C++ or Java, given the similarities in the existing thrift
> libraries
> > > > and
> > > > > usage, we expect similar improvements are possible in those
> languages
> > > as
> > > > > well.
> > > > >
> > > > > Here are some inline images:
> > > > > [image: image.png]
> > > > > [image: image.png]
> > > > >
> > > > >
> > > > > You can find full details here [1]
> > > > >
> > > > > Andrew
> > > > >
> > > > >
> > > > > [1]: https://github.com/alamb/parquet_footer_parsing
> > > > > [2]: https://github.com/apache/arrow-rs/issues/5854
> > > > >
> > > > >
> > > > > On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl 
> wrote:
> > > > >
> > > > >> > Concerning Thrift optimization, while a 2-3x improvement might
> be
> > > > >> > achievable, Flatbuffers are currently demonstrating a 10x
> > > improvement.
> > > > >> > Andrew, do you have a more precise estimate for the speedup we
> could
> > > > >> expect
> > > > >> > in C++?
> > > > >>
> > > > >> Given my past experience on cuDF, I'd estimate about

Re: [DISCUSS] flatbuf footer

2025-10-18 Thread Ed Seidl
Of course there's nothing to preclude adding just such an index to the current 
format.

On 2025/10/17 22:10:36 Corwin Joy wrote:
> For us, the exciting thing about the flatbuf footer approach is the
> potential for fast random access. For wide tables, the metadata becomes
> huge, and there is a lot of overhead to access a particular rowgroup. (See
> previous discussions, e.g., https://github.com/apache/arrow/issues/38149).
> Even if we can get a faster thrift parser, this is still limited, because
> you have to parse the entire metadata, which is inherently slow. Pulling
> information for a selected rowgroup is a lot faster.
> Right now, we have a workaround: we create an external index to get fast
> random access. (https://github.com/G-Research/PalletJack). But, having a
> fast internal random access index like the proposed flatbuf footer would be
> a big step forward.
> 
> On Fri, Oct 17, 2025 at 8:50 AM Andrew Lamb  wrote:
> 
> > Thanks Alkis, that is interesting data.
> >
> > > We found that the reported numbers were not reproducible on AWS instances
> >
> > I just updated the benchmark results[1] to include results from
> > AWS m6id.8xlarge instance (and they are indeed about 2x slower than when
> > run on my 2023 Mac laptop)
> >
> > > You can find the summary of our findings in a separate tab in the
> > proposal document:
> >
> > Thank you, these are interesting. Can you share instructions on how to
> > reproduce the reported numbers? I am interested to review the code used to
> > generate these results (esp the C++ thrift code)
> >
> > Thanks
> > Andrew
> >
> >
> > [1]:
> >
> > https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux
> >
> >
> > On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
> >  wrote:
> >
> > > Thank you Andrew for putting the code in open source so that we can repro
> > > it.
> > >
> > > We have run the rust benchmarks and also run the flatbuf proposal with
> > our
> > > C++ thrift parser, the flatbuf footer with Thrift conversion, the
> > > flatbuf footer without Thrift conversion, and the flatbuf footer
> > > without Thrift conversion and without verification. You can find the
> > > summary of our findings in a separate tab in the proposal document:
> > >
> > >
> > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s
> > >
> > > The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the
> > > optimized Thrift parsing. It also remains faster than the Thrift parser
> > > even if the Thrift parser skips statistics. Furthermore if Thrift
> > > conversion is skipped, the speedup is 50x, and if verification is skipped
> > > it goes beyond 150x.
> > >
> > >
> > > On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb 
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I did some benchmarking for the new parser[2] we are working on in
> > > > arrow-rs.
> > > >
> > > > This benchmark achieves nearly an order of magnitude improvement (7x)
> > > > parsing Parquet metadata with no changes to the Parquet format, by
> > simply
> > > > writing a more efficient thrift decoder (which can also skip
> > statistics).
> > > >
> > > > While we have not implemented a similar decoder in other languages such
> > > as
> > > > C/C++ or Java, given the similarities in the existing thrift libraries
> > > and
> > > > usage, we expect similar improvements are possible in those languages
> > as
> > > > well.
> > > >
> > > > Here are some inline images:
> > > > [image: image.png]
> > > > [image: image.png]
> > > >
> > > >
> > > > You can find full details here [1]
> > > >
> > > > Andrew
> > > >
> > > >
> > > > [1]: https://github.com/alamb/parquet_footer_parsing
> > > > [2]: https://github.com/apache/arrow-rs/issues/5854
> > > >
> > > >
> > > > On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl  wrote:
> > > >
> > > >> > Concerning Thrift optimization, while a 2-3x improvement might be
> > > >> > achievable, Flatbuffers are currently demonstrating a 10x
> > improvement.
> > > >> > Andrew, do you have a more precise estimate for the speedup we could
> > > >> expect
> > > >> > in C++?
> > > >>
> > > >> Given my past experience on cuDF, I'd estimate about 2X there as well.
> > > >> cuDF has it's own metadata parser that I once benchmarked against the
> > > >> thrift generated parser.
> > > >>
> > > >> And I'd point out that beyond the initial 2X improvement, rolling your
> > > >> own parser frees you of having to parse out every structure in the
> > > metadata.
> > > >>
> > > >
> > >
> >
> 


Re: [DISCUSS] flatbuf footer

2025-10-18 Thread Andrew Lamb
Hello,

I did some benchmarking for the new parser[2] we are working on in
arrow-rs.

This benchmark achieves nearly an order of magnitude improvement (7x)
parsing Parquet metadata with no changes to the Parquet format, by simply
writing a more efficient thrift decoder (which can also skip statistics).

While we have not implemented a similar decoder in other languages such as
C/C++ or Java, given the similarities in the existing thrift libraries and
usage, we expect similar improvements are possible in those languages as
well.

Here are some inline images:
[image: image.png]
[image: image.png]


You can find full details here [1]

Andrew


[1]: https://github.com/alamb/parquet_footer_parsing
[2]: https://github.com/apache/arrow-rs/issues/5854


On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl  wrote:

> > Concerning Thrift optimization, while a 2-3x improvement might be
> > achievable, Flatbuffers are currently demonstrating a 10x improvement.
> > Andrew, do you have a more precise estimate for the speedup we could
> expect
> > in C++?
>
> Given my past experience on cuDF, I'd estimate about 2X there as well.
> cuDF has it's own metadata parser that I once benchmarked against the
> thrift generated parser.
>
> And I'd point out that beyond the initial 2X improvement, rolling your own
> parser frees you of having to parse out every structure in the metadata.
>


Re: [DISCUSS] flatbuf footer: offsets

2025-10-18 Thread Adam Reeve
Hi Alkis

Thanks for all your work on this proposal.

I'd be in favour of keeping the offsets as i64 and not reducing the maximum
row group size, even if this results in slightly larger footers. I've heard
from some of our users within G-Research that they do have files with row
groups > 2 GiB. This is often when they use lower-level APIs to write
Parquet that don't automatically split data into row groups, and they
either write a single row group for simplicity or have some logical
partitioning of data into row groups. They might also have wide tables with
many columns, or wide array/tensor valued columns that lead to large row
groups.

In many workflows we don't read Parquet with a query engine that supports
filters and skipping row groups, but just read all rows, or directly
specify the row groups to read if there is some known logical partitioning
into row groups. I'm sure we could work around a 2 or 4 GiB row group size
limitation if we had to, but it's a new constraint that reduces the
flexibility of the format and makes more work for users who now need to
ensure they don't hit this limit.

Do you have any measurements of how much of a difference 4 byte offsets
make to footer sizes in your data, with and without the optional LZ4
compression?

Thanks,
Adam

On Tue, 14 Oct 2025 at 21:02, Alkis Evlogimenos
 wrote:

> Hi all,
>
> From the comments on the [EXTERNAL] Parquet metadata
> <
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0
> >
> document,
> it appears there's a general consensus on most aspects, with the exception
> of the relative 32-bit offsets for column chunks.
>
> I'm starting this thread to discuss this topic further and work towards a
> resolution. Adam Reeve suggested raising the limitation to 2^32, and he
> confirmed that Java does not have any issues with this. I am open to this
> change as it increases the limit without introducing any drawbacks.
>
> However, some still feel that a 2^32-byte limit for a row group is too
> restrictive. I'd like to understand these specific use cases better. From
> my perspective, for most engines, the row group is the primary unit of
> skipping, making very large row groups less desirable. In our fleet's
> workloads, it's rare to see row groups larger than 100MB, as anything
> larger tends to make statistics-based skipping ineffective.
>
> Cheers,
>


Re: [DISCUSS] flatbuf footer: offsets

2025-10-17 Thread Jan Finis
Hi Alkis,

one more very simple argument why you want these offsets to be i64:
What if you want to store a single value larger than 4GB? I know this
sounds absurd at first, but some use cases might want to store data that
can sometimes be very large (e.g. blob data, or insanely complex geo data).
And it would be a shame if that would mean that they cannot use Parquet at
all.

Thus, my opinion here is that we can limit to i32 all fields that the file
writer has under control, e.g., the number of rows within a row group, but
we shouldn't limit any values that a file writer doesn't have under
control, as they fully depend on the input data.

Note though that this means that the number of values in a column chunk
could also exceed i32, if a user has nested data with more than 4 billion
entries. With such data, the file writer again couldn't do anything to
avoid writing a row group with more
than i32 values, as a single row may not span multiple row groups. That
being said, I think that nested data with more than 4 billion entries is
less likely than a single large blob of 4 billion bytes.

I know that smaller row groups is what most / all engines prefer, but we
have to make sure the format also works for edge cases.

Cheers,
Jan

Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve :

> Hi Alkis
>
> Thanks for all your work on this proposal.
>
> I'd be in favour of keeping the offsets as i64 and not reducing the maximum
> row group size, even if this results in slightly larger footers. I've heard
> from some of our users within G-Research that they do have files with row
> groups > 2 GiB. This is often when they use lower-level APIs to write
> Parquet that don't automatically split data into row groups, and they
> either write a single row group for simplicity or have some logical
> partitioning of data into row groups. They might also have wide tables with
> many columns, or wide array/tensor valued columns that lead to large row
> groups.
>
> In many workflows we don't read Parquet with a query engine that supports
> filters and skipping row groups, but just read all rows, or directly
> specify the row groups to read if there is some known logical partitioning
> into row groups. I'm sure we could work around a 2 or 4 GiB row group size
> limitation if we had to, but it's a new constraint that reduces the
> flexibility of the format and makes more work for users who now need to
> ensure they don't hit this limit.
>
> Do you have any measurements of how much of a difference 4 byte offsets
> make to footer sizes in your data, with and without the optional LZ4
> compression?
>
> Thanks,
> Adam
>
> On Tue, 14 Oct 2025 at 21:02, Alkis Evlogimenos
>  wrote:
>
> > Hi all,
> >
> > From the comments on the [EXTERNAL] Parquet metadata
> > <
> >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0
> > >
> > document,
> > it appears there's a general consensus on most aspects, with the
> exception
> > of the relative 32-bit offsets for column chunks.
> >
> > I'm starting this thread to discuss this topic further and work towards a
> > resolution. Adam Reeve suggested raising the limitation to 2^32, and he
> > confirmed that Java does not have any issues with this. I am open to this
> > change as it increases the limit without introducing any drawbacks.
> >
> > However, some still feel that a 2^32-byte limit for a row group is too
> > restrictive. I'd like to understand these specific use cases better. From
> > my perspective, for most engines, the row group is the primary unit of
> > skipping, making very large row groups less desirable. In our fleet's
> > workloads, it's rare to see row groups larger than 100MB, as anything
> > larger tends to make statistics-based skipping ineffective.
> >
> > Cheers,
> >
>


Re: [DISCUSS] flatbuf footer

2025-10-17 Thread Corwin Joy
For us, the exciting thing about the flatbuf footer approach is the
potential for fast random access. For wide tables, the metadata becomes
huge, and there is a lot of overhead to access a particular rowgroup. (See
previous discussions, e.g., https://github.com/apache/arrow/issues/38149).
Even if we can get a faster thrift parser, this is still limited, because
you have to parse the entire metadata, which is inherently slow. Pulling
information for a selected rowgroup is a lot faster.
Right now, we have a workaround: we create an external index to get fast
random access. (https://github.com/G-Research/PalletJack). But, having a
fast internal random access index like the proposed flatbuf footer would be
a big step forward.

On Fri, Oct 17, 2025 at 8:50 AM Andrew Lamb  wrote:

> Thanks Alkis, that is interesting data.
>
> > We found that the reported numbers were not reproducible on AWS instances
>
> I just updated the benchmark results[1] to include results from
> AWS m6id.8xlarge instance (and they are indeed about 2x slower than when
> run on my 2023 Mac laptop)
>
> > You can find the summary of our findings in a separate tab in the
> proposal document:
>
> Thank you, these are interesting. Can you share instructions on how to
> reproduce the reported numbers? I am interested to review the code used to
> generate these results (esp the C++ thrift code)
>
> Thanks
> Andrew
>
>
> [1]:
>
> https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux
>
>
> On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
>  wrote:
>
> > Thank you Andrew for putting the code in open source so that we can repro
> > it.
> >
> > We have run the rust benchmarks and also run the flatbuf proposal with
> our
> > C++ thrift parser, the flatbuf footer with Thrift conversion, the
> > flatbuf footer without Thrift conversion, and the flatbuf footer
> > without Thrift conversion and without verification. You can find the
> > summary of our findings in a separate tab in the proposal document:
> >
> >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s
> >
> > The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the
> > optimized Thrift parsing. It also remains faster than the Thrift parser
> > even if the Thrift parser skips statistics. Furthermore if Thrift
> > conversion is skipped, the speedup is 50x, and if verification is skipped
> > it goes beyond 150x.
> >
> >
> > On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb 
> > wrote:
> >
> > > Hello,
> > >
> > > I did some benchmarking for the new parser[2] we are working on in
> > > arrow-rs.
> > >
> > > This benchmark achieves nearly an order of magnitude improvement (7x)
> > > parsing Parquet metadata with no changes to the Parquet format, by
> simply
> > > writing a more efficient thrift decoder (which can also skip
> statistics).
> > >
> > > While we have not implemented a similar decoder in other languages such
> > as
> > > C/C++ or Java, given the similarities in the existing thrift libraries
> > and
> > > usage, we expect similar improvements are possible in those languages
> as
> > > well.
> > >
> > > Here are some inline images:
> > > [image: image.png]
> > > [image: image.png]
> > >
> > >
> > > You can find full details here [1]
> > >
> > > Andrew
> > >
> > >
> > > [1]: https://github.com/alamb/parquet_footer_parsing
> > > [2]: https://github.com/apache/arrow-rs/issues/5854
> > >
> > >
> > > On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl  wrote:
> > >
> > >> > Concerning Thrift optimization, while a 2-3x improvement might be
> > >> > achievable, Flatbuffers are currently demonstrating a 10x
> improvement.
> > >> > Andrew, do you have a more precise estimate for the speedup we could
> > >> expect
> > >> > in C++?
> > >>
> > >> Given my past experience on cuDF, I'd estimate about 2X there as well.
> > >> cuDF has it's own metadata parser that I once benchmarked against the
> > >> thrift generated parser.
> > >>
> > >> And I'd point out that beyond the initial 2X improvement, rolling your
> > >> own parser frees you of having to parse out every structure in the
> > metadata.
> > >>
> > >
> >
>


Re: [DISCUSS] flatbuf footer

2025-10-17 Thread Andrew Lamb
Thanks Alkis, that is interesting data.

> We found that the reported numbers were not reproducible on AWS instances

I just updated the benchmark results[1] to include results from
AWS m6id.8xlarge instance (and they are indeed about 2x slower than when
run on my 2023 Mac laptop)

> You can find the summary of our findings in a separate tab in the
proposal document:

Thank you, these are interesting. Can you share instructions on how to
reproduce the reported numbers? I am interested to review the code used to
generate these results (esp the C++ thrift code)

Thanks
Andrew


[1]:
https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux


On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
 wrote:

> Thank you Andrew for putting the code in open source so that we can repro
> it.
>
> We have run the rust benchmarks and also run the flatbuf proposal with our
> C++ thrift parser, the flatbuf footer with Thrift conversion, the
> flatbuf footer without Thrift conversion, and the flatbuf footer
> without Thrift conversion and without verification. You can find the
> summary of our findings in a separate tab in the proposal document:
>
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s
>
> The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the
> optimized Thrift parsing. It also remains faster than the Thrift parser
> even if the Thrift parser skips statistics. Furthermore if Thrift
> conversion is skipped, the speedup is 50x, and if verification is skipped
> it goes beyond 150x.
>
>
> On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb 
> wrote:
>
> > Hello,
> >
> > I did some benchmarking for the new parser[2] we are working on in
> > arrow-rs.
> >
> > This benchmark achieves nearly an order of magnitude improvement (7x)
> > parsing Parquet metadata with no changes to the Parquet format, by simply
> > writing a more efficient thrift decoder (which can also skip statistics).
> >
> > While we have not implemented a similar decoder in other languages such
> as
> > C/C++ or Java, given the similarities in the existing thrift libraries
> and
> > usage, we expect similar improvements are possible in those languages as
> > well.
> >
> > Here are some inline images:
> > [image: image.png]
> > [image: image.png]
> >
> >
> > You can find full details here [1]
> >
> > Andrew
> >
> >
> > [1]: https://github.com/alamb/parquet_footer_parsing
> > [2]: https://github.com/apache/arrow-rs/issues/5854
> >
> >
> > On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl  wrote:
> >
> >> > Concerning Thrift optimization, while a 2-3x improvement might be
> >> > achievable, Flatbuffers are currently demonstrating a 10x improvement.
> >> > Andrew, do you have a more precise estimate for the speedup we could
> >> expect
> >> > in C++?
> >>
> >> Given my past experience on cuDF, I'd estimate about 2X there as well.
> >> cuDF has it's own metadata parser that I once benchmarked against the
> >> thrift generated parser.
> >>
> >> And I'd point out that beyond the initial 2X improvement, rolling your
> >> own parser frees you of having to parse out every structure in the
> metadata.
> >>
> >
>


Re: [DISCUSS] flatbuf footer

2025-10-17 Thread Andrew Bell
On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
 wrote:

> Thank you Andrew for putting the code in open source so that we can repro
> it.
>
> The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the
> optimized Thrift parsing. It also remains faster than the Thrift parser
> even if the Thrift parser skips statistics. Furthermore if Thrift
> conversion is skipped, the speedup is 50x, and if verification is skipped
> it goes beyond 150x.


Can you explain a bit the differences/changes in the parser that provides
such a speedup?

-- 
Andrew Bell
[email protected]


Re: [DISCUSS] flatbuf footer

2025-10-17 Thread Alkis Evlogimenos
Thank you Andrew for putting the code in open source so that we can repro
it.

We have run the rust benchmarks and also run the flatbuf proposal with our
C++ thrift parser, the flatbuf footer with Thrift conversion, the
flatbuf footer without Thrift conversion, and the flatbuf footer
without Thrift conversion and without verification. You can find the
summary of our findings in a separate tab in the proposal document:
https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s

The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the
optimized Thrift parsing. It also remains faster than the Thrift parser
even if the Thrift parser skips statistics. Furthermore if Thrift
conversion is skipped, the speedup is 50x, and if verification is skipped
it goes beyond 150x.


On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb  wrote:

> Hello,
>
> I did some benchmarking for the new parser[2] we are working on in
> arrow-rs.
>
> This benchmark achieves nearly an order of magnitude improvement (7x)
> parsing Parquet metadata with no changes to the Parquet format, by simply
> writing a more efficient thrift decoder (which can also skip statistics).
>
> While we have not implemented a similar decoder in other languages such as
> C/C++ or Java, given the similarities in the existing thrift libraries and
> usage, we expect similar improvements are possible in those languages as
> well.
>
> Here are some inline images:
> [image: image.png]
> [image: image.png]
>
>
> You can find full details here [1]
>
> Andrew
>
>
> [1]: https://github.com/alamb/parquet_footer_parsing
> [2]: https://github.com/apache/arrow-rs/issues/5854
>
>
> On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl  wrote:
>
>> > Concerning Thrift optimization, while a 2-3x improvement might be
>> > achievable, Flatbuffers are currently demonstrating a 10x improvement.
>> > Andrew, do you have a more precise estimate for the speedup we could
>> expect
>> > in C++?
>>
>> Given my past experience on cuDF, I'd estimate about 2X there as well.
>> cuDF has it's own metadata parser that I once benchmarked against the
>> thrift generated parser.
>>
>> And I'd point out that beyond the initial 2X improvement, rolling your
>> own parser frees you of having to parse out every structure in the metadata.
>>
>


Re: [DISCUSS] flatbuf footer

2025-09-25 Thread Andrew Lamb
> Andrew, do you have a more precise estimate for the speedup we could
expect
in C++?

I do not yet, but I will try and find out. I have filed an issue[1] to
track the question / will try and enlist some help.

It will be fun to benchmaxx our new parser

Andrew

[1]: https://github.com/apache/arrow-rs/issues/8441

On Wed, Sep 24, 2025 at 6:38 AM Alkis Evlogimenos
 wrote:

> Thank you all for taking the time to go through the doc and your feedback.
> I'd like to address some of the key points raised:
>
> Regarding nested Flatbuffers, there's no parsing benefit to using them. In
> the current prototype, approximately two-thirds of the decoding cost comes
> from converting the Flatbuffer to `FileMetadata` (the Thrift object) to
> simplify the rollout process. Even with this conversion, we're observing a
> greater than 10x improvement in footer decoding time for footers that
> perform poorly with Thrift (at the p999 percentile). Removing the
> `FileMetadata` translation should easily provide another 2x speedup.
>
> Concerning Thrift optimization, while a 2-3x improvement might be
> achievable, Flatbuffers are currently demonstrating a 10x improvement.
> Andrew, do you have a more precise estimate for the speedup we could expect
> in C++? It's also important to note that Thrift's format does not allow for
> random access, meaning we will always have to parse the entire footer,
> regardless of which columns are requested.
>
> I will work on getting numbers for LZ4 compressed versus raw footers, but
> please be aware that this will take some time.
>
> Finally, the 32-bit narrowing of row group sizes appears to be the most
> contentious aspect of the design. I suggest we discuss this live during our
> next Parquet sync. For the record, shrinking the offsets is the second most
> significant optimization for Flatbuffer footer size, with statistics being
> the first.
>
> See you all in the next sync.
>
>
> On Wed, Sep 17, 2025 at 10:02 AM Antoine Pitrou 
> wrote:
>
> >
> > Hi Andrew,
> >
> > I haven't heard of anything like this for C++, but it is an intriguing
> > idea.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Tue, 16 Sep 2025 16:44:14 -0400
> > Andrew Lamb 
> > wrote:
> > > Has anyone spent time optimizing the thrift decoder (e.g. not just use
> > > whatever a general purpose thrift compiler generates, but custom code a
> > > parser just for Parquet metadata)?
> > >
> > > Ed is in the process of implementing just such a decoder in arrow-rs[1]
> > and
> > > has seen a 2-3x performance improvement (with no change to the format)
> in
> > > early benchmark results. This is inline with our earlier work on the
> > > topic[2] where we estimated there is a 2-4x performance improvement
> with
> > > implementation improvements alone.
> > >
> > > Andrew
> > >
> > > [1]: https://github.com/apache/arrow-rs/issues/5854
> > > [2]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/
> > >
> > > On Tue, Sep 16, 2025 at 4:12 AM Antoine Pitrou <
> > [email protected]> wrote:
> > >
> > > >
> > > > Hi again,
> > > >
> > > > Ok, a quick summary of my current feedback on this:
> > > >
> > > > - decoding speed measurements are given, but not footer size
> > > >   measurements; it would be interesting to have both
> > > >
> > > > - it's not obvious whether the stated numbers are for reading all
> > > >   columns or a subset of them
> > > >
> > > > - optional LZ4 compression is mentioned, but no numbers are given for
> > > >   it; it would be nice if numbers were available for both
> uncompressed
> > > >   and compressed footers
> > > >
> > > > - the numbers seem quite underwhelming currently, I think most of us
> > > >   were expecting massive speed improvements given past discussions
> > > >
> > > > - I'm firmly against narrowing sizes to 32 bits; making the footer
> more
> > > >   compact is useful, but not to the point of reducing usefulness or
> > > >   generality
> > > >
> > > >
> > > > A more general proposal: given the slightly underwhelming perf
> > > > numbers, has nested Flatbuffers been considered as an alternative?
> > > >
> > > > For example, the RowGroup table could become:
> > > > ```
> > > > table ColumnChunk {
> > > >   file_path: string;
> > > >   meta_data: ColumnMetadata;
> > > >   // etc.
> > > > }
> > > >
> > > > struct EncodedColumnChunk {
> > > >   // Flatbuffers-encoded ColumnChunk, to be decoded/validated
> > indidually
> > > >   column: [ubyte];
> > > > }
> > > >
> > > > table RowGroup {
> > > >   columns: [EncodedColumnChunk];
> > > >   total_byte_size: int;
> > > >   num_rows: int;
> > > >   sorting_columns: [SortingColumn];
> > > >   file_offset: long;
> > > >   total_compressed_size: int;
> > > >   ordinal: short = null;
> > > > }
> > > > ```
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >
> > > > On Thu, 11 Sep 2025 08:41:34 +0200
> > > > Alkis Evlogimenos
> > > > 
> > > > wrote:
> > > > > Hi all. I am sharing as a separate thread the pr

Re: [DISCUSS] flatbuf footer

2025-09-24 Thread Ed Seidl
> Concerning Thrift optimization, while a 2-3x improvement might be
> achievable, Flatbuffers are currently demonstrating a 10x improvement.
> Andrew, do you have a more precise estimate for the speedup we could expect
> in C++? 

Given my past experience on cuDF, I'd estimate about 2X there as well. cuDF has 
it's own metadata parser that I once benchmarked against the thrift generated 
parser.

And I'd point out that beyond the initial 2X improvement, rolling your own 
parser frees you of having to parse out every structure in the metadata.


Re: [DISCUSS] flatbuf footer

2025-09-24 Thread Antoine Pitrou
On Wed, 24 Sep 2025 12:37:13 +0200
Alkis Evlogimenos

wrote:
> Thank you all for taking the time to go through the doc and your feedback.
> I'd like to address some of the key points raised:
> 
> Regarding nested Flatbuffers, there's no parsing benefit to using them. In
> the current prototype, approximately two-thirds of the decoding cost comes
> from converting the Flatbuffer to `FileMetadata` (the Thrift object) to
> simplify the rollout process. Even with this conversion, we're observing a
> greater than 10x improvement in footer decoding time for footers that
> perform poorly with Thrift (at the p999 percentile). Removing the
> `FileMetadata` translation should easily provide another 2x speedup.

1. Your own numbers show p50 percentile performance at around 1x, not
10x. It's nice that p999 (!!) percentile performance is so good, but
that probably doesn't paint a representative picture of overall
performance.

2. It would be useful to have p05 and p01 performance results, by
the way. For now we know only about the best results, not the worst,
which is a bit surprising.

3. As you said in one of the comments: "even without Thrift, we still
have to verify the flatbuf which means we still have to walk all the
bytes". Nested Flatbuffers would avoid verifying the flatbuf data for
unused columns or indices, for example.

> Finally, the 32-bit narrowing of row group sizes appears to be the most
> contentious aspect of the design. I suggest we discuss this live during our
> next Parquet sync.

Well, not everyone can often make it to the Parquet syncs. Important
spec discussions should be accessible to anyone regardless of their
personal/professional schedules.

> For the record, shrinking the offsets is the second most
> significant optimization for Flatbuffer footer size, with statistics being
> the first.

I'm curious whether LZ4 would make the optimization less significant.

Regards

Antoine.




Re: [DISCUSS] flatbuf footer

2025-09-24 Thread Alkis Evlogimenos
Thank you all for taking the time to go through the doc and your feedback.
I'd like to address some of the key points raised:

Regarding nested Flatbuffers, there's no parsing benefit to using them. In
the current prototype, approximately two-thirds of the decoding cost comes
from converting the Flatbuffer to `FileMetadata` (the Thrift object) to
simplify the rollout process. Even with this conversion, we're observing a
greater than 10x improvement in footer decoding time for footers that
perform poorly with Thrift (at the p999 percentile). Removing the
`FileMetadata` translation should easily provide another 2x speedup.

Concerning Thrift optimization, while a 2-3x improvement might be
achievable, Flatbuffers are currently demonstrating a 10x improvement.
Andrew, do you have a more precise estimate for the speedup we could expect
in C++? It's also important to note that Thrift's format does not allow for
random access, meaning we will always have to parse the entire footer,
regardless of which columns are requested.

I will work on getting numbers for LZ4 compressed versus raw footers, but
please be aware that this will take some time.

Finally, the 32-bit narrowing of row group sizes appears to be the most
contentious aspect of the design. I suggest we discuss this live during our
next Parquet sync. For the record, shrinking the offsets is the second most
significant optimization for Flatbuffer footer size, with statistics being
the first.

See you all in the next sync.


On Wed, Sep 17, 2025 at 10:02 AM Antoine Pitrou  wrote:

>
> Hi Andrew,
>
> I haven't heard of anything like this for C++, but it is an intriguing
> idea.
>
> Regards
>
> Antoine.
>
>
> On Tue, 16 Sep 2025 16:44:14 -0400
> Andrew Lamb 
> wrote:
> > Has anyone spent time optimizing the thrift decoder (e.g. not just use
> > whatever a general purpose thrift compiler generates, but custom code a
> > parser just for Parquet metadata)?
> >
> > Ed is in the process of implementing just such a decoder in arrow-rs[1]
> and
> > has seen a 2-3x performance improvement (with no change to the format) in
> > early benchmark results. This is inline with our earlier work on the
> > topic[2] where we estimated there is a 2-4x performance improvement with
> > implementation improvements alone.
> >
> > Andrew
> >
> > [1]: https://github.com/apache/arrow-rs/issues/5854
> > [2]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/
> >
> > On Tue, Sep 16, 2025 at 4:12 AM Antoine Pitrou <
> [email protected]> wrote:
> >
> > >
> > > Hi again,
> > >
> > > Ok, a quick summary of my current feedback on this:
> > >
> > > - decoding speed measurements are given, but not footer size
> > >   measurements; it would be interesting to have both
> > >
> > > - it's not obvious whether the stated numbers are for reading all
> > >   columns or a subset of them
> > >
> > > - optional LZ4 compression is mentioned, but no numbers are given for
> > >   it; it would be nice if numbers were available for both uncompressed
> > >   and compressed footers
> > >
> > > - the numbers seem quite underwhelming currently, I think most of us
> > >   were expecting massive speed improvements given past discussions
> > >
> > > - I'm firmly against narrowing sizes to 32 bits; making the footer more
> > >   compact is useful, but not to the point of reducing usefulness or
> > >   generality
> > >
> > >
> > > A more general proposal: given the slightly underwhelming perf
> > > numbers, has nested Flatbuffers been considered as an alternative?
> > >
> > > For example, the RowGroup table could become:
> > > ```
> > > table ColumnChunk {
> > >   file_path: string;
> > >   meta_data: ColumnMetadata;
> > >   // etc.
> > > }
> > >
> > > struct EncodedColumnChunk {
> > >   // Flatbuffers-encoded ColumnChunk, to be decoded/validated
> indidually
> > >   column: [ubyte];
> > > }
> > >
> > > table RowGroup {
> > >   columns: [EncodedColumnChunk];
> > >   total_byte_size: int;
> > >   num_rows: int;
> > >   sorting_columns: [SortingColumn];
> > >   file_offset: long;
> > >   total_compressed_size: int;
> > >   ordinal: short = null;
> > > }
> > > ```
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> > > On Thu, 11 Sep 2025 08:41:34 +0200
> > > Alkis Evlogimenos
> > > 
> > > wrote:
> > > > Hi all. I am sharing as a separate thread the proposal for the footer
> > > > change we have been working on:
> > > >
> > >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit
>
> > > > .
> > > >
> > > > The proposal outlines the technical aspects of the design and the
> > > > experimental results of shadow testing this in production workloads.
> I
> > > > would like to discuss the proposal's most salient points in the
> next
> > > sync:
> > > > 1. the use of flatbuffers as footer serialization format
> > > > 2. the additional limitations imposed on parquet files (row group
> size
> > > > limit, row group max num row limit)
> > > >
> > 

Re: [DISCUSS] flatbuf footer

2025-09-20 Thread Antoine Pitrou


Hi Andrew,

I haven't heard of anything like this for C++, but it is an intriguing
idea.

Regards

Antoine.


On Tue, 16 Sep 2025 16:44:14 -0400
Andrew Lamb 
wrote:
> Has anyone spent time optimizing the thrift decoder (e.g. not just use
> whatever a general purpose thrift compiler generates, but custom code a
> parser just for Parquet metadata)?
> 
> Ed is in the process of implementing just such a decoder in arrow-rs[1] and
> has seen a 2-3x performance improvement (with no change to the format) in
> early benchmark results. This is inline with our earlier work on the
> topic[2] where we estimated there is a 2-4x performance improvement with
> implementation improvements alone.
> 
> Andrew
> 
> [1]: https://github.com/apache/arrow-rs/issues/5854
> [2]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/
> 
> On Tue, Sep 16, 2025 at 4:12 AM Antoine Pitrou 
>  wrote:
> 
> >
> > Hi again,
> >
> > Ok, a quick summary of my current feedback on this:
> >
> > - decoding speed measurements are given, but not footer size
> >   measurements; it would be interesting to have both
> >
> > - it's not obvious whether the stated numbers are for reading all
> >   columns or a subset of them
> >
> > - optional LZ4 compression is mentioned, but no numbers are given for
> >   it; it would be nice if numbers were available for both uncompressed
> >   and compressed footers
> >
> > - the numbers seem quite underwhelming currently, I think most of us
> >   were expecting massive speed improvements given past discussions
> >
> > - I'm firmly against narrowing sizes to 32 bits; making the footer more
> >   compact is useful, but not to the point of reducing usefulness or
> >   generality
> >
> >
> > A more general proposal: given the slightly underwhelming perf
> > numbers, has nested Flatbuffers been considered as an alternative?
> >
> > For example, the RowGroup table could become:
> > ```
> > table ColumnChunk {
> >   file_path: string;
> >   meta_data: ColumnMetadata;
> >   // etc.
> > }
> >
> > struct EncodedColumnChunk {
> >   // Flatbuffers-encoded ColumnChunk, to be decoded/validated indidually
> >   column: [ubyte];
> > }
> >
> > table RowGroup {
> >   columns: [EncodedColumnChunk];
> >   total_byte_size: int;
> >   num_rows: int;
> >   sorting_columns: [SortingColumn];
> >   file_offset: long;
> >   total_compressed_size: int;
> >   ordinal: short = null;
> > }
> > ```
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > On Thu, 11 Sep 2025 08:41:34 +0200
> > Alkis Evlogimenos
> > 
> > wrote:  
> > > Hi all. I am sharing as a separate thread the proposal for the footer
> > > change we have been working on:
> > >  
> > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit
> >   
> > > .
> > >
> > > The proposal outlines the technical aspects of the design and the
> > > experimental results of shadow testing this in production workloads. I
> > > would like to discuss the proposal's most salient points in the next  
> > sync:  
> > > 1. the use of flatbuffers as footer serialization format
> > > 2. the additional limitations imposed on parquet files (row group size
> > > limit, row group max num row limit)
> > >
> > > I would prefer comments on the google doc to facilitate async discussion.
> > >
> > > Thank you,
> > >  
> >
> >
> >
> >  
> 





Re: [DISCUSS] flatbuf footer

2025-09-17 Thread Gang Wu
I just found this thread went to my spam folder so I just want to bump
it up before reading the details.

On Thu, Sep 11, 2025 at 2:42 PM Alkis Evlogimenos
 wrote:
>
> Hi all. I am sharing as a separate thread the proposal for the footer
> change we have been working on:
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit
> .
>
> The proposal outlines the technical aspects of the design and the
> experimental results of shadow testing this in production workloads. I
> would like to discuss the proposal's most salient points in the next sync:
> 1. the use of flatbuffers as footer serialization format
> 2. the additional limitations imposed on parquet files (row group size
> limit, row group max num row limit)
>
> I would prefer comments on the google doc to facilitate async discussion.
>
> Thank you,


Re: [DISCUSS] flatbuf footer

2025-09-16 Thread Andrew Lamb
Has anyone spent time optimizing the thrift decoder (e.g. not just use
whatever a general purpose thrift compiler generates, but custom code a
parser just for Parquet metadata)?

Ed is in the process of implementing just such a decoder in arrow-rs[1] and
has seen a 2-3x performance improvement (with no change to the format) in
early benchmark results. This is inline with our earlier work on the
topic[2] where we estimated there is a 2-4x performance improvement with
implementation improvements alone.

Andrew

[1]: https://github.com/apache/arrow-rs/issues/5854
[2]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/

On Tue, Sep 16, 2025 at 4:12 AM Antoine Pitrou  wrote:

>
> Hi again,
>
> Ok, a quick summary of my current feedback on this:
>
> - decoding speed measurements are given, but not footer size
>   measurements; it would be interesting to have both
>
> - it's not obvious whether the stated numbers are for reading all
>   columns or a subset of them
>
> - optional LZ4 compression is mentioned, but no numbers are given for
>   it; it would be nice if numbers were available for both uncompressed
>   and compressed footers
>
> - the numbers seem quite underwhelming currently, I think most of us
>   were expecting massive speed improvements given past discussions
>
> - I'm firmly against narrowing sizes to 32 bits; making the footer more
>   compact is useful, but not to the point of reducing usefulness or
>   generality
>
>
> A more general proposal: given the slightly underwhelming perf
> numbers, has nested Flatbuffers been considered as an alternative?
>
> For example, the RowGroup table could become:
> ```
> table ColumnChunk {
>   file_path: string;
>   meta_data: ColumnMetadata;
>   // etc.
> }
>
> struct EncodedColumnChunk {
>   // Flatbuffers-encoded ColumnChunk, to be decoded/validated indidually
>   column: [ubyte];
> }
>
> table RowGroup {
>   columns: [EncodedColumnChunk];
>   total_byte_size: int;
>   num_rows: int;
>   sorting_columns: [SortingColumn];
>   file_offset: long;
>   total_compressed_size: int;
>   ordinal: short = null;
> }
> ```
>
> Regards
>
> Antoine.
>
>
>
> On Thu, 11 Sep 2025 08:41:34 +0200
> Alkis Evlogimenos
> 
> wrote:
> > Hi all. I am sharing as a separate thread the proposal for the footer
> > change we have been working on:
> >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit
> > .
> >
> > The proposal outlines the technical aspects of the design and the
> > experimental results of shadow testing this in production workloads. I
> > would like to discuss the proposal's most salient points in the next
> sync:
> > 1. the use of flatbuffers as footer serialization format
> > 2. the additional limitations imposed on parquet files (row group size
> > limit, row group max num row limit)
> >
> > I would prefer comments on the google doc to facilitate async discussion.
> >
> > Thank you,
> >
>
>
>
>


Re: [DISCUSS] flatbuf footer

2025-09-16 Thread Antoine Pitrou


Hi again,

Ok, a quick summary of my current feedback on this:

- decoding speed measurements are given, but not footer size
  measurements; it would be interesting to have both

- it's not obvious whether the stated numbers are for reading all
  columns or a subset of them

- optional LZ4 compression is mentioned, but no numbers are given for
  it; it would be nice if numbers were available for both uncompressed
  and compressed footers

- the numbers seem quite underwhelming currently, I think most of us
  were expecting massive speed improvements given past discussions

- I'm firmly against narrowing sizes to 32 bits; making the footer more
  compact is useful, but not to the point of reducing usefulness or
  generality


A more general proposal: given the slightly underwhelming perf
numbers, has nested Flatbuffers been considered as an alternative?

For example, the RowGroup table could become:
```
table ColumnChunk {
  file_path: string;
  meta_data: ColumnMetadata;
  // etc.
}

struct EncodedColumnChunk {
  // Flatbuffers-encoded ColumnChunk, to be decoded/validated indidually
  column: [ubyte];
}

table RowGroup {
  columns: [EncodedColumnChunk];
  total_byte_size: int;
  num_rows: int;
  sorting_columns: [SortingColumn];
  file_offset: long;
  total_compressed_size: int;
  ordinal: short = null;
}
```

Regards

Antoine.



On Thu, 11 Sep 2025 08:41:34 +0200
Alkis Evlogimenos

wrote:
> Hi all. I am sharing as a separate thread the proposal for the footer
> change we have been working on:
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit
> .
> 
> The proposal outlines the technical aspects of the design and the
> experimental results of shadow testing this in production workloads. I
> would like to discuss the proposal's most salient points in the next sync:
> 1. the use of flatbuffers as footer serialization format
> 2. the additional limitations imposed on parquet files (row group size
> limit, row group max num row limit)
> 
> I would prefer comments on the google doc to facilitate async discussion.
> 
> Thank you,
> 





Re: [DISCUSS] flatbuf footer

2025-09-15 Thread Steve Loughran
commented, mostly on the must/may/shall section, -it's as important to call
out those MUST NOT requirements.

I'm worried about the "should Not substantially degrade performance of old
readers" -I'd put that in the MUST NOT group and define "substantially". If
this slows down existing readers other than a slightly larger end of file
range to read before parsing, it won't be welcome and so less likely to be
adopted.

I also added a security requirement; maybe it should have its own section
primarily as one of due diligence in which illegal/invalid values are
discussed, such as references to different columns referring to overlapping
files -but add that clients are NOT required to check this where the check
is expensive.

It would be good for all readers to add an option to validate the thrift
and flatbuf footers to make sure they are consistent -stop somebody trying
to sneak something malicious deeper into the pipeline where they know that
the front end only checks the thrift values. A full scan of the whole
footer for consistency of offsets again has to be an option. What does
matter is that if my code reads a file from an untrusted source which does
have an inconsistent footer (columns declared as overlapping) this is not
going to generate any exploit. You'd make full-footer-validation part of
the process for ingress of external sources, and from then on consider it
well-formed and consistent across all runtimes.

Steve

(why yes, I am getting more into cybersecurity :)




On Thu, 11 Sept 2025 at 07:43, Alkis Evlogimenos
 wrote:

> Hi all. I am sharing as a separate thread the proposal for the footer
> change we have been working on:
>
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit
> .
>
> The proposal outlines the technical aspects of the design and the
> experimental results of shadow testing this in production workloads. I
> would like to discuss the proposal's most salient points in the next sync:
> 1. the use of flatbuffers as footer serialization format
> 2. the additional limitations imposed on parquet files (row group size
> limit, row group max num row limit)
>
> I would prefer comments on the google doc to facilitate async discussion.
>
> Thank you,
>


Re: [DISCUSS] flatbuf footer

2025-09-15 Thread Antoine Pitrou


Hello,

I haven't read everything in detail yet, but I'm going to say upfront
that I'm -1 on limiting sizes to 32 bits rather than the current 64
bits, unless it brings really sizable benefits (which I doubt, given
the affected fields).

Regards

Antoine.


On Thu, 11 Sep 2025 08:41:34 +0200
Alkis Evlogimenos

wrote:
> Hi all. I am sharing as a separate thread the proposal for the footer
> change we have been working on:
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit
> .
> 
> The proposal outlines the technical aspects of the design and the
> experimental results of shadow testing this in production workloads. I
> would like to discuss the proposal's most salient points in the next sync:
> 1. the use of flatbuffers as footer serialization format
> 2. the additional limitations imposed on parquet files (row group size
> limit, row group max num row limit)
> 
> I would prefer comments on the google doc to facilitate async discussion.
> 
> Thank you,
>