Re: [DISCUSS] flatbuf footer: offsets

2025-11-03 Thread Alkis Evlogimenos
Assuming LZ4 compression at 2gb/sec (per core) and network bandwidth at
1gb/sec, and taking as example the 367mb thrift footer in the proposal, the
tradeoff is as follows:
T=thrift, F32=flatbuf with 32-bit offsets, F64=flatbuf with 64-bit offsets

T (367mb): 50ms latency + 370ms transfer --> 420ms (ignoring parse time)
F32 (113mb raw / 50mb lz4): 50ms latency + 50ms transfer + 56ms
decompression --> 156ms
F64 (155mb raw / 52mb lz4): 50ms latency + 52ms transfer + 78ms
decompression --> 180ms

Going with 64 bit offsets leaves some performance on the table and it will
make lz4 compression pretty much required for most footers above 256kb.
That said 64-bit offsets are still much faster at transfer than thrift even
ignoring the horrendous parse times.

For simplicity I am still slightly in favor of 64 bit offsets but I am open
to argumentation for 32 bit relative offsets plus alignment to bring row
group size to 64gb.

Thoughts?


On Tue, Oct 28, 2025 at 10:57 AM Antoine Pitrou  wrote:

>
> Hi,
>
> I expect LZ4 to be optional, but enabled by default by most writers.
> LZ4 decompression is extremely fast, typically several GB/s on a modern
> CPU.
>
> Regards
>
> Antoine.
>
>
> On Mon, 27 Oct 2025 17:06:07 +0100
> Jan Finis  wrote:
> > You are right that even without LZ4, we would still need I/O for the
> whole
> > footer. And I guess LZ4 is way faster than thrift, so flatbuf+LZ4 would
> be
> > an improvement over thrift. If you want superb partial decoding, we would
> > indeed need to somehow support only reading part of the footer from
> > storage. In the end, it's a trade-off. The more flexibility we want
> w.r.t.
> > partial reads, the more complexity we have to introduce. Maybe flatbuf
> > alone is already the sweet spot here and we shouldn't introduce
> additional
> > complexity. LZ4 compression would after all still be optional, right?
> >
> > Someone mentioned that they have footers with millions of columns. Maybe
> > they should comment on how much partial reading would be required for
> their
> > use case. I guess the answer will be "the more support for partial
> > reading/decoding the better".
> >
> > You could argue that if you have such a wide file, just don't use LZ4
> then
> > and that's probably a valid argument.
> >
> > Cheers,
> > Jan
> >
> >
> >
> > Am Mo., 27. Okt. 2025 um 09:28 Uhr schrieb Antoine Pitrou <
> > [email protected]>:
> >
> > >
> > > Hmmm... does it?
> > >
> > > I may be mistaken, but I had the impression that what you call "read
> > > only the parts of the footer I'm interested in" is actually "*decode*
> > > only the parts of the footer I'm interested in".
> > >
> > > That is, you still read the entire footer, which is a larger IO than
> > > doing smaller reads, but it's also a single IO rather than several
> > > smaller ones.
> > >
> > > Of course, if we want to make things more flexible, we can have
> > > individual Flatbuffers metadata pieces for each column, each
> > > LZ4-compressed. And embed two sizes at the end of the file: the size of
> > > the "core footer" metadata (without columns) and the size of the "full
> > > footer" metadata (with columns); so that readers can choose their
> > > preferred strategy.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > On Sat, 25 Oct 2025 14:39:37 +0200
> > > Jan Finis  wrote:
> > > > Note that LZ4 compression destroys the whole "I can read only the
> parts
> > > of
> > > > the footer I'm interested in", so I wouldn't say that LZ4 can be the
> > > > solution to everything.
> > > >
> > > > Cheers,
> > > > Jan
> > > >
> > > > On Sat, Oct 25, 2025, 12:33 Antoine Pitrou <
> > > [email protected]> wrote:
> > > >
> > > > > On Fri, 24 Oct 2025 12:12:02 -0700
> > > > > Julien Le Dem  wrote:
> > > > > > I had an idea about this topic.
> > > > > > What if we say the offset is always a multiple of 16? (I'm
> saying
> > > 16, but
> > > > > > it works with 8 or 32 or any other power of 2).
> > > > > > Then we store in the footer the offset divided by 16.
> > > > > > That means you need to pad each row group by up to 16 bytes.
> > > > > > But now the max size of the file is 32GB.
> > > > > >
> > > > > > Personally, I still don't like having arbitrary limits but 32GB
> > > seems a
> > > > > lot
> > > > > > less like a restricting limit than 2GB.
> > > > > > If we get crazy, we add this to the footer as metadata and the
> > > writer
> > > > > gets
> > > > > > to pick whether you multiply offsets by 32, 64 or 128 if ten
> years
> > > from
> > > > > now
> > > > > > we start having much bigger files.
> > > > > > The size of the padding becomes negligible over the size of the
> file.
> > > > > >
> > > > > > Thoughts?
> > > > >
> > > > > That's an interesting suggestion. I would be fine with it
> personally,
> > > > > provided the multiplier is either large enough (say, 64) or
> embedded in
> > > > > the footer.
> > > > >
> > > > > That said, I would first wait for the outcome of the experiment
> with
> > > > > L

Re: [DISCUSS] flatbuf footer: offsets

2025-10-28 Thread Antoine Pitrou


Hi,

I expect LZ4 to be optional, but enabled by default by most writers.
LZ4 decompression is extremely fast, typically several GB/s on a modern
CPU.

Regards

Antoine.


On Mon, 27 Oct 2025 17:06:07 +0100
Jan Finis  wrote:
> You are right that even without LZ4, we would still need I/O for the whole
> footer. And I guess LZ4 is way faster than thrift, so flatbuf+LZ4 would be
> an improvement over thrift. If you want superb partial decoding, we would
> indeed need to somehow support only reading part of the footer from
> storage. In the end, it's a trade-off. The more flexibility we want w.r.t.
> partial reads, the more complexity we have to introduce. Maybe flatbuf
> alone is already the sweet spot here and we shouldn't introduce additional
> complexity. LZ4 compression would after all still be optional, right?
> 
> Someone mentioned that they have footers with millions of columns. Maybe
> they should comment on how much partial reading would be required for their
> use case. I guess the answer will be "the more support for partial
> reading/decoding the better".
> 
> You could argue that if you have such a wide file, just don't use LZ4 then
> and that's probably a valid argument.
> 
> Cheers,
> Jan
> 
> 
> 
> Am Mo., 27. Okt. 2025 um 09:28 Uhr schrieb Antoine Pitrou <
> [email protected]>:
> 
> >
> > Hmmm... does it?
> >
> > I may be mistaken, but I had the impression that what you call "read
> > only the parts of the footer I'm interested in" is actually "*decode*
> > only the parts of the footer I'm interested in".
> >
> > That is, you still read the entire footer, which is a larger IO than
> > doing smaller reads, but it's also a single IO rather than several
> > smaller ones.
> >
> > Of course, if we want to make things more flexible, we can have
> > individual Flatbuffers metadata pieces for each column, each
> > LZ4-compressed. And embed two sizes at the end of the file: the size of
> > the "core footer" metadata (without columns) and the size of the "full
> > footer" metadata (with columns); so that readers can choose their
> > preferred strategy.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Sat, 25 Oct 2025 14:39:37 +0200
> > Jan Finis  wrote:  
> > > Note that LZ4 compression destroys the whole "I can read only the parts  
> > of  
> > > the footer I'm interested in", so I wouldn't say that LZ4 can be the
> > > solution to everything.
> > >
> > > Cheers,
> > > Jan
> > >
> > > On Sat, Oct 25, 2025, 12:33 Antoine Pitrou <  
> > [email protected]> wrote:  
> > >  
> > > > On Fri, 24 Oct 2025 12:12:02 -0700
> > > > Julien Le Dem  wrote:  
> > > > > I had an idea about this topic.
> > > > > What if we say the offset is always a multiple of 16? (I'm saying  
> > 16, but  
> > > > > it works with 8 or 32 or any other power of 2).
> > > > > Then we store in the footer the offset divided by 16.
> > > > > That means you need to pad each row group by up to 16 bytes.
> > > > > But now the max size of the file is 32GB.
> > > > >
> > > > > Personally, I still don't like having arbitrary limits but 32GB  
> > seems a  
> > > > lot  
> > > > > less like a restricting limit than 2GB.
> > > > > If we get crazy, we add this to the footer as metadata and the  
> > writer  
> > > > gets  
> > > > > to pick whether you multiply offsets by 32, 64 or 128 if ten years  
> > from  
> > > > now  
> > > > > we start having much bigger files.
> > > > > The size of the padding becomes negligible over the size of the file.
> > > > >
> > > > > Thoughts?  
> > > >
> > > > That's an interesting suggestion. I would be fine with it personally,
> > > > provided the multiplier is either large enough (say, 64) or embedded in
> > > > the footer.
> > > >
> > > > That said, I would first wait for the outcome of the experiment with
> > > > LZ4 compression. If it negates the additional cost of 64-bit offsets,
> > > > then we should not bother with this multiplier mechanism.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >  
> > > > >
> > > > >
> > > > > On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
> > > > >  
> > > > > wrote:
> > > > >  
> > > > > > We've analyzed a large footer from our production environment to  
> > > > understand  
> > > > > > byte distribution across its fields. The detailed analysis is  
> > > > available in  
> > > > > > the proposal document here:
> > > > > >
> > > > > >  
> > > >  
> > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
> >  
> > > > > > .
> > > > > >
> > > > > > To illustrate the impact of 64-bit fields, we conducted an  
> > experiment  
> > > > where  
> > > > > > all proposed 32-bit fields in the Flatbuf footer were changed to  
> > > > 64-bit.  
> > > > > > This resulted in a *40% increase* in footer size.
> > > > > >
> > > > > > That said, LZ4 manages to compress this away. We will do some  
> > more  
> > > > testing  
> > > > > > with 64 bit offsets/numvals/s

Re: [DISCUSS] flatbuf footer: offsets

2025-10-27 Thread Jan Finis
You are right that even without LZ4, we would still need I/O for the whole
footer. And I guess LZ4 is way faster than thrift, so flatbuf+LZ4 would be
an improvement over thrift. If you want superb partial decoding, we would
indeed need to somehow support only reading part of the footer from
storage. In the end, it's a trade-off. The more flexibility we want w.r.t.
partial reads, the more complexity we have to introduce. Maybe flatbuf
alone is already the sweet spot here and we shouldn't introduce additional
complexity. LZ4 compression would after all still be optional, right?

Someone mentioned that they have footers with millions of columns. Maybe
they should comment on how much partial reading would be required for their
use case. I guess the answer will be "the more support for partial
reading/decoding the better".

You could argue that if you have such a wide file, just don't use LZ4 then
and that's probably a valid argument.

Cheers,
Jan



Am Mo., 27. Okt. 2025 um 09:28 Uhr schrieb Antoine Pitrou <
[email protected]>:

>
> Hmmm... does it?
>
> I may be mistaken, but I had the impression that what you call "read
> only the parts of the footer I'm interested in" is actually "*decode*
> only the parts of the footer I'm interested in".
>
> That is, you still read the entire footer, which is a larger IO than
> doing smaller reads, but it's also a single IO rather than several
> smaller ones.
>
> Of course, if we want to make things more flexible, we can have
> individual Flatbuffers metadata pieces for each column, each
> LZ4-compressed. And embed two sizes at the end of the file: the size of
> the "core footer" metadata (without columns) and the size of the "full
> footer" metadata (with columns); so that readers can choose their
> preferred strategy.
>
> Regards
>
> Antoine.
>
>
> On Sat, 25 Oct 2025 14:39:37 +0200
> Jan Finis  wrote:
> > Note that LZ4 compression destroys the whole "I can read only the parts
> of
> > the footer I'm interested in", so I wouldn't say that LZ4 can be the
> > solution to everything.
> >
> > Cheers,
> > Jan
> >
> > On Sat, Oct 25, 2025, 12:33 Antoine Pitrou <
> [email protected]> wrote:
> >
> > > On Fri, 24 Oct 2025 12:12:02 -0700
> > > Julien Le Dem  wrote:
> > > > I had an idea about this topic.
> > > > What if we say the offset is always a multiple of 16? (I'm saying
> 16, but
> > > > it works with 8 or 32 or any other power of 2).
> > > > Then we store in the footer the offset divided by 16.
> > > > That means you need to pad each row group by up to 16 bytes.
> > > > But now the max size of the file is 32GB.
> > > >
> > > > Personally, I still don't like having arbitrary limits but 32GB
> seems a
> > > lot
> > > > less like a restricting limit than 2GB.
> > > > If we get crazy, we add this to the footer as metadata and the
> writer
> > > gets
> > > > to pick whether you multiply offsets by 32, 64 or 128 if ten years
> from
> > > now
> > > > we start having much bigger files.
> > > > The size of the padding becomes negligible over the size of the file.
> > > >
> > > > Thoughts?
> > >
> > > That's an interesting suggestion. I would be fine with it personally,
> > > provided the multiplier is either large enough (say, 64) or embedded in
> > > the footer.
> > >
> > > That said, I would first wait for the outcome of the experiment with
> > > LZ4 compression. If it negates the additional cost of 64-bit offsets,
> > > then we should not bother with this multiplier mechanism.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > >
> > > >
> > > > On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
> > > >  wrote:
> > > >
> > > > > We've analyzed a large footer from our production environment to
> > > understand
> > > > > byte distribution across its fields. The detailed analysis is
> > > available in
> > > > > the proposal document here:
> > > > >
> > > > >
> > >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
>
> > > > > .
> > > > >
> > > > > To illustrate the impact of 64-bit fields, we conducted an
> experiment
> > > where
> > > > > all proposed 32-bit fields in the Flatbuf footer were changed to
> > > 64-bit.
> > > > > This resulted in a *40% increase* in footer size.
> > > > >
> > > > > That said, LZ4 manages to compress this away. We will do some
> more
> > > testing
> > > > > with 64 bit offsets/numvals/sizes and revert back. If it all goes
> well
> > > we
> > > > > can resolve this by going 64 bit.
> > > > >
> > > > >
> > > > > On Wed, Oct 15, 2025 at 12:49 PM Jan Finis <
> > > [email protected]> wrote:
> > > > >
> > > > > > Hi Alkis,
> > > > > >
> > > > > > one more very simple argument why you want these offsets to be
> i64:
> > > > > > What if you want to store a single value larger than 4GB? I know
> this
> > > > > > sounds absurd at first, but some use cases might want to store
> data
> > > that
> > > > > > can sometimes b

Re: [DISCUSS] flatbuf footer: offsets

2025-10-27 Thread Antoine Pitrou


Hmmm... does it?

I may be mistaken, but I had the impression that what you call "read
only the parts of the footer I'm interested in" is actually "*decode*
only the parts of the footer I'm interested in".

That is, you still read the entire footer, which is a larger IO than
doing smaller reads, but it's also a single IO rather than several
smaller ones.

Of course, if we want to make things more flexible, we can have
individual Flatbuffers metadata pieces for each column, each
LZ4-compressed. And embed two sizes at the end of the file: the size of
the "core footer" metadata (without columns) and the size of the "full
footer" metadata (with columns); so that readers can choose their
preferred strategy.

Regards

Antoine.


On Sat, 25 Oct 2025 14:39:37 +0200
Jan Finis  wrote:
> Note that LZ4 compression destroys the whole "I can read only the parts of
> the footer I'm interested in", so I wouldn't say that LZ4 can be the
> solution to everything.
> 
> Cheers,
> Jan
> 
> On Sat, Oct 25, 2025, 12:33 Antoine Pitrou 
>  wrote:
> 
> > On Fri, 24 Oct 2025 12:12:02 -0700
> > Julien Le Dem  wrote:  
> > > I had an idea about this topic.
> > > What if we say the offset is always a multiple of 16? (I'm saying 16, but
> > > it works with 8 or 32 or any other power of 2).
> > > Then we store in the footer the offset divided by 16.
> > > That means you need to pad each row group by up to 16 bytes.
> > > But now the max size of the file is 32GB.
> > >
> > > Personally, I still don't like having arbitrary limits but 32GB seems a  
> > lot  
> > > less like a restricting limit than 2GB.
> > > If we get crazy, we add this to the footer as metadata and the writer  
> > gets  
> > > to pick whether you multiply offsets by 32, 64 or 128 if ten years from  
> > now  
> > > we start having much bigger files.
> > > The size of the padding becomes negligible over the size of the file.
> > >
> > > Thoughts?  
> >
> > That's an interesting suggestion. I would be fine with it personally,
> > provided the multiplier is either large enough (say, 64) or embedded in
> > the footer.
> >
> > That said, I would first wait for the outcome of the experiment with
> > LZ4 compression. If it negates the additional cost of 64-bit offsets,
> > then we should not bother with this multiplier mechanism.
> >
> > Regards
> >
> > Antoine.
> >
> >  
> > >
> > >
> > > On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
> > >  wrote:
> > >  
> > > > We've analyzed a large footer from our production environment to  
> > understand  
> > > > byte distribution across its fields. The detailed analysis is  
> > available in  
> > > > the proposal document here:
> > > >
> > > >  
> > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
> >   
> > > > .
> > > >
> > > > To illustrate the impact of 64-bit fields, we conducted an experiment  
> > where  
> > > > all proposed 32-bit fields in the Flatbuf footer were changed to  
> > 64-bit.  
> > > > This resulted in a *40% increase* in footer size.
> > > >
> > > > That said, LZ4 manages to compress this away. We will do some more  
> > testing  
> > > > with 64 bit offsets/numvals/sizes and revert back. If it all goes well  
> > we  
> > > > can resolve this by going 64 bit.
> > > >
> > > >
> > > > On Wed, Oct 15, 2025 at 12:49 PM Jan Finis <  
> > [email protected]> wrote:  
> > > >  
> > > > > Hi Alkis,
> > > > >
> > > > > one more very simple argument why you want these offsets to be i64:
> > > > > What if you want to store a single value larger than 4GB? I know this
> > > > > sounds absurd at first, but some use cases might want to store data  
> > that  
> > > > > can sometimes be very large (e.g. blob data, or insanely complex  
> > geo  
> > > > data).  
> > > > > And it would be a shame if that would mean that they cannot use  
> > Parquet  
> > > > at  
> > > > > all.
> > > > >
> > > > > Thus, my opinion here is that we can limit to i32 all fields that  
> > the  
> > > > file  
> > > > > writer has under control, e.g., the number of rows within a row  
> > group,  
> > > > but  
> > > > > we shouldn't limit any values that a file writer doesn't have under
> > > > > control, as they fully depend on the input data.
> > > > >
> > > > > Note though that this means that the number of values in a column  
> > chunk  
> > > > > could also exceed i32, if a user has nested data with more than 4  
> > billion  
> > > > > entries. With such data, the file writer again couldn't do anything  
> > to  
> > > > > avoid writing a row group with more
> > > > > than i32 values, as a single row may not span multiple row groups.  
> > That  
> > > > > being said, I think that nested data with more than 4 billion  
> > entries is  
> > > > > less likely than a single large blob of 4 billion bytes.
> > > > >
> > > > > I know that smaller row groups is what most / all engines prefer,  
> > but we  
> > > > > have to make sure 

Re: [DISCUSS] flatbuf footer: offsets

2025-10-25 Thread Jan Finis
Note that LZ4 compression destroys the whole "I can read only the parts of
the footer I'm interested in", so I wouldn't say that LZ4 can be the
solution to everything.

Cheers,
Jan

On Sat, Oct 25, 2025, 12:33 Antoine Pitrou  wrote:

> On Fri, 24 Oct 2025 12:12:02 -0700
> Julien Le Dem  wrote:
> > I had an idea about this topic.
> > What if we say the offset is always a multiple of 16? (I'm saying 16, but
> > it works with 8 or 32 or any other power of 2).
> > Then we store in the footer the offset divided by 16.
> > That means you need to pad each row group by up to 16 bytes.
> > But now the max size of the file is 32GB.
> >
> > Personally, I still don't like having arbitrary limits but 32GB seems a
> lot
> > less like a restricting limit than 2GB.
> > If we get crazy, we add this to the footer as metadata and the writer
> gets
> > to pick whether you multiply offsets by 32, 64 or 128 if ten years from
> now
> > we start having much bigger files.
> > The size of the padding becomes negligible over the size of the file.
> >
> > Thoughts?
>
> That's an interesting suggestion. I would be fine with it personally,
> provided the multiplier is either large enough (say, 64) or embedded in
> the footer.
>
> That said, I would first wait for the outcome of the experiment with
> LZ4 compression. If it negates the additional cost of 64-bit offsets,
> then we should not bother with this multiplier mechanism.
>
> Regards
>
> Antoine.
>
>
> >
> >
> > On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
> >  wrote:
> >
> > > We've analyzed a large footer from our production environment to
> understand
> > > byte distribution across its fields. The detailed analysis is
> available in
> > > the proposal document here:
> > >
> > >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
> > > .
> > >
> > > To illustrate the impact of 64-bit fields, we conducted an experiment
> where
> > > all proposed 32-bit fields in the Flatbuf footer were changed to
> 64-bit.
> > > This resulted in a *40% increase* in footer size.
> > >
> > > That said, LZ4 manages to compress this away. We will do some more
> testing
> > > with 64 bit offsets/numvals/sizes and revert back. If it all goes well
> we
> > > can resolve this by going 64 bit.
> > >
> > >
> > > On Wed, Oct 15, 2025 at 12:49 PM Jan Finis <
> [email protected]> wrote:
> > >
> > > > Hi Alkis,
> > > >
> > > > one more very simple argument why you want these offsets to be i64:
> > > > What if you want to store a single value larger than 4GB? I know this
> > > > sounds absurd at first, but some use cases might want to store data
> that
> > > > can sometimes be very large (e.g. blob data, or insanely complex
> geo
> > > data).
> > > > And it would be a shame if that would mean that they cannot use
> Parquet
> > > at
> > > > all.
> > > >
> > > > Thus, my opinion here is that we can limit to i32 all fields that
> the
> > > file
> > > > writer has under control, e.g., the number of rows within a row
> group,
> > > but
> > > > we shouldn't limit any values that a file writer doesn't have under
> > > > control, as they fully depend on the input data.
> > > >
> > > > Note though that this means that the number of values in a column
> chunk
> > > > could also exceed i32, if a user has nested data with more than 4
> billion
> > > > entries. With such data, the file writer again couldn't do anything
> to
> > > > avoid writing a row group with more
> > > > than i32 values, as a single row may not span multiple row groups.
> That
> > > > being said, I think that nested data with more than 4 billion
> entries is
> > > > less likely than a single large blob of 4 billion bytes.
> > > >
> > > > I know that smaller row groups is what most / all engines prefer,
> but we
> > > > have to make sure the format also works for edge cases.
> > > >
> > > > Cheers,
> > > > Jan
> > > >
> > > > Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve
>  > > >:
> > > >
> > > > > Hi Alkis
> > > > >
> > > > > Thanks for all your work on this proposal.
> > > > >
> > > > > I'd be in favour of keeping the offsets as i64 and not reducing
> the
> > > > maximum
> > > > > row group size, even if this results in slightly larger footers.
> I've
> > > > heard
> > > > > from some of our users within G-Research that they do have files
> with
> > > row
> > > > > groups > 2 GiB. This is often when they use lower-level APIs to
> write
> > > > > Parquet that don't automatically split data into row groups, and
> they
> > > > > either write a single row group for simplicity or have some logical
> > > > > partitioning of data into row groups. They might also have wide
> tables
> > > > with
> > > > > many columns, or wide array/tensor valued columns that lead to
> large
> > > row
> > > > > groups.
> > > > >
> > > > > In many workflows we don't read Parquet with a query engine that
> > > supports
> > > > > filters and skipping row groups, 

Re: [DISCUSS] flatbuf footer: offsets

2025-10-25 Thread Antoine Pitrou
On Fri, 24 Oct 2025 12:12:02 -0700
Julien Le Dem  wrote:
> I had an idea about this topic.
> What if we say the offset is always a multiple of 16? (I'm saying 16, but
> it works with 8 or 32 or any other power of 2).
> Then we store in the footer the offset divided by 16.
> That means you need to pad each row group by up to 16 bytes.
> But now the max size of the file is 32GB.
> 
> Personally, I still don't like having arbitrary limits but 32GB seems a lot
> less like a restricting limit than 2GB.
> If we get crazy, we add this to the footer as metadata and the writer gets
> to pick whether you multiply offsets by 32, 64 or 128 if ten years from now
> we start having much bigger files.
> The size of the padding becomes negligible over the size of the file.
> 
> Thoughts?

That's an interesting suggestion. I would be fine with it personally,
provided the multiplier is either large enough (say, 64) or embedded in
the footer.

That said, I would first wait for the outcome of the experiment with
LZ4 compression. If it negates the additional cost of 64-bit offsets,
then we should not bother with this multiplier mechanism.

Regards

Antoine.


> 
> 
> On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
>  wrote:
> 
> > We've analyzed a large footer from our production environment to understand
> > byte distribution across its fields. The detailed analysis is available in
> > the proposal document here:
> >
> > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
> > .
> >
> > To illustrate the impact of 64-bit fields, we conducted an experiment where
> > all proposed 32-bit fields in the Flatbuf footer were changed to 64-bit.
> > This resulted in a *40% increase* in footer size.
> >
> > That said, LZ4 manages to compress this away. We will do some more testing
> > with 64 bit offsets/numvals/sizes and revert back. If it all goes well we
> > can resolve this by going 64 bit.
> >
> >
> > On Wed, Oct 15, 2025 at 12:49 PM Jan Finis 
> >  wrote:
> >  
> > > Hi Alkis,
> > >
> > > one more very simple argument why you want these offsets to be i64:
> > > What if you want to store a single value larger than 4GB? I know this
> > > sounds absurd at first, but some use cases might want to store data that
> > > can sometimes be very large (e.g. blob data, or insanely complex geo  
> > data).  
> > > And it would be a shame if that would mean that they cannot use Parquet  
> > at  
> > > all.
> > >
> > > Thus, my opinion here is that we can limit to i32 all fields that the  
> > file  
> > > writer has under control, e.g., the number of rows within a row group,  
> > but  
> > > we shouldn't limit any values that a file writer doesn't have under
> > > control, as they fully depend on the input data.
> > >
> > > Note though that this means that the number of values in a column chunk
> > > could also exceed i32, if a user has nested data with more than 4 billion
> > > entries. With such data, the file writer again couldn't do anything to
> > > avoid writing a row group with more
> > > than i32 values, as a single row may not span multiple row groups. That
> > > being said, I think that nested data with more than 4 billion entries is
> > > less likely than a single large blob of 4 billion bytes.
> > >
> > > I know that smaller row groups is what most / all engines prefer, but we
> > > have to make sure the format also works for edge cases.
> > >
> > > Cheers,
> > > Jan
> > >
> > > Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve 
> > >  > >:
> > >  
> > > > Hi Alkis
> > > >
> > > > Thanks for all your work on this proposal.
> > > >
> > > > I'd be in favour of keeping the offsets as i64 and not reducing the  
> > > maximum  
> > > > row group size, even if this results in slightly larger footers. I've  
> > > heard  
> > > > from some of our users within G-Research that they do have files with  
> > row  
> > > > groups > 2 GiB. This is often when they use lower-level APIs to write
> > > > Parquet that don't automatically split data into row groups, and they
> > > > either write a single row group for simplicity or have some logical
> > > > partitioning of data into row groups. They might also have wide tables  
> > > with  
> > > > many columns, or wide array/tensor valued columns that lead to large  
> > row  
> > > > groups.
> > > >
> > > > In many workflows we don't read Parquet with a query engine that  
> > supports  
> > > > filters and skipping row groups, but just read all rows, or directly
> > > > specify the row groups to read if there is some known logical  
> > > partitioning  
> > > > into row groups. I'm sure we could work around a 2 or 4 GiB row group  
> > > size  
> > > > limitation if we had to, but it's a new constraint that reduces the
> > > > flexibility of the format and makes more work for users who now need to
> > > > ensure they don't hit this limit.
> > > >
> > > > Do you have any measurements of how much of a difference 4 byte

Re: [DISCUSS] flatbuf footer: offsets

2025-10-24 Thread Julien Le Dem
I had an idea about this topic.
What if we say the offset is always a multiple of 16? (I'm saying 16, but
it works with 8 or 32 or any other power of 2).
Then we store in the footer the offset divided by 16.
That means you need to pad each row group by up to 16 bytes.
But now the max size of the file is 32GB.

Personally, I still don't like having arbitrary limits but 32GB seems a lot
less like a restricting limit than 2GB.
If we get crazy, we add this to the footer as metadata and the writer gets
to pick whether you multiply offsets by 32, 64 or 128 if ten years from now
we start having much bigger files.
The size of the padding becomes negligible over the size of the file.

Thoughts?


On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
 wrote:

> We've analyzed a large footer from our production environment to understand
> byte distribution across its fields. The detailed analysis is available in
> the proposal document here:
>
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
> .
>
> To illustrate the impact of 64-bit fields, we conducted an experiment where
> all proposed 32-bit fields in the Flatbuf footer were changed to 64-bit.
> This resulted in a *40% increase* in footer size.
>
> That said, LZ4 manages to compress this away. We will do some more testing
> with 64 bit offsets/numvals/sizes and revert back. If it all goes well we
> can resolve this by going 64 bit.
>
>
> On Wed, Oct 15, 2025 at 12:49 PM Jan Finis  wrote:
>
> > Hi Alkis,
> >
> > one more very simple argument why you want these offsets to be i64:
> > What if you want to store a single value larger than 4GB? I know this
> > sounds absurd at first, but some use cases might want to store data that
> > can sometimes be very large (e.g. blob data, or insanely complex geo
> data).
> > And it would be a shame if that would mean that they cannot use Parquet
> at
> > all.
> >
> > Thus, my opinion here is that we can limit to i32 all fields that the
> file
> > writer has under control, e.g., the number of rows within a row group,
> but
> > we shouldn't limit any values that a file writer doesn't have under
> > control, as they fully depend on the input data.
> >
> > Note though that this means that the number of values in a column chunk
> > could also exceed i32, if a user has nested data with more than 4 billion
> > entries. With such data, the file writer again couldn't do anything to
> > avoid writing a row group with more
> > than i32 values, as a single row may not span multiple row groups. That
> > being said, I think that nested data with more than 4 billion entries is
> > less likely than a single large blob of 4 billion bytes.
> >
> > I know that smaller row groups is what most / all engines prefer, but we
> > have to make sure the format also works for edge cases.
> >
> > Cheers,
> > Jan
> >
> > Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve  >:
> >
> > > Hi Alkis
> > >
> > > Thanks for all your work on this proposal.
> > >
> > > I'd be in favour of keeping the offsets as i64 and not reducing the
> > maximum
> > > row group size, even if this results in slightly larger footers. I've
> > heard
> > > from some of our users within G-Research that they do have files with
> row
> > > groups > 2 GiB. This is often when they use lower-level APIs to write
> > > Parquet that don't automatically split data into row groups, and they
> > > either write a single row group for simplicity or have some logical
> > > partitioning of data into row groups. They might also have wide tables
> > with
> > > many columns, or wide array/tensor valued columns that lead to large
> row
> > > groups.
> > >
> > > In many workflows we don't read Parquet with a query engine that
> supports
> > > filters and skipping row groups, but just read all rows, or directly
> > > specify the row groups to read if there is some known logical
> > partitioning
> > > into row groups. I'm sure we could work around a 2 or 4 GiB row group
> > size
> > > limitation if we had to, but it's a new constraint that reduces the
> > > flexibility of the format and makes more work for users who now need to
> > > ensure they don't hit this limit.
> > >
> > > Do you have any measurements of how much of a difference 4 byte offsets
> > > make to footer sizes in your data, with and without the optional LZ4
> > > compression?
> > >
> > > Thanks,
> > > Adam
> > >
> > > On Tue, 14 Oct 2025 at 21:02, Alkis Evlogimenos
> > >  wrote:
> > >
> > > > Hi all,
> > > >
> > > > From the comments on the [EXTERNAL] Parquet metadata
> > > > <
> > > >
> > >
> >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0
> > > > >
> > > > document,
> > > > it appears there's a general consensus on most aspects, with the
> > > exception
> > > > of the relative 32-bit offsets for column chunks.
> > > >
> > > > I'm starting this thread to discuss this topic further and work
> > towards a
> > > > reso

Re: [DISCUSS] flatbuf footer: offsets

2025-10-21 Thread Alkis Evlogimenos
We've analyzed a large footer from our production environment to understand
byte distribution across its fields. The detailed analysis is available in
the proposal document here:
https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
.

To illustrate the impact of 64-bit fields, we conducted an experiment where
all proposed 32-bit fields in the Flatbuf footer were changed to 64-bit.
This resulted in a *40% increase* in footer size.

That said, LZ4 manages to compress this away. We will do some more testing
with 64 bit offsets/numvals/sizes and revert back. If it all goes well we
can resolve this by going 64 bit.


On Wed, Oct 15, 2025 at 12:49 PM Jan Finis  wrote:

> Hi Alkis,
>
> one more very simple argument why you want these offsets to be i64:
> What if you want to store a single value larger than 4GB? I know this
> sounds absurd at first, but some use cases might want to store data that
> can sometimes be very large (e.g. blob data, or insanely complex geo data).
> And it would be a shame if that would mean that they cannot use Parquet at
> all.
>
> Thus, my opinion here is that we can limit to i32 all fields that the file
> writer has under control, e.g., the number of rows within a row group, but
> we shouldn't limit any values that a file writer doesn't have under
> control, as they fully depend on the input data.
>
> Note though that this means that the number of values in a column chunk
> could also exceed i32, if a user has nested data with more than 4 billion
> entries. With such data, the file writer again couldn't do anything to
> avoid writing a row group with more
> than i32 values, as a single row may not span multiple row groups. That
> being said, I think that nested data with more than 4 billion entries is
> less likely than a single large blob of 4 billion bytes.
>
> I know that smaller row groups is what most / all engines prefer, but we
> have to make sure the format also works for edge cases.
>
> Cheers,
> Jan
>
> Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve :
>
> > Hi Alkis
> >
> > Thanks for all your work on this proposal.
> >
> > I'd be in favour of keeping the offsets as i64 and not reducing the
> maximum
> > row group size, even if this results in slightly larger footers. I've
> heard
> > from some of our users within G-Research that they do have files with row
> > groups > 2 GiB. This is often when they use lower-level APIs to write
> > Parquet that don't automatically split data into row groups, and they
> > either write a single row group for simplicity or have some logical
> > partitioning of data into row groups. They might also have wide tables
> with
> > many columns, or wide array/tensor valued columns that lead to large row
> > groups.
> >
> > In many workflows we don't read Parquet with a query engine that supports
> > filters and skipping row groups, but just read all rows, or directly
> > specify the row groups to read if there is some known logical
> partitioning
> > into row groups. I'm sure we could work around a 2 or 4 GiB row group
> size
> > limitation if we had to, but it's a new constraint that reduces the
> > flexibility of the format and makes more work for users who now need to
> > ensure they don't hit this limit.
> >
> > Do you have any measurements of how much of a difference 4 byte offsets
> > make to footer sizes in your data, with and without the optional LZ4
> > compression?
> >
> > Thanks,
> > Adam
> >
> > On Tue, 14 Oct 2025 at 21:02, Alkis Evlogimenos
> >  wrote:
> >
> > > Hi all,
> > >
> > > From the comments on the [EXTERNAL] Parquet metadata
> > > <
> > >
> >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0
> > > >
> > > document,
> > > it appears there's a general consensus on most aspects, with the
> > exception
> > > of the relative 32-bit offsets for column chunks.
> > >
> > > I'm starting this thread to discuss this topic further and work
> towards a
> > > resolution. Adam Reeve suggested raising the limitation to 2^32, and he
> > > confirmed that Java does not have any issues with this. I am open to
> this
> > > change as it increases the limit without introducing any drawbacks.
> > >
> > > However, some still feel that a 2^32-byte limit for a row group is too
> > > restrictive. I'd like to understand these specific use cases better.
> From
> > > my perspective, for most engines, the row group is the primary unit of
> > > skipping, making very large row groups less desirable. In our fleet's
> > > workloads, it's rare to see row groups larger than 100MB, as anything
> > > larger tends to make statistics-based skipping ineffective.
> > >
> > > Cheers,
> > >
> >
>


Re: [DISCUSS] flatbuf footer: offsets

2025-10-18 Thread Adam Reeve
Hi Alkis

Thanks for all your work on this proposal.

I'd be in favour of keeping the offsets as i64 and not reducing the maximum
row group size, even if this results in slightly larger footers. I've heard
from some of our users within G-Research that they do have files with row
groups > 2 GiB. This is often when they use lower-level APIs to write
Parquet that don't automatically split data into row groups, and they
either write a single row group for simplicity or have some logical
partitioning of data into row groups. They might also have wide tables with
many columns, or wide array/tensor valued columns that lead to large row
groups.

In many workflows we don't read Parquet with a query engine that supports
filters and skipping row groups, but just read all rows, or directly
specify the row groups to read if there is some known logical partitioning
into row groups. I'm sure we could work around a 2 or 4 GiB row group size
limitation if we had to, but it's a new constraint that reduces the
flexibility of the format and makes more work for users who now need to
ensure they don't hit this limit.

Do you have any measurements of how much of a difference 4 byte offsets
make to footer sizes in your data, with and without the optional LZ4
compression?

Thanks,
Adam

On Tue, 14 Oct 2025 at 21:02, Alkis Evlogimenos
 wrote:

> Hi all,
>
> From the comments on the [EXTERNAL] Parquet metadata
> <
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0
> >
> document,
> it appears there's a general consensus on most aspects, with the exception
> of the relative 32-bit offsets for column chunks.
>
> I'm starting this thread to discuss this topic further and work towards a
> resolution. Adam Reeve suggested raising the limitation to 2^32, and he
> confirmed that Java does not have any issues with this. I am open to this
> change as it increases the limit without introducing any drawbacks.
>
> However, some still feel that a 2^32-byte limit for a row group is too
> restrictive. I'd like to understand these specific use cases better. From
> my perspective, for most engines, the row group is the primary unit of
> skipping, making very large row groups less desirable. In our fleet's
> workloads, it's rare to see row groups larger than 100MB, as anything
> larger tends to make statistics-based skipping ineffective.
>
> Cheers,
>


Re: [DISCUSS] flatbuf footer: offsets

2025-10-17 Thread Jan Finis
Hi Alkis,

one more very simple argument why you want these offsets to be i64:
What if you want to store a single value larger than 4GB? I know this
sounds absurd at first, but some use cases might want to store data that
can sometimes be very large (e.g. blob data, or insanely complex geo data).
And it would be a shame if that would mean that they cannot use Parquet at
all.

Thus, my opinion here is that we can limit to i32 all fields that the file
writer has under control, e.g., the number of rows within a row group, but
we shouldn't limit any values that a file writer doesn't have under
control, as they fully depend on the input data.

Note though that this means that the number of values in a column chunk
could also exceed i32, if a user has nested data with more than 4 billion
entries. With such data, the file writer again couldn't do anything to
avoid writing a row group with more
than i32 values, as a single row may not span multiple row groups. That
being said, I think that nested data with more than 4 billion entries is
less likely than a single large blob of 4 billion bytes.

I know that smaller row groups is what most / all engines prefer, but we
have to make sure the format also works for edge cases.

Cheers,
Jan

Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve :

> Hi Alkis
>
> Thanks for all your work on this proposal.
>
> I'd be in favour of keeping the offsets as i64 and not reducing the maximum
> row group size, even if this results in slightly larger footers. I've heard
> from some of our users within G-Research that they do have files with row
> groups > 2 GiB. This is often when they use lower-level APIs to write
> Parquet that don't automatically split data into row groups, and they
> either write a single row group for simplicity or have some logical
> partitioning of data into row groups. They might also have wide tables with
> many columns, or wide array/tensor valued columns that lead to large row
> groups.
>
> In many workflows we don't read Parquet with a query engine that supports
> filters and skipping row groups, but just read all rows, or directly
> specify the row groups to read if there is some known logical partitioning
> into row groups. I'm sure we could work around a 2 or 4 GiB row group size
> limitation if we had to, but it's a new constraint that reduces the
> flexibility of the format and makes more work for users who now need to
> ensure they don't hit this limit.
>
> Do you have any measurements of how much of a difference 4 byte offsets
> make to footer sizes in your data, with and without the optional LZ4
> compression?
>
> Thanks,
> Adam
>
> On Tue, 14 Oct 2025 at 21:02, Alkis Evlogimenos
>  wrote:
>
> > Hi all,
> >
> > From the comments on the [EXTERNAL] Parquet metadata
> > <
> >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0
> > >
> > document,
> > it appears there's a general consensus on most aspects, with the
> exception
> > of the relative 32-bit offsets for column chunks.
> >
> > I'm starting this thread to discuss this topic further and work towards a
> > resolution. Adam Reeve suggested raising the limitation to 2^32, and he
> > confirmed that Java does not have any issues with this. I am open to this
> > change as it increases the limit without introducing any drawbacks.
> >
> > However, some still feel that a 2^32-byte limit for a row group is too
> > restrictive. I'd like to understand these specific use cases better. From
> > my perspective, for most engines, the row group is the primary unit of
> > skipping, making very large row groups less desirable. In our fleet's
> > workloads, it's rare to see row groups larger than 100MB, as anything
> > larger tends to make statistics-based skipping ineffective.
> >
> > Cheers,
> >
>


[DISCUSS] flatbuf footer: offsets

2025-10-14 Thread Alkis Evlogimenos
Hi all,

>From the comments on the [EXTERNAL] Parquet metadata

document,
it appears there's a general consensus on most aspects, with the exception
of the relative 32-bit offsets for column chunks.

I'm starting this thread to discuss this topic further and work towards a
resolution. Adam Reeve suggested raising the limitation to 2^32, and he
confirmed that Java does not have any issues with this. I am open to this
change as it increases the limit without introducing any drawbacks.

However, some still feel that a 2^32-byte limit for a row group is too
restrictive. I'd like to understand these specific use cases better. From
my perspective, for most engines, the row group is the primary unit of
skipping, making very large row groups less desirable. In our fleet's
workloads, it's rare to see row groups larger than 100MB, as anything
larger tends to make statistics-based skipping ineffective.

Cheers,