Re: Wide tables in V4

2025-06-13 Thread Péter Váry
Hi everyone,

I did some experiments with splitting up wide Parquet files into multiple
column families. You can check the PR here:
https://github.com/apache/iceberg/pull/13306.

What the test does:

   - Creates tables with 100/1000/1 columns, where the column type is
   double
   - Generates random data into these columns
   - Reads and writes records to these tables
  - Using the current implementation of the reader/writer
  - Using multiple (2, 5, 10) column families
  - Using single and multiple threads for parallelizing the
  reading/writing when there are multiple column families

I have used my local machine to run the tests, and used my local disk to
store the files.

Here is what I have learned:

   - I had to be very strict about random generation, since if I did any
   reuse of the records the Parquet writer was able to use the duplication to
   decrease the size of the data files. This was especially prominent when I
   had a high number of column families. This highlights the possible gains
   coming from vertically splitting the tables, but wanted to avoid this in
   testing as it is very dependent on the use-case
   - Reading performance gains kick in after a few hundred columns. We even
   gain them when the reading is not parallelized. We don't gain too much on
   parallelization - most probably my environment is IO bound
   - Write performance gains kick in sooner, but needs parallelization to
   not lose performance. The gains are much more substantial - as more CPUs
   are used for compression.

In these tests,* I have seen more than 15 percent read performance
improvements, and up to 50 percent write improvements*.

For me this signals that vertically splitting wide tables to multiple files
could help us use more of the available CPU/IO, and could have substantial
gains above the functional features that it makes possible.

If you think we should test different scenarios, feel free to use the code
in the PR, or share your thoughts.

Thanks,
Peter

The results in detail:

   - families:0, multiThreaded:true - could be ignored (skipped to run the
   tests)
   - marked the values generated by the current readers/writers with *bold*


Benchmark(columns)  (families)  (multiThreaded)
 Mode  Cnt   ScoreError  Units
MultiThreadedParquetBenchmark.read100   0 true
   ss   20  ≈ 10⁻⁷s/op
*MultiThreadedParquetBenchmark.read100   0false
   ss   20   3.739 ±  0.096   s/op*
MultiThreadedParquetBenchmark.read100   1 true
   ss   20   3.883 ±  0.062   s/op
MultiThreadedParquetBenchmark.read100   1false
   ss   20   3.968 ±  0.070   s/op
MultiThreadedParquetBenchmark.read100   2 true
   ss   20   4.063 ±  0.080   s/op
MultiThreadedParquetBenchmark.read100   2false
   ss   20   4.036 ±  0.082   s/op
MultiThreadedParquetBenchmark.read100   5 true
   ss   20   4.093 ±  0.083   s/op
MultiThreadedParquetBenchmark.read100   5false
   ss   20   4.090 ±  0.070   s/op
MultiThreadedParquetBenchmark.read100  10 true
   ss   20   4.267 ±  0.087   s/op
MultiThreadedParquetBenchmark.read100  10false
   ss   20   4.206 ±  0.075   s/op
MultiThreadedParquetBenchmark.read   1000   0 true
   ss   20  ≈ 10⁻⁷s/op
*MultiThreadedParquetBenchmark.read   1000   0false
   ss   20   5.276 ±  0.408   s/op*
MultiThreadedParquetBenchmark.read   1000   1 true
   ss   20   5.202 ±  0.403   s/op
MultiThreadedParquetBenchmark.read   1000   1false
   ss   20   5.224 ±  0.397   s/op
MultiThreadedParquetBenchmark.read   1000   2 true
   ss   20   4.881 ±  0.281   s/op
MultiThreadedParquetBenchmark.read   1000   2false
   ss   20   4.794 ±  0.295   s/op
MultiThreadedParquetBenchmark.read   1000   5 true
   ss   20   5.096 ±  0.259   s/op
MultiThreadedParquetBenchmark.read   1000   5false
   ss   20   5.181 ±  0.288   s/op
MultiThreadedParquetBenchmark.read   1000  10 true
   ss   20   5.408 ±  0.252   s/op
MultiThreadedParquetBenchmark.read   1000  10false
   ss   20   5.336 ±  0.185   s/op
MultiThreadedParquetBenchmark.read  1   0 true
   ss   20  ≈ 10⁻⁷s/op
*MultiThreadedParquetBenchmark.read  1   0false
   ss   20   8.692 ±  1.246   s/op*
MultiThreadedParquetBenchmark.read  1   1 true
   ss   20   8.337 ±  0.415   s/op
MultiThreadedParquetBenchmark.read  1   1false
   ss   20   9.073 ±  0.864   s/op
MultiThreadedParquetBenchmark.r

Re: Wide tables in V4

2025-06-06 Thread Micah Kornfield
At a high-level we should probably work out if supporting wide tables with
performant appends is something we want to invest effort into and focus on
the lower level questions once that is resolved.  I think it would be great
to make this work, I think the main question is whether any PMC/community
members feel like it would introduce too much complexity to proceed with
further design/analysis.

Some more detailed replies to what has been discussed in the thread:

I might be wrong, but page skipping relies on page headers which are stored
> in-line with the data itself. When downloading data from blob stores this
> could be less than ideal.


No, parquet supports page indices
 [1].  I think it
is reasonable to also think about improvements to Parquet for large blobs
so these can be handled better. There is also general interest in Parquet
in evolving it to better support some of these use-cases in general, so if
there are clear items that can be pushed to the file level, let's have
those conversations.


> Would it not be more a data file/parquet "issue" ? Especially with the
> data file API you are proposing, I think Iceberg should "delegate" to
> the data file layer (Parquet here) and Iceberg could be "agnostic".


I think maybe we should maybe table the discussion on exactly what belongs
in each layer until we have more data.  Roughly the concerns expressed I
think boil down into a few mains buckets:

1.  Read vs Write amplification (I think one could run some rough
experiments with low-level Parquet APIs to see the impact of splitting out
columns into individual objects to answer both sides of this).
 - For large blobs, I think memory pressure becomes a real concern here
as well.

2.  Complexity:
 - If multiple files are needed for performance, what advantages do we
gain from having effectively two level manifests)?  What does it do to
Iceberg metadata to have to track both (V4 is actually a great place to
look at this since it seems like we are looking at major metadata overhauls
anyway, if it is too ambitious we can perhaps postpone some of the work to
v5)?
 - What are the implications for things like time-travel, maintenance,
etc in these cases.  I would guess this probably needs a little bit more
detailed design considering the two options (pushing some concerns down to
parquet vs handling everything in Iceberg metadata).


Some of the complexity questions can be answered by prototyping the APIs
necessary to make this work. Specifically  I think we would at least need:

1.  An `newAppendColumns` API added to the transaction

[2].
Lance's APIs might provide some inspiration [3] here.
2a.  New abstractions to handle columns for the same rows split across
files.
2b.  New File level APIs for
   - Append columns
   - Delete files (if we decide on multiple files for a row-range
and it is pushed down the file level, the deletion logic needs to be
delegated to the file level as well).

Items 2a/2b depend on the ultimate approach taken but trying to sketch
these out and how they relate to the transaction API, might help inform the
decision on complexity.

Other feature interactions probably need a more careful analysis when
proposing the spec changes.

Cheers,
Micah

[1] https://parquet.apache.org/docs/file-format/pageindex/
[2]
https://iceberg.apache.org/javadoc/1.9.1/org/apache/iceberg/Transaction.html
[3]
https://lancedb.github.io/lance/introduction/schema_evolution.html#adding-new-columns

On Fri, Jun 6, 2025 at 10:39 AM Jean-Baptiste Onofré 
wrote:

> Hi Peter
>
> Thanks for your message. It's an interesting topic.
>
> Would it not be more a data file/parquet "issue" ? Especially with the
> data file API you are proposing, I think Iceberg should "delegate" to
> the data file layer (Parquet here) and Iceberg could be "agnostic".
>
> Regards
> JB
>
> On Mon, May 26, 2025 at 6:28 AM Péter Váry 
> wrote:
> >
> > Hi Team,
> >
> > In machine learning use-cases, it's common to encounter tables with a
> very high number of columns - sometimes even in the range of several
> thousand. I've seen cases with up to 15,000 columns. Storing such wide
> tables in a single Parquet file is often suboptimal, as Parquet can become
> a bottleneck, even when only a subset of columns is queried.
> >
> > A common approach to mitigate this is to split the data across multiple
> Parquet files. With the upcoming File Format API, we could introduce a
> layer that combines these files into a single iterator, enabling efficient
> reading of wide and very wide tables.
> >
> > To support this, we would need to revise the metadata specification.
> Instead of the current `_file` column, we could introduce a _files column
> containing:
> > - `_file_column_ids`: the column IDs present in each file
> > - `_file_path`: the path to the corresponding file
> >
> > Has there been any p

Re: Wide tables in V4

2025-06-06 Thread Jean-Baptiste Onofré
Hi Peter

Thanks for your message. It's an interesting topic.

Would it not be more a data file/parquet "issue" ? Especially with the
data file API you are proposing, I think Iceberg should "delegate" to
the data file layer (Parquet here) and Iceberg could be "agnostic".

Regards
JB

On Mon, May 26, 2025 at 6:28 AM Péter Váry  wrote:
>
> Hi Team,
>
> In machine learning use-cases, it's common to encounter tables with a very 
> high number of columns - sometimes even in the range of several thousand. 
> I've seen cases with up to 15,000 columns. Storing such wide tables in a 
> single Parquet file is often suboptimal, as Parquet can become a bottleneck, 
> even when only a subset of columns is queried.
>
> A common approach to mitigate this is to split the data across multiple 
> Parquet files. With the upcoming File Format API, we could introduce a layer 
> that combines these files into a single iterator, enabling efficient reading 
> of wide and very wide tables.
>
> To support this, we would need to revise the metadata specification. Instead 
> of the current `_file` column, we could introduce a _files column containing:
> - `_file_column_ids`: the column IDs present in each file
> - `_file_path`: the path to the corresponding file
>
> Has there been any prior discussion around this idea?
> Is anyone else interested in exploring this further?
>
> Best regards,
> Peter


Re: Wide tables in V4

2025-06-05 Thread Péter Váry
For the record, link from a user requesting this feature:
https://github.com/apache/iceberg/issues/11634

On Mon, Jun 2, 2025, 12:34 Péter Váry  wrote:

> Hi Bart,
>
> Thanks for your answer!
> I’ve pulled out some text from your thorough and well-organized response
> to make it easier to highlight my comments.
>
> > It would be well possible to tune parquet writers to write very large
> row groups when a large string column dominates. [..]
>
> What would you do, if there are more "optimal" sizes, let's say a string
> column where dictionary encoding could be optimal, and maybe some other
> differently sized columns?
>
> > You know that a row group is very large, so you might then shard it by
> row ranges. Each parallel reader would have to filter out the rows that
> weren't assigned to it. With Parquet page skipping, each reader could avoid
> reading the large-string column pages for rows that weren't assigned to
> it.
>
> I might be wrong, but page skipping relies on page headers which are
> stored in-line with the data itself. When downloading data from blob stores
> this could be less than ideal. This makes the idea of storing row-group
> boundaries in the Iceberg metadata feel more appealing to me. Of course, we
> need to perform row-index-range-based skipping for some files, but
> page-level skipping could also help optimize it - if we decide it's
> necessary.
>
> > If you use column-specific files, then you actually need to read the
> parquet footers of *all the separate column files*. That's 2x the number
> of I/Os.
>
> Agree, this's a valid point, until the footer fits into a single read -
> which is true when the configuration is correct.
>
> > There's a third option, which is to use column-specific files (or
> groups of columns in a file) that form a single Parquet structure with
> cross-file references (which is already in the Parquet standard, albeit not
> implemented anywhere).
>
> We have talked about this internally, but we saw several disadvantages:
> - It is not implemented anywhere - which means if we start using it
> everyone needs a new reader
> - If I understand correctly the cross-file references are for column
> chunks - we want to avoid too much fragmentation
> - It becomes hard to ensure that the file is really immutable
> - We still have to optimize the page alignment for reads.
>
> > I agree that it's an interesting idea, but it does add a lot of
> complexity, and I'm not convinced that it's better from a performance
> standpoint (metadata size increase, more I/Os). If we can get away with a
> better row group sizing policy, wouldn't that be preferable?
>
> That's a great question regarding the complexity. I'm still working
> through all the implications myself, but I believe we can encapsulate this
> behind the Iceberg File Format API. That way, it becomes available across
> all file formats and shields the rest of the codebase from the underlying
> complexity.
>
> Your point about performance is valid, especially in the context of full
> table scans. However, with these very wide tables, full scans are quite
> rare. If the column families are well-designed, we can actually improve
> performance across many columns/queries - not just a select few.
>
> Additionally, this approach enables frequently requested features like
> adding or updating column families without rewriting the entire table.
>
> Thanks,
> Peter
>
> Bart Samwel  ezt írta (időpont: 2025. jún.
> 2., H, 10:21):
>
>> On Fri, May 30, 2025 at 8:35 PM Péter Váry 
>> wrote:
>>
>>> Consider this example
>>> Imagine a table with one large string column and many small numeric
>>> columns.
>>>
>>> Scenario 1: Single File
>>>
>>>- All columns are written into a single file.
>>>- The RowGroup size is small due to the large string column
>>>dominating the layout.
>>>
>>> This is an assumption that may not be necessary. It would be well
>> possible to tune parquet writers to write very large row groups when a
>> large string column dominates. Such a string column would probably not get
>> dictionary encoded anyway, so it would effectively end up with a couple of
>> values per 1MB Parquet page. The other columns would get decent-sized
>> pages, and the overall row group size would be appropriate for getting good
>> compression on those smaller columns.
>>
>> What would be the downside of this approach?
>>
>>- When you're only reading the integer columns it is exactly the same
>>as when the columns would have been in a file by themselves. You just 
>> don't
>>read the large column chunk.
>>- I think it adds some complexity to the distributed/parallel reading
>>of the row groups when the large string column is included in the selected
>>set of columns. You know that a row group is very large, so you might then
>>shard it by row ranges. Each parallel reader would have to filter out the
>>rows that weren't assigned to it. With Parquet page skipping, each reader
>>could avoid reading t

Re: Wide tables in V4

2025-06-02 Thread Péter Váry
Hi Bart,

Thanks for your answer!
I’ve pulled out some text from your thorough and well-organized response to
make it easier to highlight my comments.

> It would be well possible to tune parquet writers to write very large row
groups when a large string column dominates. [..]

What would you do, if there are more "optimal" sizes, let's say a string
column where dictionary encoding could be optimal, and maybe some other
differently sized columns?

> You know that a row group is very large, so you might then shard it by
row ranges. Each parallel reader would have to filter out the rows that
weren't assigned to it. With Parquet page skipping, each reader could avoid
reading the large-string column pages for rows that weren't assigned to it.

I might be wrong, but page skipping relies on page headers which are stored
in-line with the data itself. When downloading data from blob stores this
could be less than ideal. This makes the idea of storing row-group
boundaries in the Iceberg metadata feel more appealing to me. Of course, we
need to perform row-index-range-based skipping for some files, but
page-level skipping could also help optimize it - if we decide it's
necessary.

> If you use column-specific files, then you actually need to read the
parquet footers of *all the separate column files*. That's 2x the number of
I/Os.

Agree, this's a valid point, until the footer fits into a single read -
which is true when the configuration is correct.

> There's a third option, which is to use column-specific files (or groups
of columns in a file) that form a single Parquet structure with cross-file
references (which is already in the Parquet standard, albeit not
implemented anywhere).

We have talked about this internally, but we saw several disadvantages:
- It is not implemented anywhere - which means if we start using it
everyone needs a new reader
- If I understand correctly the cross-file references are for column chunks
- we want to avoid too much fragmentation
- It becomes hard to ensure that the file is really immutable
- We still have to optimize the page alignment for reads.

> I agree that it's an interesting idea, but it does add a lot of
complexity, and I'm not convinced that it's better from a performance
standpoint (metadata size increase, more I/Os). If we can get away with a
better row group sizing policy, wouldn't that be preferable?

That's a great question regarding the complexity. I'm still working through
all the implications myself, but I believe we can encapsulate this behind
the Iceberg File Format API. That way, it becomes available across all file
formats and shields the rest of the codebase from the underlying complexity.

Your point about performance is valid, especially in the context of full
table scans. However, with these very wide tables, full scans are quite
rare. If the column families are well-designed, we can actually improve
performance across many columns/queries - not just a select few.

Additionally, this approach enables frequently requested features like
adding or updating column families without rewriting the entire table.

Thanks,
Peter

Bart Samwel  ezt írta (időpont: 2025. jún. 2.,
H, 10:21):

> On Fri, May 30, 2025 at 8:35 PM Péter Váry 
> wrote:
>
>> Consider this example
>> Imagine a table with one large string column and many small numeric
>> columns.
>>
>> Scenario 1: Single File
>>
>>- All columns are written into a single file.
>>- The RowGroup size is small due to the large string column
>>dominating the layout.
>>
>> This is an assumption that may not be necessary. It would be well
> possible to tune parquet writers to write very large row groups when a
> large string column dominates. Such a string column would probably not get
> dictionary encoded anyway, so it would effectively end up with a couple of
> values per 1MB Parquet page. The other columns would get decent-sized
> pages, and the overall row group size would be appropriate for getting good
> compression on those smaller columns.
>
> What would be the downside of this approach?
>
>- When you're only reading the integer columns it is exactly the same
>as when the columns would have been in a file by themselves. You just don't
>read the large column chunk.
>- I think it adds some complexity to the distributed/parallel reading
>of the row groups when the large string column is included in the selected
>set of columns. You know that a row group is very large, so you might then
>shard it by row ranges. Each parallel reader would have to filter out the
>rows that weren't assigned to it. With Parquet page skipping, each reader
>could avoid reading the large-string column pages for rows that weren't
>assigned to it.
>
> Ultimately I think the parallel reading problem here is *nearly* the same
> regardless of whether you use one XL row group or separate files. You need
> to know the exact row group / page boundaries within each file in order to
> decide how 

Re: Wide tables in V4

2025-06-02 Thread Bart Samwel
On Fri, May 30, 2025 at 8:35 PM Péter Váry 
wrote:

> Consider this example
> Imagine a table with one large string column and many small numeric
> columns.
>
> Scenario 1: Single File
>
>- All columns are written into a single file.
>- The RowGroup size is small due to the large string column dominating
>the layout.
>
> This is an assumption that may not be necessary. It would be well possible
to tune parquet writers to write very large row groups when a large string
column dominates. Such a string column would probably not get dictionary
encoded anyway, so it would effectively end up with a couple of values per
1MB Parquet page. The other columns would get decent-sized pages, and the
overall row group size would be appropriate for getting good compression on
those smaller columns.

What would be the downside of this approach?

   - When you're only reading the integer columns it is exactly the same as
   when the columns would have been in a file by themselves. You just don't
   read the large column chunk.
   - I think it adds some complexity to the distributed/parallel reading of
   the row groups when the large string column is included in the selected set
   of columns. You know that a row group is very large, so you might then
   shard it by row ranges. Each parallel reader would have to filter out the
   rows that weren't assigned to it. With Parquet page skipping, each reader
   could avoid reading the large-string column pages for rows that weren't
   assigned to it.

Ultimately I think the parallel reading problem here is *nearly* the same
regardless of whether you use one XL row group or separate files. You need
to know the exact row group / page boundaries within each file in order to
decide how to shard the read. And then you need to do row-index-range based
skipping on at least *some* of the input columns.

   - With XL row groups, in order to shard the row group into evenly sized
   chunks, you need to actually read the parquet footer first, because you
   need to know the row group boundaries within each file, and ideally even
   the page boundaries within each row group so that you can align your row
   ranges with those boundaries.
   - If you use column-specific files, then you actually need to read the
   parquet footers of *all the separate column files*. That's 2x the number
   of I/Os. These I/Os can be done in parallel, but they will contribute to
   throttling on cloud object stores.

So XL row groups distributed read planning can be done in one I/O, while
column-specific files require more I/Os. Either that, or you need to
store *even
more* information in the metadata (namely all of these boundaries). The
column-specific files also require more I/Os to read later (because you end
up having to read two footers), which adds up especially if you read the
large string column which means you parallelize the read into many small
chunks.

>
>- The numeric columns are not compacted efficiently.
>
> Scenario 2: Column-Specific Files
>
>- One file is written for the string column, and another for the
>numeric columns.
>- The RowGroup size for the string column remains small, but the
>numeric columns benefit from optimal RowGroup sizing.
>
> There's a third option, which is to use column-specific files (or groups
of columns in a file) that form a single Parquet structure with cross-file
references (which is already in the Parquet standard, albeit not
implemented anywhere). This approach has several advantages over the other
options:

   1. All of the metadata required for distributed reads is in one place
   (one parquet footer), making distributed read planning require fewer I/Os,
   and reducing the pressure to move all of that information to the
   table-level metadata as well.
   2. Flexible structure. Different files can have different distribution
   of columns over files, and you don't have to remember the per-file
   distribution in the metadata.
   3. More scalable: you can have a file per column if you want, if your
   column sizes are wildly variable, without bloating the table-level metadata
   with information about more files.
   4. You can add/replace an entire column just by writing one extra file
   (with the new column contents, plus a new footer for the entire file that
   simply points to the old files for the existing data that wasn't modified).
   5. Relatively simple to implement in existing Parquet readers compared
   to "read multiple parquets and zip them together".



> Query Performance Impact:
>
>- If a query only reads one of the numeric columns:
>   - Scenario 1: Requires reading many small column chunks.
>   - Scenario 2: Reads a single, continuous column chunk - much more
>   efficient.
>
> Queries only reading columns which are stored in a single file will have
> improvements. Cross file queries will have over-reading which might, or
> might not be balanced out by reading bigger continuous chunks. Full table
>

Re: Wide tables in V4

2025-05-30 Thread Péter Váry
Consider this example
Imagine a table with one large string column and many small numeric columns.

Scenario 1: Single File

   - All columns are written into a single file.
   - The RowGroup size is small due to the large string column dominating
   the layout.
   - The numeric columns are not compacted efficiently.

Scenario 2: Column-Specific Files

   - One file is written for the string column, and another for the numeric
   columns.
   - The RowGroup size for the string column remains small, but the numeric
   columns benefit from optimal RowGroup sizing.

Query Performance Impact:

   - If a query only reads one of the numeric columns:
  - Scenario 1: Requires reading many small column chunks.
  - Scenario 2: Reads a single, continuous column chunk - much more
  efficient.

Queries only reading columns which are stored in a single file will have
improvements. Cross file queries will have over-reading which might, or
might not be balanced out by reading bigger continuous chunks. Full table
scans will definitely have a performance penalty, but that is not the goal
here.

> And aren't Parquet pages already providing these unaligned sizes?

Parquet pages do offer some flexibility in size, but they operate at a
lower level and are still bound by the RowGroup structure. What I’m
proposing is a higher-level abstraction that allows us to group columns
into independently optimized Physical Files, each with its own RowGroup
sizing strategy. This could allow us to better optimize for queries where
only a small number of columns are projected from a wide table.


Bart Samwel  ezt írta (időpont: 2025. máj.
30., P, 16:03):

>
>
> On Fri, May 30, 2025 at 3:33 PM Péter Váry 
> wrote:
>
>> One key advantage of introducing Physical Files is the flexibility to
>> vary RowGroup sizes across columns. For instance, wide string columns could
>> benefit from smaller RowGroups to reduce memory pressure, while numeric
>> columns could use larger RowGroups to improve compression and scan
>> efficiency. Rather than enforcing strict row group alignment across all
>> columns, we can explore optimizing read split sizes and write-time RowGroup
>> sizes independently - striking a balance that maximizes performance and
>> storage costs for different data types and queries.
>>
>
> That actually sounds very complicated if you want to split file reads in a
> distributed system. If you want to read across column groups, then you
> always end up over-reading on one of them if they are not aligned.
>
> And aren't Parquet pages already providing these unaligned sizes?
>
> Gang Wu  ezt írta (időpont: 2025. máj. 30., P, 8:09):
>>
>>> IMO, the main drawback for the view solution is the complexity of
>>> maintaining consistency across tables if we want to use features like time
>>> travel, incremental scan, branch & tag, encryption, etc.
>>>
>>> On Fri, May 30, 2025 at 12:55 PM Bryan Keller  wrote:
>>>
 Fewer commit conflicts meaning the tables representing column families
 are updated independently, rather than having to serialize commits to a
 single table. Perhaps with a wide table solution the commit logic could be
 enhanced to support things like concurrent overwrites to independent column
 families, but it seems like it would be fairly involved.


 On May 29, 2025, at 7:16 PM, Steven Wu  wrote:

 Bryan, interesting approach to split horizontally across multiple
 tables.

 A few potential down sides
 * operational overhead. tables need to be managed consistently and
 probably in some coordinated way
 * complex read
 * maybe fragile to enforce correctness (during join). It is robust to
 enforce the stitching correctness at file group level in file reader and
 writer if built in the table format.

 > fewer commit conflicts

 Can you elaborate on this one? Are those tables populated by streaming
 or batch pipelines?

 On Thu, May 29, 2025 at 5:03 PM Bryan Keller  wrote:

> Hi everyone,
>
> We have been investigating a wide table format internally for a
> similar use case, i.e. we have wide ML tables with features generated by
> different pipelines and teams but want a unified view of the data. We are
> comparing that against separate tables joined together using a 
> shuffle-less
> join (e.g. storage partition join), along with a corresponding view.
>
> The join/view approach seems to give us much of we need, with some
> added benefits like splitting up the metadata, fewer commit conflicts, and
> ability to share, nest, and swap "column families". The downsides are 
> table
> management is split across multiple tables, it requires engine support of
> shuffle-less joins for best performance, and even then, scans probably
> won't be as optimal.
>
> I'm curious if anyone had further thoughts on the two?
>
> -Bryan
>
>
>
> On May 29,

Re: Wide tables in V4

2025-05-30 Thread Péter Váry
> A larger problem for splitting columns across files is that there are a
lot of assumptions about how data is laid out in both readers and writers.
For example, aligning row groups and correctly handling split calculation
is very complicated if you're trying to split rows across files.  Other
features are also impacted like deletes, which reference the file to which
they apply and would need to account for deletes applying to multiple files
and needing to update those references if columns are added.

I agree that there are many assumptions baked into the Content File layout,
and we should preserve that abstraction.We can introduce a lower-level
abstraction: the Physical File. This allows us to retain most of the
existing assumptions at the Content File level, while delegating the
responsibility of determining which Physical Files to access to the reader.

One key advantage of introducing Physical Files is the flexibility to vary
RowGroup sizes across columns. For instance, wide string columns could
benefit from smaller RowGroups to reduce memory pressure, while numeric
columns could use larger RowGroups to improve compression and scan
efficiency. Rather than enforcing strict row group alignment across all
columns, we can explore optimizing read split sizes and write-time RowGroup
sizes independently - striking a balance that maximizes performance and
storage costs for different data types and queries.

On the topic of the view based wide tables:
Using readers to join Physical File files closely resembles the
shuffle-less join approach mentioned by Bryan, with the added benefit that
any number of columns can be updated independently. In contrast, the
multitable approach restricts independent updates to predefined column
families (tables behind the view).


Gang Wu  ezt írta (időpont: 2025. máj. 30., P, 8:09):

> IMO, the main drawback for the view solution is the complexity of
> maintaining consistency across tables if we want to use features like time
> travel, incremental scan, branch & tag, encryption, etc.
>
> On Fri, May 30, 2025 at 12:55 PM Bryan Keller  wrote:
>
>> Fewer commit conflicts meaning the tables representing column families
>> are updated independently, rather than having to serialize commits to a
>> single table. Perhaps with a wide table solution the commit logic could be
>> enhanced to support things like concurrent overwrites to independent column
>> families, but it seems like it would be fairly involved.
>>
>>
>> On May 29, 2025, at 7:16 PM, Steven Wu  wrote:
>>
>> Bryan, interesting approach to split horizontally across multiple tables.
>>
>> A few potential down sides
>> * operational overhead. tables need to be managed consistently and
>> probably in some coordinated way
>> * complex read
>> * maybe fragile to enforce correctness (during join). It is robust to
>> enforce the stitching correctness at file group level in file reader and
>> writer if built in the table format.
>>
>> > fewer commit conflicts
>>
>> Can you elaborate on this one? Are those tables populated by streaming or
>> batch pipelines?
>>
>> On Thu, May 29, 2025 at 5:03 PM Bryan Keller  wrote:
>>
>>> Hi everyone,
>>>
>>> We have been investigating a wide table format internally for a similar
>>> use case, i.e. we have wide ML tables with features generated by different
>>> pipelines and teams but want a unified view of the data. We are comparing
>>> that against separate tables joined together using a shuffle-less join
>>> (e.g. storage partition join), along with a corresponding view.
>>>
>>> The join/view approach seems to give us much of we need, with some added
>>> benefits like splitting up the metadata, fewer commit conflicts, and
>>> ability to share, nest, and swap "column families". The downsides are table
>>> management is split across multiple tables, it requires engine support of
>>> shuffle-less joins for best performance, and even then, scans probably
>>> won't be as optimal.
>>>
>>> I'm curious if anyone had further thoughts on the two?
>>>
>>> -Bryan
>>>
>>>
>>>
>>> On May 29, 2025, at 8:18 AM, Péter Váry 
>>> wrote:
>>>
>>> I received feedback from Alkis regarding their Parquet optimization
>>> work. Their internal testing shows promising results for reducing metadata
>>> size and improving parsing performance. They plan to formalize a proposal
>>> for these Parquet enhancements in the near future.
>>>
>>> Meanwhile, I'm putting together our horizontal sharding proposal as a
>>> complementary approach. Even with the Parquet metadata improvements,
>>> horizontal sharding would provide additional benefits for:
>>>
>>>- More efficient column-level updates
>>>- Streamlined column additions
>>>- Better handling of dominant columns that can cause RowGroup size
>>>imbalances (placing these in separate files could significantly improve
>>>performance)
>>>
>>> Thanks, Peter
>>>
>>>
>>>
>>> Péter Váry  ezt írta (időpont: 2025. máj.
>>> 28., Sze, 15:39):
>>>
 I would be happy to put

Re: Wide tables in V4

2025-05-30 Thread Bart Samwel
On Fri, May 30, 2025 at 3:33 PM Péter Váry 
wrote:

> One key advantage of introducing Physical Files is the flexibility to vary
> RowGroup sizes across columns. For instance, wide string columns could
> benefit from smaller RowGroups to reduce memory pressure, while numeric
> columns could use larger RowGroups to improve compression and scan
> efficiency. Rather than enforcing strict row group alignment across all
> columns, we can explore optimizing read split sizes and write-time RowGroup
> sizes independently - striking a balance that maximizes performance and
> storage costs for different data types and queries.
>

That actually sounds very complicated if you want to split file reads in a
distributed system. If you want to read across column groups, then you
always end up over-reading on one of them if they are not aligned.

And aren't Parquet pages already providing these unaligned sizes?

Gang Wu  ezt írta (időpont: 2025. máj. 30., P, 8:09):
>
>> IMO, the main drawback for the view solution is the complexity of
>> maintaining consistency across tables if we want to use features like time
>> travel, incremental scan, branch & tag, encryption, etc.
>>
>> On Fri, May 30, 2025 at 12:55 PM Bryan Keller  wrote:
>>
>>> Fewer commit conflicts meaning the tables representing column families
>>> are updated independently, rather than having to serialize commits to a
>>> single table. Perhaps with a wide table solution the commit logic could be
>>> enhanced to support things like concurrent overwrites to independent column
>>> families, but it seems like it would be fairly involved.
>>>
>>>
>>> On May 29, 2025, at 7:16 PM, Steven Wu  wrote:
>>>
>>> Bryan, interesting approach to split horizontally across multiple
>>> tables.
>>>
>>> A few potential down sides
>>> * operational overhead. tables need to be managed consistently and
>>> probably in some coordinated way
>>> * complex read
>>> * maybe fragile to enforce correctness (during join). It is robust to
>>> enforce the stitching correctness at file group level in file reader and
>>> writer if built in the table format.
>>>
>>> > fewer commit conflicts
>>>
>>> Can you elaborate on this one? Are those tables populated by streaming
>>> or batch pipelines?
>>>
>>> On Thu, May 29, 2025 at 5:03 PM Bryan Keller  wrote:
>>>
 Hi everyone,

 We have been investigating a wide table format internally for a similar
 use case, i.e. we have wide ML tables with features generated by different
 pipelines and teams but want a unified view of the data. We are comparing
 that against separate tables joined together using a shuffle-less join
 (e.g. storage partition join), along with a corresponding view.

 The join/view approach seems to give us much of we need, with some
 added benefits like splitting up the metadata, fewer commit conflicts, and
 ability to share, nest, and swap "column families". The downsides are table
 management is split across multiple tables, it requires engine support of
 shuffle-less joins for best performance, and even then, scans probably
 won't be as optimal.

 I'm curious if anyone had further thoughts on the two?

 -Bryan



 On May 29, 2025, at 8:18 AM, Péter Váry 
 wrote:

 I received feedback from Alkis regarding their Parquet optimization
 work. Their internal testing shows promising results for reducing metadata
 size and improving parsing performance. They plan to formalize a proposal
 for these Parquet enhancements in the near future.

 Meanwhile, I'm putting together our horizontal sharding proposal as a
 complementary approach. Even with the Parquet metadata improvements,
 horizontal sharding would provide additional benefits for:

- More efficient column-level updates
- Streamlined column additions
- Better handling of dominant columns that can cause RowGroup size
imbalances (placing these in separate files could significantly improve
performance)

 Thanks, Peter



 Péter Váry  ezt írta (időpont: 2025. máj.
 28., Sze, 15:39):

> I would be happy to put together a proposal based on the inputs got
> here.
>
> Thanks everyone for your thoughts!
> I will try to incorporate all of this.
>
> Thanks, Peter
>
> Daniel Weeks  ezt írta (időpont: 2025. máj. 27.,
> K, 20:07):
>
>> I feel like we have two different issues we're talking about here
>> that aren't necessarily tied (though solutions may address both): 1) wide
>> tables, 2) adding columns
>>
>> Wide tables are definitely a problem where parquet has limitations.
>> I'm optimistic about the ongoing work to help improve parquet 
>> footers/stats
>> in this area that Fokko mentioned.  There are always limitations in how
>> this scales as wide rows lead to small row groups and the cost to
>> reconstitute a ro

Re: Wide tables in V4

2025-05-29 Thread Gang Wu
IMO, the main drawback for the view solution is the complexity of
maintaining consistency across tables if we want to use features like time
travel, incremental scan, branch & tag, encryption, etc.

On Fri, May 30, 2025 at 12:55 PM Bryan Keller  wrote:

> Fewer commit conflicts meaning the tables representing column families are
> updated independently, rather than having to serialize commits to a single
> table. Perhaps with a wide table solution the commit logic could be
> enhanced to support things like concurrent overwrites to independent column
> families, but it seems like it would be fairly involved.
>
>
> On May 29, 2025, at 7:16 PM, Steven Wu  wrote:
>
> Bryan, interesting approach to split horizontally across multiple tables.
>
> A few potential down sides
> * operational overhead. tables need to be managed consistently and
> probably in some coordinated way
> * complex read
> * maybe fragile to enforce correctness (during join). It is robust to
> enforce the stitching correctness at file group level in file reader and
> writer if built in the table format.
>
> > fewer commit conflicts
>
> Can you elaborate on this one? Are those tables populated by streaming or
> batch pipelines?
>
> On Thu, May 29, 2025 at 5:03 PM Bryan Keller  wrote:
>
>> Hi everyone,
>>
>> We have been investigating a wide table format internally for a similar
>> use case, i.e. we have wide ML tables with features generated by different
>> pipelines and teams but want a unified view of the data. We are comparing
>> that against separate tables joined together using a shuffle-less join
>> (e.g. storage partition join), along with a corresponding view.
>>
>> The join/view approach seems to give us much of we need, with some added
>> benefits like splitting up the metadata, fewer commit conflicts, and
>> ability to share, nest, and swap "column families". The downsides are table
>> management is split across multiple tables, it requires engine support of
>> shuffle-less joins for best performance, and even then, scans probably
>> won't be as optimal.
>>
>> I'm curious if anyone had further thoughts on the two?
>>
>> -Bryan
>>
>>
>>
>> On May 29, 2025, at 8:18 AM, Péter Váry 
>> wrote:
>>
>> I received feedback from Alkis regarding their Parquet optimization work.
>> Their internal testing shows promising results for reducing metadata size
>> and improving parsing performance. They plan to formalize a proposal for
>> these Parquet enhancements in the near future.
>>
>> Meanwhile, I'm putting together our horizontal sharding proposal as a
>> complementary approach. Even with the Parquet metadata improvements,
>> horizontal sharding would provide additional benefits for:
>>
>>- More efficient column-level updates
>>- Streamlined column additions
>>- Better handling of dominant columns that can cause RowGroup size
>>imbalances (placing these in separate files could significantly improve
>>performance)
>>
>> Thanks, Peter
>>
>>
>>
>> Péter Váry  ezt írta (időpont: 2025. máj.
>> 28., Sze, 15:39):
>>
>>> I would be happy to put together a proposal based on the inputs got here.
>>>
>>> Thanks everyone for your thoughts!
>>> I will try to incorporate all of this.
>>>
>>> Thanks, Peter
>>>
>>> Daniel Weeks  ezt írta (időpont: 2025. máj. 27., K,
>>> 20:07):
>>>
 I feel like we have two different issues we're talking about here that
 aren't necessarily tied (though solutions may address both): 1) wide
 tables, 2) adding columns

 Wide tables are definitely a problem where parquet has limitations. I'm
 optimistic about the ongoing work to help improve parquet footers/stats in
 this area that Fokko mentioned.  There are always limitations in how this
 scales as wide rows lead to small row groups and the cost to reconstitute a
 row gets more expensive, but for cases that are read heavy and projecting
 subsets of columns should significantly improve performance.

 Adding columns to an existing dataset is something that comes up
 periodically, but there's a lot of complexity involved in this.  Parquet
 does support referencing columns in separate files per the spec, but
 there's no implementation that takes advantage of this to my knowledge.
 This does allow for approaches where you separate/rewrite just the footers
 or various other tricks, but these approaches get complicated quickly and
 the number of readers that can consume those representations would
 initially be very limited.

 A larger problem for splitting columns across files is that there are a
 lot of assumptions about how data is laid out in both readers and writers.
 For example, aligning row groups and correctly handling split calculation
 is very complicated if you're trying to split rows across files.  Other
 features are also impacted like deletes, which reference the file to which
 they apply and would need to account for deletes applying to multiple fil

Re: Wide tables in V4

2025-05-29 Thread Bryan Keller
Fewer commit conflicts meaning the tables representing column families are 
updated independently, rather than having to serialize commits to a single 
table. Perhaps with a wide table solution the commit logic could be enhanced to 
support things like concurrent overwrites to independent column families, but 
it seems like it would be fairly involved.


> On May 29, 2025, at 7:16 PM, Steven Wu  wrote:
> 
> Bryan, interesting approach to split horizontally across multiple tables. 
> 
> A few potential down sides
> * operational overhead. tables need to be managed consistently and probably 
> in some coordinated way
> * complex read
> * maybe fragile to enforce correctness (during join). It is robust to enforce 
> the stitching correctness at file group level in file reader and writer if 
> built in the table format.
> 
> > fewer commit conflicts
> 
> Can you elaborate on this one? Are those tables populated by streaming or 
> batch pipelines?
> 
> On Thu, May 29, 2025 at 5:03 PM Bryan Keller  > wrote:
>> Hi everyone,
>> 
>> We have been investigating a wide table format internally for a similar use 
>> case, i.e. we have wide ML tables with features generated by different 
>> pipelines and teams but want a unified view of the data. We are comparing 
>> that against separate tables joined together using a shuffle-less join (e.g. 
>> storage partition join), along with a corresponding view.
>> 
>> The join/view approach seems to give us much of we need, with some added 
>> benefits like splitting up the metadata, fewer commit conflicts, and ability 
>> to share, nest, and swap "column families". The downsides are table 
>> management is split across multiple tables, it requires engine support of 
>> shuffle-less joins for best performance, and even then, scans probably won't 
>> be as optimal.
>> 
>> I'm curious if anyone had further thoughts on the two?
>> 
>> -Bryan
>> 
>> 
>> 
>>> On May 29, 2025, at 8:18 AM, Péter Váry >> > wrote:
>>> 
>>> I received feedback from Alkis regarding their Parquet optimization work. 
>>> Their internal testing shows promising results for reducing metadata size 
>>> and improving parsing performance. They plan to formalize a proposal for 
>>> these Parquet enhancements in the near future.
>>> 
>>> Meanwhile, I'm putting together our horizontal sharding proposal as a 
>>> complementary approach. Even with the Parquet metadata improvements, 
>>> horizontal sharding would provide additional benefits for:
>>> More efficient column-level updates
>>> Streamlined column additions
>>> Better handling of dominant columns that can cause RowGroup size imbalances 
>>> (placing these in separate files could significantly improve performance)
>>> Thanks, Peter
>>> 
>>> 
>>> 
>>> Péter Váry >> > ezt írta (időpont: 2025. máj. 28., 
>>> Sze, 15:39):
 I would be happy to put together a proposal based on the inputs got here.
 
 Thanks everyone for your thoughts!
 I will try to incorporate all of this.
 
 Thanks, Peter 
 
 Daniel Weeks mailto:[email protected]>> ezt írta 
 (időpont: 2025. máj. 27., K, 20:07):
> I feel like we have two different issues we're talking about here that 
> aren't necessarily tied (though solutions may address both): 1) wide 
> tables, 2) adding columns
> 
> Wide tables are definitely a problem where parquet has limitations. I'm 
> optimistic about the ongoing work to help improve parquet footers/stats 
> in this area that Fokko mentioned.  There are always limitations in how 
> this scales as wide rows lead to small row groups and the cost to 
> reconstitute a row gets more expensive, but for cases that are read heavy 
> and projecting subsets of columns should significantly improve 
> performance.
> 
> Adding columns to an existing dataset is something that comes up 
> periodically, but there's a lot of complexity involved in this.  Parquet 
> does support referencing columns in separate files per the spec, but 
> there's no implementation that takes advantage of this to my knowledge.  
> This does allow for approaches where you separate/rewrite just the 
> footers or various other tricks, but these approaches get complicated 
> quickly and the number of readers that can consume those representations 
> would initially be very limited.
> 
> A larger problem for splitting columns across files is that there are a 
> lot of assumptions about how data is laid out in both readers and 
> writers.  For example, aligning row groups and correctly handling split 
> calculation is very complicated if you're trying to split rows across 
> files.  Other features are also impacted like deletes, which reference 
> the file to which they apply and would need to account for deletes 
> applying to multiple files and needing to update those ref

Re: Wide tables in V4

2025-05-29 Thread Steven Wu
Bryan, interesting approach to split horizontally across multiple tables.

A few potential down sides
* operational overhead. tables need to be managed consistently and probably
in some coordinated way
* complex read
* maybe fragile to enforce correctness (during join). It is robust to
enforce the stitching correctness at file group level in file reader and
writer if built in the table format.

> fewer commit conflicts

Can you elaborate on this one? Are those tables populated by streaming or
batch pipelines?

On Thu, May 29, 2025 at 5:03 PM Bryan Keller  wrote:

> Hi everyone,
>
> We have been investigating a wide table format internally for a similar
> use case, i.e. we have wide ML tables with features generated by different
> pipelines and teams but want a unified view of the data. We are comparing
> that against separate tables joined together using a shuffle-less join
> (e.g. storage partition join), along with a corresponding view.
>
> The join/view approach seems to give us much of we need, with some added
> benefits like splitting up the metadata, fewer commit conflicts, and
> ability to share, nest, and swap "column families". The downsides are table
> management is split across multiple tables, it requires engine support of
> shuffle-less joins for best performance, and even then, scans probably
> won't be as optimal.
>
> I'm curious if anyone had further thoughts on the two?
>
> -Bryan
>
>
>
> On May 29, 2025, at 8:18 AM, Péter Váry 
> wrote:
>
> I received feedback from Alkis regarding their Parquet optimization work.
> Their internal testing shows promising results for reducing metadata size
> and improving parsing performance. They plan to formalize a proposal for
> these Parquet enhancements in the near future.
>
> Meanwhile, I'm putting together our horizontal sharding proposal as a
> complementary approach. Even with the Parquet metadata improvements,
> horizontal sharding would provide additional benefits for:
>
>- More efficient column-level updates
>- Streamlined column additions
>- Better handling of dominant columns that can cause RowGroup size
>imbalances (placing these in separate files could significantly improve
>performance)
>
> Thanks, Peter
>
>
>
> Péter Váry  ezt írta (időpont: 2025. máj.
> 28., Sze, 15:39):
>
>> I would be happy to put together a proposal based on the inputs got here.
>>
>> Thanks everyone for your thoughts!
>> I will try to incorporate all of this.
>>
>> Thanks, Peter
>>
>> Daniel Weeks  ezt írta (időpont: 2025. máj. 27., K,
>> 20:07):
>>
>>> I feel like we have two different issues we're talking about here that
>>> aren't necessarily tied (though solutions may address both): 1) wide
>>> tables, 2) adding columns
>>>
>>> Wide tables are definitely a problem where parquet has limitations. I'm
>>> optimistic about the ongoing work to help improve parquet footers/stats in
>>> this area that Fokko mentioned.  There are always limitations in how this
>>> scales as wide rows lead to small row groups and the cost to reconstitute a
>>> row gets more expensive, but for cases that are read heavy and projecting
>>> subsets of columns should significantly improve performance.
>>>
>>> Adding columns to an existing dataset is something that comes up
>>> periodically, but there's a lot of complexity involved in this.  Parquet
>>> does support referencing columns in separate files per the spec, but
>>> there's no implementation that takes advantage of this to my knowledge.
>>> This does allow for approaches where you separate/rewrite just the footers
>>> or various other tricks, but these approaches get complicated quickly and
>>> the number of readers that can consume those representations would
>>> initially be very limited.
>>>
>>> A larger problem for splitting columns across files is that there are a
>>> lot of assumptions about how data is laid out in both readers and writers.
>>> For example, aligning row groups and correctly handling split calculation
>>> is very complicated if you're trying to split rows across files.  Other
>>> features are also impacted like deletes, which reference the file to which
>>> they apply and would need to account for deletes applying to multiple files
>>> and needing to update those references if columns are added.
>>>
>>> I believe there are a lot of interesting approaches to addressing these
>>> use cases, but we'd really need a thorough proposal that explores all of
>>> these scenarios.  The last thing we would want is to introduce
>>> incompatibilities within the format that result in incompatible features.
>>>
>>> -Dan
>>>
>>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer <
>>> [email protected]> wrote:
>>>
 Point definitely taken. We really should probably POC some of
 these ideas and see what we are actually dealing with. (He said without
 volunteering to do the work :P)

 On Tue, May 27, 2025 at 11:55 AM Selcuk Aya
  wrote:

> Yes having to rewrite the whole file i

Re: Wide tables in V4

2025-05-29 Thread Bryan Keller
Hi everyone,

We have been investigating a wide table format internally for a similar use 
case, i.e. we have wide ML tables with features generated by different 
pipelines and teams but want a unified view of the data. We are comparing that 
against separate tables joined together using a shuffle-less join (e.g. storage 
partition join), along with a corresponding view.

The join/view approach seems to give us much of we need, with some added 
benefits like splitting up the metadata, fewer commit conflicts, and ability to 
share, nest, and swap "column families". The downsides are table management is 
split across multiple tables, it requires engine support of shuffle-less joins 
for best performance, and even then, scans probably won't be as optimal.

I'm curious if anyone had further thoughts on the two?

-Bryan



> On May 29, 2025, at 8:18 AM, Péter Váry  wrote:
> 
> I received feedback from Alkis regarding their Parquet optimization work. 
> Their internal testing shows promising results for reducing metadata size and 
> improving parsing performance. They plan to formalize a proposal for these 
> Parquet enhancements in the near future.
> 
> Meanwhile, I'm putting together our horizontal sharding proposal as a 
> complementary approach. Even with the Parquet metadata improvements, 
> horizontal sharding would provide additional benefits for:
> More efficient column-level updates
> Streamlined column additions
> Better handling of dominant columns that can cause RowGroup size imbalances 
> (placing these in separate files could significantly improve performance)
> Thanks, Peter
> 
> 
> 
> Péter Váry mailto:[email protected]>> 
> ezt írta (időpont: 2025. máj. 28., Sze, 15:39):
>> I would be happy to put together a proposal based on the inputs got here.
>> 
>> Thanks everyone for your thoughts!
>> I will try to incorporate all of this.
>> 
>> Thanks, Peter 
>> 
>> Daniel Weeks mailto:[email protected]>> ezt írta 
>> (időpont: 2025. máj. 27., K, 20:07):
>>> I feel like we have two different issues we're talking about here that 
>>> aren't necessarily tied (though solutions may address both): 1) wide 
>>> tables, 2) adding columns
>>> 
>>> Wide tables are definitely a problem where parquet has limitations. I'm 
>>> optimistic about the ongoing work to help improve parquet footers/stats in 
>>> this area that Fokko mentioned.  There are always limitations in how this 
>>> scales as wide rows lead to small row groups and the cost to reconstitute a 
>>> row gets more expensive, but for cases that are read heavy and projecting 
>>> subsets of columns should significantly improve performance.
>>> 
>>> Adding columns to an existing dataset is something that comes up 
>>> periodically, but there's a lot of complexity involved in this.  Parquet 
>>> does support referencing columns in separate files per the spec, but 
>>> there's no implementation that takes advantage of this to my knowledge.  
>>> This does allow for approaches where you separate/rewrite just the footers 
>>> or various other tricks, but these approaches get complicated quickly and 
>>> the number of readers that can consume those representations would 
>>> initially be very limited.
>>> 
>>> A larger problem for splitting columns across files is that there are a lot 
>>> of assumptions about how data is laid out in both readers and writers.  For 
>>> example, aligning row groups and correctly handling split calculation is 
>>> very complicated if you're trying to split rows across files.  Other 
>>> features are also impacted like deletes, which reference the file to which 
>>> they apply and would need to account for deletes applying to multiple files 
>>> and needing to update those references if columns are added.
>>> 
>>> I believe there are a lot of interesting approaches to addressing these use 
>>> cases, but we'd really need a thorough proposal that explores all of these 
>>> scenarios.  The last thing we would want is to introduce incompatibilities 
>>> within the format that result in incompatible features.
>>> 
>>> -Dan
>>> 
>>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer >> > wrote:
 Point definitely taken. We really should probably POC some of these ideas 
 and see what we are actually dealing with. (He said without volunteering 
 to do the work :P)
 
 On Tue, May 27, 2025 at 11:55 AM Selcuk Aya 
  wrote:
> Yes having to rewrite the whole file is not ideal but I believe most of 
> the cost of rewriting a file comes from decompression, encoding, stats 
> calculations etc. If you are adding new values for some columns but are 
> keeping the rest of the columns the same in the file, then a bunch of 
> rewrite cost can be optimized away. I am not saying this is better than 
> writing to a separate file, I am not sure how much worse it is though.
> 
> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer 
> mailto:russell.spit...@gmai

Re: Wide tables in V4

2025-05-29 Thread Péter Váry
I received feedback from Alkis regarding their Parquet optimization work.
Their internal testing shows promising results for reducing metadata size
and improving parsing performance. They plan to formalize a proposal for
these Parquet enhancements in the near future.

Meanwhile, I'm putting together our horizontal sharding proposal as a
complementary approach. Even with the Parquet metadata improvements,
horizontal sharding would provide additional benefits for:

   - More efficient column-level updates
   - Streamlined column additions
   - Better handling of dominant columns that can cause RowGroup size
   imbalances (placing these in separate files could significantly improve
   performance)

Thanks, Peter



Péter Váry  ezt írta (időpont: 2025. máj. 28.,
Sze, 15:39):

> I would be happy to put together a proposal based on the inputs got here.
>
> Thanks everyone for your thoughts!
> I will try to incorporate all of this.
>
> Thanks, Peter
>
> Daniel Weeks  ezt írta (időpont: 2025. máj. 27., K,
> 20:07):
>
>> I feel like we have two different issues we're talking about here that
>> aren't necessarily tied (though solutions may address both): 1) wide
>> tables, 2) adding columns
>>
>> Wide tables are definitely a problem where parquet has limitations. I'm
>> optimistic about the ongoing work to help improve parquet footers/stats in
>> this area that Fokko mentioned.  There are always limitations in how this
>> scales as wide rows lead to small row groups and the cost to reconstitute a
>> row gets more expensive, but for cases that are read heavy and projecting
>> subsets of columns should significantly improve performance.
>>
>> Adding columns to an existing dataset is something that comes up
>> periodically, but there's a lot of complexity involved in this.  Parquet
>> does support referencing columns in separate files per the spec, but
>> there's no implementation that takes advantage of this to my knowledge.
>> This does allow for approaches where you separate/rewrite just the footers
>> or various other tricks, but these approaches get complicated quickly and
>> the number of readers that can consume those representations would
>> initially be very limited.
>>
>> A larger problem for splitting columns across files is that there are a
>> lot of assumptions about how data is laid out in both readers and writers.
>> For example, aligning row groups and correctly handling split calculation
>> is very complicated if you're trying to split rows across files.  Other
>> features are also impacted like deletes, which reference the file to which
>> they apply and would need to account for deletes applying to multiple files
>> and needing to update those references if columns are added.
>>
>> I believe there are a lot of interesting approaches to addressing these
>> use cases, but we'd really need a thorough proposal that explores all of
>> these scenarios.  The last thing we would want is to introduce
>> incompatibilities within the format that result in incompatible features.
>>
>> -Dan
>>
>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> Point definitely taken. We really should probably POC some of
>>> these ideas and see what we are actually dealing with. (He said without
>>> volunteering to do the work :P)
>>>
>>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya
>>>  wrote:
>>>
 Yes having to rewrite the whole file is not ideal but I believe most of
 the cost of rewriting a file comes from decompression, encoding, stats
 calculations etc. If you are adding new values for some columns but are
 keeping the rest of the columns the same in the file, then a bunch of
 rewrite cost can be optimized away. I am not saying this is better than
 writing to a separate file, I am not sure how much worse it is though.

 On Tue, May 27, 2025 at 9:40 AM Russell Spitzer <
 [email protected]> wrote:

> I think that "after the fact" modification is one of the requirements
> here, IE: Updating a single column without rewriting the whole file.
> If we have to write new metadata for the file aren't we in the same
> boat as having to rewrite the whole file?
>
> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya
>  wrote:
>
>> If files represent column projections of a table rather than the
>> whole columns in the table, then any read that reads across these files
>> needs to identify what constitutes a row. Lance DB for example has 
>> vertical
>> partitioning across columns but also horizontal partitioning across rows
>> such that in each horizontal partitioning(fragment), the same number of
>> rows exist in each vertical partition,  which I think is necessary to 
>> make
>> whole/partial row construction cheap. If this is the case, there is no
>> reason not to achieve the same data layout inside a single columnar file
>> with a lean header. I think the only valid argumen

Re: Wide tables in V4

2025-05-28 Thread Péter Váry
I would be happy to put together a proposal based on the inputs got here.

Thanks everyone for your thoughts!
I will try to incorporate all of this.

Thanks, Peter

Daniel Weeks  ezt írta (időpont: 2025. máj. 27., K,
20:07):

> I feel like we have two different issues we're talking about here that
> aren't necessarily tied (though solutions may address both): 1) wide
> tables, 2) adding columns
>
> Wide tables are definitely a problem where parquet has limitations. I'm
> optimistic about the ongoing work to help improve parquet footers/stats in
> this area that Fokko mentioned.  There are always limitations in how this
> scales as wide rows lead to small row groups and the cost to reconstitute a
> row gets more expensive, but for cases that are read heavy and projecting
> subsets of columns should significantly improve performance.
>
> Adding columns to an existing dataset is something that comes up
> periodically, but there's a lot of complexity involved in this.  Parquet
> does support referencing columns in separate files per the spec, but
> there's no implementation that takes advantage of this to my knowledge.
> This does allow for approaches where you separate/rewrite just the footers
> or various other tricks, but these approaches get complicated quickly and
> the number of readers that can consume those representations would
> initially be very limited.
>
> A larger problem for splitting columns across files is that there are a
> lot of assumptions about how data is laid out in both readers and writers.
> For example, aligning row groups and correctly handling split calculation
> is very complicated if you're trying to split rows across files.  Other
> features are also impacted like deletes, which reference the file to which
> they apply and would need to account for deletes applying to multiple files
> and needing to update those references if columns are added.
>
> I believe there are a lot of interesting approaches to addressing these
> use cases, but we'd really need a thorough proposal that explores all of
> these scenarios.  The last thing we would want is to introduce
> incompatibilities within the format that result in incompatible features.
>
> -Dan
>
> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer <
> [email protected]> wrote:
>
>> Point definitely taken. We really should probably POC some of these ideas
>> and see what we are actually dealing with. (He said without volunteering to
>> do the work :P)
>>
>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya
>>  wrote:
>>
>>> Yes having to rewrite the whole file is not ideal but I believe most of
>>> the cost of rewriting a file comes from decompression, encoding, stats
>>> calculations etc. If you are adding new values for some columns but are
>>> keeping the rest of the columns the same in the file, then a bunch of
>>> rewrite cost can be optimized away. I am not saying this is better than
>>> writing to a separate file, I am not sure how much worse it is though.
>>>
>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer <
>>> [email protected]> wrote:
>>>
 I think that "after the fact" modification is one of the requirements
 here, IE: Updating a single column without rewriting the whole file.
 If we have to write new metadata for the file aren't we in the same
 boat as having to rewrite the whole file?

 On Tue, May 27, 2025 at 11:27 AM Selcuk Aya
  wrote:

> If files represent column projections of a table rather than the whole
> columns in the table, then any read that reads across these files needs to
> identify what constitutes a row. Lance DB for example has vertical
> partitioning across columns but also horizontal partitioning across rows
> such that in each horizontal partitioning(fragment), the same number of
> rows exist in each vertical partition,  which I think is necessary to make
> whole/partial row construction cheap. If this is the case, there is no
> reason not to achieve the same data layout inside a single columnar file
> with a lean header. I think the only valid argument for a separate file is
> adding a new set of columns to an existing table, but even then I am not
> sure a separate file is absolutely necessary for good performance.
>
> Selcuk
>
> On Tue, May 27, 2025 at 9:18 AM Devin Smith
>  wrote:
>
>> There's a `file_path` field in the parquet ColumnChunk structure,
>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
>>
>> I'm not sure what tooling actually supports this though. Could be
>> interesting to see what the history of this is.
>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3,
>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw
>>
>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> I have to agree tha

Re: Wide tables in V4

2025-05-27 Thread Daniel Weeks
I feel like we have two different issues we're talking about here that
aren't necessarily tied (though solutions may address both): 1) wide
tables, 2) adding columns

Wide tables are definitely a problem where parquet has limitations. I'm
optimistic about the ongoing work to help improve parquet footers/stats in
this area that Fokko mentioned.  There are always limitations in how this
scales as wide rows lead to small row groups and the cost to reconstitute a
row gets more expensive, but for cases that are read heavy and projecting
subsets of columns should significantly improve performance.

Adding columns to an existing dataset is something that comes up
periodically, but there's a lot of complexity involved in this.  Parquet
does support referencing columns in separate files per the spec, but
there's no implementation that takes advantage of this to my knowledge.
This does allow for approaches where you separate/rewrite just the footers
or various other tricks, but these approaches get complicated quickly and
the number of readers that can consume those representations would
initially be very limited.

A larger problem for splitting columns across files is that there are a lot
of assumptions about how data is laid out in both readers and writers.  For
example, aligning row groups and correctly handling split calculation is
very complicated if you're trying to split rows across files.  Other
features are also impacted like deletes, which reference the file to which
they apply and would need to account for deletes applying to multiple files
and needing to update those references if columns are added.

I believe there are a lot of interesting approaches to addressing these use
cases, but we'd really need a thorough proposal that explores all of these
scenarios.  The last thing we would want is to introduce incompatibilities
within the format that result in incompatible features.

-Dan

On Tue, May 27, 2025 at 10:02 AM Russell Spitzer 
wrote:

> Point definitely taken. We really should probably POC some of these ideas
> and see what we are actually dealing with. (He said without volunteering to
> do the work :P)
>
> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya
>  wrote:
>
>> Yes having to rewrite the whole file is not ideal but I believe most of
>> the cost of rewriting a file comes from decompression, encoding, stats
>> calculations etc. If you are adding new values for some columns but are
>> keeping the rest of the columns the same in the file, then a bunch of
>> rewrite cost can be optimized away. I am not saying this is better than
>> writing to a separate file, I am not sure how much worse it is though.
>>
>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> I think that "after the fact" modification is one of the requirements
>>> here, IE: Updating a single column without rewriting the whole file.
>>> If we have to write new metadata for the file aren't we in the same boat
>>> as having to rewrite the whole file?
>>>
>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya
>>>  wrote:
>>>
 If files represent column projections of a table rather than the whole
 columns in the table, then any read that reads across these files needs to
 identify what constitutes a row. Lance DB for example has vertical
 partitioning across columns but also horizontal partitioning across rows
 such that in each horizontal partitioning(fragment), the same number of
 rows exist in each vertical partition,  which I think is necessary to make
 whole/partial row construction cheap. If this is the case, there is no
 reason not to achieve the same data layout inside a single columnar file
 with a lean header. I think the only valid argument for a separate file is
 adding a new set of columns to an existing table, but even then I am not
 sure a separate file is absolutely necessary for good performance.

 Selcuk

 On Tue, May 27, 2025 at 9:18 AM Devin Smith
  wrote:

> There's a `file_path` field in the parquet ColumnChunk structure,
> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
>
> I'm not sure what tooling actually supports this though. Could be
> interesting to see what the history of this is.
> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3,
> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw
>
> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer <
> [email protected]> wrote:
>
>> I have to agree that while there can be some fixes in Parquet, we
>> fundamentally need a way to split a "row group"
>> or something like that between separate files. If that's something we
>> can do in the parquet project that would be great
>> but it feels like we need to start exploring more drastic options
>> than footer encoding.
>>
>> On Mon, May 26, 2025 at 

Re: Wide tables in V4

2025-05-27 Thread Selcuk Aya
Yes having to rewrite the whole file is not ideal but I believe most of the
cost of rewriting a file comes from decompression, encoding, stats
calculations etc. If you are adding new values for some columns but are
keeping the rest of the columns the same in the file, then a bunch of
rewrite cost can be optimized away. I am not saying this is better than
writing to a separate file, I am not sure how much worse it is though.

On Tue, May 27, 2025 at 9:40 AM Russell Spitzer 
wrote:

> I think that "after the fact" modification is one of the requirements
> here, IE: Updating a single column without rewriting the whole file.
> If we have to write new metadata for the file aren't we in the same boat
> as having to rewrite the whole file?
>
> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya
>  wrote:
>
>> If files represent column projections of a table rather than the whole
>> columns in the table, then any read that reads across these files needs to
>> identify what constitutes a row. Lance DB for example has vertical
>> partitioning across columns but also horizontal partitioning across rows
>> such that in each horizontal partitioning(fragment), the same number of
>> rows exist in each vertical partition,  which I think is necessary to make
>> whole/partial row construction cheap. If this is the case, there is no
>> reason not to achieve the same data layout inside a single columnar file
>> with a lean header. I think the only valid argument for a separate file is
>> adding a new set of columns to an existing table, but even then I am not
>> sure a separate file is absolutely necessary for good performance.
>>
>> Selcuk
>>
>> On Tue, May 27, 2025 at 9:18 AM Devin Smith
>>  wrote:
>>
>>> There's a `file_path` field in the parquet ColumnChunk structure,
>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
>>>
>>> I'm not sure what tooling actually supports this though. Could be
>>> interesting to see what the history of this is.
>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3,
>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw
>>>
>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer <
>>> [email protected]> wrote:
>>>
 I have to agree that while there can be some fixes in Parquet, we
 fundamentally need a way to split a "row group"
 or something like that between separate files. If that's something we
 can do in the parquet project that would be great
 but it feels like we need to start exploring more drastic options than
 footer encoding.

 On Mon, May 26, 2025 at 8:42 PM Gang Wu  wrote:

> I agree with Steven that there are limitations that Parquet cannot do.
>
> In addition to adding new columns by rewriting all files, files of
> wide tables may suffer from bad performance like below:
> - Poor compression of row groups because there are too many columns
> and even a small number of rows can reach the row group threshold.
> - Dominating columns (e.g. blobs) may contribute to 99% size of a row
> group, leading to unbalanced column chunks and deteriorate the row group
> compression.
> - Similar to adding new columns, partial update also requires
> rewriting all columns of the affected rows.
>
> IIRC, some table formats already support splitting columns into
> different files:
> - Lance manifest splits a fragment [1] into one or more data files.
> - Apache Hudi has the concept of column family [2].
> - Apache Paimon supports sequence groups [3] for partial update.
>
> Although Parquet can introduce the concept of logical file and
> physical file to manage the columns to file mapping, this looks like yet
> another manifest file design which duplicates the purpose of Iceberg.
> These might be something worth exploring in Iceberg.
>
> [1] https://lancedb.github.io/lance/format.html#fragments
> [2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
> [3]
> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group
>
> Best,
> Gang
>
>
>
> On Tue, May 27, 2025 at 7:03 AM Steven Wu 
> wrote:
>
>> The Parquet metadata proposal (linked by Fokko) is mainly addressing
>> the read performance due to bloated metadata.
>>
>> What Peter described in the description seems useful for some ML
>> workload of feature engineering. A new set of features/columns are added 
>> to
>> the table. Currently, Iceberg  would require rewriting all data files to
>> combine old and new columns (write amplification). Similarly, in the past
>> the community also talked about the use cases of updating a single 
>> column,
>> which would require rewriting all data files.
>>
>> On Mon, May 26, 2025 at 2:42 PM Péter Váry <
>> [email protected]> wrote:

Re: Wide tables in V4

2025-05-27 Thread Russell Spitzer
Point definitely taken. We really should probably POC some of these ideas
and see what we are actually dealing with. (He said without volunteering to
do the work :P)

On Tue, May 27, 2025 at 11:55 AM Selcuk Aya
 wrote:

> Yes having to rewrite the whole file is not ideal but I believe most of
> the cost of rewriting a file comes from decompression, encoding, stats
> calculations etc. If you are adding new values for some columns but are
> keeping the rest of the columns the same in the file, then a bunch of
> rewrite cost can be optimized away. I am not saying this is better than
> writing to a separate file, I am not sure how much worse it is though.
>
> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer 
> wrote:
>
>> I think that "after the fact" modification is one of the requirements
>> here, IE: Updating a single column without rewriting the whole file.
>> If we have to write new metadata for the file aren't we in the same boat
>> as having to rewrite the whole file?
>>
>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya
>>  wrote:
>>
>>> If files represent column projections of a table rather than the whole
>>> columns in the table, then any read that reads across these files needs to
>>> identify what constitutes a row. Lance DB for example has vertical
>>> partitioning across columns but also horizontal partitioning across rows
>>> such that in each horizontal partitioning(fragment), the same number of
>>> rows exist in each vertical partition,  which I think is necessary to make
>>> whole/partial row construction cheap. If this is the case, there is no
>>> reason not to achieve the same data layout inside a single columnar file
>>> with a lean header. I think the only valid argument for a separate file is
>>> adding a new set of columns to an existing table, but even then I am not
>>> sure a separate file is absolutely necessary for good performance.
>>>
>>> Selcuk
>>>
>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith
>>>  wrote:
>>>
 There's a `file_path` field in the parquet ColumnChunk structure,
 https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962

 I'm not sure what tooling actually supports this though. Could be
 interesting to see what the history of this is.
 https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3,
 https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw

 On Tue, May 27, 2025 at 8:59 AM Russell Spitzer <
 [email protected]> wrote:

> I have to agree that while there can be some fixes in Parquet, we
> fundamentally need a way to split a "row group"
> or something like that between separate files. If that's something we
> can do in the parquet project that would be great
> but it feels like we need to start exploring more drastic options than
> footer encoding.
>
> On Mon, May 26, 2025 at 8:42 PM Gang Wu  wrote:
>
>> I agree with Steven that there are limitations that Parquet cannot do.
>>
>> In addition to adding new columns by rewriting all files, files of
>> wide tables may suffer from bad performance like below:
>> - Poor compression of row groups because there are too many columns
>> and even a small number of rows can reach the row group threshold.
>> - Dominating columns (e.g. blobs) may contribute to 99% size of a row
>> group, leading to unbalanced column chunks and deteriorate the row group
>> compression.
>> - Similar to adding new columns, partial update also requires
>> rewriting all columns of the affected rows.
>>
>> IIRC, some table formats already support splitting columns into
>> different files:
>> - Lance manifest splits a fragment [1] into one or more data files.
>> - Apache Hudi has the concept of column family [2].
>> - Apache Paimon supports sequence groups [3] for partial update.
>>
>> Although Parquet can introduce the concept of logical file and
>> physical file to manage the columns to file mapping, this looks like yet
>> another manifest file design which duplicates the purpose of Iceberg.
>> These might be something worth exploring in Iceberg.
>>
>> [1] https://lancedb.github.io/lance/format.html#fragments
>> [2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
>> [3]
>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group
>>
>> Best,
>> Gang
>>
>>
>>
>> On Tue, May 27, 2025 at 7:03 AM Steven Wu 
>> wrote:
>>
>>> The Parquet metadata proposal (linked by Fokko) is mainly addressing
>>> the read performance due to bloated metadata.
>>>
>>> What Peter described in the description seems useful for some ML
>>> workload of feature engineering. A new set of features/columns are 
>>> added to
>>> the table. Currently, Iceberg  would require rewriting all data files to

Re: Wide tables in V4

2025-05-27 Thread Russell Spitzer
I think that "after the fact" modification is one of the requirements here,
IE: Updating a single column without rewriting the whole file.
If we have to write new metadata for the file aren't we in the same boat as
having to rewrite the whole file?

On Tue, May 27, 2025 at 11:27 AM Selcuk Aya
 wrote:

> If files represent column projections of a table rather than the whole
> columns in the table, then any read that reads across these files needs to
> identify what constitutes a row. Lance DB for example has vertical
> partitioning across columns but also horizontal partitioning across rows
> such that in each horizontal partitioning(fragment), the same number of
> rows exist in each vertical partition,  which I think is necessary to make
> whole/partial row construction cheap. If this is the case, there is no
> reason not to achieve the same data layout inside a single columnar file
> with a lean header. I think the only valid argument for a separate file is
> adding a new set of columns to an existing table, but even then I am not
> sure a separate file is absolutely necessary for good performance.
>
> Selcuk
>
> On Tue, May 27, 2025 at 9:18 AM Devin Smith
>  wrote:
>
>> There's a `file_path` field in the parquet ColumnChunk structure,
>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
>>
>> I'm not sure what tooling actually supports this though. Could be
>> interesting to see what the history of this is.
>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3,
>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw
>>
>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> I have to agree that while there can be some fixes in Parquet, we
>>> fundamentally need a way to split a "row group"
>>> or something like that between separate files. If that's something we
>>> can do in the parquet project that would be great
>>> but it feels like we need to start exploring more drastic options than
>>> footer encoding.
>>>
>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu  wrote:
>>>
 I agree with Steven that there are limitations that Parquet cannot do.

 In addition to adding new columns by rewriting all files, files of wide
 tables may suffer from bad performance like below:
 - Poor compression of row groups because there are too many columns and
 even a small number of rows can reach the row group threshold.
 - Dominating columns (e.g. blobs) may contribute to 99% size of a row
 group, leading to unbalanced column chunks and deteriorate the row group
 compression.
 - Similar to adding new columns, partial update also requires rewriting
 all columns of the affected rows.

 IIRC, some table formats already support splitting columns into
 different files:
 - Lance manifest splits a fragment [1] into one or more data files.
 - Apache Hudi has the concept of column family [2].
 - Apache Paimon supports sequence groups [3] for partial update.

 Although Parquet can introduce the concept of logical file and physical
 file to manage the columns to file mapping, this looks like yet another
 manifest file design which duplicates the purpose of Iceberg.
 These might be something worth exploring in Iceberg.

 [1] https://lancedb.github.io/lance/format.html#fragments
 [2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
 [3]
 https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group

 Best,
 Gang



 On Tue, May 27, 2025 at 7:03 AM Steven Wu  wrote:

> The Parquet metadata proposal (linked by Fokko) is mainly addressing
> the read performance due to bloated metadata.
>
> What Peter described in the description seems useful for some ML
> workload of feature engineering. A new set of features/columns are added 
> to
> the table. Currently, Iceberg  would require rewriting all data files to
> combine old and new columns (write amplification). Similarly, in the past
> the community also talked about the use cases of updating a single column,
> which would require rewriting all data files.
>
> On Mon, May 26, 2025 at 2:42 PM Péter Váry <
> [email protected]> wrote:
>
>> Do you have the link at hand for the thread where this was discussed
>> on the Parquet list?
>> The docs seem quite old, and the PR stale, so I would like to
>> understand the situation better.
>> If it is possible to do this in Parquet, that would be great, but
>> Avro, ORC would still suffer.
>>
>> Amogh Jahagirdar <[email protected]> ezt írta (időpont: 2025. máj.
>> 26., H, 22:07):
>>
>>> Hey Peter,
>>>
>>> Thanks for bringing this issue up. I think I agree with Fokko; the
>>> issue of wide tables leading to Parquet metadata 

Re: Wide tables in V4

2025-05-27 Thread Selcuk Aya
If files represent column projections of a table rather than the whole
columns in the table, then any read that reads across these files needs to
identify what constitutes a row. Lance DB for example has vertical
partitioning across columns but also horizontal partitioning across rows
such that in each horizontal partitioning(fragment), the same number of
rows exist in each vertical partition,  which I think is necessary to make
whole/partial row construction cheap. If this is the case, there is no
reason not to achieve the same data layout inside a single columnar file
with a lean header. I think the only valid argument for a separate file is
adding a new set of columns to an existing table, but even then I am not
sure a separate file is absolutely necessary for good performance.

Selcuk

On Tue, May 27, 2025 at 9:18 AM Devin Smith 
wrote:

> There's a `file_path` field in the parquet ColumnChunk structure,
> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
>
> I'm not sure what tooling actually supports this though. Could be
> interesting to see what the history of this is.
> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3,
> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw
>
> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer 
> wrote:
>
>> I have to agree that while there can be some fixes in Parquet, we
>> fundamentally need a way to split a "row group"
>> or something like that between separate files. If that's something we can
>> do in the parquet project that would be great
>> but it feels like we need to start exploring more drastic options than
>> footer encoding.
>>
>> On Mon, May 26, 2025 at 8:42 PM Gang Wu  wrote:
>>
>>> I agree with Steven that there are limitations that Parquet cannot do.
>>>
>>> In addition to adding new columns by rewriting all files, files of wide
>>> tables may suffer from bad performance like below:
>>> - Poor compression of row groups because there are too many columns and
>>> even a small number of rows can reach the row group threshold.
>>> - Dominating columns (e.g. blobs) may contribute to 99% size of a row
>>> group, leading to unbalanced column chunks and deteriorate the row group
>>> compression.
>>> - Similar to adding new columns, partial update also requires rewriting
>>> all columns of the affected rows.
>>>
>>> IIRC, some table formats already support splitting columns into
>>> different files:
>>> - Lance manifest splits a fragment [1] into one or more data files.
>>> - Apache Hudi has the concept of column family [2].
>>> - Apache Paimon supports sequence groups [3] for partial update.
>>>
>>> Although Parquet can introduce the concept of logical file and physical
>>> file to manage the columns to file mapping, this looks like yet another
>>> manifest file design which duplicates the purpose of Iceberg.
>>> These might be something worth exploring in Iceberg.
>>>
>>> [1] https://lancedb.github.io/lance/format.html#fragments
>>> [2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
>>> [3]
>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group
>>>
>>> Best,
>>> Gang
>>>
>>>
>>>
>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu  wrote:
>>>
 The Parquet metadata proposal (linked by Fokko) is mainly addressing
 the read performance due to bloated metadata.

 What Peter described in the description seems useful for some ML
 workload of feature engineering. A new set of features/columns are added to
 the table. Currently, Iceberg  would require rewriting all data files to
 combine old and new columns (write amplification). Similarly, in the past
 the community also talked about the use cases of updating a single column,
 which would require rewriting all data files.

 On Mon, May 26, 2025 at 2:42 PM Péter Váry 
 wrote:

> Do you have the link at hand for the thread where this was discussed
> on the Parquet list?
> The docs seem quite old, and the PR stale, so I would like to
> understand the situation better.
> If it is possible to do this in Parquet, that would be great, but
> Avro, ORC would still suffer.
>
> Amogh Jahagirdar <[email protected]> ezt írta (időpont: 2025. máj.
> 26., H, 22:07):
>
>> Hey Peter,
>>
>> Thanks for bringing this issue up. I think I agree with Fokko; the
>> issue of wide tables leading to Parquet metadata bloat and poor Thrift
>> deserialization performance is a long standing issue that I believe 
>> there's
>> motivation in the community to address. So to me it seems better to 
>> address
>> it in Parquet itself rather than Iceberg library facilitate a pattern 
>> which
>> works around the limitations.
>>
>> Thanks,
>> Amogh Jahagirdar
>>
>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong 
>> wrote:
>>
>>> Hi Peter,

Re: Wide tables in V4

2025-05-27 Thread Devin Smith
There's a `file_path` field in the parquet ColumnChunk structure,
https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962

I'm not sure what tooling actually supports this though. Could be
interesting to see what the history of this is.
https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3,
https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw

On Tue, May 27, 2025 at 8:59 AM Russell Spitzer 
wrote:

> I have to agree that while there can be some fixes in Parquet, we
> fundamentally need a way to split a "row group"
> or something like that between separate files. If that's something we can
> do in the parquet project that would be great
> but it feels like we need to start exploring more drastic options than
> footer encoding.
>
> On Mon, May 26, 2025 at 8:42 PM Gang Wu  wrote:
>
>> I agree with Steven that there are limitations that Parquet cannot do.
>>
>> In addition to adding new columns by rewriting all files, files of wide
>> tables may suffer from bad performance like below:
>> - Poor compression of row groups because there are too many columns and
>> even a small number of rows can reach the row group threshold.
>> - Dominating columns (e.g. blobs) may contribute to 99% size of a row
>> group, leading to unbalanced column chunks and deteriorate the row group
>> compression.
>> - Similar to adding new columns, partial update also requires rewriting
>> all columns of the affected rows.
>>
>> IIRC, some table formats already support splitting columns into different
>> files:
>> - Lance manifest splits a fragment [1] into one or more data files.
>> - Apache Hudi has the concept of column family [2].
>> - Apache Paimon supports sequence groups [3] for partial update.
>>
>> Although Parquet can introduce the concept of logical file and physical
>> file to manage the columns to file mapping, this looks like yet another
>> manifest file design which duplicates the purpose of Iceberg.
>> These might be something worth exploring in Iceberg.
>>
>> [1] https://lancedb.github.io/lance/format.html#fragments
>> [2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
>> [3]
>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group
>>
>> Best,
>> Gang
>>
>>
>>
>> On Tue, May 27, 2025 at 7:03 AM Steven Wu  wrote:
>>
>>> The Parquet metadata proposal (linked by Fokko) is mainly addressing
>>> the read performance due to bloated metadata.
>>>
>>> What Peter described in the description seems useful for some ML
>>> workload of feature engineering. A new set of features/columns are added to
>>> the table. Currently, Iceberg  would require rewriting all data files to
>>> combine old and new columns (write amplification). Similarly, in the past
>>> the community also talked about the use cases of updating a single column,
>>> which would require rewriting all data files.
>>>
>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry 
>>> wrote:
>>>
 Do you have the link at hand for the thread where this was discussed on
 the Parquet list?
 The docs seem quite old, and the PR stale, so I would like to
 understand the situation better.
 If it is possible to do this in Parquet, that would be great, but Avro,
 ORC would still suffer.

 Amogh Jahagirdar <[email protected]> ezt írta (időpont: 2025. máj. 26.,
 H, 22:07):

> Hey Peter,
>
> Thanks for bringing this issue up. I think I agree with Fokko; the
> issue of wide tables leading to Parquet metadata bloat and poor Thrift
> deserialization performance is a long standing issue that I believe 
> there's
> motivation in the community to address. So to me it seems better to 
> address
> it in Parquet itself rather than Iceberg library facilitate a pattern 
> which
> works around the limitations.
>
> Thanks,
> Amogh Jahagirdar
>
> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong 
> wrote:
>
>> Hi Peter,
>>
>> Thanks for bringing this up. Wouldn't it make more sense to fix this
>> in Parquet itself? It has been a long-running issue on Parquet, and there
>> is still active interest from the community. There is a PR to replace the
>> footer with FlatBuffers, which dramatically improves performance
>> . The underlying
>> proposal can be found here
>> 
>> .
>>
>> Kind regards,
>> Fokko
>>
>> Op ma 26 mei 2025 om 20:35 schreef yun zou <
>> [email protected]>:
>>
>>> +1, I am really interested in this topic. Performance has always
>>> been a problem when dealing with wide tables, not just read/write, but 
>>> also
>>> during compilation. Most of the ML use cases typically exhibit a 
>>> vectorized

Re: Wide tables in V4

2025-05-27 Thread Russell Spitzer
I have to agree that while there can be some fixes in Parquet, we
fundamentally need a way to split a "row group"
or something like that between separate files. If that's something we can
do in the parquet project that would be great
but it feels like we need to start exploring more drastic options than
footer encoding.

On Mon, May 26, 2025 at 8:42 PM Gang Wu  wrote:

> I agree with Steven that there are limitations that Parquet cannot do.
>
> In addition to adding new columns by rewriting all files, files of wide
> tables may suffer from bad performance like below:
> - Poor compression of row groups because there are too many columns and
> even a small number of rows can reach the row group threshold.
> - Dominating columns (e.g. blobs) may contribute to 99% size of a row
> group, leading to unbalanced column chunks and deteriorate the row group
> compression.
> - Similar to adding new columns, partial update also requires rewriting
> all columns of the affected rows.
>
> IIRC, some table formats already support splitting columns into different
> files:
> - Lance manifest splits a fragment [1] into one or more data files.
> - Apache Hudi has the concept of column family [2].
> - Apache Paimon supports sequence groups [3] for partial update.
>
> Although Parquet can introduce the concept of logical file and physical
> file to manage the columns to file mapping, this looks like yet another
> manifest file design which duplicates the purpose of Iceberg.
> These might be something worth exploring in Iceberg.
>
> [1] https://lancedb.github.io/lance/format.html#fragments
> [2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
> [3]
> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group
>
> Best,
> Gang
>
>
>
> On Tue, May 27, 2025 at 7:03 AM Steven Wu  wrote:
>
>> The Parquet metadata proposal (linked by Fokko) is mainly addressing
>> the read performance due to bloated metadata.
>>
>> What Peter described in the description seems useful for some ML workload
>> of feature engineering. A new set of features/columns are added to the
>> table. Currently, Iceberg  would require rewriting all data files to
>> combine old and new columns (write amplification). Similarly, in the past
>> the community also talked about the use cases of updating a single column,
>> which would require rewriting all data files.
>>
>> On Mon, May 26, 2025 at 2:42 PM Péter Váry 
>> wrote:
>>
>>> Do you have the link at hand for the thread where this was discussed on
>>> the Parquet list?
>>> The docs seem quite old, and the PR stale, so I would like to understand
>>> the situation better.
>>> If it is possible to do this in Parquet, that would be great, but Avro,
>>> ORC would still suffer.
>>>
>>> Amogh Jahagirdar <[email protected]> ezt írta (időpont: 2025. máj. 26.,
>>> H, 22:07):
>>>
 Hey Peter,

 Thanks for bringing this issue up. I think I agree with Fokko; the
 issue of wide tables leading to Parquet metadata bloat and poor Thrift
 deserialization performance is a long standing issue that I believe there's
 motivation in the community to address. So to me it seems better to address
 it in Parquet itself rather than Iceberg library facilitate a pattern which
 works around the limitations.

 Thanks,
 Amogh Jahagirdar

 On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong 
 wrote:

> Hi Peter,
>
> Thanks for bringing this up. Wouldn't it make more sense to fix this
> in Parquet itself? It has been a long-running issue on Parquet, and there
> is still active interest from the community. There is a PR to replace the
> footer with FlatBuffers, which dramatically improves performance
> . The underlying proposal
> can be found here
> 
> .
>
> Kind regards,
> Fokko
>
> Op ma 26 mei 2025 om 20:35 schreef yun zou  >:
>
>> +1, I am really interested in this topic. Performance has always been
>> a problem when dealing with wide tables, not just read/write, but also
>> during compilation. Most of the ML use cases typically exhibit a 
>> vectorized
>> read/write pattern, I am also wondering if there is any way at the 
>> metadata
>> level to help the whole compilation and execution process. I do not have
>> any answer fo this yet, but I would be really interested in exploring 
>> this
>> further.
>>
>> Best Regards,
>> Yun
>>
>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang
>>  wrote:
>>
>>> Hi Peter, I am interested in this proposal. What's more, I am
>>> curious if there is a similar story on the write side as well (how to
>>> generate these splitted files) and specifically, are you targeting 
>>> feature
>>> backfill use c

Re: Wide tables in V4

2025-05-26 Thread Gang Wu
I agree with Steven that there are limitations that Parquet cannot do.

In addition to adding new columns by rewriting all files, files of wide
tables may suffer from bad performance like below:
- Poor compression of row groups because there are too many columns and
even a small number of rows can reach the row group threshold.
- Dominating columns (e.g. blobs) may contribute to 99% size of a row
group, leading to unbalanced column chunks and deteriorate the row group
compression.
- Similar to adding new columns, partial update also requires rewriting all
columns of the affected rows.

IIRC, some table formats already support splitting columns into different
files:
- Lance manifest splits a fragment [1] into one or more data files.
- Apache Hudi has the concept of column family [2].
- Apache Paimon supports sequence groups [3] for partial update.

Although Parquet can introduce the concept of logical file and physical
file to manage the columns to file mapping, this looks like yet another
manifest file design which duplicates the purpose of Iceberg.
These might be something worth exploring in Iceberg.

[1] https://lancedb.github.io/lance/format.html#fragments
[2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
[3]
https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group

Best,
Gang



On Tue, May 27, 2025 at 7:03 AM Steven Wu  wrote:

> The Parquet metadata proposal (linked by Fokko) is mainly addressing
> the read performance due to bloated metadata.
>
> What Peter described in the description seems useful for some ML workload
> of feature engineering. A new set of features/columns are added to the
> table. Currently, Iceberg  would require rewriting all data files to
> combine old and new columns (write amplification). Similarly, in the past
> the community also talked about the use cases of updating a single column,
> which would require rewriting all data files.
>
> On Mon, May 26, 2025 at 2:42 PM Péter Váry 
> wrote:
>
>> Do you have the link at hand for the thread where this was discussed on
>> the Parquet list?
>> The docs seem quite old, and the PR stale, so I would like to understand
>> the situation better.
>> If it is possible to do this in Parquet, that would be great, but Avro,
>> ORC would still suffer.
>>
>> Amogh Jahagirdar <[email protected]> ezt írta (időpont: 2025. máj. 26.,
>> H, 22:07):
>>
>>> Hey Peter,
>>>
>>> Thanks for bringing this issue up. I think I agree with Fokko; the issue
>>> of wide tables leading to Parquet metadata bloat and poor Thrift
>>> deserialization performance is a long standing issue that I believe there's
>>> motivation in the community to address. So to me it seems better to address
>>> it in Parquet itself rather than Iceberg library facilitate a pattern which
>>> works around the limitations.
>>>
>>> Thanks,
>>> Amogh Jahagirdar
>>>
>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong 
>>> wrote:
>>>
 Hi Peter,

 Thanks for bringing this up. Wouldn't it make more sense to fix this in
 Parquet itself? It has been a long-running issue on Parquet, and there is
 still active interest from the community. There is a PR to replace the
 footer with FlatBuffers, which dramatically improves performance
 . The underlying proposal
 can be found here
 
 .

 Kind regards,
 Fokko

 Op ma 26 mei 2025 om 20:35 schreef yun zou >>> >:

> +1, I am really interested in this topic. Performance has always been
> a problem when dealing with wide tables, not just read/write, but also
> during compilation. Most of the ML use cases typically exhibit a 
> vectorized
> read/write pattern, I am also wondering if there is any way at the 
> metadata
> level to help the whole compilation and execution process. I do not have
> any answer fo this yet, but I would be really interested in exploring this
> further.
>
> Best Regards,
> Yun
>
> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang
>  wrote:
>
>> Hi Peter, I am interested in this proposal. What's more, I am curious
>> if there is a similar story on the write side as well (how to generate
>> these splitted files) and specifically, are you targeting feature 
>> backfill
>> use cases in ML use?
>>
>> On Mon, May 26, 2025 at 6:29 AM Péter Váry <
>> [email protected]> wrote:
>>
>>> Hi Team,
>>>
>>> In machine learning use-cases, it's common to encounter tables with
>>> a very high number of columns - sometimes even in the range of several
>>> thousand. I've seen cases with up to 15,000 columns. Storing such wide
>>> tables in a single Parquet file is often suboptimal, as Parquet can 
>>> become
>>> a bottleneck, even when only a su

Re: Wide tables in V4

2025-05-26 Thread Steven Wu
The Parquet metadata proposal (linked by Fokko) is mainly addressing
the read performance due to bloated metadata.

What Peter described in the description seems useful for some ML workload
of feature engineering. A new set of features/columns are added to the
table. Currently, Iceberg  would require rewriting all data files to
combine old and new columns (write amplification). Similarly, in the past
the community also talked about the use cases of updating a single column,
which would require rewriting all data files.

On Mon, May 26, 2025 at 2:42 PM Péter Váry 
wrote:

> Do you have the link at hand for the thread where this was discussed on
> the Parquet list?
> The docs seem quite old, and the PR stale, so I would like to understand
> the situation better.
> If it is possible to do this in Parquet, that would be great, but Avro,
> ORC would still suffer.
>
> Amogh Jahagirdar <[email protected]> ezt írta (időpont: 2025. máj. 26., H,
> 22:07):
>
>> Hey Peter,
>>
>> Thanks for bringing this issue up. I think I agree with Fokko; the issue
>> of wide tables leading to Parquet metadata bloat and poor Thrift
>> deserialization performance is a long standing issue that I believe there's
>> motivation in the community to address. So to me it seems better to address
>> it in Parquet itself rather than Iceberg library facilitate a pattern which
>> works around the limitations.
>>
>> Thanks,
>> Amogh Jahagirdar
>>
>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong 
>> wrote:
>>
>>> Hi Peter,
>>>
>>> Thanks for bringing this up. Wouldn't it make more sense to fix this in
>>> Parquet itself? It has been a long-running issue on Parquet, and there is
>>> still active interest from the community. There is a PR to replace the
>>> footer with FlatBuffers, which dramatically improves performance
>>> . The underlying proposal
>>> can be found here
>>> 
>>> .
>>>
>>> Kind regards,
>>> Fokko
>>>
>>> Op ma 26 mei 2025 om 20:35 schreef yun zou :
>>>
 +1, I am really interested in this topic. Performance has always been a
 problem when dealing with wide tables, not just read/write, but also during
 compilation. Most of the ML use cases typically exhibit a vectorized
 read/write pattern, I am also wondering if there is any way at the metadata
 level to help the whole compilation and execution process. I do not have
 any answer fo this yet, but I would be really interested in exploring this
 further.

 Best Regards,
 Yun

 On Mon, May 26, 2025 at 9:14 AM Pucheng Yang
  wrote:

> Hi Peter, I am interested in this proposal. What's more, I am curious
> if there is a similar story on the write side as well (how to generate
> these splitted files) and specifically, are you targeting feature backfill
> use cases in ML use?
>
> On Mon, May 26, 2025 at 6:29 AM Péter Váry <
> [email protected]> wrote:
>
>> Hi Team,
>>
>> In machine learning use-cases, it's common to encounter tables with a
>> very high number of columns - sometimes even in the range of several
>> thousand. I've seen cases with up to 15,000 columns. Storing such wide
>> tables in a single Parquet file is often suboptimal, as Parquet can 
>> become
>> a bottleneck, even when only a subset of columns is queried.
>>
>> A common approach to mitigate this is to split the data across
>> multiple Parquet files. With the upcoming File Format API, we could
>> introduce a layer that combines these files into a single iterator,
>> enabling efficient reading of wide and very wide tables.
>>
>> To support this, we would need to revise the metadata specification.
>> Instead of the current `_file` column, we could introduce a _files column
>> containing:
>> - `_file_column_ids`: the column IDs present in each file
>> - `_file_path`: the path to the corresponding file
>>
>> Has there been any prior discussion around this idea?
>> Is anyone else interested in exploring this further?
>>
>> Best regards,
>> Peter
>>
>


Re: Wide tables in V4

2025-05-26 Thread Péter Váry
Do you have the link at hand for the thread where this was discussed on the
Parquet list?
The docs seem quite old, and the PR stale, so I would like to understand
the situation better.
If it is possible to do this in Parquet, that would be great, but Avro, ORC
would still suffer.

Amogh Jahagirdar <[email protected]> ezt írta (időpont: 2025. máj. 26., H,
22:07):

> Hey Peter,
>
> Thanks for bringing this issue up. I think I agree with Fokko; the issue
> of wide tables leading to Parquet metadata bloat and poor Thrift
> deserialization performance is a long standing issue that I believe there's
> motivation in the community to address. So to me it seems better to address
> it in Parquet itself rather than Iceberg library facilitate a pattern which
> works around the limitations.
>
> Thanks,
> Amogh Jahagirdar
>
> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong  wrote:
>
>> Hi Peter,
>>
>> Thanks for bringing this up. Wouldn't it make more sense to fix this in
>> Parquet itself? It has been a long-running issue on Parquet, and there is
>> still active interest from the community. There is a PR to replace the
>> footer with FlatBuffers, which dramatically improves performance
>> . The underlying proposal
>> can be found here
>> 
>> .
>>
>> Kind regards,
>> Fokko
>>
>> Op ma 26 mei 2025 om 20:35 schreef yun zou :
>>
>>> +1, I am really interested in this topic. Performance has always been a
>>> problem when dealing with wide tables, not just read/write, but also during
>>> compilation. Most of the ML use cases typically exhibit a vectorized
>>> read/write pattern, I am also wondering if there is any way at the metadata
>>> level to help the whole compilation and execution process. I do not have
>>> any answer fo this yet, but I would be really interested in exploring this
>>> further.
>>>
>>> Best Regards,
>>> Yun
>>>
>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang 
>>> wrote:
>>>
 Hi Peter, I am interested in this proposal. What's more, I am curious
 if there is a similar story on the write side as well (how to generate
 these splitted files) and specifically, are you targeting feature backfill
 use cases in ML use?

 On Mon, May 26, 2025 at 6:29 AM Péter Váry 
 wrote:

> Hi Team,
>
> In machine learning use-cases, it's common to encounter tables with a
> very high number of columns - sometimes even in the range of several
> thousand. I've seen cases with up to 15,000 columns. Storing such wide
> tables in a single Parquet file is often suboptimal, as Parquet can become
> a bottleneck, even when only a subset of columns is queried.
>
> A common approach to mitigate this is to split the data across
> multiple Parquet files. With the upcoming File Format API, we could
> introduce a layer that combines these files into a single iterator,
> enabling efficient reading of wide and very wide tables.
>
> To support this, we would need to revise the metadata specification.
> Instead of the current `_file` column, we could introduce a _files column
> containing:
> - `_file_column_ids`: the column IDs present in each file
> - `_file_path`: the path to the corresponding file
>
> Has there been any prior discussion around this idea?
> Is anyone else interested in exploring this further?
>
> Best regards,
> Peter
>



Re: Wide tables in V4

2025-05-26 Thread Amogh Jahagirdar
Hey Peter,

Thanks for bringing this issue up. I think I agree with Fokko; the issue of
wide tables leading to Parquet metadata bloat and poor Thrift
deserialization performance is a long standing issue that I believe there's
motivation in the community to address. So to me it seems better to address
it in Parquet itself rather than Iceberg library facilitate a pattern which
works around the limitations.

Thanks,
Amogh Jahagirdar

On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong  wrote:

> Hi Peter,
>
> Thanks for bringing this up. Wouldn't it make more sense to fix this in
> Parquet itself? It has been a long-running issue on Parquet, and there is
> still active interest from the community. There is a PR to replace the
> footer with FlatBuffers, which dramatically improves performance
> . The underlying proposal can
> be found here
> 
> .
>
> Kind regards,
> Fokko
>
> Op ma 26 mei 2025 om 20:35 schreef yun zou :
>
>> +1, I am really interested in this topic. Performance has always been a
>> problem when dealing with wide tables, not just read/write, but also during
>> compilation. Most of the ML use cases typically exhibit a vectorized
>> read/write pattern, I am also wondering if there is any way at the metadata
>> level to help the whole compilation and execution process. I do not have
>> any answer fo this yet, but I would be really interested in exploring this
>> further.
>>
>> Best Regards,
>> Yun
>>
>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang 
>> wrote:
>>
>>> Hi Peter, I am interested in this proposal. What's more, I am curious if
>>> there is a similar story on the write side as well (how to generate these
>>> splitted files) and specifically, are you targeting feature backfill use
>>> cases in ML use?
>>>
>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry 
>>> wrote:
>>>
 Hi Team,

 In machine learning use-cases, it's common to encounter tables with a
 very high number of columns - sometimes even in the range of several
 thousand. I've seen cases with up to 15,000 columns. Storing such wide
 tables in a single Parquet file is often suboptimal, as Parquet can become
 a bottleneck, even when only a subset of columns is queried.

 A common approach to mitigate this is to split the data across multiple
 Parquet files. With the upcoming File Format API, we could introduce a
 layer that combines these files into a single iterator, enabling efficient
 reading of wide and very wide tables.

 To support this, we would need to revise the metadata specification.
 Instead of the current `_file` column, we could introduce a _files column
 containing:
 - `_file_column_ids`: the column IDs present in each file
 - `_file_path`: the path to the corresponding file

 Has there been any prior discussion around this idea?
 Is anyone else interested in exploring this further?

 Best regards,
 Peter

>>>


Re: Wide tables in V4

2025-05-26 Thread Fokko Driesprong
Hi Peter,

Thanks for bringing this up. Wouldn't it make more sense to fix this in
Parquet itself? It has been a long-running issue on Parquet, and there is
still active interest from the community. There is a PR to replace the
footer with FlatBuffers, which dramatically improves performance
. The underlying proposal can
be found here

.

Kind regards,
Fokko

Op ma 26 mei 2025 om 20:35 schreef yun zou :

> +1, I am really interested in this topic. Performance has always been a
> problem when dealing with wide tables, not just read/write, but also during
> compilation. Most of the ML use cases typically exhibit a vectorized
> read/write pattern, I am also wondering if there is any way at the metadata
> level to help the whole compilation and execution process. I do not have
> any answer fo this yet, but I would be really interested in exploring this
> further.
>
> Best Regards,
> Yun
>
> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang 
> wrote:
>
>> Hi Peter, I am interested in this proposal. What's more, I am curious if
>> there is a similar story on the write side as well (how to generate these
>> splitted files) and specifically, are you targeting feature backfill use
>> cases in ML use?
>>
>> On Mon, May 26, 2025 at 6:29 AM Péter Váry 
>> wrote:
>>
>>> Hi Team,
>>>
>>> In machine learning use-cases, it's common to encounter tables with a
>>> very high number of columns - sometimes even in the range of several
>>> thousand. I've seen cases with up to 15,000 columns. Storing such wide
>>> tables in a single Parquet file is often suboptimal, as Parquet can become
>>> a bottleneck, even when only a subset of columns is queried.
>>>
>>> A common approach to mitigate this is to split the data across multiple
>>> Parquet files. With the upcoming File Format API, we could introduce a
>>> layer that combines these files into a single iterator, enabling efficient
>>> reading of wide and very wide tables.
>>>
>>> To support this, we would need to revise the metadata specification.
>>> Instead of the current `_file` column, we could introduce a _files column
>>> containing:
>>> - `_file_column_ids`: the column IDs present in each file
>>> - `_file_path`: the path to the corresponding file
>>>
>>> Has there been any prior discussion around this idea?
>>> Is anyone else interested in exploring this further?
>>>
>>> Best regards,
>>> Peter
>>>
>>


Re: Wide tables in V4

2025-05-26 Thread Pucheng Yang
Hi Peter, I am interested in this proposal. What's more, I am curious if
there is a similar story on the write side as well (how to generate these
splitted files) and specifically, are you targeting feature backfill use
cases in ML use?

On Mon, May 26, 2025 at 6:29 AM Péter Váry 
wrote:

> Hi Team,
>
> In machine learning use-cases, it's common to encounter tables with a very
> high number of columns - sometimes even in the range of several thousand.
> I've seen cases with up to 15,000 columns. Storing such wide tables in a
> single Parquet file is often suboptimal, as Parquet can become a
> bottleneck, even when only a subset of columns is queried.
>
> A common approach to mitigate this is to split the data across multiple
> Parquet files. With the upcoming File Format API, we could introduce a
> layer that combines these files into a single iterator, enabling efficient
> reading of wide and very wide tables.
>
> To support this, we would need to revise the metadata specification.
> Instead of the current `_file` column, we could introduce a _files column
> containing:
> - `_file_column_ids`: the column IDs present in each file
> - `_file_path`: the path to the corresponding file
>
> Has there been any prior discussion around this idea?
> Is anyone else interested in exploring this further?
>
> Best regards,
> Peter
>


Re: Wide tables in V4

2025-05-26 Thread yun zou
+1, I am really interested in this topic. Performance has always been a
problem when dealing with wide tables, not just read/write, but also during
compilation. Most of the ML use cases typically exhibit a vectorized
read/write pattern, I am also wondering if there is any way at the metadata
level to help the whole compilation and execution process. I do not have
any answer fo this yet, but I would be really interested in exploring this
further.

Best Regards,
Yun

On Mon, May 26, 2025 at 9:14 AM Pucheng Yang 
wrote:

> Hi Peter, I am interested in this proposal. What's more, I am curious if
> there is a similar story on the write side as well (how to generate these
> splitted files) and specifically, are you targeting feature backfill use
> cases in ML use?
>
> On Mon, May 26, 2025 at 6:29 AM Péter Váry 
> wrote:
>
>> Hi Team,
>>
>> In machine learning use-cases, it's common to encounter tables with a
>> very high number of columns - sometimes even in the range of several
>> thousand. I've seen cases with up to 15,000 columns. Storing such wide
>> tables in a single Parquet file is often suboptimal, as Parquet can become
>> a bottleneck, even when only a subset of columns is queried.
>>
>> A common approach to mitigate this is to split the data across multiple
>> Parquet files. With the upcoming File Format API, we could introduce a
>> layer that combines these files into a single iterator, enabling efficient
>> reading of wide and very wide tables.
>>
>> To support this, we would need to revise the metadata specification.
>> Instead of the current `_file` column, we could introduce a _files column
>> containing:
>> - `_file_column_ids`: the column IDs present in each file
>> - `_file_path`: the path to the corresponding file
>>
>> Has there been any prior discussion around this idea?
>> Is anyone else interested in exploring this further?
>>
>> Best regards,
>> Peter
>>
>