Re: [DISCUSS] Update supported blob types in puffin spec
Hey Denys, I have gone through the proposal and it seems adding additional stats to partition stats spec is requested by both Trino and Hive community. I feel keeping column level stats per partition will definitely help the query planner to make better decisions. I saw that document didn't get enough reviews to move forward. I suggest adding this as a discussion item for the next community sync to take this forward. - Ajantha On Thu, Feb 6, 2025 at 6:10 PM Denys Kuzmenko wrote: > Hi All, > > After reviewing Iceberg's proposals for stats, checking the code and > reading the comments, I've created a DRAFT proposal for the partition-level > column stats. Would be great to continue discussion on that topic and share > the ideas. > > > https://docs.google.com/document/d/11Rp-irqb4L4Qpdxr6l83bA4IRsfw3AAyR8wokNe1r80/edit?usp=sharing > > Thank you, > Denys >
Re: [DISCUSS] Update supported blob types in puffin spec
Hi All, After reviewing Iceberg's proposals for stats, checking the code and reading the comments, I've created a DRAFT proposal for the partition-level column stats. Would be great to continue discussion on that topic and share the ideas. https://docs.google.com/document/d/11Rp-irqb4L4Qpdxr6l83bA4IRsfw3AAyR8wokNe1r80/edit?usp=sharing Thank you, Denys
Re: [DISCUSS] Update supported blob types in puffin spec
Thanks All for the reactions. I wanted to emphasize that Hive's StatsObject was shared as an example with the suggestion to adapt it for iceberg - `PartitionColumnStats` (i.e. use column ids and drop name/type, etc). As was mentioned by Rayan, column upper/lower bounds, counts, null value and NaN counts are tracked at file level. For the partition aggregates, we still need some compound object that could provide all values at once, precomputed, similar to basic stats - https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/PartitionStats.java WDYT?
Re: [DISCUSS] Update supported blob types in puffin spec
Thanks Denys for starting this discussion! Thanks Ryan, i agree it would be better to have engine agnostic data structures in the Blobs we maintain in the Iceberg project. At least for the "standard blob types". Note however that Puffin format is intentionally open-ended. An application can put additional sketch/blob types without having to have them standardized first. This is helpful for experimentation with new stuff. This also removes the need to standardize blob types that we don't expect to be universally useful. Best Piotr On Tue, 4 Feb 2025 at 18:49, [email protected] wrote: > Thanks for proposing this. > > My main concern is that this doesn't seem to be aimed at standardizing > this metadata, but rather a way to pass existing Hive structures in a > different way. I commented on the PR, but I'll carry it over here for this > discussion. > > Iceberg already supports tracking column lower bounds, upper bounds, > column counts, null value counts, and NaN counts per column in table > metadata. The existing stats use column ID rather than column name so that > stats are compatible with schema evolution. These are kept at the file > level and we are adding partition stats files to handle aggregates at the > partition level. We also use snapshot summaries for similar purposes at the > table level. The proposed Hive structure doesn't seem like it adds much, > and, if anything, would be a feature regression because it doesn't use > column IDs and has extra metadata that would not be accurate (like > `isPrimaryKey`). > > The Puffin format also has an NDV sketch that is already in use (thanks, > Piotr!) so that seems to duplicate existing functionality as well. > > The KLL sketch seems useful to me, but I would separate that from the Hive > blob. > > Ryan > > On Tue, Feb 4, 2025 at 8:16 AM Denys Kuzmenko > wrote: > >> Hi Gabor, >> >> Thanks for your feedback! >> >> > In that use case however, we'd lose the stats we got previously from HMS >> >> For Iceberg tables Hive computes and stores the same stats object in a >> puffin file, previously persisted to HMS. So, there shouldn't be any >> changes for Impala other than changing the stats source. >> >> > We could gather all the column stats needed by different engines, >> standardize them into the Iceberg repo >> >> That is an option I mentioned above and provided the Hive schema, >> currently used to store column statistics. >> I can create a google doc to continue the discussion in that direction. >> >> > Aren't partition status just a more granular way of column stats. >> >> In Iceberg 1.7 Ajantha added a helper method to compute the basic >> partition stats for the given snapshot. >> Collection computeStats(Table table, Snapshot snapshot) >> >> Hopefully, we'll get reader and writer support in 1.8: >> https://github.com/apache/iceberg/pull/11216 >> >> A similar functionality is needed for column stats. >> In the case of a partitioned table, we need to create 1 ColumnStatistics >> object per partition and store it as a separate blob in a puffin file. >> >> During the query planning, we'll compute and use aggregated stats based >> on a pruned partition list. >> >> Regards, >> Denys >> >
Re: [DISCUSS] Update supported blob types in puffin spec
Thanks for proposing this. My main concern is that this doesn't seem to be aimed at standardizing this metadata, but rather a way to pass existing Hive structures in a different way. I commented on the PR, but I'll carry it over here for this discussion. Iceberg already supports tracking column lower bounds, upper bounds, column counts, null value counts, and NaN counts per column in table metadata. The existing stats use column ID rather than column name so that stats are compatible with schema evolution. These are kept at the file level and we are adding partition stats files to handle aggregates at the partition level. We also use snapshot summaries for similar purposes at the table level. The proposed Hive structure doesn't seem like it adds much, and, if anything, would be a feature regression because it doesn't use column IDs and has extra metadata that would not be accurate (like `isPrimaryKey`). The Puffin format also has an NDV sketch that is already in use (thanks, Piotr!) so that seems to duplicate existing functionality as well. The KLL sketch seems useful to me, but I would separate that from the Hive blob. Ryan On Tue, Feb 4, 2025 at 8:16 AM Denys Kuzmenko wrote: > Hi Gabor, > > Thanks for your feedback! > > > In that use case however, we'd lose the stats we got previously from HMS > > For Iceberg tables Hive computes and stores the same stats object in a > puffin file, previously persisted to HMS. So, there shouldn't be any > changes for Impala other than changing the stats source. > > > We could gather all the column stats needed by different engines, > standardize them into the Iceberg repo > > That is an option I mentioned above and provided the Hive schema, > currently used to store column statistics. > I can create a google doc to continue the discussion in that direction. > > > Aren't partition status just a more granular way of column stats. > > In Iceberg 1.7 Ajantha added a helper method to compute the basic > partition stats for the given snapshot. > Collection computeStats(Table table, Snapshot snapshot) > > Hopefully, we'll get reader and writer support in 1.8: > https://github.com/apache/iceberg/pull/11216 > > A similar functionality is needed for column stats. > In the case of a partitioned table, we need to create 1 ColumnStatistics > object per partition and store it as a separate blob in a puffin file. > > During the query planning, we'll compute and use aggregated stats based on > a pruned partition list. > > Regards, > Denys >
Re: [DISCUSS] Update supported blob types in puffin spec
Hi Gabor, Thanks for your feedback! > In that use case however, we'd lose the stats we got previously from HMS For Iceberg tables Hive computes and stores the same stats object in a puffin file, previously persisted to HMS. So, there shouldn't be any changes for Impala other than changing the stats source. > We could gather all the column stats needed by different engines, standardize > them into the Iceberg repo That is an option I mentioned above and provided the Hive schema, currently used to store column statistics. I can create a google doc to continue the discussion in that direction. > Aren't partition status just a more granular way of column stats. In Iceberg 1.7 Ajantha added a helper method to compute the basic partition stats for the given snapshot. Collection computeStats(Table table, Snapshot snapshot) Hopefully, we'll get reader and writer support in 1.8: https://github.com/apache/iceberg/pull/11216 A similar functionality is needed for column stats. In the case of a partitioned table, we need to create 1 ColumnStatistics object per partition and store it as a separate blob in a puffin file. During the query planning, we'll compute and use aggregated stats based on a pruned partition list. Regards, Denys
Re: [DISCUSS] Update supported blob types in puffin spec
Hi Denys, Thanks for raising this! I think extending the Puffin spec with additional columns stats would make sense. I saw the PR for the Puffin spec at some point late last year and I also had it in my plans to revive it in a way. My motivation is that Impala currently uses a lot of stats from HMS, but the community is actively working on having an HMS-free option for Impala that would use purely REST catalog. In that use case however, we'd lose the stats we got previously from HMS and it seemed a reasonable approach to extend the Puffin spec with the required ones. There is something I'm hesitant wrt the current proposal in the PR, though: the mentioned ColumnStatictics class lives within the Hive repo and the Iceberg lib has no support to read/write it. Basically Iceberg would standardize those stats inot Puffin but then would have no functionality to use them. I have a slightly different approach in mind: I think we could gather all the column stats needed by different engines, standardize them into the Iceberg repo similarly to partition stats, add read/write support also within the Iceberg repo and this way we could add them into the Puffin spec. What do you think? Talking about partition stats, just thinking out loud here: Aren't partition status just a more granular way of column stats. I know initially partition stats don't have everything that is now in ColStatistics that you shared with us, but wouldn't it make sense to gradually extend them with the missing ones. That way aggregating partition stats could give us the column stats if it's feasible. Again, this is just me thinking out loud here. Let me know what you think! Gabor On Tue, Feb 4, 2025 at 1:36 PM Denys Kuzmenko wrote: > sorry, valid Doc PR link: > https://github.com/apache/iceberg-docs/pull/269 >
Re: [DISCUSS] Update supported blob types in puffin spec
There is an option to standardize Hive's ColStatistics object schema and use
Iceberg:
class ColStatistics {
static class Range {
Number minValue;
Number maxValue;
}
String colName;
String colType;
long countDistinct;
long numNulls;
double avgColLen;
long numTrues;
long numFalses;
Range range;
boolean isPrimaryKey;
boolean isEstimated;
boolean isFilteredColumn;
byte[] bitVectors;
byte[] histogram;
}
Re: [DISCUSS] Update supported blob types in puffin spec
sorry, valid Doc PR link: https://github.com/apache/iceberg-docs/pull/269
