Re: [DISCUSS] Update supported blob types in puffin spec

2025-03-26 Thread Ajantha Bhat
Hey Denys,
I have gone through the proposal and it seems adding additional stats to
partition stats spec is requested by both Trino and Hive community.

I feel keeping column level stats per partition will definitely help the
query planner to make better decisions.
I saw that document didn't get enough reviews to move forward. I suggest
adding this as a discussion item for the next community sync to take this
forward.

- Ajantha

On Thu, Feb 6, 2025 at 6:10 PM Denys Kuzmenko  wrote:

> Hi All,
>
> After reviewing Iceberg's proposals for stats, checking the code and
> reading the comments, I've created a DRAFT proposal for the partition-level
> column stats. Would be great to continue discussion on that topic and share
> the ideas.
>
>
> https://docs.google.com/document/d/11Rp-irqb4L4Qpdxr6l83bA4IRsfw3AAyR8wokNe1r80/edit?usp=sharing
>
> Thank you,
> Denys
>


Re: [DISCUSS] Update supported blob types in puffin spec

2025-02-06 Thread Denys Kuzmenko
Hi All,

After reviewing Iceberg's proposals for stats, checking the code and reading 
the comments, I've created a DRAFT proposal for the partition-level column 
stats. Would be great to continue discussion on that topic and share the ideas.

https://docs.google.com/document/d/11Rp-irqb4L4Qpdxr6l83bA4IRsfw3AAyR8wokNe1r80/edit?usp=sharing

Thank you,
Denys


Re: [DISCUSS] Update supported blob types in puffin spec

2025-02-04 Thread Denys Kuzmenko
Thanks All for the reactions.

I wanted to emphasize that Hive's StatsObject was shared as an example with the 
suggestion to adapt it for iceberg - `PartitionColumnStats` (i.e. use column 
ids and drop name/type, etc).

As was mentioned by Rayan, column upper/lower bounds, counts, null value and 
NaN counts are tracked at file level. 
For the partition aggregates, we still need some compound object that could 
provide all values at once, precomputed, similar to basic stats - 
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/PartitionStats.java

WDYT?


Re: [DISCUSS] Update supported blob types in puffin spec

2025-02-04 Thread Piotr Findeisen
Thanks  Denys for starting this discussion!

Thanks Ryan, i agree it would be better to have engine agnostic data
structures in the Blobs we maintain in the Iceberg project. At least for
the "standard blob types".
Note however that Puffin format is intentionally open-ended. An application
can put additional sketch/blob types without having to have them
standardized first. This is helpful for experimentation with new stuff.
This also removes the need to standardize blob types that we don't expect
to be universally useful.

Best
Piotr



On Tue, 4 Feb 2025 at 18:49, [email protected]  wrote:

> Thanks for proposing this.
>
> My main concern is that this doesn't seem to be aimed at standardizing
> this metadata, but rather a way to pass existing Hive structures in a
> different way. I commented on the PR, but I'll carry it over here for this
> discussion.
>
> Iceberg already supports tracking column lower bounds, upper bounds,
> column counts, null value counts, and NaN counts per column in table
> metadata. The existing stats use column ID rather than column name so that
> stats are compatible with schema evolution. These are kept at the file
> level and we are adding partition stats files to handle aggregates at the
> partition level. We also use snapshot summaries for similar purposes at the
> table level. The proposed Hive structure doesn't seem like it adds much,
> and, if anything, would be a feature regression because it doesn't use
> column IDs and has extra metadata that would not be accurate (like
> `isPrimaryKey`).
>
> The Puffin format also has an NDV sketch that is already in use (thanks,
> Piotr!) so that seems to duplicate existing functionality as well.
>
> The KLL sketch seems useful to me, but I would separate that from the Hive
> blob.
>
> Ryan
>
> On Tue, Feb 4, 2025 at 8:16 AM Denys Kuzmenko 
> wrote:
>
>> Hi Gabor,
>>
>> Thanks for your feedback!
>>
>> > In that use case however, we'd lose the stats we got previously from HMS
>>
>> For Iceberg tables Hive computes and stores the same stats object in a
>> puffin file, previously persisted to HMS. So, there shouldn't be any
>> changes for Impala other than changing the stats source.
>>
>> > We could gather all the column stats needed by different engines,
>> standardize them into the Iceberg repo
>>
>> That is an option I mentioned above and provided the Hive schema,
>> currently used to store column statistics.
>> I can create a google doc to continue the discussion in that direction.
>>
>> > Aren't partition status just a more granular way of column stats.
>>
>> In Iceberg 1.7 Ajantha added a helper method to compute the basic
>> partition stats for the given snapshot.
>> Collection computeStats(Table table, Snapshot snapshot)
>>
>> Hopefully, we'll get reader and writer support in 1.8:
>> https://github.com/apache/iceberg/pull/11216
>>
>> A similar functionality is needed for column stats.
>> In the case of a partitioned table, we need to create 1 ColumnStatistics
>> object per partition and store it as a separate blob in a puffin file.
>>
>> During the query planning, we'll compute and use aggregated stats based
>> on a pruned partition list.
>>
>> Regards,
>> Denys
>>
>


Re: [DISCUSS] Update supported blob types in puffin spec

2025-02-04 Thread [email protected]
Thanks for proposing this.

My main concern is that this doesn't seem to be aimed at standardizing this
metadata, but rather a way to pass existing Hive structures in a different
way. I commented on the PR, but I'll carry it over here for this discussion.

Iceberg already supports tracking column lower bounds, upper bounds, column
counts, null value counts, and NaN counts per column in table metadata. The
existing stats use column ID rather than column name so that stats are
compatible with schema evolution. These are kept at the file level and we
are adding partition stats files to handle aggregates at the partition
level. We also use snapshot summaries for similar purposes at the table
level. The proposed Hive structure doesn't seem like it adds much, and, if
anything, would be a feature regression because it doesn't use column IDs
and has extra metadata that would not be accurate (like `isPrimaryKey`).

The Puffin format also has an NDV sketch that is already in use (thanks,
Piotr!) so that seems to duplicate existing functionality as well.

The KLL sketch seems useful to me, but I would separate that from the Hive
blob.

Ryan

On Tue, Feb 4, 2025 at 8:16 AM Denys Kuzmenko  wrote:

> Hi Gabor,
>
> Thanks for your feedback!
>
> > In that use case however, we'd lose the stats we got previously from HMS
>
> For Iceberg tables Hive computes and stores the same stats object in a
> puffin file, previously persisted to HMS. So, there shouldn't be any
> changes for Impala other than changing the stats source.
>
> > We could gather all the column stats needed by different engines,
> standardize them into the Iceberg repo
>
> That is an option I mentioned above and provided the Hive schema,
> currently used to store column statistics.
> I can create a google doc to continue the discussion in that direction.
>
> > Aren't partition status just a more granular way of column stats.
>
> In Iceberg 1.7 Ajantha added a helper method to compute the basic
> partition stats for the given snapshot.
> Collection computeStats(Table table, Snapshot snapshot)
>
> Hopefully, we'll get reader and writer support in 1.8:
> https://github.com/apache/iceberg/pull/11216
>
> A similar functionality is needed for column stats.
> In the case of a partitioned table, we need to create 1 ColumnStatistics
> object per partition and store it as a separate blob in a puffin file.
>
> During the query planning, we'll compute and use aggregated stats based on
> a pruned partition list.
>
> Regards,
> Denys
>


Re: [DISCUSS] Update supported blob types in puffin spec

2025-02-04 Thread Denys Kuzmenko
Hi Gabor,

Thanks for your feedback!

> In that use case however, we'd lose the stats we got previously from HMS

For Iceberg tables Hive computes and stores the same stats object in a puffin 
file, previously persisted to HMS. So, there shouldn't be any changes for 
Impala other than changing the stats source. 

> We could gather all the column stats needed by different engines, standardize 
> them into the Iceberg repo

That is an option I mentioned above and provided the Hive schema, currently 
used to store column statistics. 
I can create a google doc to continue the discussion in that direction.

> Aren't partition status just a more granular way of column stats. 

In Iceberg 1.7 Ajantha added a helper method to compute the basic partition 
stats for the given snapshot. 
Collection computeStats(Table table, Snapshot snapshot)

Hopefully, we'll get reader and writer support in 1.8: 
https://github.com/apache/iceberg/pull/11216

A similar functionality is needed for column stats. 
In the case of a partitioned table, we need to create 1 ColumnStatistics object 
per partition and store it as a separate blob in a puffin file.

During the query planning, we'll compute and use aggregated stats based on a 
pruned partition list.

Regards,
Denys


Re: [DISCUSS] Update supported blob types in puffin spec

2025-02-04 Thread Gabor Kaszab
Hi Denys,

Thanks for raising this! I think extending the Puffin spec with additional
columns stats would make sense.

I saw the PR for the Puffin spec at some point late last year and I also
had it in my plans to revive it in a way. My motivation is that Impala
currently uses a lot of stats from HMS, but the community is actively
working on having an HMS-free option for Impala that would use purely REST
catalog. In that use case however, we'd lose the stats we got previously
from HMS and it seemed a reasonable approach to extend the Puffin spec with
the required ones.

There is something I'm hesitant wrt the current proposal in the PR, though:
the mentioned ColumnStatictics class lives within the Hive repo and the
Iceberg lib has no support to read/write it. Basically Iceberg would
standardize those stats inot Puffin but then would have no functionality to
use them.

I have a slightly different approach in mind: I think we could gather all
the column stats needed by different engines, standardize them into the
Iceberg repo similarly to partition stats, add read/write support also
within the Iceberg repo and this way we could add them into the Puffin
spec. What do you think?

Talking about partition stats, just thinking out loud here: Aren't
partition status just a more granular way of column stats. I know initially
partition stats don't have everything that is now in ColStatistics that you
shared with us, but wouldn't it make sense to gradually extend them with
the missing ones. That way aggregating partition stats could give us the
column stats if it's feasible. Again, this is just me thinking out loud
here.

Let me know what you think!
Gabor

On Tue, Feb 4, 2025 at 1:36 PM Denys Kuzmenko  wrote:

> sorry, valid Doc PR link:
> https://github.com/apache/iceberg-docs/pull/269
>


Re: [DISCUSS] Update supported blob types in puffin spec

2025-02-04 Thread Denys Kuzmenko
There is an option to standardize Hive's ColStatistics object schema and use 
Iceberg:

class ColStatistics {

  static class Range {
Number minValue;
Number maxValue;
  }
  
  String colName;
  String colType;
  long countDistinct;
  long numNulls;
  double avgColLen;
  long numTrues;
  long numFalses;
  Range range;
  boolean isPrimaryKey;
  boolean isEstimated;
  boolean isFilteredColumn;
  byte[] bitVectors;
  byte[] histogram;
}


Re: [DISCUSS] Update supported blob types in puffin spec

2025-02-04 Thread Denys Kuzmenko
sorry, valid Doc PR link:
https://github.com/apache/iceberg-docs/pull/269