Re: [DISCUSS] Add metadata stats/metrics management on the REST Spec

2025-02-19 Thread Péter Váry
@Fokko: your point is absolutely valid. We don't want to burden the active
catalog instance with returning such a big data set. Otherwise the main
responsibility of the catalog could suffer.

OTOH there is some info which exists only on the catalog side which is not
available elsewhere. This is especially true for catalogs which are doing
query planning.
For example, I would love to see the query statistics for a table, and how
often specific files are accessed/returned in a plan. This would help
compaction scheduling/planning highlight the hot spots where applying
compaction could really make a difference.

Thanks,
Peter

Fokko Driesprong  ezt írta (időpont: 2025. febr. 19.,
Sze, 14:20):

> Hey JB,
>
> Thanks for the additional context. My main question is, why wouldn't the
> TMS directly query the metadata? Since the TMS should have access to the
> data (otherwise it cannot compact it). This would be much faster and more
> efficient. I share Daniel's concern that these requests could easily run
> into the gigabytes (assuming JSON?).
>
> Kind regards,
> Fokko
>
>
>
> Op wo 19 feb 2025 om 12:12 schreef Jean-Baptiste Onofré :
>
>> Hi folks,
>>
>> I realized that my first email on this thread needs context to be
>> better understood :)
>>
>> In Apache Polaris TMS (Table Maintenance Service), we "scoped" where
>> Polaris can help to trigger table maintenance jobs:
>> 1. Is table maintenance enabled (in Polaris)?
>> 2. Policies exposed by Polaris (e.g. data retention policy, compaction
>> policy, ...)
>> 3. Polaris events (e.g. tables/views/namespaces updates)
>> 4. Table metadata (via Iceberg REST)
>> 4.1. Table schema/partition spec/properties, etc
>> 4.2. Iceberg table Stats and metrics. Only the stats and metrics
>> are defined in the Iceberg table spec, e.g., partition stats, snapshot
>> summaries are available at this moment.
>>
>> Specifically about 4.2, the Table Maintenance Service would need more
>> than that.
>>
>> My proposal about adding metrics endpoint to the REST spec is to
>> expose extra metrics for TMS and engine. I'm thinking of:
>> - metrics helping the compaction decisions and snapshots GC
>> - "extra" metrics which are very helpful for TMS (e.g. file size
>> distribution without partitions)
>>
>> I would like to propose a "two steps" approach:
>> 1. Add a "wild" metrics endpoint gathering all metrics for TMS/engines
>> but the exposed metrics are decided by the Catalog impl
>> 2. Enforce metrics list in the spec with a clear schema and
>> standardized metrics names.
>>
>> I will move forward with a proposal draft about that if there is no
>> objection.
>>
>> Thoughts ?
>>
>> Regards
>> JB
>>
>> On Tue, Jan 21, 2025 at 3:40 PM Jean-Baptiste Onofré 
>> wrote:
>> >
>> > Hi folks,
>> >
>> > I know we don't want to "expose" the whole metadata tables in the REST
>> > api, but I would like to discuss adding metadata stats and metrics
>> > management.
>> > We are discussing this as part of the Apache Polaris TMS proposal.
>> >
>> > The purpose is:
>> > 1. To add interfaces to manage metadata stats and metrics (partition
>> > stats, snapshot summaries, relay Parquet stats exposed via REST, ...)
>> > 2. The catalog implementation can deal with table properties, but can
>> > also extend to "extra" stats and metrics if needed
>> > 3. Query planners can use these metadata stats and metrics to perform
>> > better query plans. It could also be used by the server side planning
>> > to provide "pre-plan check"
>> >
>> > Before going to a proposal document, I would like to get first
>> > feedback from the community (if it makes sense or not).
>> >
>> > Thoughts ?
>> >
>> > Thanks !
>> > Regards
>> > JB
>>
>


Re: [DISCUSS] Add metadata stats/metrics management on the REST Spec

2025-02-19 Thread Jean-Baptiste Onofré
Hi Fokko

That's an approach I considered but the problem is that the TMS/query
engine goes via the REST. So, if the metadata.json exposed by REST
doesn't contain the metrics, how can I get it ?
If your proposal is to extend the metadata.json with extra metrics,
that could be an option.
My proposal is more to have an extra endpoint to get metrics
"unrelated" to a table or extending the metadata.json, with also a way
to retrieve only the metrics needed by the TMS.

Regards
JB

On Wed, Feb 19, 2025 at 2:20 PM Fokko Driesprong  wrote:
>
> Hey JB,
>
> Thanks for the additional context. My main question is, why wouldn't the TMS 
> directly query the metadata? Since the TMS should have access to the data 
> (otherwise it cannot compact it). This would be much faster and more 
> efficient. I share Daniel's concern that these requests could easily run into 
> the gigabytes (assuming JSON?).
>
> Kind regards,
> Fokko
>
>
>
> Op wo 19 feb 2025 om 12:12 schreef Jean-Baptiste Onofré :
>>
>> Hi folks,
>>
>> I realized that my first email on this thread needs context to be
>> better understood :)
>>
>> In Apache Polaris TMS (Table Maintenance Service), we "scoped" where
>> Polaris can help to trigger table maintenance jobs:
>> 1. Is table maintenance enabled (in Polaris)?
>> 2. Policies exposed by Polaris (e.g. data retention policy, compaction
>> policy, ...)
>> 3. Polaris events (e.g. tables/views/namespaces updates)
>> 4. Table metadata (via Iceberg REST)
>> 4.1. Table schema/partition spec/properties, etc
>> 4.2. Iceberg table Stats and metrics. Only the stats and metrics
>> are defined in the Iceberg table spec, e.g., partition stats, snapshot
>> summaries are available at this moment.
>>
>> Specifically about 4.2, the Table Maintenance Service would need more than 
>> that.
>>
>> My proposal about adding metrics endpoint to the REST spec is to
>> expose extra metrics for TMS and engine. I'm thinking of:
>> - metrics helping the compaction decisions and snapshots GC
>> - "extra" metrics which are very helpful for TMS (e.g. file size
>> distribution without partitions)
>>
>> I would like to propose a "two steps" approach:
>> 1. Add a "wild" metrics endpoint gathering all metrics for TMS/engines
>> but the exposed metrics are decided by the Catalog impl
>> 2. Enforce metrics list in the spec with a clear schema and
>> standardized metrics names.
>>
>> I will move forward with a proposal draft about that if there is no 
>> objection.
>>
>> Thoughts ?
>>
>> Regards
>> JB
>>
>> On Tue, Jan 21, 2025 at 3:40 PM Jean-Baptiste Onofré  
>> wrote:
>> >
>> > Hi folks,
>> >
>> > I know we don't want to "expose" the whole metadata tables in the REST
>> > api, but I would like to discuss adding metadata stats and metrics
>> > management.
>> > We are discussing this as part of the Apache Polaris TMS proposal.
>> >
>> > The purpose is:
>> > 1. To add interfaces to manage metadata stats and metrics (partition
>> > stats, snapshot summaries, relay Parquet stats exposed via REST, ...)
>> > 2. The catalog implementation can deal with table properties, but can
>> > also extend to "extra" stats and metrics if needed
>> > 3. Query planners can use these metadata stats and metrics to perform
>> > better query plans. It could also be used by the server side planning
>> > to provide "pre-plan check"
>> >
>> > Before going to a proposal document, I would like to get first
>> > feedback from the community (if it makes sense or not).
>> >
>> > Thoughts ?
>> >
>> > Thanks !
>> > Regards
>> > JB


Re: [DISCUSS] Add metadata stats/metrics management on the REST Spec

2025-02-19 Thread Fokko Driesprong
Hey JB,

Thanks for the additional context. My main question is, why wouldn't the
TMS directly query the metadata? Since the TMS should have access to the
data (otherwise it cannot compact it). This would be much faster and more
efficient. I share Daniel's concern that these requests could easily run
into the gigabytes (assuming JSON?).

Kind regards,
Fokko



Op wo 19 feb 2025 om 12:12 schreef Jean-Baptiste Onofré :

> Hi folks,
>
> I realized that my first email on this thread needs context to be
> better understood :)
>
> In Apache Polaris TMS (Table Maintenance Service), we "scoped" where
> Polaris can help to trigger table maintenance jobs:
> 1. Is table maintenance enabled (in Polaris)?
> 2. Policies exposed by Polaris (e.g. data retention policy, compaction
> policy, ...)
> 3. Polaris events (e.g. tables/views/namespaces updates)
> 4. Table metadata (via Iceberg REST)
> 4.1. Table schema/partition spec/properties, etc
> 4.2. Iceberg table Stats and metrics. Only the stats and metrics
> are defined in the Iceberg table spec, e.g., partition stats, snapshot
> summaries are available at this moment.
>
> Specifically about 4.2, the Table Maintenance Service would need more than
> that.
>
> My proposal about adding metrics endpoint to the REST spec is to
> expose extra metrics for TMS and engine. I'm thinking of:
> - metrics helping the compaction decisions and snapshots GC
> - "extra" metrics which are very helpful for TMS (e.g. file size
> distribution without partitions)
>
> I would like to propose a "two steps" approach:
> 1. Add a "wild" metrics endpoint gathering all metrics for TMS/engines
> but the exposed metrics are decided by the Catalog impl
> 2. Enforce metrics list in the spec with a clear schema and
> standardized metrics names.
>
> I will move forward with a proposal draft about that if there is no
> objection.
>
> Thoughts ?
>
> Regards
> JB
>
> On Tue, Jan 21, 2025 at 3:40 PM Jean-Baptiste Onofré 
> wrote:
> >
> > Hi folks,
> >
> > I know we don't want to "expose" the whole metadata tables in the REST
> > api, but I would like to discuss adding metadata stats and metrics
> > management.
> > We are discussing this as part of the Apache Polaris TMS proposal.
> >
> > The purpose is:
> > 1. To add interfaces to manage metadata stats and metrics (partition
> > stats, snapshot summaries, relay Parquet stats exposed via REST, ...)
> > 2. The catalog implementation can deal with table properties, but can
> > also extend to "extra" stats and metrics if needed
> > 3. Query planners can use these metadata stats and metrics to perform
> > better query plans. It could also be used by the server side planning
> > to provide "pre-plan check"
> >
> > Before going to a proposal document, I would like to get first
> > feedback from the community (if it makes sense or not).
> >
> > Thoughts ?
> >
> > Thanks !
> > Regards
> > JB
>


Re: [DISCUSS] Add metadata stats/metrics management on the REST Spec

2025-02-19 Thread Jean-Baptiste Onofré
Hi folks,

I realized that my first email on this thread needs context to be
better understood :)

In Apache Polaris TMS (Table Maintenance Service), we "scoped" where
Polaris can help to trigger table maintenance jobs:
1. Is table maintenance enabled (in Polaris)?
2. Policies exposed by Polaris (e.g. data retention policy, compaction
policy, ...)
3. Polaris events (e.g. tables/views/namespaces updates)
4. Table metadata (via Iceberg REST)
4.1. Table schema/partition spec/properties, etc
4.2. Iceberg table Stats and metrics. Only the stats and metrics
are defined in the Iceberg table spec, e.g., partition stats, snapshot
summaries are available at this moment.

Specifically about 4.2, the Table Maintenance Service would need more than that.

My proposal about adding metrics endpoint to the REST spec is to
expose extra metrics for TMS and engine. I'm thinking of:
- metrics helping the compaction decisions and snapshots GC
- "extra" metrics which are very helpful for TMS (e.g. file size
distribution without partitions)

I would like to propose a "two steps" approach:
1. Add a "wild" metrics endpoint gathering all metrics for TMS/engines
but the exposed metrics are decided by the Catalog impl
2. Enforce metrics list in the spec with a clear schema and
standardized metrics names.

I will move forward with a proposal draft about that if there is no objection.

Thoughts ?

Regards
JB

On Tue, Jan 21, 2025 at 3:40 PM Jean-Baptiste Onofré  wrote:
>
> Hi folks,
>
> I know we don't want to "expose" the whole metadata tables in the REST
> api, but I would like to discuss adding metadata stats and metrics
> management.
> We are discussing this as part of the Apache Polaris TMS proposal.
>
> The purpose is:
> 1. To add interfaces to manage metadata stats and metrics (partition
> stats, snapshot summaries, relay Parquet stats exposed via REST, ...)
> 2. The catalog implementation can deal with table properties, but can
> also extend to "extra" stats and metrics if needed
> 3. Query planners can use these metadata stats and metrics to perform
> better query plans. It could also be used by the server side planning
> to provide "pre-plan check"
>
> Before going to a proposal document, I would like to get first
> feedback from the community (if it makes sense or not).
>
> Thoughts ?
>
> Thanks !
> Regards
> JB


Re: [DISCUSS] Add metadata stats/metrics management on the REST Spec

2025-01-21 Thread Jean-Baptiste Onofré
Hi Dan,

The target is about exposing stats & metrics from the metadata
(relaying partition stats, etc), and give the option for a REST
Catalog implementation to extend with additional metrics/stats.
The purpose of the REST Catalog interface is to expose that from the
query planner, be able to use these stats/metrics to do better
planning.

So, it's a raw idea for now. I would love to brainstorm with the community.

Thanks !
Regards
JB

On Tue, Jan 21, 2025 at 6:42 PM Daniel Weeks  wrote:
>
> Hey JB,
>
> I'm not sure I fully understand what the proposal is, but I also realise it's 
> probably not completely fleshed out yet.
>
> When you say "manage metadata", the first concern that I have is whether you 
> mean to just query/get the info or to also modify it.  Table metadata is 
> immutable and requires a commit to change, so I would assume you largely are 
> interested in just getting access to the data.  Currently, snapshot summaries 
> are included with table load and I'm not clear on how we would expose 
> parquet/file stats since file level stats could be huge and largely depend on 
> the filters/projections to prune.  I think partition stats is probably 
> something to consider, but I'm not sure how much faster that would be and the 
> size of partitions could really complicate the protocol.
>
> I think server-side pre/plan apis would be able to address a lot of these 
> types of situations, but I'm just concerned that we would end up rebuilding 
> that same functionality to address all of the issues with exposing this 
> information more directly.
>
> I'm interested if there are more concrete proposals, but I'm a little 
> hesitant because of these challenges.
>
> -Dan
>
> On Tue, Jan 21, 2025 at 6:40 AM Jean-Baptiste Onofré  
> wrote:
>>
>> Hi folks,
>>
>> I know we don't want to "expose" the whole metadata tables in the REST
>> api, but I would like to discuss adding metadata stats and metrics
>> management.
>> We are discussing this as part of the Apache Polaris TMS proposal.
>>
>> The purpose is:
>> 1. To add interfaces to manage metadata stats and metrics (partition
>> stats, snapshot summaries, relay Parquet stats exposed via REST, ...)
>> 2. The catalog implementation can deal with table properties, but can
>> also extend to "extra" stats and metrics if needed
>> 3. Query planners can use these metadata stats and metrics to perform
>> better query plans. It could also be used by the server side planning
>> to provide "pre-plan check"
>>
>> Before going to a proposal document, I would like to get first
>> feedback from the community (if it makes sense or not).
>>
>> Thoughts ?
>>
>> Thanks !
>> Regards
>> JB


Re: [DISCUSS] Add metadata stats/metrics management on the REST Spec

2025-01-21 Thread Daniel Weeks
Hey JB,

I'm not sure I fully understand what the proposal is, but I also realise
it's probably not completely fleshed out yet.

When you say "manage metadata", the first concern that I have is whether
you mean to just query/get the info or to also modify it.  Table metadata
is immutable and requires a commit to change, so I would assume you largely
are interested in just getting access to the data.  Currently, snapshot
summaries are included with table load and I'm not clear on how we would
expose parquet/file stats since file level stats could be huge and
largely depend on the filters/projections to prune.  I think partition
stats is probably something to consider, but I'm not sure how much faster
that would be and the size of partitions could really complicate the
protocol.

I think server-side pre/plan apis would be able to address a lot of these
types of situations, but I'm just concerned that we would end up rebuilding
that same functionality to address all of the issues with exposing this
information more directly.

I'm interested if there are more concrete proposals, but I'm a little
hesitant because of these challenges.

-Dan

On Tue, Jan 21, 2025 at 6:40 AM Jean-Baptiste Onofré 
wrote:

> Hi folks,
>
> I know we don't want to "expose" the whole metadata tables in the REST
> api, but I would like to discuss adding metadata stats and metrics
> management.
> We are discussing this as part of the Apache Polaris TMS proposal.
>
> The purpose is:
> 1. To add interfaces to manage metadata stats and metrics (partition
> stats, snapshot summaries, relay Parquet stats exposed via REST, ...)
> 2. The catalog implementation can deal with table properties, but can
> also extend to "extra" stats and metrics if needed
> 3. Query planners can use these metadata stats and metrics to perform
> better query plans. It could also be used by the server side planning
> to provide "pre-plan check"
>
> Before going to a proposal document, I would like to get first
> feedback from the community (if it makes sense or not).
>
> Thoughts ?
>
> Thanks !
> Regards
> JB
>