Re: [DISCUSS] Add metadata stats/metrics management on the REST Spec
@Fokko: your point is absolutely valid. We don't want to burden the active catalog instance with returning such a big data set. Otherwise the main responsibility of the catalog could suffer. OTOH there is some info which exists only on the catalog side which is not available elsewhere. This is especially true for catalogs which are doing query planning. For example, I would love to see the query statistics for a table, and how often specific files are accessed/returned in a plan. This would help compaction scheduling/planning highlight the hot spots where applying compaction could really make a difference. Thanks, Peter Fokko Driesprong ezt írta (időpont: 2025. febr. 19., Sze, 14:20): > Hey JB, > > Thanks for the additional context. My main question is, why wouldn't the > TMS directly query the metadata? Since the TMS should have access to the > data (otherwise it cannot compact it). This would be much faster and more > efficient. I share Daniel's concern that these requests could easily run > into the gigabytes (assuming JSON?). > > Kind regards, > Fokko > > > > Op wo 19 feb 2025 om 12:12 schreef Jean-Baptiste Onofré : > >> Hi folks, >> >> I realized that my first email on this thread needs context to be >> better understood :) >> >> In Apache Polaris TMS (Table Maintenance Service), we "scoped" where >> Polaris can help to trigger table maintenance jobs: >> 1. Is table maintenance enabled (in Polaris)? >> 2. Policies exposed by Polaris (e.g. data retention policy, compaction >> policy, ...) >> 3. Polaris events (e.g. tables/views/namespaces updates) >> 4. Table metadata (via Iceberg REST) >> 4.1. Table schema/partition spec/properties, etc >> 4.2. Iceberg table Stats and metrics. Only the stats and metrics >> are defined in the Iceberg table spec, e.g., partition stats, snapshot >> summaries are available at this moment. >> >> Specifically about 4.2, the Table Maintenance Service would need more >> than that. >> >> My proposal about adding metrics endpoint to the REST spec is to >> expose extra metrics for TMS and engine. I'm thinking of: >> - metrics helping the compaction decisions and snapshots GC >> - "extra" metrics which are very helpful for TMS (e.g. file size >> distribution without partitions) >> >> I would like to propose a "two steps" approach: >> 1. Add a "wild" metrics endpoint gathering all metrics for TMS/engines >> but the exposed metrics are decided by the Catalog impl >> 2. Enforce metrics list in the spec with a clear schema and >> standardized metrics names. >> >> I will move forward with a proposal draft about that if there is no >> objection. >> >> Thoughts ? >> >> Regards >> JB >> >> On Tue, Jan 21, 2025 at 3:40 PM Jean-Baptiste Onofré >> wrote: >> > >> > Hi folks, >> > >> > I know we don't want to "expose" the whole metadata tables in the REST >> > api, but I would like to discuss adding metadata stats and metrics >> > management. >> > We are discussing this as part of the Apache Polaris TMS proposal. >> > >> > The purpose is: >> > 1. To add interfaces to manage metadata stats and metrics (partition >> > stats, snapshot summaries, relay Parquet stats exposed via REST, ...) >> > 2. The catalog implementation can deal with table properties, but can >> > also extend to "extra" stats and metrics if needed >> > 3. Query planners can use these metadata stats and metrics to perform >> > better query plans. It could also be used by the server side planning >> > to provide "pre-plan check" >> > >> > Before going to a proposal document, I would like to get first >> > feedback from the community (if it makes sense or not). >> > >> > Thoughts ? >> > >> > Thanks ! >> > Regards >> > JB >> >
Re: [DISCUSS] Add metadata stats/metrics management on the REST Spec
Hi Fokko That's an approach I considered but the problem is that the TMS/query engine goes via the REST. So, if the metadata.json exposed by REST doesn't contain the metrics, how can I get it ? If your proposal is to extend the metadata.json with extra metrics, that could be an option. My proposal is more to have an extra endpoint to get metrics "unrelated" to a table or extending the metadata.json, with also a way to retrieve only the metrics needed by the TMS. Regards JB On Wed, Feb 19, 2025 at 2:20 PM Fokko Driesprong wrote: > > Hey JB, > > Thanks for the additional context. My main question is, why wouldn't the TMS > directly query the metadata? Since the TMS should have access to the data > (otherwise it cannot compact it). This would be much faster and more > efficient. I share Daniel's concern that these requests could easily run into > the gigabytes (assuming JSON?). > > Kind regards, > Fokko > > > > Op wo 19 feb 2025 om 12:12 schreef Jean-Baptiste Onofré : >> >> Hi folks, >> >> I realized that my first email on this thread needs context to be >> better understood :) >> >> In Apache Polaris TMS (Table Maintenance Service), we "scoped" where >> Polaris can help to trigger table maintenance jobs: >> 1. Is table maintenance enabled (in Polaris)? >> 2. Policies exposed by Polaris (e.g. data retention policy, compaction >> policy, ...) >> 3. Polaris events (e.g. tables/views/namespaces updates) >> 4. Table metadata (via Iceberg REST) >> 4.1. Table schema/partition spec/properties, etc >> 4.2. Iceberg table Stats and metrics. Only the stats and metrics >> are defined in the Iceberg table spec, e.g., partition stats, snapshot >> summaries are available at this moment. >> >> Specifically about 4.2, the Table Maintenance Service would need more than >> that. >> >> My proposal about adding metrics endpoint to the REST spec is to >> expose extra metrics for TMS and engine. I'm thinking of: >> - metrics helping the compaction decisions and snapshots GC >> - "extra" metrics which are very helpful for TMS (e.g. file size >> distribution without partitions) >> >> I would like to propose a "two steps" approach: >> 1. Add a "wild" metrics endpoint gathering all metrics for TMS/engines >> but the exposed metrics are decided by the Catalog impl >> 2. Enforce metrics list in the spec with a clear schema and >> standardized metrics names. >> >> I will move forward with a proposal draft about that if there is no >> objection. >> >> Thoughts ? >> >> Regards >> JB >> >> On Tue, Jan 21, 2025 at 3:40 PM Jean-Baptiste Onofré >> wrote: >> > >> > Hi folks, >> > >> > I know we don't want to "expose" the whole metadata tables in the REST >> > api, but I would like to discuss adding metadata stats and metrics >> > management. >> > We are discussing this as part of the Apache Polaris TMS proposal. >> > >> > The purpose is: >> > 1. To add interfaces to manage metadata stats and metrics (partition >> > stats, snapshot summaries, relay Parquet stats exposed via REST, ...) >> > 2. The catalog implementation can deal with table properties, but can >> > also extend to "extra" stats and metrics if needed >> > 3. Query planners can use these metadata stats and metrics to perform >> > better query plans. It could also be used by the server side planning >> > to provide "pre-plan check" >> > >> > Before going to a proposal document, I would like to get first >> > feedback from the community (if it makes sense or not). >> > >> > Thoughts ? >> > >> > Thanks ! >> > Regards >> > JB
Re: [DISCUSS] Add metadata stats/metrics management on the REST Spec
Hey JB, Thanks for the additional context. My main question is, why wouldn't the TMS directly query the metadata? Since the TMS should have access to the data (otherwise it cannot compact it). This would be much faster and more efficient. I share Daniel's concern that these requests could easily run into the gigabytes (assuming JSON?). Kind regards, Fokko Op wo 19 feb 2025 om 12:12 schreef Jean-Baptiste Onofré : > Hi folks, > > I realized that my first email on this thread needs context to be > better understood :) > > In Apache Polaris TMS (Table Maintenance Service), we "scoped" where > Polaris can help to trigger table maintenance jobs: > 1. Is table maintenance enabled (in Polaris)? > 2. Policies exposed by Polaris (e.g. data retention policy, compaction > policy, ...) > 3. Polaris events (e.g. tables/views/namespaces updates) > 4. Table metadata (via Iceberg REST) > 4.1. Table schema/partition spec/properties, etc > 4.2. Iceberg table Stats and metrics. Only the stats and metrics > are defined in the Iceberg table spec, e.g., partition stats, snapshot > summaries are available at this moment. > > Specifically about 4.2, the Table Maintenance Service would need more than > that. > > My proposal about adding metrics endpoint to the REST spec is to > expose extra metrics for TMS and engine. I'm thinking of: > - metrics helping the compaction decisions and snapshots GC > - "extra" metrics which are very helpful for TMS (e.g. file size > distribution without partitions) > > I would like to propose a "two steps" approach: > 1. Add a "wild" metrics endpoint gathering all metrics for TMS/engines > but the exposed metrics are decided by the Catalog impl > 2. Enforce metrics list in the spec with a clear schema and > standardized metrics names. > > I will move forward with a proposal draft about that if there is no > objection. > > Thoughts ? > > Regards > JB > > On Tue, Jan 21, 2025 at 3:40 PM Jean-Baptiste Onofré > wrote: > > > > Hi folks, > > > > I know we don't want to "expose" the whole metadata tables in the REST > > api, but I would like to discuss adding metadata stats and metrics > > management. > > We are discussing this as part of the Apache Polaris TMS proposal. > > > > The purpose is: > > 1. To add interfaces to manage metadata stats and metrics (partition > > stats, snapshot summaries, relay Parquet stats exposed via REST, ...) > > 2. The catalog implementation can deal with table properties, but can > > also extend to "extra" stats and metrics if needed > > 3. Query planners can use these metadata stats and metrics to perform > > better query plans. It could also be used by the server side planning > > to provide "pre-plan check" > > > > Before going to a proposal document, I would like to get first > > feedback from the community (if it makes sense or not). > > > > Thoughts ? > > > > Thanks ! > > Regards > > JB >
Re: [DISCUSS] Add metadata stats/metrics management on the REST Spec
Hi folks, I realized that my first email on this thread needs context to be better understood :) In Apache Polaris TMS (Table Maintenance Service), we "scoped" where Polaris can help to trigger table maintenance jobs: 1. Is table maintenance enabled (in Polaris)? 2. Policies exposed by Polaris (e.g. data retention policy, compaction policy, ...) 3. Polaris events (e.g. tables/views/namespaces updates) 4. Table metadata (via Iceberg REST) 4.1. Table schema/partition spec/properties, etc 4.2. Iceberg table Stats and metrics. Only the stats and metrics are defined in the Iceberg table spec, e.g., partition stats, snapshot summaries are available at this moment. Specifically about 4.2, the Table Maintenance Service would need more than that. My proposal about adding metrics endpoint to the REST spec is to expose extra metrics for TMS and engine. I'm thinking of: - metrics helping the compaction decisions and snapshots GC - "extra" metrics which are very helpful for TMS (e.g. file size distribution without partitions) I would like to propose a "two steps" approach: 1. Add a "wild" metrics endpoint gathering all metrics for TMS/engines but the exposed metrics are decided by the Catalog impl 2. Enforce metrics list in the spec with a clear schema and standardized metrics names. I will move forward with a proposal draft about that if there is no objection. Thoughts ? Regards JB On Tue, Jan 21, 2025 at 3:40 PM Jean-Baptiste Onofré wrote: > > Hi folks, > > I know we don't want to "expose" the whole metadata tables in the REST > api, but I would like to discuss adding metadata stats and metrics > management. > We are discussing this as part of the Apache Polaris TMS proposal. > > The purpose is: > 1. To add interfaces to manage metadata stats and metrics (partition > stats, snapshot summaries, relay Parquet stats exposed via REST, ...) > 2. The catalog implementation can deal with table properties, but can > also extend to "extra" stats and metrics if needed > 3. Query planners can use these metadata stats and metrics to perform > better query plans. It could also be used by the server side planning > to provide "pre-plan check" > > Before going to a proposal document, I would like to get first > feedback from the community (if it makes sense or not). > > Thoughts ? > > Thanks ! > Regards > JB
Re: [DISCUSS] Add metadata stats/metrics management on the REST Spec
Hi Dan, The target is about exposing stats & metrics from the metadata (relaying partition stats, etc), and give the option for a REST Catalog implementation to extend with additional metrics/stats. The purpose of the REST Catalog interface is to expose that from the query planner, be able to use these stats/metrics to do better planning. So, it's a raw idea for now. I would love to brainstorm with the community. Thanks ! Regards JB On Tue, Jan 21, 2025 at 6:42 PM Daniel Weeks wrote: > > Hey JB, > > I'm not sure I fully understand what the proposal is, but I also realise it's > probably not completely fleshed out yet. > > When you say "manage metadata", the first concern that I have is whether you > mean to just query/get the info or to also modify it. Table metadata is > immutable and requires a commit to change, so I would assume you largely are > interested in just getting access to the data. Currently, snapshot summaries > are included with table load and I'm not clear on how we would expose > parquet/file stats since file level stats could be huge and largely depend on > the filters/projections to prune. I think partition stats is probably > something to consider, but I'm not sure how much faster that would be and the > size of partitions could really complicate the protocol. > > I think server-side pre/plan apis would be able to address a lot of these > types of situations, but I'm just concerned that we would end up rebuilding > that same functionality to address all of the issues with exposing this > information more directly. > > I'm interested if there are more concrete proposals, but I'm a little > hesitant because of these challenges. > > -Dan > > On Tue, Jan 21, 2025 at 6:40 AM Jean-Baptiste Onofré > wrote: >> >> Hi folks, >> >> I know we don't want to "expose" the whole metadata tables in the REST >> api, but I would like to discuss adding metadata stats and metrics >> management. >> We are discussing this as part of the Apache Polaris TMS proposal. >> >> The purpose is: >> 1. To add interfaces to manage metadata stats and metrics (partition >> stats, snapshot summaries, relay Parquet stats exposed via REST, ...) >> 2. The catalog implementation can deal with table properties, but can >> also extend to "extra" stats and metrics if needed >> 3. Query planners can use these metadata stats and metrics to perform >> better query plans. It could also be used by the server side planning >> to provide "pre-plan check" >> >> Before going to a proposal document, I would like to get first >> feedback from the community (if it makes sense or not). >> >> Thoughts ? >> >> Thanks ! >> Regards >> JB
Re: [DISCUSS] Add metadata stats/metrics management on the REST Spec
Hey JB, I'm not sure I fully understand what the proposal is, but I also realise it's probably not completely fleshed out yet. When you say "manage metadata", the first concern that I have is whether you mean to just query/get the info or to also modify it. Table metadata is immutable and requires a commit to change, so I would assume you largely are interested in just getting access to the data. Currently, snapshot summaries are included with table load and I'm not clear on how we would expose parquet/file stats since file level stats could be huge and largely depend on the filters/projections to prune. I think partition stats is probably something to consider, but I'm not sure how much faster that would be and the size of partitions could really complicate the protocol. I think server-side pre/plan apis would be able to address a lot of these types of situations, but I'm just concerned that we would end up rebuilding that same functionality to address all of the issues with exposing this information more directly. I'm interested if there are more concrete proposals, but I'm a little hesitant because of these challenges. -Dan On Tue, Jan 21, 2025 at 6:40 AM Jean-Baptiste Onofré wrote: > Hi folks, > > I know we don't want to "expose" the whole metadata tables in the REST > api, but I would like to discuss adding metadata stats and metrics > management. > We are discussing this as part of the Apache Polaris TMS proposal. > > The purpose is: > 1. To add interfaces to manage metadata stats and metrics (partition > stats, snapshot summaries, relay Parquet stats exposed via REST, ...) > 2. The catalog implementation can deal with table properties, but can > also extend to "extra" stats and metrics if needed > 3. Query planners can use these metadata stats and metrics to perform > better query plans. It could also be used by the server side planning > to provide "pre-plan check" > > Before going to a proposal document, I would like to get first > feedback from the community (if it makes sense or not). > > Thoughts ? > > Thanks ! > Regards > JB >
