Hi. We're using the dynamic bucketing feature to store data in a quite large table with PKs. Over time, in our scenario, we're processing a lot of deletes and inserts of newly created PKs, so the number of buckets is becoming quite large and the buckets itself quite "fragmented".
The dynamic bucket index always grows as new events are received. Even if the actual data in the bucket is deleted and compacted, in the index still remains all the observed entries PKs, which leads to a situation like this: we have a lot of buckets with less than 2M rows (the default limit per bucket), as the 50% of the inserted entities have been removed afterwards; but with 2M entries in the index, so new buckets are constantly being created for every new PK as the assignors interpret that the existing buckets are already filled. Is there any procedure we can apply to recompute and remove from the buckets index entries that are no longer present in any of the snapshots? If no procedure exists at this moment, would it be something interesting to implement in the future? To handle this situation: a table with PK and high rate of deletes and inserts; seems that there are only three options to avoid creating a uncontrolled number of partial-filled buckets: - using dynamic bucketing and limit the max number of buckets ( *dynamic-bucket.max-buckets* property): this approach will limit the number of buckets, we'll guarantee that the buckets are completely filled, but still will create a quite big indexes to keep in memory, as they will continuously grow with every new PK observed, and not being purged when the data is deleted. This approach doesn't seem feasible in terms of memory usage in the dynamic buckets assigners. - using dynamic bucketing and enabling partitioning to try to drop partitions for older keys, if possible. Seems that, at least, the dynamic bucket assigner would drop from memory indexes for non active partitions <https://github.com/apache/paimon/blob/452c3bafe642d5c11a49ea66845c1c22c7bbe2f3/paimon-core/src/main/java/org/apache/paimon/index/HashBucketAssigner.java#L135>. But this approach is not always possible to implement, as sometimes it's not possible to classify the data in partitions in a meaningful way. - using static bucketting: to avoid requiring a bucket index at all. Thinking about this approach, is there a recommended maximum number of keys per bucket when using static bucketting? The default number for dynamic bucketing is 2M, but when using static bucketting is there a recommended limit we should not exceed? The same 2M limit applies, even if there is no bucket index to manage and keep in memory? Thanks in advance. We're trying to find the best approach to handle scenarios with high rate of delete operations and trying to understand how they affect the dynamic bucket index maintenance. Regards. PD: at New Relic we're using Paimon now in production for multiple use cases, could it be possible to apply for a Slack invitation to join the Paimon's slack community? Thanks!