Dynamic Bucket Index Maintenance

David Macía via user Mon, 26 May 2025 10:12:09 -0700

Hi. We're using the dynamic bucketing feature to store data in a quite
large table with PKs. Over time, in our scenario, we're processing a lot of
deletes and inserts of newly created PKs, so the number of buckets is
becoming quite large and the buckets itself quite "fragmented".


The dynamic bucket index always grows as new events are received. Even if
the actual data in the bucket is deleted and compacted, in the index still
remains all the observed entries PKs, which leads to a situation like this:
we have a lot of buckets with less than 2M rows (the default limit per
bucket), as the 50% of the inserted entities have been removed afterwards;
but with 2M entries in the index, so new buckets are constantly being
created for every new PK as the assignors interpret that the existing
buckets are already filled.

Is there any procedure we can apply to recompute and remove from the
buckets index entries that are no longer present in any of the snapshots?
If no procedure exists at this moment, would it be something interesting to
implement in the future?

To handle this situation: a table with PK and high rate of deletes and
inserts; seems that there are only three options to avoid creating a
uncontrolled number of partial-filled buckets:

   - using dynamic bucketing and limit the max number of buckets (
   *dynamic-bucket.max-buckets* property): this approach will limit the
   number of buckets, we'll guarantee that the buckets are completely filled,
   but still will create a quite big indexes to keep in memory, as they will
   continuously grow with every new PK observed, and not being purged when the
   data is deleted. This approach doesn't seem feasible in terms of memory
   usage in the dynamic buckets assigners.


   - using dynamic bucketing and enabling partitioning to try to drop
   partitions for older keys, if possible. Seems that, at least, the dynamic
   bucket assigner would drop from memory indexes for non active partitions
   
<https://github.com/apache/paimon/blob/452c3bafe642d5c11a49ea66845c1c22c7bbe2f3/paimon-core/src/main/java/org/apache/paimon/index/HashBucketAssigner.java#L135>.
   But this approach is not always possible to implement, as sometimes it's
   not possible to classify the data in partitions in a meaningful way.


   - using static bucketting: to avoid requiring a bucket index at all.
   Thinking about this approach, is there a recommended maximum number of keys
   per bucket when using static bucketting? The default number for dynamic
   bucketing is 2M, but when using static bucketting is there a recommended
   limit we should not exceed? The same 2M limit applies, even if there is no
   bucket index to manage and keep in memory?

Thanks in advance. We're trying to find the best approach to handle
scenarios with high rate of delete operations and trying to understand how
they affect the dynamic bucket index maintenance.

Regards.


PD: at New Relic we're using Paimon now in production for multiple use
cases, could it be possible to apply for a Slack invitation to join the
Paimon's slack community?
Thanks!

Dynamic Bucket Index Maintenance

Reply via email to