RE: Rebalancing of an Ignite cluster

jrovira Thu, 05 Dec 2024 04:13:03 -0800

Thank you for your answer, Stephen.


We use the metrics to measure the system, but for our current development we 
need to know exactly when start and finish a rebalancing. We use the local 
cache on each server node to feed an external process, local to each node 
(divide and conquer algorithm 😉). To coordinate those local processes, we 
implemented a Ignite Service that listen to the events that I described. We try 
to understand the sequence of events involved to cover all cases. We are 
specially worried about failing messages (PART_DATA_LOST, PART_MISSED, etc.) to 
cover properly those events. Do you know about them and why don’t we receive 
the REBALANCE_STARTED event (that’s weird ☹)?

 

About the other question, you are right, it is better to let the cluster to 
decide when to move the service.

 

From: Stephen Darlington <sdarling...@apache.org> 
Sent: miércoles, 4 de diciembre de 2024 18:32
To: user@ignite.apache.org
Subject: Re: Rebalancing of an Ignite cluster

 

There are a bunch of "rebalance" cache metrics 
(https://ignite.apache.org/docs/latest/monitoring-metrics/new-metrics#caches 
<https://www.google.com/url?q=https://ignite.apache.org/docs/latest/monitoring-metrics/new-metrics%23caches&source=gmail-imap&ust=1733938352000000&usg=AOvVaw2JiC5DpYI8OguJ1pHzuWp1>
 ). Do they not do what you need?

 

I don't know enough about the internals to guide you on the ordering, but I 
would say that it's not a good idea to subscribe to CACHE_ENTRY_* events. There 
could be millions of them.

 

AFAIK, you can't configure a cluster singleton service as you suggest. The idea 
is that the cluster manages it for you. You can limit which nodes it runs on 
using cluster groups. It's more common to use node singletons so that you 
service scales.

 

 

On Wed, 4 Dec 2024 at 11:18, <jrov...@identy.io <mailto:jrov...@identy.io> > 
wrote:

Hi.

 

We are currently working on a feature to track the rebalancing of our Ignite 
cluster listening all events: when it starts and end, how many partitions are 
sent, etc.

 

We implemented a singleton Ignite Service and deploy it on the cluster.

 

When the service starts it listens the following remote events:

            EVT_NODE_JOINED

            EVT_NODE_LEFT

            EVT_NODE_FAILED

            EVT_NODE_SEGMENTED

            EVT_CACHE_REBALANCE_STARTED

            EVT_CACHE_REBALANCE_STOPPED

            EVT_CACHE_REBALANCE_PART_SUPPLIED

            EVT_CACHE_REBALANCE_PART_LOADED

            EVT_CACHE_REBALANCE_PART_UNLOADED

            EVT_CACHE_REBALANCE_PART_DATA_LOST

            EVT_CACHE_REBALANCE_PART_MISSED

 

Some questions about our approach:

 

1.      Is the following sequence of events correct?

*       Baseline: A,B
*       setBaselineTopology: A, B, C, D

 

*       C and D send event EVT_CACHE_REBALANCE_STARTED

 

*       A and B send multiple EVT_CACHE_REBALANCE_PART_SUPPLIED with the 
partitions to rebalance

 

*       C and D receive data from A and B
*       C and D send multiple CACHE_ENTRY_CREATED, CACHE_ENTRY_DESTROYED, 
CACHE_REBALANCE_OBJECT_LOADED to create new entries
*       C and D send multiple CACHE_REBALANCE_PART_LOADED when the data is 
loaded into the new nodes

 

*       C and D send event EVT_CACHE_REBALANCE_STOPPED

 

*       A and B send multiple CACHE_ENTRY_CREATED, CACHE_ENTRY_DESTROYED, 
CACHE_REBALANCE_OBJECT_UNLOADED to remove old entries
*       A and B send multiple EVT_CACHE_REBALANCE_PART_UNLOADED when the 
partitions are removed

 

 

2.      As you can see, I receive some CACHE_ENTRY_DESTROYED when C and D are 
receiving the data. Why are they destroying entries? Also, something similar 
happens when A and B are removing partitions, I receive some 
CACHE_ENTRY_CREATED.

 

3.      Sometimes I did not receive an EVT_CACHE_REBALANCE_STARTED from a node 
that is supposed to start a rebalancing, how is this possible?

 

4.      About the use of services, when the node that contains a service is 
removed from the cluster, it is translated to another node. Is there a way to 
do it manually? A mean, for example I know that the node is processing a lot of 
data, and I want to transfer this service to a different node, is it possible 
to do it?

 

5.      We want to know if a rebalancing is in course, is there a way to ask 
the cluster about this? We currently listen the events and maintain a variable 
with this state, but if the service dies… the variable is lost.

 

Thank you.

RE: Rebalancing of an Ignite cluster

Reply via email to