Hi Evangelos and Matt, As far as know there were issues with a join of a client node in previous Ignite versions. In new versions a joining client should not cause any spikes.
In fact PME is (unfortunately) a widely known beast in the Ignite world. Fundamentally PME can (and should) perform smooth when new server nodes join the cluster not very frequently. I will bring some details what happens when a new server node joins the cluster. I hope it will help to answer a question 3 from a first message in this thread. As its name hints PME is a process when all nodes agree on a data distribution in the cluster after an events which leads to a redistribution. E.g. such event is node joining. And data distribution is a knowledge that a partition i is located on a node j. And for correct cluster operations each node should agree on the same distribution (consensus). So, it is all about a consistent data distribution. Consquently some data should be rebalanced after nodes come to an agreement on a distribution. And Ignite uses a clever trick to allow operations during data is rebalanced. When new node joins: 1. PME occurs and nodes agree on a same data distribution among nodes. And in that distribution all primary partitions belong to same nodes which they belong before PME. Also temporary backup partitions are assigned to the new node which will become a primary node for those partitions (keep reading). 2. Rebalance starts and delivers a data to the temporary backup partitions* mentioned before. The cluster is fully operational meanwhile. 3. Once rebalance completes another one PME happens. Now the temporary backups become primary (and other redundant partitions are marked for unload). * it worth noting here that a partition was empty and loaded during rebalance is marked as MOVING. It is not readable because it does not containt all data yet, but all writes come to this partition as well in order to make it up to date when rebalnce completes. (In Ignite the described trick is sometimes called "late affinity assignment") So, PME should not be very heavy because it is mainly about establishing an agreement on data distribution. Heavier data rebalance process happens when a cluster is fully operational. But PME still requires a silence period during establishing an agreement. As you might know PME and write operations use a mechanism similar to a read-write lock. Write operations are guarded by that lock in a shared mode. PME acquires that lock in an exclusive mode. So, at any moment we can have either several running write operations or only one running PME. It means that PME have to await all write operations to complete before it can start. Also it blocks all new write operations to start. Therefore long running transactions blocking PME can lead to a prolonged "silence" period. чт, 25 апр. 2019 г. в 00:58, Evangelos Morakis <[email protected]>: > > Matt thank you for your reply, > Indeed I saw your question too yesterday. In regards to points 3-4 of my > question I suppose that as you mention, if one shuts down gracefully the > client node and if the number of threads responsible for rebalancing the > data gets tweaked, then I guess the amount of time the cluster blocks could > be managed. For point 2 I think it’s necessary for someone from the dev team > to provide a bit more insight as to what ignite’s behavior is in regards to > client nodes joining/leaving the cluster as I fail to understand why PEM is > triggered for such nodes given their natural exclusion from computations and > the lack of storage of cache data in them. Indeed if the case is that PEM is > triggered for client nodes when joining/leaving, scenarios where remote > clients come and go on demand become difficult to accommodate at best, and > this sounds very restrictive. I simply need to know more on this otherwise it > would not be possible to develop a working strategy for accommodating clients > that come, do a bit of work, and then they leave until next time. > > Kind regards > > Dr. Evangelos Morakis > Software Architect > > > On 24 Apr 2019, at 21:21, MattNohelty <[email protected]> wrote: > > > > I have these same questions and posted about this yesterday > > (http://apache-ignite-users.70518.x6.nabble.com/What-happens-when-a-client-gets-disconnected-td27959.html). > > Based on my understanding: > > > > 1) Yes, PME will always happen when a server node joins > > > > 2) This is my biggest question. I'm currently using 2.4 and it appears PME > > is happening when a client connects or disconnects but I received one > > response that seemed to indicate that PME should not happen in this case in > > the newest versions of Ignite. I agree with your reasoning that these > > rebalancing processes do not seem necessary as all the data is on the server > > nodes which is what prompted my initial question. > > > > 3) The responses I received do say that the cluster blocks while this > > happens and I've seen evidence of this as well. I've only seen substantial > > blocking though when a client node is disconnected ungracefully. When the > > start or stop properly, we do not observe substantial blocking on the other > > clients. > > > > This behavior has caused some issues for us recently and it seems very > > problematic that one client node crashing can cause issues on all other > > client nodes. Granted, we are still on Ignite 2.4 so maybe this has been > > correct in 2.7, but I would really like to understand what the expected > > behavior should be. > > > > > > > > -- > > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ -- Best regards, Ivan Pavlukhin
