Semen Boikov created IGNITE-4798: ------------------------------------ Summary: Cluster does not finish rebalancing after nodes leaving Key: IGNITE-4798 URL: https://issues.apache.org/jira/browse/IGNITE-4798 Project: Ignite Issue Type: Bug Reporter: Denis Kholodov
Hi Valentin, I managed to reproduce the stability issue we've been having in production in a relatively sterile environment. The logs and stack traces are accessible here: https://drive.google.com/open?id=0B1YMrCiHZq1PMWJsblBYSXhaX1k The situation is: 1. Startup a cluster of 223 nodes. 2. Wait for everything to stabilize (took about 2 minutes). 3. Shut down 112 nodes. 4. Wait for everything to stabilize.. Since that point, I can't connect client nodes to the cluster: 2017-02-15 23:13:16.396 WARN o.a.i.i.p.c.GridCachePartitionExchangeManager main ctx: actor: - Failed to wait for initial partition map exchange. Possible reasons are: ^-- Transactions in deadlock. ^-- Long running transactions (ignore if this is the case). ^-- Unreleased explicit locks. Other cache operations are also stuck. Let me know what other information I can provide. -- This message was sent by Atlassian JIRA (v6.3.15#6346)