Alexey Goncharuk created IGNITE-3616: ----------------------------------------
Summary: Drop failed nodes from topology after a configured timeout Key: IGNITE-3616 URL: https://issues.apache.org/jira/browse/IGNITE-3616 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 1.5.0.final Reporter: Alexey Goncharuk If an OOME or assertion happens on a node, it is not uncommon that partition exchange is stuck blocking the whole cluster. We should provide a mechanism to drop non-responsive nodes automatically. When partition exchange is times out, a coordinator should: - print out IDs/IPs of non-responsive nodes at all times - introduce a certain kill timeout for non-responsive nodes (-1 means disabled) - the timeout should be at least a minute after the 1st non-responsive node message is printed - when the timeout expires, we should kill the nodes and automatically collect their thread dumps (do best effort for it) - we should print out a message asking users to provide these thread dumps to us via Jira or dev list -- This message was sent by Atlassian JIRA (v6.3.4#6332)