Thanks for the patience with my questions - just trying to understand the system better.
3) I was referring to https://apacheignite.readme.io/docs/zookeeper-discovery#section-failures-and-split-brain-handling. How come it doesn't get the node to shut down? 4) Are there any docs/JIRAs that explain how counters are used, and why they are required in the state? Cheers, Eugene On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh <[email protected]> wrote: > 3) Such mechanics will be implemented in IEP-25 (linked above). > 4) Partition map states include update counters, which are incremented on > every cache update and play important role in new state calculation. So, > technically, every cache operation can lead to partition map change, and > for obvious reasons we can't route them through coordinator. Ignite is a > more complex system than Akka or Kafka and such simple solutions won't work > here (in general case). However, it is true that PME could be simplified or > completely avoid for certain cases and the community is currently working > on such optimizations (https://issues.apache.org/jira/browse/IGNITE-9558 > for example). > > On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky < > [email protected]> wrote: > >> 2b) I had a few situations where the cluster went into a state where PME >> constantly failed, and could never recover. I think the root cause was that >> a transaction got stuck and didn't timeout/rollback. I will try to >> reproduce it again and get back to you >> 3) If a node is down, I would expect it to get detected and the node to >> get removed from the cluster. In such case, PME should not even be >> attempted with that node. Hence you would expect PME to fail very rarely >> (any faulty node will be removed before it has a chance to fail PME) >> 4) Don't all partition map changes go through the coordinator? I believe >> a lot of distributed systems work in this way (all decisions are made by >> the coordinator/leader) - In Akka the leader is responsible for making all >> cluster membership changes, in Kafka the controller does the leader >> election. >> >> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <[email protected]> >> wrote: >> >>> 1) It is. >>> 2a) Ignite has retry mechanics for all messages, including PME-related >>> ones. >>> 2b) In this situation PME will hang, but it isn't a "deadlock". >>> 3) Sorry, I didn't understand your question. If a node is down, but >>> DiscoverySpi doesn't detect it, it isn't PME-related problem. >>> 4) How can you ensure that partition maps on coordinator are *latest >>> *without >>> "freezing" cluster state for some time? >>> >>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky < >>> [email protected]> wrote: >>> >>>> Thanks! >>>> >>>> We are using persistence, so I am not sure if shutting down nodes will >>>> be the desired outcome for us since we would need to modify the baseline >>>> topolgy. >>>> >>>> A couple more follow up questions >>>> >>>> 1) Is PME triggered when client nodes join us well? We are using Spark >>>> client, so new nodes are created/destroy every time. >>>> 2) It sounds to me like there is a pontential for the cluster to get >>>> into a deadlock if >>>> a) single PME message is lost (PME never finishes, there are no >>>> retries, and all future operations are blocked on the pending PME) >>>> b) one of the nodes has a long running/stuck pending operation >>>> 3) Under what circumastance can PME fail, while DiscoverySpi fails to >>>> detect the node being down? We are using ZookeeperSpi so I would expect the >>>> split brain resolver to shut down the node. >>>> 4) Why is PME needed? Doesn't the coordinator know the altest >>>> toplogy/pertition map of the cluster through regualr gossip? >>>> >>>> Cheers, >>>> Eugene >>>> >>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <[email protected]> >>>> wrote: >>>> >>>>> Hi Eugene, >>>>> >>>>> 1) PME happens when topology is modified (TopologyVersion is >>>>> incremented). The most common events that trigger it are: node >>>>> start/stop/fail, cluster activation/deactivation, dynamic cache >>>>> start/stop. >>>>> 2) It is done by a separate ExchangeWorker. Events that trigger PME >>>>> are transferred using DiscoverySpi instead of CommunicationSpi. >>>>> 3) All nodes wait for all pending cache operations to finish and then >>>>> send their local partition maps to the coordinator (oldest node). Then >>>>> coordinator calculates new global partition maps and sends them to every >>>>> node. >>>>> 4) All cache operations. >>>>> 5) Exchange is never retried. Ignite community is currently working on >>>>> PME failure handling that should kick all problematic nodes after timeout >>>>> is reached (see >>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving >>>>> for details), but it isn't done yet. >>>>> 6) You shouldn't consider PME failure as a error by itself, but rather >>>>> as a result of some other error. The most common reason of PME hang-up is >>>>> pending cache operation that couldn't finish. Check your logs - it should >>>>> list pending transactions and atomic updates. Search for "Found long >>>>> running" substring. >>>>> >>>>> Hope this helps. >>>>> >>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky < >>>>> [email protected]> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> Out cluster occasionally fails with "partition map exchange failure" >>>>>> errors, I have searched around and it seems that a lot of people have >>>>>> had a >>>>>> similar issue in the past. My high-level understanding is that when one >>>>>> of >>>>>> the nodes fails (out of memory, exception, GC etc.) nodes fail to >>>>>> exchange >>>>>> partition maps. However, I have a few questions >>>>>> 1) When does partition map exchange happen? Periodically, when a node >>>>>> joins, etc. >>>>>> 2) Is it done in the same thread as communication SPI, or is a >>>>>> separate worker? >>>>>> 3) How does the exchange happen? Via a coordinator, peer to peer, >>>>>> etc? >>>>>> 4) What does the exchange block? >>>>>> 5) When is the exchange retried? >>>>>> 5) How to resolve the error? The only thing I have seen online is to >>>>>> decrease failureDetectionTimeout >>>>>> >>>>>> Our settings are >>>>>> - Zookeeper SPI >>>>>> - Persistence enabled >>>>>> >>>>>> Cheers, >>>>>> Eugene >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Best regards, >>>>> Ilya >>>>> >>>> >>> >>> >>> -- >>> Best regards, >>> Ilya >>> >> > > > -- > Best regards, > Ilya >
