Hi Pavel, The issue we are discussing is PME failing because one node cannot communicate to another node, that's what IEP-25 is trying to solve. But in that case (where one node is either down, or there is a communication problem between two nodes) I would expect the split brain resolver to kick in, and shut down one of the nodes. I would also expect the dead node to be removed from the cluster, and no longer take part in PME.
On Wed, Sep 12, 2018 at 11:25 AM Pavel Kovalenko <[email protected]> wrote: > Hi Eugene, > > Sorry, but I didn't catch the meaning of your question about Zookeeper > Discovery. Could you please re-phrase it? > > ср, 12 сент. 2018 г. в 17:54, Ilya Lantukh <[email protected]>: > >> Pavel K., can you please answer about Zookeeper discovery? >> >> On Wed, Sep 12, 2018 at 5:49 PM, eugene miretsky < >> [email protected]> wrote: >> >>> Thanks for the patience with my questions - just trying to understand >>> the system better. >>> >>> 3) I was referring to >>> https://apacheignite.readme.io/docs/zookeeper-discovery#section-failures-and-split-brain-handling. >>> How come it doesn't get the node to shut down? >>> 4) Are there any docs/JIRAs that explain how counters are used, and why >>> they are required in the state? >>> >>> Cheers, >>> Eugene >>> >>> >>> On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh <[email protected]> >>> wrote: >>> >>>> 3) Such mechanics will be implemented in IEP-25 (linked above). >>>> 4) Partition map states include update counters, which are incremented >>>> on every cache update and play important role in new state calculation. So, >>>> technically, every cache operation can lead to partition map change, and >>>> for obvious reasons we can't route them through coordinator. Ignite is a >>>> more complex system than Akka or Kafka and such simple solutions won't work >>>> here (in general case). However, it is true that PME could be simplified or >>>> completely avoid for certain cases and the community is currently working >>>> on such optimizations ( >>>> https://issues.apache.org/jira/browse/IGNITE-9558 for example). >>>> >>>> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky < >>>> [email protected]> wrote: >>>> >>>>> 2b) I had a few situations where the cluster went into a state where >>>>> PME constantly failed, and could never recover. I think the root cause was >>>>> that a transaction got stuck and didn't timeout/rollback. I will try to >>>>> reproduce it again and get back to you >>>>> 3) If a node is down, I would expect it to get detected and the node >>>>> to get removed from the cluster. In such case, PME should not even be >>>>> attempted with that node. Hence you would expect PME to fail very rarely >>>>> (any faulty node will be removed before it has a chance to fail PME) >>>>> 4) Don't all partition map changes go through the coordinator? I >>>>> believe a lot of distributed systems work in this way (all decisions are >>>>> made by the coordinator/leader) - In Akka the leader is responsible for >>>>> making all cluster membership changes, in Kafka the controller does the >>>>> leader election. >>>>> >>>>> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <[email protected]> >>>>> wrote: >>>>> >>>>>> 1) It is. >>>>>> 2a) Ignite has retry mechanics for all messages, including >>>>>> PME-related ones. >>>>>> 2b) In this situation PME will hang, but it isn't a "deadlock". >>>>>> 3) Sorry, I didn't understand your question. If a node is down, but >>>>>> DiscoverySpi doesn't detect it, it isn't PME-related problem. >>>>>> 4) How can you ensure that partition maps on coordinator are *latest >>>>>> *without "freezing" cluster state for some time? >>>>>> >>>>>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> We are using persistence, so I am not sure if shutting down nodes >>>>>>> will be the desired outcome for us since we would need to modify the >>>>>>> baseline topolgy. >>>>>>> >>>>>>> A couple more follow up questions >>>>>>> >>>>>>> 1) Is PME triggered when client nodes join us well? We are using >>>>>>> Spark client, so new nodes are created/destroy every time. >>>>>>> 2) It sounds to me like there is a pontential for the cluster to get >>>>>>> into a deadlock if >>>>>>> a) single PME message is lost (PME never finishes, there are no >>>>>>> retries, and all future operations are blocked on the pending PME) >>>>>>> b) one of the nodes has a long running/stuck pending operation >>>>>>> 3) Under what circumastance can PME fail, while DiscoverySpi fails >>>>>>> to detect the node being down? We are using ZookeeperSpi so I would >>>>>>> expect >>>>>>> the split brain resolver to shut down the node. >>>>>>> 4) Why is PME needed? Doesn't the coordinator know the altest >>>>>>> toplogy/pertition map of the cluster through regualr gossip? >>>>>>> >>>>>>> Cheers, >>>>>>> Eugene >>>>>>> >>>>>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Eugene, >>>>>>>> >>>>>>>> 1) PME happens when topology is modified (TopologyVersion is >>>>>>>> incremented). The most common events that trigger it are: node >>>>>>>> start/stop/fail, cluster activation/deactivation, dynamic cache >>>>>>>> start/stop. >>>>>>>> 2) It is done by a separate ExchangeWorker. Events that trigger PME >>>>>>>> are transferred using DiscoverySpi instead of CommunicationSpi. >>>>>>>> 3) All nodes wait for all pending cache operations to finish and >>>>>>>> then send their local partition maps to the coordinator (oldest node). >>>>>>>> Then >>>>>>>> coordinator calculates new global partition maps and sends them to >>>>>>>> every >>>>>>>> node. >>>>>>>> 4) All cache operations. >>>>>>>> 5) Exchange is never retried. Ignite community is currently working >>>>>>>> on PME failure handling that should kick all problematic nodes after >>>>>>>> timeout is reached (see >>>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving >>>>>>>> for details), but it isn't done yet. >>>>>>>> 6) You shouldn't consider PME failure as a error by itself, but >>>>>>>> rather as a result of some other error. The most common reason of PME >>>>>>>> hang-up is pending cache operation that couldn't finish. Check your >>>>>>>> logs - >>>>>>>> it should list pending transactions and atomic updates. Search for >>>>>>>> "Found >>>>>>>> long running" substring. >>>>>>>> >>>>>>>> Hope this helps. >>>>>>>> >>>>>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> Out cluster occasionally fails with "partition map exchange >>>>>>>>> failure" errors, I have searched around and it seems that a lot of >>>>>>>>> people >>>>>>>>> have had a similar issue in the past. My high-level understanding is >>>>>>>>> that >>>>>>>>> when one of the nodes fails (out of memory, exception, GC etc.) nodes >>>>>>>>> fail >>>>>>>>> to exchange partition maps. However, I have a few questions >>>>>>>>> 1) When does partition map exchange happen? Periodically, when a >>>>>>>>> node joins, etc. >>>>>>>>> 2) Is it done in the same thread as communication SPI, or is a >>>>>>>>> separate worker? >>>>>>>>> 3) How does the exchange happen? Via a coordinator, peer to peer, >>>>>>>>> etc? >>>>>>>>> 4) What does the exchange block? >>>>>>>>> 5) When is the exchange retried? >>>>>>>>> 5) How to resolve the error? The only thing I have seen online is >>>>>>>>> to decrease failureDetectionTimeout >>>>>>>>> >>>>>>>>> Our settings are >>>>>>>>> - Zookeeper SPI >>>>>>>>> - Persistence enabled >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Eugene >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Best regards, >>>>>>>> Ilya >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best regards, >>>>>> Ilya >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> Ilya >>>> >>> >> >> >> -- >> Best regards, >> Ilya >> >
