Re: Partition map exchange in detail

Pavel Kovalenko Wed, 12 Sep 2018 09:18:42 -0700

Eugene,

In the case of Zookeeper Discovery is enabled and communication problem
between some nodes, a subset of problem nodes will be automatically killed
to reach cluster state where each node can communicate with each other
without problems. So, you're absolutely right, dead nodes will be removed
from a cluster and will not participate in PME.
IEP-25 is trying to solve a more general problem related only to PME.
Network problems are only part of the problem can happen during PME. A node
may break down before it even tried to send a message because of unexpected
exceptions (e.g. NullPointer, Runtime, Assertion e.g.). In general, IEP-25
tries to defend us from any kind of unexpected problems to make sure that
PME will not be blocked in that case and the cluster will continue to live.



ср, 12 сент. 2018 г. в 18:53, eugene miretsky <[email protected]>:

> Hi Pavel,
>
> The issue we are discussing is PME failing because one node cannot
> communicate to another node, that's what IEP-25 is trying to solve. But in
> that case (where one node is either down, or there is a communication
> problem between two nodes) I would expect the split brain resolver to kick
> in, and shut down one of the nodes. I would also expect the dead node to be
> removed from the cluster, and no longer take part in PME.
>
>
>
> On Wed, Sep 12, 2018 at 11:25 AM Pavel Kovalenko <[email protected]>
> wrote:
>
>> Hi Eugene,
>>
>> Sorry, but I didn't catch the meaning of your question about Zookeeper
>> Discovery. Could you please re-phrase it?
>>
>> ср, 12 сент. 2018 г. в 17:54, Ilya Lantukh <[email protected]>:
>>
>>> Pavel K., can you please answer about Zookeeper discovery?
>>>
>>> On Wed, Sep 12, 2018 at 5:49 PM, eugene miretsky <
>>> [email protected]> wrote:
>>>
>>>> Thanks for the patience with my questions - just trying to understand
>>>> the system better.
>>>>
>>>> 3) I was referring to
>>>> https://apacheignite.readme.io/docs/zookeeper-discovery#section-failures-and-split-brain-handling.
>>>> How come it doesn't get the node to shut down?
>>>> 4) Are there any docs/JIRAs that explain how counters are used, and why
>>>> they are required in the state?
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>>
>>>> On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh <[email protected]>
>>>> wrote:
>>>>
>>>>> 3) Such mechanics will be implemented in IEP-25 (linked above).
>>>>> 4) Partition map states include update counters, which are incremented
>>>>> on every cache update and play important role in new state calculation. 
>>>>> So,
>>>>> technically, every cache operation can lead to partition map change, and
>>>>> for obvious reasons we can't route them through coordinator. Ignite is a
>>>>> more complex system than Akka or Kafka and such simple solutions won't 
>>>>> work
>>>>> here (in general case). However, it is true that PME could be simplified 
>>>>> or
>>>>> completely avoid for certain cases and the community is currently working
>>>>> on such optimizations (
>>>>> https://issues.apache.org/jira/browse/IGNITE-9558 for example).
>>>>>
>>>>> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> 2b) I had a few situations where the cluster went into a state where
>>>>>> PME constantly failed, and could never recover. I think the root cause 
>>>>>> was
>>>>>> that a transaction got stuck and didn't timeout/rollback.  I will try to
>>>>>> reproduce it again and get back to you
>>>>>> 3) If a node is down, I would expect it to get detected and the node
>>>>>> to get removed from the cluster. In such case, PME should not even be
>>>>>> attempted with that node. Hence you would expect PME to fail very rarely
>>>>>> (any faulty node will be removed before it has a chance to fail PME)
>>>>>> 4) Don't all partition map changes go through the coordinator? I
>>>>>> believe a lot of distributed systems work in this way (all decisions are
>>>>>> made by the coordinator/leader) - In Akka the leader is responsible for
>>>>>> making all cluster membership changes, in Kafka the controller does the
>>>>>> leader election.
>>>>>>
>>>>>> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> 1) It is.
>>>>>>> 2a) Ignite has retry mechanics for all messages, including
>>>>>>> PME-related ones.
>>>>>>> 2b) In this situation PME will hang, but it isn't a "deadlock".
>>>>>>> 3) Sorry, I didn't understand your question. If a node is down, but
>>>>>>> DiscoverySpi doesn't detect it, it isn't PME-related problem.
>>>>>>> 4) How can you ensure that partition maps on coordinator are *latest
>>>>>>> *without "freezing" cluster state for some time?
>>>>>>>
>>>>>>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> We are using persistence, so I am not sure if shutting down nodes
>>>>>>>> will be the desired outcome for us since we would need to modify the
>>>>>>>> baseline topolgy.
>>>>>>>>
>>>>>>>> A couple more follow up questions
>>>>>>>>
>>>>>>>> 1) Is PME triggered when client nodes join us well? We are using
>>>>>>>> Spark client, so new nodes are created/destroy every time.
>>>>>>>> 2) It sounds to me like there is a pontential for the cluster to
>>>>>>>> get into a deadlock if
>>>>>>>>    a) single PME message is lost (PME never finishes, there are no
>>>>>>>> retries, and all future operations are blocked on the pending PME)
>>>>>>>>    b) one of the nodes has a  long running/stuck pending operation
>>>>>>>> 3) Under what circumastance can PME fail, while DiscoverySpi fails
>>>>>>>> to detect the node being down? We are using ZookeeperSpi so I would 
>>>>>>>> expect
>>>>>>>> the split brain resolver to shut down the node.
>>>>>>>> 4) Why is PME needed? Doesn't the coordinator know the altest
>>>>>>>> toplogy/pertition map of the cluster through regualr gossip?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Eugene
>>>>>>>>
>>>>>>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Eugene,
>>>>>>>>>
>>>>>>>>> 1) PME happens when topology is modified (TopologyVersion is
>>>>>>>>> incremented). The most common events that trigger it are: node
>>>>>>>>> start/stop/fail, cluster activation/deactivation, dynamic cache 
>>>>>>>>> start/stop.
>>>>>>>>> 2) It is done by a separate ExchangeWorker. Events that trigger
>>>>>>>>> PME are transferred using DiscoverySpi instead of CommunicationSpi.
>>>>>>>>> 3) All nodes wait for all pending cache operations to finish and
>>>>>>>>> then send their local partition maps to the coordinator (oldest 
>>>>>>>>> node). Then
>>>>>>>>> coordinator calculates new global partition maps and sends them to 
>>>>>>>>> every
>>>>>>>>> node.
>>>>>>>>> 4) All cache operations.
>>>>>>>>> 5) Exchange is never retried. Ignite community is currently
>>>>>>>>> working on PME failure handling that should kick all problematic nodes
>>>>>>>>> after timeout is reached (see
>>>>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving
>>>>>>>>> for details), but it isn't done yet.
>>>>>>>>> 6) You shouldn't consider PME failure as a error by itself, but
>>>>>>>>> rather as a result of some other error. The most common reason of PME
>>>>>>>>> hang-up is pending cache operation that couldn't finish. Check your 
>>>>>>>>> logs -
>>>>>>>>> it should list pending transactions and atomic updates. Search for 
>>>>>>>>> "Found
>>>>>>>>> long running" substring.
>>>>>>>>>
>>>>>>>>> Hope this helps.
>>>>>>>>>
>>>>>>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> Out cluster occasionally fails with "partition map exchange
>>>>>>>>>> failure" errors, I have searched around and it seems that a lot of 
>>>>>>>>>> people
>>>>>>>>>> have had a similar issue in the past. My high-level understanding is 
>>>>>>>>>> that
>>>>>>>>>> when one of the nodes fails (out of memory, exception, GC etc.) 
>>>>>>>>>> nodes fail
>>>>>>>>>> to exchange partition maps. However, I have a few questions
>>>>>>>>>> 1) When does partition map exchange happen? Periodically, when a
>>>>>>>>>> node joins, etc.
>>>>>>>>>> 2) Is it done in the same thread as communication SPI, or is a
>>>>>>>>>> separate worker?
>>>>>>>>>> 3) How does the exchange happen? Via a coordinator, peer to peer,
>>>>>>>>>> etc?
>>>>>>>>>> 4) What does the exchange block?
>>>>>>>>>> 5) When is the exchange retried?
>>>>>>>>>> 5) How to resolve the error? The only thing I have seen online is
>>>>>>>>>> to decrease failureDetectionTimeout
>>>>>>>>>>
>>>>>>>>>> Our settings are
>>>>>>>>>> - Zookeeper SPI
>>>>>>>>>> - Persistence enabled
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Eugene
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Ilya
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Ilya
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Ilya
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Ilya
>>>
>>

Re: Partition map exchange in detail

Reply via email to