Re: Partition map exchange in detail

eugene miretsky Wed, 12 Sep 2018 08:53:38 -0700

Hi Pavel,

The issue we are discussing is PME failing because one node cannot
communicate to another node, that's what IEP-25 is trying to solve. But in
that case (where one node is either down, or there is a communication
problem between two nodes) I would expect the split brain resolver to kick
in, and shut down one of the nodes. I would also expect the dead node to be
removed from the cluster, and no longer take part in PME.




On Wed, Sep 12, 2018 at 11:25 AM Pavel Kovalenko <[email protected]> wrote:

> Hi Eugene,
>
> Sorry, but I didn't catch the meaning of your question about Zookeeper
> Discovery. Could you please re-phrase it?
>
> ср, 12 сент. 2018 г. в 17:54, Ilya Lantukh <[email protected]>:
>
>> Pavel K., can you please answer about Zookeeper discovery?
>>
>> On Wed, Sep 12, 2018 at 5:49 PM, eugene miretsky <
>> [email protected]> wrote:
>>
>>> Thanks for the patience with my questions - just trying to understand
>>> the system better.
>>>
>>> 3) I was referring to
>>> https://apacheignite.readme.io/docs/zookeeper-discovery#section-failures-and-split-brain-handling.
>>> How come it doesn't get the node to shut down?
>>> 4) Are there any docs/JIRAs that explain how counters are used, and why
>>> they are required in the state?
>>>
>>> Cheers,
>>> Eugene
>>>
>>>
>>> On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh <[email protected]>
>>> wrote:
>>>
>>>> 3) Such mechanics will be implemented in IEP-25 (linked above).
>>>> 4) Partition map states include update counters, which are incremented
>>>> on every cache update and play important role in new state calculation. So,
>>>> technically, every cache operation can lead to partition map change, and
>>>> for obvious reasons we can't route them through coordinator. Ignite is a
>>>> more complex system than Akka or Kafka and such simple solutions won't work
>>>> here (in general case). However, it is true that PME could be simplified or
>>>> completely avoid for certain cases and the community is currently working
>>>> on such optimizations (
>>>> https://issues.apache.org/jira/browse/IGNITE-9558 for example).
>>>>
>>>> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky <
>>>> [email protected]> wrote:
>>>>
>>>>> 2b) I had a few situations where the cluster went into a state where
>>>>> PME constantly failed, and could never recover. I think the root cause was
>>>>> that a transaction got stuck and didn't timeout/rollback.  I will try to
>>>>> reproduce it again and get back to you
>>>>> 3) If a node is down, I would expect it to get detected and the node
>>>>> to get removed from the cluster. In such case, PME should not even be
>>>>> attempted with that node. Hence you would expect PME to fail very rarely
>>>>> (any faulty node will be removed before it has a chance to fail PME)
>>>>> 4) Don't all partition map changes go through the coordinator? I
>>>>> believe a lot of distributed systems work in this way (all decisions are
>>>>> made by the coordinator/leader) - In Akka the leader is responsible for
>>>>> making all cluster membership changes, in Kafka the controller does the
>>>>> leader election.
>>>>>
>>>>> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> 1) It is.
>>>>>> 2a) Ignite has retry mechanics for all messages, including
>>>>>> PME-related ones.
>>>>>> 2b) In this situation PME will hang, but it isn't a "deadlock".
>>>>>> 3) Sorry, I didn't understand your question. If a node is down, but
>>>>>> DiscoverySpi doesn't detect it, it isn't PME-related problem.
>>>>>> 4) How can you ensure that partition maps on coordinator are *latest
>>>>>> *without "freezing" cluster state for some time?
>>>>>>
>>>>>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> We are using persistence, so I am not sure if shutting down nodes
>>>>>>> will be the desired outcome for us since we would need to modify the
>>>>>>> baseline topolgy.
>>>>>>>
>>>>>>> A couple more follow up questions
>>>>>>>
>>>>>>> 1) Is PME triggered when client nodes join us well? We are using
>>>>>>> Spark client, so new nodes are created/destroy every time.
>>>>>>> 2) It sounds to me like there is a pontential for the cluster to get
>>>>>>> into a deadlock if
>>>>>>>    a) single PME message is lost (PME never finishes, there are no
>>>>>>> retries, and all future operations are blocked on the pending PME)
>>>>>>>    b) one of the nodes has a  long running/stuck pending operation
>>>>>>> 3) Under what circumastance can PME fail, while DiscoverySpi fails
>>>>>>> to detect the node being down? We are using ZookeeperSpi so I would 
>>>>>>> expect
>>>>>>> the split brain resolver to shut down the node.
>>>>>>> 4) Why is PME needed? Doesn't the coordinator know the altest
>>>>>>> toplogy/pertition map of the cluster through regualr gossip?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eugene
>>>>>>>
>>>>>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Eugene,
>>>>>>>>
>>>>>>>> 1) PME happens when topology is modified (TopologyVersion is
>>>>>>>> incremented). The most common events that trigger it are: node
>>>>>>>> start/stop/fail, cluster activation/deactivation, dynamic cache 
>>>>>>>> start/stop.
>>>>>>>> 2) It is done by a separate ExchangeWorker. Events that trigger PME
>>>>>>>> are transferred using DiscoverySpi instead of CommunicationSpi.
>>>>>>>> 3) All nodes wait for all pending cache operations to finish and
>>>>>>>> then send their local partition maps to the coordinator (oldest node). 
>>>>>>>> Then
>>>>>>>> coordinator calculates new global partition maps and sends them to 
>>>>>>>> every
>>>>>>>> node.
>>>>>>>> 4) All cache operations.
>>>>>>>> 5) Exchange is never retried. Ignite community is currently working
>>>>>>>> on PME failure handling that should kick all problematic nodes after
>>>>>>>> timeout is reached (see
>>>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving
>>>>>>>> for details), but it isn't done yet.
>>>>>>>> 6) You shouldn't consider PME failure as a error by itself, but
>>>>>>>> rather as a result of some other error. The most common reason of PME
>>>>>>>> hang-up is pending cache operation that couldn't finish. Check your 
>>>>>>>> logs -
>>>>>>>> it should list pending transactions and atomic updates. Search for 
>>>>>>>> "Found
>>>>>>>> long running" substring.
>>>>>>>>
>>>>>>>> Hope this helps.
>>>>>>>>
>>>>>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> Out cluster occasionally fails with "partition map exchange
>>>>>>>>> failure" errors, I have searched around and it seems that a lot of 
>>>>>>>>> people
>>>>>>>>> have had a similar issue in the past. My high-level understanding is 
>>>>>>>>> that
>>>>>>>>> when one of the nodes fails (out of memory, exception, GC etc.) nodes 
>>>>>>>>> fail
>>>>>>>>> to exchange partition maps. However, I have a few questions
>>>>>>>>> 1) When does partition map exchange happen? Periodically, when a
>>>>>>>>> node joins, etc.
>>>>>>>>> 2) Is it done in the same thread as communication SPI, or is a
>>>>>>>>> separate worker?
>>>>>>>>> 3) How does the exchange happen? Via a coordinator, peer to peer,
>>>>>>>>> etc?
>>>>>>>>> 4) What does the exchange block?
>>>>>>>>> 5) When is the exchange retried?
>>>>>>>>> 5) How to resolve the error? The only thing I have seen online is
>>>>>>>>> to decrease failureDetectionTimeout
>>>>>>>>>
>>>>>>>>> Our settings are
>>>>>>>>> - Zookeeper SPI
>>>>>>>>> - Persistence enabled
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Eugene
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Ilya
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Ilya
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Ilya
>>>>
>>>
>>
>>
>> --
>> Best regards,
>> Ilya
>>
>

Re: Partition map exchange in detail

Reply via email to