Re: Partition map exchange in detail

Pavel Kovalenko Wed, 12 Sep 2018 08:25:30 -0700

Hi Eugene,

Sorry, but I didn't catch the meaning of your question about Zookeeper
Discovery. Could you please re-phrase it?


ср, 12 сент. 2018 г. в 17:54, Ilya Lantukh <[email protected]>:

> Pavel K., can you please answer about Zookeeper discovery?
>
> On Wed, Sep 12, 2018 at 5:49 PM, eugene miretsky <
> [email protected]> wrote:
>
>> Thanks for the patience with my questions - just trying to understand the
>> system better.
>>
>> 3) I was referring to
>> https://apacheignite.readme.io/docs/zookeeper-discovery#section-failures-and-split-brain-handling.
>> How come it doesn't get the node to shut down?
>> 4) Are there any docs/JIRAs that explain how counters are used, and why
>> they are required in the state?
>>
>> Cheers,
>> Eugene
>>
>>
>> On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh <[email protected]>
>> wrote:
>>
>>> 3) Such mechanics will be implemented in IEP-25 (linked above).
>>> 4) Partition map states include update counters, which are incremented
>>> on every cache update and play important role in new state calculation. So,
>>> technically, every cache operation can lead to partition map change, and
>>> for obvious reasons we can't route them through coordinator. Ignite is a
>>> more complex system than Akka or Kafka and such simple solutions won't work
>>> here (in general case). However, it is true that PME could be simplified or
>>> completely avoid for certain cases and the community is currently working
>>> on such optimizations (https://issues.apache.org/jira/browse/IGNITE-9558
>>> for example).
>>>
>>> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky <
>>> [email protected]> wrote:
>>>
>>>> 2b) I had a few situations where the cluster went into a state where
>>>> PME constantly failed, and could never recover. I think the root cause was
>>>> that a transaction got stuck and didn't timeout/rollback.  I will try to
>>>> reproduce it again and get back to you
>>>> 3) If a node is down, I would expect it to get detected and the node to
>>>> get removed from the cluster. In such case, PME should not even be
>>>> attempted with that node. Hence you would expect PME to fail very rarely
>>>> (any faulty node will be removed before it has a chance to fail PME)
>>>> 4) Don't all partition map changes go through the coordinator? I
>>>> believe a lot of distributed systems work in this way (all decisions are
>>>> made by the coordinator/leader) - In Akka the leader is responsible for
>>>> making all cluster membership changes, in Kafka the controller does the
>>>> leader election.
>>>>
>>>> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <[email protected]>
>>>> wrote:
>>>>
>>>>> 1) It is.
>>>>> 2a) Ignite has retry mechanics for all messages, including PME-related
>>>>> ones.
>>>>> 2b) In this situation PME will hang, but it isn't a "deadlock".
>>>>> 3) Sorry, I didn't understand your question. If a node is down, but
>>>>> DiscoverySpi doesn't detect it, it isn't PME-related problem.
>>>>> 4) How can you ensure that partition maps on coordinator are *latest 
>>>>> *without
>>>>> "freezing" cluster state for some time?
>>>>>
>>>>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> We are using persistence, so I am not sure if shutting down nodes
>>>>>> will be the desired outcome for us since we would need to modify the
>>>>>> baseline topolgy.
>>>>>>
>>>>>> A couple more follow up questions
>>>>>>
>>>>>> 1) Is PME triggered when client nodes join us well? We are using
>>>>>> Spark client, so new nodes are created/destroy every time.
>>>>>> 2) It sounds to me like there is a pontential for the cluster to get
>>>>>> into a deadlock if
>>>>>>    a) single PME message is lost (PME never finishes, there are no
>>>>>> retries, and all future operations are blocked on the pending PME)
>>>>>>    b) one of the nodes has a  long running/stuck pending operation
>>>>>> 3) Under what circumastance can PME fail, while DiscoverySpi fails to
>>>>>> detect the node being down? We are using ZookeeperSpi so I would expect 
>>>>>> the
>>>>>> split brain resolver to shut down the node.
>>>>>> 4) Why is PME needed? Doesn't the coordinator know the altest
>>>>>> toplogy/pertition map of the cluster through regualr gossip?
>>>>>>
>>>>>> Cheers,
>>>>>> Eugene
>>>>>>
>>>>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Eugene,
>>>>>>>
>>>>>>> 1) PME happens when topology is modified (TopologyVersion is
>>>>>>> incremented). The most common events that trigger it are: node
>>>>>>> start/stop/fail, cluster activation/deactivation, dynamic cache 
>>>>>>> start/stop.
>>>>>>> 2) It is done by a separate ExchangeWorker. Events that trigger PME
>>>>>>> are transferred using DiscoverySpi instead of CommunicationSpi.
>>>>>>> 3) All nodes wait for all pending cache operations to finish and
>>>>>>> then send their local partition maps to the coordinator (oldest node). 
>>>>>>> Then
>>>>>>> coordinator calculates new global partition maps and sends them to every
>>>>>>> node.
>>>>>>> 4) All cache operations.
>>>>>>> 5) Exchange is never retried. Ignite community is currently working
>>>>>>> on PME failure handling that should kick all problematic nodes after
>>>>>>> timeout is reached (see
>>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving
>>>>>>> for details), but it isn't done yet.
>>>>>>> 6) You shouldn't consider PME failure as a error by itself, but
>>>>>>> rather as a result of some other error. The most common reason of PME
>>>>>>> hang-up is pending cache operation that couldn't finish. Check your 
>>>>>>> logs -
>>>>>>> it should list pending transactions and atomic updates. Search for 
>>>>>>> "Found
>>>>>>> long running" substring.
>>>>>>>
>>>>>>> Hope this helps.
>>>>>>>
>>>>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Out cluster occasionally fails with "partition map exchange
>>>>>>>> failure" errors, I have searched around and it seems that a lot of 
>>>>>>>> people
>>>>>>>> have had a similar issue in the past. My high-level understanding is 
>>>>>>>> that
>>>>>>>> when one of the nodes fails (out of memory, exception, GC etc.) nodes 
>>>>>>>> fail
>>>>>>>> to exchange partition maps. However, I have a few questions
>>>>>>>> 1) When does partition map exchange happen? Periodically, when a
>>>>>>>> node joins, etc.
>>>>>>>> 2) Is it done in the same thread as communication SPI, or is a
>>>>>>>> separate worker?
>>>>>>>> 3) How does the exchange happen? Via a coordinator, peer to peer,
>>>>>>>> etc?
>>>>>>>> 4) What does the exchange block?
>>>>>>>> 5) When is the exchange retried?
>>>>>>>> 5) How to resolve the error? The only thing I have seen online is
>>>>>>>> to decrease failureDetectionTimeout
>>>>>>>>
>>>>>>>> Our settings are
>>>>>>>> - Zookeeper SPI
>>>>>>>> - Persistence enabled
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Eugene
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Ilya
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Ilya
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Ilya
>>>
>>
>
>
> --
> Best regards,
> Ilya
>

Re: Partition map exchange in detail

Reply via email to