Re: Partition map exchange in detail

eugene miretsky Wed, 12 Sep 2018 07:50:49 -0700

Thanks for the patience with my questions - just trying to understand the
system better.


3) I was referring to
https://apacheignite.readme.io/docs/zookeeper-discovery#section-failures-and-split-brain-handling.
How come it doesn't get the node to shut down?
4) Are there any docs/JIRAs that explain how counters are used, and why
they are required in the state?

Cheers,
Eugene


On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh <[email protected]> wrote:

> 3) Such mechanics will be implemented in IEP-25 (linked above).
> 4) Partition map states include update counters, which are incremented on
> every cache update and play important role in new state calculation. So,
> technically, every cache operation can lead to partition map change, and
> for obvious reasons we can't route them through coordinator. Ignite is a
> more complex system than Akka or Kafka and such simple solutions won't work
> here (in general case). However, it is true that PME could be simplified or
> completely avoid for certain cases and the community is currently working
> on such optimizations (https://issues.apache.org/jira/browse/IGNITE-9558
> for example).
>
> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky <
> [email protected]> wrote:
>
>> 2b) I had a few situations where the cluster went into a state where PME
>> constantly failed, and could never recover. I think the root cause was that
>> a transaction got stuck and didn't timeout/rollback.  I will try to
>> reproduce it again and get back to you
>> 3) If a node is down, I would expect it to get detected and the node to
>> get removed from the cluster. In such case, PME should not even be
>> attempted with that node. Hence you would expect PME to fail very rarely
>> (any faulty node will be removed before it has a chance to fail PME)
>> 4) Don't all partition map changes go through the coordinator? I believe
>> a lot of distributed systems work in this way (all decisions are made by
>> the coordinator/leader) - In Akka the leader is responsible for making all
>> cluster membership changes, in Kafka the controller does the leader
>> election.
>>
>> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <[email protected]>
>> wrote:
>>
>>> 1) It is.
>>> 2a) Ignite has retry mechanics for all messages, including PME-related
>>> ones.
>>> 2b) In this situation PME will hang, but it isn't a "deadlock".
>>> 3) Sorry, I didn't understand your question. If a node is down, but
>>> DiscoverySpi doesn't detect it, it isn't PME-related problem.
>>> 4) How can you ensure that partition maps on coordinator are *latest 
>>> *without
>>> "freezing" cluster state for some time?
>>>
>>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <
>>> [email protected]> wrote:
>>>
>>>> Thanks!
>>>>
>>>> We are using persistence, so I am not sure if shutting down nodes will
>>>> be the desired outcome for us since we would need to modify the baseline
>>>> topolgy.
>>>>
>>>> A couple more follow up questions
>>>>
>>>> 1) Is PME triggered when client nodes join us well? We are using Spark
>>>> client, so new nodes are created/destroy every time.
>>>> 2) It sounds to me like there is a pontential for the cluster to get
>>>> into a deadlock if
>>>>    a) single PME message is lost (PME never finishes, there are no
>>>> retries, and all future operations are blocked on the pending PME)
>>>>    b) one of the nodes has a  long running/stuck pending operation
>>>> 3) Under what circumastance can PME fail, while DiscoverySpi fails to
>>>> detect the node being down? We are using ZookeeperSpi so I would expect the
>>>> split brain resolver to shut down the node.
>>>> 4) Why is PME needed? Doesn't the coordinator know the altest
>>>> toplogy/pertition map of the cluster through regualr gossip?
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Eugene,
>>>>>
>>>>> 1) PME happens when topology is modified (TopologyVersion is
>>>>> incremented). The most common events that trigger it are: node
>>>>> start/stop/fail, cluster activation/deactivation, dynamic cache 
>>>>> start/stop.
>>>>> 2) It is done by a separate ExchangeWorker. Events that trigger PME
>>>>> are transferred using DiscoverySpi instead of CommunicationSpi.
>>>>> 3) All nodes wait for all pending cache operations to finish and then
>>>>> send their local partition maps to the coordinator (oldest node). Then
>>>>> coordinator calculates new global partition maps and sends them to every
>>>>> node.
>>>>> 4) All cache operations.
>>>>> 5) Exchange is never retried. Ignite community is currently working on
>>>>> PME failure handling that should kick all problematic nodes after timeout
>>>>> is reached (see
>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving
>>>>> for details), but it isn't done yet.
>>>>> 6) You shouldn't consider PME failure as a error by itself, but rather
>>>>> as a result of some other error. The most common reason of PME hang-up is
>>>>> pending cache operation that couldn't finish. Check your logs - it should
>>>>> list pending transactions and atomic updates. Search for "Found long
>>>>> running" substring.
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Out cluster occasionally fails with "partition map exchange failure"
>>>>>> errors, I have searched around and it seems that a lot of people have 
>>>>>> had a
>>>>>> similar issue in the past. My high-level understanding is that when one 
>>>>>> of
>>>>>> the nodes fails (out of memory, exception, GC etc.) nodes fail to 
>>>>>> exchange
>>>>>> partition maps. However, I have a few questions
>>>>>> 1) When does partition map exchange happen? Periodically, when a node
>>>>>> joins, etc.
>>>>>> 2) Is it done in the same thread as communication SPI, or is a
>>>>>> separate worker?
>>>>>> 3) How does the exchange happen? Via a coordinator, peer to peer,
>>>>>> etc?
>>>>>> 4) What does the exchange block?
>>>>>> 5) When is the exchange retried?
>>>>>> 5) How to resolve the error? The only thing I have seen online is to
>>>>>> decrease failureDetectionTimeout
>>>>>>
>>>>>> Our settings are
>>>>>> - Zookeeper SPI
>>>>>> - Persistence enabled
>>>>>>
>>>>>> Cheers,
>>>>>> Eugene
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Ilya
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Ilya
>>>
>>
>
>
> --
> Best regards,
> Ilya
>

Re: Partition map exchange in detail

Reply via email to