Re: Partition map exchange in detail

eugene miretsky Fri, 07 Sep 2018 17:22:36 -0700

Thanks!

We are using persistence, so I am not sure if shutting down nodes will be
the desired outcome for us since we would need to modify the baseline
topolgy.


A couple more follow up questions

1) Is PME triggered when client nodes join us well? We are using Spark
client, so new nodes are created/destroy every time.
2) It sounds to me like there is a pontential for the cluster to get into a
deadlock if
   a) single PME message is lost (PME never finishes, there are no retries,
and all future operations are blocked on the pending PME)
   b) one of the nodes has a  long running/stuck pending operation
3) Under what circumastance can PME fail, while DiscoverySpi fails to
detect the node being down? We are using ZookeeperSpi so I would expect the
split brain resolver to shut down the node.
4) Why is PME needed? Doesn't the coordinator know the altest
toplogy/pertition map of the cluster through regualr gossip?

Cheers,
Eugene

On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <[email protected]> wrote:

> Hi Eugene,
>
> 1) PME happens when topology is modified (TopologyVersion is incremented).
> The most common events that trigger it are: node start/stop/fail, cluster
> activation/deactivation, dynamic cache start/stop.
> 2) It is done by a separate ExchangeWorker. Events that trigger PME are
> transferred using DiscoverySpi instead of CommunicationSpi.
> 3) All nodes wait for all pending cache operations to finish and then send
> their local partition maps to the coordinator (oldest node). Then
> coordinator calculates new global partition maps and sends them to every
> node.
> 4) All cache operations.
> 5) Exchange is never retried. Ignite community is currently working on PME
> failure handling that should kick all problematic nodes after timeout is
> reached (see
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving
> for details), but it isn't done yet.
> 6) You shouldn't consider PME failure as a error by itself, but rather as
> a result of some other error. The most common reason of PME hang-up is
> pending cache operation that couldn't finish. Check your logs - it should
> list pending transactions and atomic updates. Search for "Found long
> running" substring.
>
> Hope this helps.
>
> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
> [email protected]> wrote:
>
>> Hello,
>>
>> Out cluster occasionally fails with "partition map exchange failure"
>> errors, I have searched around and it seems that a lot of people have had a
>> similar issue in the past. My high-level understanding is that when one of
>> the nodes fails (out of memory, exception, GC etc.) nodes fail to exchange
>> partition maps. However, I have a few questions
>> 1) When does partition map exchange happen? Periodically, when a node
>> joins, etc.
>> 2) Is it done in the same thread as communication SPI, or is a separate
>> worker?
>> 3) How does the exchange happen? Via a coordinator, peer to peer, etc?
>> 4) What does the exchange block?
>> 5) When is the exchange retried?
>> 5) How to resolve the error? The only thing I have seen online is to
>> decrease failureDetectionTimeout
>>
>> Our settings are
>> - Zookeeper SPI
>> - Persistence enabled
>>
>> Cheers,
>> Eugene
>>
>
>
>
> --
> Best regards,
> Ilya
>

Re: Partition map exchange in detail

Reply via email to