Re: Crash recover for Ignite persistence cluster (lost partitions case)

Ilya Shishkov Tue, 22 Nov 2022 10:59:23 -0800

There is a typo here:
> Lost partitions are expected behaviour in case of partition because you
have only 1 backup and lost two nodes.


I mean, that lost partitions are expected behaviour in case of partitioned
caches when the number of offline nodes is more than the number of backups.
In your case there are 1 backup and 2 offline nodes.

вт, 22 нояб. 2022 г. в 21:56, Ilya Shishkov <[email protected]>:

> Hi,
> > 1) What can I do to recover from partitions lost problem after shutting
> down several nodes?
> > I thought that in case of graceful shutdown this problem must be solved.
> > Now I can recover by returning *one* of offline nodes to cluster
> (starting the service) and running *reset_lost_partitions* command for
> broken cache. After this cache becomes available.
>
> Are caches with lost partitions replicated or partitioned? Lost partitions
> are expected behaviour in case of partition because you have only 1 backup
> and lost two nodes. If you want from cluster data to remain fully available
> in case of 2 nodes, you should set 2 backups for partitioned caches.
>
> As for graceful shutdown: why do you expect that data would not be lost?
> If you have 1 backup and 1 offline node, then there are some partitions
> without backups, because the latter remains inaccessible while their owner
> is offline. So, if you shutdown another one node with such partitions, they
> will be lost.
>
> So, for persistent clusters if you are in a situation, when you should
> work a long time without backups (i.e. with offline nodes, BUT without
> partition loss), you should trigger a rebalance. It can be done manually or
> automatically by changing the baseline.
> After rebalancing, the amount of data copies will be restored.
>
> Now you should bring back at least one of the nodes, in order to make
> partitions available. But if you need a full set of primary and partitions
> you need all baseline nodes in the cluster.
>
> 2) What can I do to prevent this problem in scenario with automatic
> cluster deployment? Should I add *reset_lost_partitions* command after
> activation or redeploy?
>
> I don't fully understand what you mean, but there are no problems with
> automatic deployments. In most cases, the situation with
> partition losses tells that cluster is in invalid state.
>
> вт, 22 нояб. 2022 г. в 19:49, Айсина Роза Мунеровна <
> [email protected]>:
>
>> Hi Sumit!
>>
>> Thanks for your reply!
>>
>> Yeah, I have used this utility reset_lost_partitions many times.
>>
>> The problem is that this function requires all baseline nodes to be
>> present.
>> If I shutdown node auto adjustment does not remove this node from
>> baseline topology and reset_lost_partitions ends with error that all
>> partition owners have left the grid, partition data has been lost.
>>
>> So I remove them manually and this operation succeeds but with loss of
>> data on offline nodes.
>>
>> What I am trying to understand is that why graceful shutdown do not
>> handles this situation in case of backup caches and persistance.
>> How can we automatically raise Ignite nodes if after redeploy data is
>> lost because cluster can’t handle lost partitions problem?
>>
>> Best regards,
>> Rose.
>>
>> On 22 Nov 2022, at 5:44 PM, Sumit Deshinge <[email protected]>
>> wrote:
>>
>> Внимание: Внешний отправитель!
>> Если вы не знаете отправителя - не открывайте вложения, не переходите по
>> ссылкам, не пересылайте письмо!
>>
>> Please check if this helps:
>> https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss
>> Also any reason baseline auto adjustment is disabled?
>>
>> On Tue, Nov 22, 2022 at 6:38 PM Айсина Роза Мунеровна <
>> [email protected]> wrote:
>>
>>> Hola again!
>>>
>>> I discovered that enabling graceful shutdown via does not work.
>>>
>>> In service logs I see that nothing happens when *SIGTERM* comes :(
>>> Eventually stopping action has been timed out and *SIGKILL* has been
>>> sent which causes ungraceful shutdown.
>>> Timeout is set to *10 minutes*.
>>>
>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite
>>> In-Memory Computing Platform Service...
>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite
>>> In-Memory Computing Platform Service.
>>> Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite
>>> In-Memory Computing Platform Service...
>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
>>> [email protected]: State 'stop-final-sigterm' timed out.
>>> Killing.
>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
>>> [email protected]: Killing process 11135 (java) with
>>> signal SIGKILL.
>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]:
>>> [email protected]: Failed with result 'timeout'.
>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite
>>> In-Memory Computing Platform Service.
>>>
>>>
>>> I also enabled *DEBUG* level and see that nothing happens after
>>> rebalancing started (this is the end of log):
>>>
>>> [2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown
>>> hook...
>>> [2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in
>>> progress (ignoring): Shutdown in progress
>>> [2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that caches
>>> have sufficient backups and local rebalance completion...
>>>
>>>
>>> I forgot to add that service is tarted with *service.sh*, not
>>> *ignite.sh*.
>>>
>>> Please help!
>>>
>>> On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна <
>>> [email protected]> wrote:
>>>
>>> Hola!
>>> I have a problem recovering from cluster crash in case when persistence
>>> is enabled.
>>>
>>> Our setup is
>>> - 5 VM nodes with 40G Ram and 200GB disk,
>>> - persistence is enabled (on separate disk on each VM),
>>> - all cluster actions are made through Ansible playbooks,
>>> - all caches are either partitioned with backups = 1 or replicated,
>>> - cluster starts as the service with running ignite.sh,
>>> - baseline auto adjust is disabled.
>>>
>>> Also following the docs about partition loss policy I have added
>>> *-DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true* to *JVM_OPTS* to wait
>>> until partition rebalancing.
>>>
>>> What problem we have: after shutting down several nodes (2 go 5) one
>>> after another exception about lost partitions is raised.
>>>
>>> *Caused by:
>>> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
>>> Failed to execute query because cache partition has been lostPart
>>> [cacheName=PUBLIC_StoreProductFeatures, part=512]*
>>>
>>> But in logs of dead nodes I see that all shutdown hooks are called as
>>> expected on both nodes:
>>>
>>> [2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown
>>> hook...
>>> [2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that caches
>>> have sufficient backups and local rebalance completion...
>>>
>>>
>>>
>>> And baseline topology looks like this (with 2 offline nodes as
>>> expected):
>>>
>>> Cluster state: active
>>> Current topology version: 23
>>> Baseline auto adjustment disabled: softTimeout=30000
>>>
>>> Current topology version: 23 (Coordinator:
>>> ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1,
>>> Order=3)
>>>
>>> Baseline nodes:
>>>     ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b,
>>> Address=172.17.0.1, State=ONLINE, Order=3
>>>     ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b,
>>> Address=172.17.0.1, State=ONLINE, Order=21
>>>     ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e,
>>> Address=172.17.0.1, State=ONLINE, Order=5
>>>     ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE
>>>     ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE
>>>
>>> --------------------------------------------------------------------------------
>>> Number of baseline nodes: 5
>>>
>>> Other nodes not found.
>>>
>>>
>>>
>>> So my questions are:
>>>
>>> 1) What can I do to recover from partitions lost problem after shutting
>>> down several nodes? I thought that in case of graceful shutdown this
>>> problem must be solved.
>>>
>>> Now I can recover by returning *one* of offline nodes to cluster
>>> (starting the service) and running *reset_lost_partitions* command for
>>> broken cache. After this cache becomes available.
>>>
>>> 2) What can I do to prevent this problem in scenario with automatic
>>> cluster deployment? Should I add *reset_lost_partitions* command after
>>> activation or redeploy?
>>>
>>> Please help.
>>> Thanks in advance!
>>>
>>> Best regards,
>>> Rose.
>>>
>>> *--*
>>>
>>> *Роза Айсина*
>>> Старший разработчик ПО
>>> *СберМаркет* | Доставка из любимых магазинов
>>>
>>>
>>> Email: [email protected]
>>> Mob:
>>> Web: sbermarket.ru
>>> App: iOS
>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>> и Android
>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>> Вам, использование, копирование, распространение информации, содержащейся в
>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>>> сообщение.
>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>> confidential. If you are not the intended recipient you are notified that
>>> using, copying, distributing or taking any action in reliance on the
>>> contents of this information is strictly prohibited. If you have received
>>> this email in error please notify the sender and delete this email.
>>> *--*
>>>
>>> *Роза Айсина*
>>> Старший разработчик ПО
>>> *СберМаркет* | Доставка из любимых магазинов
>>>
>>>
>>> Email: [email protected]
>>> Mob:
>>> Web: sbermarket.ru
>>> App: iOS
>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>> и Android
>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>
>>>
>>>
>>>
>>>
>>> *--*
>>>
>>> *Роза Айсина*
>>> Старший разработчик ПО
>>> *СберМаркет* | Доставка из любимых магазинов
>>>
>>>
>>> Email: [email protected]
>>> Mob:
>>> Web: sbermarket.ru
>>> App: iOS
>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>> и Android
>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>> Вам, использование, копирование, распространение информации, содержащейся в
>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>>> сообщение.
>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>> confidential. If you are not the intended recipient you are notified that
>>> using, copying, distributing or taking any action in reliance on the
>>> contents of this information is strictly prohibited. If you have received
>>> this email in error please notify the sender and delete this email.
>>>
>>
>>
>> --
>> Regards,
>> Sumit Deshinge
>>
>>
>> *--*
>>
>> *Роза Айсина*
>>
>> Старший разработчик ПО
>>
>> *СберМаркет* | Доставка из любимых магазинов
>>
>>
>>
>> Email: [email protected]
>>
>> Mob:
>>
>> Web: sbermarket.ru
>>
>> App: iOS
>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>> и Android
>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>
>>
>>
>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>> документы, приложенные к нему, содержат конфиденциальную информацию.
>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>> Вам, использование, копирование, распространение информации, содержащейся в
>> настоящем сообщении, а также осуществление любых действий на основе этой
>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>> сообщение.
>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>> confidential. If you are not the intended recipient you are notified that
>> using, copying, distributing or taking any action in reliance on the
>> contents of this information is strictly prohibited. If you have received
>> this email in error please notify the sender and delete this email.
>>
>

Re: Crash recover for Ignite persistence cluster (lost partitions case)

Reply via email to