Hola again!

I discovered that enabling graceful shutdown via does not work.

In service logs I see that nothing happens when SIGTERM comes :(
Eventually stopping action has been timed out and SIGKILL has been sent which 
causes ungraceful shutdown.
Timeout is set to 10 minutes.

Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite In-Memory 
Computing Platform Service...
Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite In-Memory 
Computing Platform Service.
Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite In-Memory 
Computing Platform Service...
Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: [email protected]: 
State 'stop-final-sigterm' timed out. Killing.
Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: [email protected]: 
Killing process 11135 (java) with signal SIGKILL.
Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: [email protected]: 
Failed with result 'timeout'.
Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite In-Memory 
Computing Platform Service.

I also enabled DEBUG level and see that nothing happens after rebalancing 
started (this is the end of log):

[2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown hook...
[2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in progress 
(ignoring): Shutdown in progress
[2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that caches have 
sufficient backups and local rebalance completion...

I forgot to add that service is tarted with service.sh, not ignite.sh.

Please help!

On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна <[email protected]> 
wrote:

Hola!
I have a problem recovering from cluster crash in case when persistence is 
enabled.

Our setup is
- 5 VM nodes with 40G Ram and 200GB disk,
- persistence is enabled (on separate disk on each VM),
- all cluster actions are made through Ansible playbooks,
- all caches are either partitioned with backups = 1 or replicated,
- cluster starts as the service with running ignite.sh,
- baseline auto adjust is disabled.

Also following the docs about partition loss policy I have added 
-DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true to JVM_OPTS to wait until partition 
rebalancing.

What problem we have: after shutting down several nodes (2 go 5) one after 
another exception about lost partitions is raised.

Caused by: 
org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed 
to execute query because cache partition has been lostPart 
[cacheName=PUBLIC_StoreProductFeatures, part=512]

But in logs of dead nodes I see that all shutdown hooks are called as expected 
on both nodes:

[2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown hook...
[2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that caches have 
sufficient backups and local rebalance completion...


And baseline topology looks like this (with 2 offline nodes as expected):

Cluster state: active
Current topology version: 23
Baseline auto adjustment disabled: softTimeout=30000

Current topology version: 23 (Coordinator: 
ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1, Order=3)

Baseline nodes:
    ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1, 
State=ONLINE, Order=3
    ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b, Address=172.17.0.1, 
State=ONLINE, Order=21
    ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e, Address=172.17.0.1, 
State=ONLINE, Order=5
    ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE
    ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE
--------------------------------------------------------------------------------
Number of baseline nodes: 5

Other nodes not found.


So my questions are:

1) What can I do to recover from partitions lost problem after shutting down 
several nodes? I thought that in case of graceful shutdown this problem must be 
solved.

Now I can recover by returning one of offline nodes to cluster (starting the 
service) and running reset_lost_partitions command for broken cache. After this 
cache becomes available.

2) What can I do to prevent this problem in scenario with automatic cluster 
deployment? Should I add reset_lost_partitions command after activation or 
redeploy?

Please help.
Thanks in advance!

Best regards,
Rose.

--

Роза Айсина
Старший разработчик ПО
СберМаркет | Доставка из любимых магазинов



Email: [email protected]<mailto:[email protected]>
Mob:
Web: sbermarket.ru<https://sbermarket.ru/>
App: 
iOS<https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
 и 
Android<https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>








УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ: это электронное сообщение и любые документы, 
приложенные к нему, содержат конфиденциальную информацию. Настоящим уведомляем 
Вас о том, что, если это сообщение не предназначено Вам, использование, 
копирование, распространение информации, содержащейся в настоящем сообщении, а 
также осуществление любых действий на основе этой информации, строго запрещено. 
Если Вы получили это сообщение по ошибке, пожалуйста, сообщите об этом 
отправителю по электронной почте и удалите это сообщение.
CONFIDENTIALITY NOTICE: This email and any files attached to it are 
confidential. If you are not the intended recipient you are notified that 
using, copying, distributing or taking any action in reliance on the contents 
of this information is strictly prohibited. If you have received this email in 
error please notify the sender and delete this email.

--

Роза Айсина
Старший разработчик ПО
СберМаркет | Доставка из любимых магазинов



Email: [email protected]<mailto:[email protected]>
Mob:
Web: sbermarket.ru<https://sbermarket.ru/>
App: 
iOS<https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
 и 
Android<https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>






--

Роза Айсина

Старший разработчик ПО

СберМаркет | Доставка из любимых магазинов



Email: [email protected]<mailto:[email protected]>

Mob:

Web: sbermarket.ru<https://sbermarket.ru/>

App: 
iOS<https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
 и 
Android<https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>



УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ: это электронное сообщение и любые документы, 
приложенные к нему, содержат конфиденциальную информацию. Настоящим уведомляем 
Вас о том, что, если это сообщение не предназначено Вам, использование, 
копирование, распространение информации, содержащейся в настоящем сообщении, а 
также осуществление любых действий на основе этой информации, строго запрещено. 
Если Вы получили это сообщение по ошибке, пожалуйста, сообщите об этом 
отправителю по электронной почте и удалите это сообщение.
CONFIDENTIALITY NOTICE: This email and any files attached to it are 
confidential. If you are not the intended recipient you are notified that 
using, copying, distributing or taking any action in reliance on the contents 
of this information is strictly prohibited. If you have received this email in 
error please notify the sender and delete this email.

Reply via email to