There is a typo here: > Lost partitions are expected behaviour in case of partition because you have only 1 backup and lost two nodes.
I mean, that lost partitions are expected behaviour in case of partitioned caches when the number of offline nodes is more than the number of backups. In your case there are 1 backup and 2 offline nodes. вт, 22 нояб. 2022 г. в 21:56, Ilya Shishkov <[email protected]>: > Hi, > > 1) What can I do to recover from partitions lost problem after shutting > down several nodes? > > I thought that in case of graceful shutdown this problem must be solved. > > Now I can recover by returning *one* of offline nodes to cluster > (starting the service) and running *reset_lost_partitions* command for > broken cache. After this cache becomes available. > > Are caches with lost partitions replicated or partitioned? Lost partitions > are expected behaviour in case of partition because you have only 1 backup > and lost two nodes. If you want from cluster data to remain fully available > in case of 2 nodes, you should set 2 backups for partitioned caches. > > As for graceful shutdown: why do you expect that data would not be lost? > If you have 1 backup and 1 offline node, then there are some partitions > without backups, because the latter remains inaccessible while their owner > is offline. So, if you shutdown another one node with such partitions, they > will be lost. > > So, for persistent clusters if you are in a situation, when you should > work a long time without backups (i.e. with offline nodes, BUT without > partition loss), you should trigger a rebalance. It can be done manually or > automatically by changing the baseline. > After rebalancing, the amount of data copies will be restored. > > Now you should bring back at least one of the nodes, in order to make > partitions available. But if you need a full set of primary and partitions > you need all baseline nodes in the cluster. > > 2) What can I do to prevent this problem in scenario with automatic > cluster deployment? Should I add *reset_lost_partitions* command after > activation or redeploy? > > I don't fully understand what you mean, but there are no problems with > automatic deployments. In most cases, the situation with > partition losses tells that cluster is in invalid state. > > вт, 22 нояб. 2022 г. в 19:49, Айсина Роза Мунеровна < > [email protected]>: > >> Hi Sumit! >> >> Thanks for your reply! >> >> Yeah, I have used this utility reset_lost_partitions many times. >> >> The problem is that this function requires all baseline nodes to be >> present. >> If I shutdown node auto adjustment does not remove this node from >> baseline topology and reset_lost_partitions ends with error that all >> partition owners have left the grid, partition data has been lost. >> >> So I remove them manually and this operation succeeds but with loss of >> data on offline nodes. >> >> What I am trying to understand is that why graceful shutdown do not >> handles this situation in case of backup caches and persistance. >> How can we automatically raise Ignite nodes if after redeploy data is >> lost because cluster can’t handle lost partitions problem? >> >> Best regards, >> Rose. >> >> On 22 Nov 2022, at 5:44 PM, Sumit Deshinge <[email protected]> >> wrote: >> >> Внимание: Внешний отправитель! >> Если вы не знаете отправителя - не открывайте вложения, не переходите по >> ссылкам, не пересылайте письмо! >> >> Please check if this helps: >> https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss >> Also any reason baseline auto adjustment is disabled? >> >> On Tue, Nov 22, 2022 at 6:38 PM Айсина Роза Мунеровна < >> [email protected]> wrote: >> >>> Hola again! >>> >>> I discovered that enabling graceful shutdown via does not work. >>> >>> In service logs I see that nothing happens when *SIGTERM* comes :( >>> Eventually stopping action has been timed out and *SIGKILL* has been >>> sent which causes ungraceful shutdown. >>> Timeout is set to *10 minutes*. >>> >>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite >>> In-Memory Computing Platform Service... >>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite >>> In-Memory Computing Platform Service. >>> Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite >>> In-Memory Computing Platform Service... >>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: >>> [email protected]: State 'stop-final-sigterm' timed out. >>> Killing. >>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: >>> [email protected]: Killing process 11135 (java) with >>> signal SIGKILL. >>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: >>> [email protected]: Failed with result 'timeout'. >>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite >>> In-Memory Computing Platform Service. >>> >>> >>> I also enabled *DEBUG* level and see that nothing happens after >>> rebalancing started (this is the end of log): >>> >>> [2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown >>> hook... >>> [2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in >>> progress (ignoring): Shutdown in progress >>> [2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that caches >>> have sufficient backups and local rebalance completion... >>> >>> >>> I forgot to add that service is tarted with *service.sh*, not >>> *ignite.sh*. >>> >>> Please help! >>> >>> On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна < >>> [email protected]> wrote: >>> >>> Hola! >>> I have a problem recovering from cluster crash in case when persistence >>> is enabled. >>> >>> Our setup is >>> - 5 VM nodes with 40G Ram and 200GB disk, >>> - persistence is enabled (on separate disk on each VM), >>> - all cluster actions are made through Ansible playbooks, >>> - all caches are either partitioned with backups = 1 or replicated, >>> - cluster starts as the service with running ignite.sh, >>> - baseline auto adjust is disabled. >>> >>> Also following the docs about partition loss policy I have added >>> *-DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true* to *JVM_OPTS* to wait >>> until partition rebalancing. >>> >>> What problem we have: after shutting down several nodes (2 go 5) one >>> after another exception about lost partitions is raised. >>> >>> *Caused by: >>> org.apache.ignite.internal.processors.cache.CacheInvalidStateException: >>> Failed to execute query because cache partition has been lostPart >>> [cacheName=PUBLIC_StoreProductFeatures, part=512]* >>> >>> But in logs of dead nodes I see that all shutdown hooks are called as >>> expected on both nodes: >>> >>> [2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown >>> hook... >>> [2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that caches >>> have sufficient backups and local rebalance completion... >>> >>> >>> >>> And baseline topology looks like this (with 2 offline nodes as >>> expected): >>> >>> Cluster state: active >>> Current topology version: 23 >>> Baseline auto adjustment disabled: softTimeout=30000 >>> >>> Current topology version: 23 (Coordinator: >>> ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1, >>> Order=3) >>> >>> Baseline nodes: >>> ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, >>> Address=172.17.0.1, State=ONLINE, Order=3 >>> ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b, >>> Address=172.17.0.1, State=ONLINE, Order=21 >>> ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e, >>> Address=172.17.0.1, State=ONLINE, Order=5 >>> ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE >>> ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE >>> >>> -------------------------------------------------------------------------------- >>> Number of baseline nodes: 5 >>> >>> Other nodes not found. >>> >>> >>> >>> So my questions are: >>> >>> 1) What can I do to recover from partitions lost problem after shutting >>> down several nodes? I thought that in case of graceful shutdown this >>> problem must be solved. >>> >>> Now I can recover by returning *one* of offline nodes to cluster >>> (starting the service) and running *reset_lost_partitions* command for >>> broken cache. After this cache becomes available. >>> >>> 2) What can I do to prevent this problem in scenario with automatic >>> cluster deployment? Should I add *reset_lost_partitions* command after >>> activation or redeploy? >>> >>> Please help. >>> Thanks in advance! >>> >>> Best regards, >>> Rose. >>> >>> *--* >>> >>> *Роза Айсина* >>> Старший разработчик ПО >>> *СберМаркет* | Доставка из любимых магазинов >>> >>> >>> Email: [email protected] >>> Mob: >>> Web: sbermarket.ru >>> App: iOS >>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> >>> и Android >>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru> >>> >>> >>> >>> >>> >>> >>> >>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые >>> документы, приложенные к нему, содержат конфиденциальную информацию. >>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено >>> Вам, использование, копирование, распространение информации, содержащейся в >>> настоящем сообщении, а также осуществление любых действий на основе этой >>> информации, строго запрещено. Если Вы получили это сообщение по ошибке, >>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это >>> сообщение. >>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are >>> confidential. If you are not the intended recipient you are notified that >>> using, copying, distributing or taking any action in reliance on the >>> contents of this information is strictly prohibited. If you have received >>> this email in error please notify the sender and delete this email. >>> *--* >>> >>> *Роза Айсина* >>> Старший разработчик ПО >>> *СберМаркет* | Доставка из любимых магазинов >>> >>> >>> Email: [email protected] >>> Mob: >>> Web: sbermarket.ru >>> App: iOS >>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> >>> и Android >>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru> >>> >>> >>> >>> >>> >>> *--* >>> >>> *Роза Айсина* >>> Старший разработчик ПО >>> *СберМаркет* | Доставка из любимых магазинов >>> >>> >>> Email: [email protected] >>> Mob: >>> Web: sbermarket.ru >>> App: iOS >>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> >>> и Android >>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru> >>> >>> >>> >>> >>> >>> >>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые >>> документы, приложенные к нему, содержат конфиденциальную информацию. >>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено >>> Вам, использование, копирование, распространение информации, содержащейся в >>> настоящем сообщении, а также осуществление любых действий на основе этой >>> информации, строго запрещено. Если Вы получили это сообщение по ошибке, >>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это >>> сообщение. >>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are >>> confidential. If you are not the intended recipient you are notified that >>> using, copying, distributing or taking any action in reliance on the >>> contents of this information is strictly prohibited. If you have received >>> this email in error please notify the sender and delete this email. >>> >> >> >> -- >> Regards, >> Sumit Deshinge >> >> >> *--* >> >> *Роза Айсина* >> >> Старший разработчик ПО >> >> *СберМаркет* | Доставка из любимых магазинов >> >> >> >> Email: [email protected] >> >> Mob: >> >> Web: sbermarket.ru >> >> App: iOS >> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> >> и Android >> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru> >> >> >> >> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые >> документы, приложенные к нему, содержат конфиденциальную информацию. >> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено >> Вам, использование, копирование, распространение информации, содержащейся в >> настоящем сообщении, а также осуществление любых действий на основе этой >> информации, строго запрещено. Если Вы получили это сообщение по ошибке, >> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это >> сообщение. >> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are >> confidential. If you are not the intended recipient you are notified that >> using, copying, distributing or taking any action in reliance on the >> contents of this information is strictly prohibited. If you have received >> this email in error please notify the sender and delete this email. >> >
