Re: How to fix lost partitions gracefully?

Данилов Семён Thu, 21 Jul 2022 10:28:53 -0700

Hello, Rose!

It sure is hard to figure out why pod restarts, but "Killed" in the logs makes 
me think that it might be due to out of memory killer.
You can try searching for oom killer in journalctl (or which one log utility 
you have on your particular system).


As for lost partitions, could you please elaborate on how you used control.sh 
to reset lost partitions? I tried reproducing your scenario (but without OOM) 
and I got it to work with reset lost partitions command.
Sadly at this moment, Ignite doesn't reset lost partitions automatically, but 
we do have a ticket addressing this issue 
https://issues.apache.org/jira/browse/IGNITE-15653.

Kind regards,
Semyon.


> Hello Семён!
> 
> Thank you for your reply!
> 
> 2. Yeah, at some point all nodes go down and, as they are created by 
> StatefulSet, k8s recreates each with the same pod name (ignite-0, ignite-1). 
> I can be sure because number of restarts changes from 0 to 1. I just fully 
> repeated deployment instructions from
> official docs (how to deploy Ignite on k8s).
> 3. I attached the cluster configuration to this email. Persistence is enabled 
> and files are stored on disk of k8s node (by nodeSelectors I make sure that 
> all pods look into the same folder on disk).
> 4. I am not sure which ID you mean.
> 
> What I see in Grafana logs (sorry, a lot of text below):
> 
> ignite-1: ID changes from b54b6cde to
> 7776249c when node restarts (note: no any errors before restart):
> 
> // old ID
> 
> {"log":" ^--
> Node [id=b54b6cde, uptime=17 days, 
> 09:20:13.458]\n","stream":"stderr","time":"2022-07-17T04:50:38.730606703Z"}
> 
> // here it restarts with the same consistentId and new node ID
> 
> {"log":"[04:51:01,500][INFO][main][PdsFoldersResolver]
> Consistent ID used for local node is [932603da-955c-4b1a-8717-2ce2e875b20c]
> according to persistence data storage 
> folders\n","stream":"stderr","time":"2022-07-17T04:51:01.500301744Z"}
> 
> // sees the second node
> 
> {"log":"[04:51:44,707][INFO][tcp-disco-msg-worker-[]-#2-#55][TcpDiscoverySpi]
> New next node [newNext=TcpDiscoveryNode 
> [id=6daf16ec-368f-44d5-9751-fa7b440eb264, 
> consistentId=b849e31e-152b-4f1b-afd7-3bc107181677, addrs=ArrayList 
> [10.233.96.234, 127.0.0.1], sockAddrs=HashSet 
> [ignite-0.ignite.rec-matchtv.svc.cluster.local/10.233.96.234:47500,
> /127.0.0.1:47500], discPort=47500, order=764, intOrder=384, 
> lastExchangeTime=1658033499670, loc=false, ver=2.12.0#20220108-sha1:b1289f75, 
> isClient=false]]\n","stream":"stderr","time":"2022-07-17T04:51:44.708116152Z"}
> 
> {"log":"\u003e\u003e\u003e
> Local node [ID=7776249C-FB90-4761-B3BD-79244C97EAB7, order=773, 
> clientMode=false]\n","stream":"stderr","time":"2022-07-17T04:51:49.08998977Z"}
> 
> ignite-0: sees that old ignite-1 node failed, lost partitions were detected 
> and new node was connected with the same
> consistentId:
> 
> {"log":"[04:50:54,905][WARNING][disco-event-worker-#61][GridDiscoveryManager]
> Node FAILED: TcpDiscoveryNode [id=b54b6cde-9d34-473f-afdf-9f0a7d843e2f, 
> consistentId=932603da-955c-4b1a-8717-2ce2e875b20c, addrs=ArrayList 
> [10.233.96.120, 127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500,
> ignite-1.ignite.rec-matchtv.svc.cluster.local/10.233.96.120:47500], 
> discPort=47500, order=2, intOrder=2, lastExchangeTime=1658008371506, 
> loc=false, ver=2.12.0#20220108-sha1:b1289f75, 
> isClient=false]\n","stream":"stderr","time":"2022-07-17T04:50:54.905678158Z"}
> 
> {"log":"[04:50:55,172][WARNING][sys-#3337][GridDhtPartitionTopologyImpl]
> Detected lost partitions [grp=PipelineConfig, parts=[2, 12-21, 23, ...], 
> topVer=AffinityTopologyVersion [topVer=772, 
> minorTopVer=0]]\n","stream":"stderr","time":"2022-07-17T04:50:55.172591305Z"}
> 
> {"log":"[04:51:39,521][INFO][tcp-disco-msg-worker-[crd]-#2-#55][TcpDiscoverySpi]
> New next node [newNext=TcpDiscoveryNode 
> [id=7776249c-fb90-4761-b3bd-79244c97eab7, 
> consistentId=932603da-955c-4b1a-8717-2ce2e875b20c, addrs=ArrayList 
> [10.233.96.120, 127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500,
> ignite-1.ignite.rec-matchtv.svc.cluster.local/10.233.96.120:47500], 
> discPort=47500, order=0, intOrder=388, lastExchangeTime=1658033499392, 
> loc=false, ver=2.12.0#20220108-sha1:b1289f75, 
> isClient=false]]\n","stream":"stderr","time":"2022-07-17T04:51:39.52124616Z"}
> 
> Note that they do not go down
> simultaneously, ignite-1 goes down, restarts, ignite-0 sees fresh node and 
> makes connection. Moreover ignite-0 itself restarts significantly earlier at
> 2022-07-16 21:51:30 with the same sequence of events. And also has no any 
> errors except:
> {"log":"Killed\n","stream":"stderr","time":"2022-07-16T21:51:54.018444887Z"}
> just before restart.
> 
> Corrupted erros started to be raised since
> 2022-07-17 05:37:19 on ignite-1 with the same
> id=7776249c and uptime=00:46:00.268:
> 
> {"log":"[05:37:18,805][SEVERE][client-connector-#436][ClientListenerNioListener]
>  Failed to process client request 
> [req=o.a.i.i.processors.platform.client.cache.ClientCacheScanQueryRequest@6e7493d,
> msg=class o.a.i.i.processors.cache.CacheInvalidStateException: Failed to 
> execute query because cache partition has been lostParts 
> [cacheName=PipelineConfig, 
> part=0]]\n","stream":"stderr","time":"2022-07-17T05:37:18.807327604Z"}
> 
> On ignite-0 (which restarted earlier) corrupted errors appear earlier at 
> 2022-07-17 03:41:09 with
> id=6daf16ec and uptime=05:48:01.522 without any reason
> and at the same time there are no erros on ignite-1.
> 
> Honestly I don't understand what is going on. Long story short that's what 
> happens:
> 
> 2022-07-16 21:51:30 ignite-0 restarts
> 
> 2022-07-17 03:41:09 ignite-0 has corrupted errors without any beforehabd 
> events or stacktraces and ignite-1
> lives happily without errors
> 
> 2022-07-17 04:51:01 ignite-1 restarts
> 
> 2022-07-17
> 05:37:19 ignite-1 starts to have corrupted errors
> 
> What causes ignite-0 restart - I don't know :( Everything seems fine and I 
> can't find any events on k8s cluster. Also the same question remains - why 
> persistent data is lost when consistent ID is the same, folders of disk are 
> the same, and this happens not
> immedeately but after several hours.
> 
> Please help 🙏
> 
> Best regards,
> 
> Rose.
> 
> From: Данилов Семён <[email protected]>
> 
> Sent: Tuesday, July 19, 2022 10:44:22 AM
> 
> To: [email protected]
> 
> Subject: Re: How to fix lost partitions gracefully?
> 
> Hello Роза,
> 
> You're right, having persistence enabled should prevent the cluster losing 
> partitions, given that all nodes are online of course. So if a node (or whole 
> cluster goes down), they should not be lost after the restart.
> 
> I have a couple of questions:
> 
> 1. Do I understand correctly that you observe server nodes go down and k8s 
> recreate them?
> 
> 2. Can you provide your cluster configuration?
> 
> 3. Can you check that nodes that are started are the same nodes that went 
> down? (Re-)started node should have the same consistent id as the node that 
> went down. If it doesn't, then it's a brand new node with no persistence.
> 
> Regards, Semyon.
> 
>> Hi Stephen!
> 
>>
> 
>> Thank for your reply!
> 
>>
> 
>> 2. Well, that's the problem - I can't figure out why all server nodes go 
>> down. Nobody uses this cluster except my two apps with clients nodes. And 
>> nothing happens before unexpected shutdown and recreation of server pods. 
>> k8s cluster seems fine as well.
> 
>>
> 
>> 3. Also I have persistence enabled (with saving data on disk in k8s single 
>> node). Why when server-pods are recreated they can't restore their caches 
>> from persistence automatically? I thought this is the main goal of 
>> persistence - to save data.
> 
>> 4. Unfortunately resetting partitions didn't help :( Control script return 0 
>> exit code but it was still impossible to retrieve data from corrupted cache 
>> (same error). So I deleted cache data, redeploy the whole Ignite cluster and 
>> now everything works fine.
> 
>> But it's very costly to do this every time when Ignite server nodes are 
>> recreated which shouldn't be "stop-the-world" problem as data is saved.
> 
>>
> 
>> 5. I guess that backuping partitions will not help as both nodes went 
>> shutdown at the same time. It seems for me then that all partitions will be 
>> lost including those that were back-upped.
> 
>>
> 
>> Best regards,
> 
>>
> 
>> Rose.
> 
>>
> 
>> From: Stephen Darlington <[email protected]>
> 
>>
> 
>> Sent: Monday, July 18, 2022 5:54 PM
> 
>>
> 
>> To: user
> 
>>
> 
>> Subject: Re: How to fix lost partitions gracefully?
> 
>>
> 
>> Client nodes disconnecting is not the problem here. You have
> 
>> server nodes going down.
> 
>>
> 
>> Caches are split into partitions, which are then distributed across the 
>> nodes in your cluster. If one of your data nodes goes down, and you have not 
>> configured any backup partitions, then you will lose some partitions and the 
>> data in them.
> 
>>
> 
>> There’s a script you can run to “reset lost partitions”: control-script
> 
>>
> 
>> Of course this does not magically bring the data back.
> 
>>
> 
>> You perhaps need to consider more nodes and configure your caches with at 
>> least one backup.
> 
>>
> 
>>> On 18 Jul 2022, at 12:49, Айсина Роза <[email protected]> wrote:
> 
>>>
> 
>>> Hello!
> 
>>>
> 
>>> We have Ignite standalone cluster in k8s environment with 2 server nodes 
>>> and several clients - Java Spring application and Spark application.
> 
>>>
> 
>>> Both apps raise client nodes to connect to cluster each two hours (rolling 
>>> update redeploy of both apps happens).
> 
>>>
> 
>>> The whole setup is in k8s in one neamespace.
> 
>>>
> 
>>> There is strange behavior we see sporadically after
> 
>>> several weeks.
> 
>>>
> 
>>> Cache both apps using often becomes corrupted with the following exception:
> 
>>>
> 
>>> [10:57:43,951][SEVERE][client-connector-#2796][ClientListenerNioListener] 
>>> Failed to process client request 
>>> [req=o.a.i.i.processors.platform.client.cache.ClientCacheScanQueryRequest@78481268,
>>>  msg=class
> 
>>> o.a.i.i.processors.cache.CacheInvalidStateException: Failed to execute 
>>> query because cache partition has been lostParts [cacheName=PipelineConfig, 
>>> part=0]]
> 
>>>
> 
>>> javax.cache.CacheException: class 
>>> org.apache.ignite.internal.processors.cache.CacheInvalidStateException: 
>>> Failed to execute query because cache partition
> 
>>> has been lostParts [cacheName=PipelineConfig, part=0]
> 
>>>
> 
>>> I investigated through server logs from both Ignite nodes and found some 
>>> events that I cannot to understand.
> 
>>>
> 
>>> I attached logs - one with keyword = "Exception" to locate errors and the 
>>> other - original logs when first lost partitions error happens.
> 
>>>
> 
>>> It seems that this error is causing this behavior: Failed to shutdown socket
> 
>>>
> 
>>> After this all interactions with cluster becomes impossible.
> 
>>>
> 
>>> Also there are so many errors like this: Client disconnected abruptly due 
>>> to network connection loss or because the connection was
> 
>>> left open on application shutdown.
> 
>>>
> 
>>> So I have two questions:
> 
>>>
> 
>>> 2. Can you please help to investigate the main reason for lost partitions 
>>> error and how to handle it automatically? Right now I manually redeploy the 
>>> whole cluster and then all applications connected to it which is insane and 
>>> very human-dependent.
> 
>>> 3. Is there any special actions I need to do to gracefully handle client 
>>> nodes when apps are going to shutdown? Is it possible that often (each 2h) 
>>> connect-then-die events from client nodes cause this behavior?
> 
>>>
> 
>>> Thank you in advance! Looking forward for any help! 🙏
> 
>>
> 
>> Информация данного сообщения содержит коммерческую тайну Общества с 
>> ограниченной ответственностью «ГПМ Дата», ОГРН 1207700499863 (далее – ООО 
>> «ГПМ Дата») и предназначена только для лиц, которым адресовано данное 
>> сообщение. Если Вы получили данное сообщение
> 
>> по ошибке, просим Вас удалить его и не использовать полученную информацию, 
>> составляющую коммерческую тайну ООО «ГПМ Дата».
> 
>>
> 
>> В соответствии с действующим законодательством Российской Федерации ООО «ГПМ 
>> Дата» вправе требовать от лиц, получивших доступ к информации, составляющей 
>> коммерческую тайну, в результате действий, совершенных случайно или по 
>> ошибке, охраны конфиденциальности
> 
>> этой информации.
> 
>>
> 
>> Настоящее сообщение не является вступлением в переговоры о заключении 
>> каких-либо договоров или соглашений, не свидетельствует о каком-либо 
>> безусловном намерении ООО «ГПМ Дата» заключить какой-либо договор или 
>> соглашение, не является заверением об обстоятельствах,
> 
>> которые должны быть доведены до сведения другой стороны. Настоящее сообщение 
>> не является офертой, акцептом оферты, равно как не является предварительным 
>> соглашением и носит исключительно информационный и необязывающий характер. 
>> ООО «ГПМ Дата» оставляет за
> 
>> собой право на прекращение настоящей переписки в любое время.
> 
> Информация данного сообщения содержит коммерческую тайну Общества с 
> ограниченной ответственностью «ГПМ Дата», ОГРН 1207700499863 (далее – ООО 
> «ГПМ Дата») и предназначена только для лиц, которым адресовано данное 
> сообщение. Если Вы получили данное сообщение
> по ошибке, просим Вас удалить его и не использовать полученную информацию, 
> составляющую коммерческую тайну ООО «ГПМ Дата».
> 
> В соответствии с действующим законодательством Российской Федерации ООО «ГПМ 
> Дата» вправе требовать от лиц, получивших доступ к информации, составляющей 
> коммерческую тайну, в результате действий, совершенных случайно или по 
> ошибке, охраны конфиденциальности
> этой информации.
> 
> Настоящее сообщение не является вступлением в переговоры о заключении 
> каких-либо договоров или соглашений, не свидетельствует о каком-либо 
> безусловном намерении ООО «ГПМ Дата» заключить какой-либо договор или 
> соглашение, не является заверением об обстоятельствах,
> которые должны быть доведены до сведения другой стороны. Настоящее сообщение 
> не является офертой, акцептом оферты, равно как не является предварительным 
> соглашением и носит исключительно информационный и необязывающий характер. 
> ООО «ГПМ Дата» оставляет за
> собой право на прекращение настоящей переписки в любое время.

Re: How to fix lost partitions gracefully?

Reply via email to