Hi Alex, We are using Ignite v2.15.
I will track down the additional log information and reply on this thread. Raymond. On Wed, Jul 19, 2023 at 2:55 AM Alex Plehanov <plehanov.a...@gmail.com> wrote: > Hello, > > Which Ignite version do you use? > Please share exception details after "Exception during start processors, > node will be stopped and close connections" (there should be a reason in > the log, why the page delta can't be applied). > > вт, 18 июл. 2023 г. в 05:05, Raymond Wilson <raymond_wil...@trimble.com>: > >> Hi, >> >> We run a dev/alpha stack of our application in Azure Kubernetes. >> Persistent storage is contained in Azure Files NAS storage volumes, one per >> server node. >> >> We ran an upgrade of Kubernetes today (from 1.24.9 to 1.26.3). During the >> update various pods were stopped and restarted as is normal for an update. >> This included nodes running the dev/alpha stack. >> >> At least one node (of a cluster of four server nodes in the cluster) >> failed to restart after the update, with the following logging: >> >> 2023-07-18 01:23:55.171 [1] INF Restoring checkpoint after logical >> recovery, will start physical recovery from back pointer: WALPointer >> [idx=2431, fileOff=209031823, len=29] >> 2023-07-18 01:23:55.205 [28] ERR Failed to apply page delta. >> rec=[PagesListRemovePageRecord [rmvdPageId=0101000100000057, >> pageId=0101000100000004, grpId=-1476359018, super=PageDeltaRecord >> [grpId=-1476359018, pageId=0101000100000004, super=WALRecord [size=41, >> chainSize=0, pos=WALPointer [idx=2431, fileOff=209169155, len=41], >> type=PAGES_LIST_REMOVE_PAGE]]]] >> 2023-07-18 01:23:55.217 [1] INF Cleanup cache stores [total=0, >> left=0, cleanFiles=false] >> 2023-07-18 01:23:55.218 [1] ERR Got exception while starting (will >> rollback startup routine). >> 2023-07-18 01:23:55.218 [1] ERR Exception during start processors, >> node will be stopped and close connections >> >> I know Apache Ignite is very good at surviving 'Big Red Switch' >> scenarios, and we have our data regions configured with the strictest >> update protocol (full sync after each write), however it's possible the NAS >> implementation does something different! >> >> I think if we delete the WAL files from the nodes that won't restart then >> the node may be happy, though we will lose any updates since the last >> checkpoint (but then, it has low use and checkpoints are every 30-45 >> seconds or so, so this won't be significant). >> >> Is this an error anyone else has noticed? >> Has anyone else had similar issues with Azure Files when using strict >> update/sync semantics? >> >> Thanks, >> Raymond. >> >> -- >> <http://www.trimble.com/> >> Raymond Wilson >> Trimble Distinguished Engineer, Civil Construction Software (CCS) >> 11 Birmingham Drive | Christchurch, New Zealand >> raymond_wil...@trimble.com >> >> >> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch> >> > -- <http://www.trimble.com/> Raymond Wilson Trimble Distinguished Engineer, Civil Construction Software (CCS) 11 Birmingham Drive | Christchurch, New Zealand raymond_wil...@trimble.com <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>