Re: Possible WAL corruption on running system during K8s update

Raymond Wilson Tue, 18 Jul 2023 10:43:58 -0700

Hi Alex,

We are using Ignite v2.15.


I will track down the additional log information and reply on this thread.

Raymond.


On Wed, Jul 19, 2023 at 2:55 AM Alex Plehanov <[email protected]>
wrote:

> Hello,
>
> Which Ignite version do you use?
> Please share exception details after "Exception during start processors,
> node will be stopped and close connections" (there should be a reason in
> the log, why the page delta can't be applied).
>
> вт, 18 июл. 2023 г. в 05:05, Raymond Wilson <[email protected]>:
>
>> Hi,
>>
>> We run a dev/alpha stack of our application in Azure Kubernetes.
>> Persistent storage is contained in Azure Files NAS storage volumes, one per
>> server node.
>>
>> We ran an upgrade of Kubernetes today (from 1.24.9 to 1.26.3). During the
>> update various pods were stopped and restarted as is normal for an update.
>> This included nodes running the dev/alpha stack.
>>
>> At least one node (of a cluster of four server nodes in the cluster)
>> failed to restart after the update, with the following logging:
>>
>>   2023-07-18 01:23:55.171 [1] INF    Restoring checkpoint after logical
>> recovery, will start physical recovery from back pointer: WALPointer
>> [idx=2431, fileOff=209031823, len=29]
>>  2023-07-18 01:23:55.205  [28] ERR    Failed to apply page delta.
>> rec=[PagesListRemovePageRecord [rmvdPageId=0101000100000057,
>> pageId=0101000100000004, grpId=-1476359018, super=PageDeltaRecord
>> [grpId=-1476359018, pageId=0101000100000004, super=WALRecord [size=41,
>> chainSize=0, pos=WALPointer [idx=2431, fileOff=209169155, len=41],
>> type=PAGES_LIST_REMOVE_PAGE]]]]
>>  2023-07-18 01:23:55.217 [1] INF    Cleanup cache stores [total=0,
>> left=0, cleanFiles=false]
>>  2023-07-18 01:23:55.218 [1] ERR    Got exception while starting (will
>> rollback startup routine).
>>  2023-07-18 01:23:55.218 [1] ERR    Exception during start processors,
>> node will be stopped and close connections
>>
>> I know Apache Ignite is very good at surviving 'Big Red Switch'
>> scenarios, and we have our data regions configured with the strictest
>> update protocol (full sync after each write), however it's possible the NAS
>> implementation does something different!
>>
>> I think if we delete the WAL files from the nodes that won't restart then
>> the node may be happy, though we will lose any updates since the last
>> checkpoint (but then, it has low use and checkpoints are every 30-45
>> seconds or so, so this won't be significant).
>>
>> Is this an error anyone else has noticed?
>> Has anyone else had similar issues with Azure Files when using strict
>> update/sync semantics?
>>
>> Thanks,
>> Raymond.
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Trimble Distinguished Engineer, Civil Construction Software (CCS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> [email protected]
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>

-- 
<http://www.trimble.com/>
Raymond Wilson
Trimble Distinguished Engineer, Civil Construction Software (CCS)
11 Birmingham Drive | Christchurch, New Zealand
[email protected]

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Possible WAL corruption on running system during K8s update

Reply via email to