Hi,
One of the 8 server node cluster down suddenly, printed error not that much
helpful to find the reason. Given below the log entries, please help us here to
find the reason.
################################################################################
2024-03-16T07:18:52,489][INFO
][db-checkpoint-thread-#556%EDIFCustomerMAINDR%][GridCacheDatabaseSharedManager]
Checkpoint started [checkpointId=90c6ab62-a441-4cd3-8b6d-fbef550fbe4c,
startPtr=FileWALPointer [idx=5463768, fileOff=51537417, len=56705],
checkpointLockWait=0ms, checkpointLockHoldTime=21ms,
walCpRecordFsyncDuration=4ms, pages=62810, reason='timeout']
2024-03-16T07:18:53,565][INFO
][db-checkpoint-thread-#556%EDIFCustomerMAINDR%][GridCacheDatabaseSharedManager]
Checkpoint finished [cpId=90c6ab62-a441-4cd3-8b6d-fbef550fbe4c, pages=62810,
markPos=FileWALPointer [idx=5463768, fileOff=51537417, len=56705],
walSegmentsCleared=4, walSegmentsCovered=[5463764 - 5463767],
markDuration=43ms, pagesWrite=491ms, fsync=585ms, total=1119ms]
2024-03-16T07:19:11,191][INFO
][wal-file-archiver%EDIFCustomerMAINDR-#466%EDIFCustomerMAINDR%][FileWriteAheadLogManager]
Starting to copy WAL segment [absIdx=5463768, segIdx=8,
origFile=/datastore1/wal/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000000000008.wal,
dstFile=/datastore1/archive/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000005463768.wal]
2024-03-16T07:19:11,318][INFO
][wal-file-archiver%EDIFCustomerMAINDR-#466%EDIFCustomerMAINDR%][FileWriteAheadLogManager]
Copied file
[src=/datastore1/wal/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000000000008.wal,
dst=/datastore1/archive/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000005463768.wal]
2024-03-16T07:19:33,988][INFO
][wal-file-archiver%EDIFCustomerMAINDR-#466%EDIFCustomerMAINDR%][FileWriteAheadLogManager]
Starting to copy WAL segment [absIdx=5463769, segIdx=9,
origFile=/datastore1/wal/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000000000009.wal,
dstFile=/datastore1/archive/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000005463769.wal]
2024-03-16T07:19:34,121][INFO
][wal-file-archiver%EDIFCustomerMAINDR-#466%EDIFCustomerMAINDR%][FileWriteAheadLogManager]
Copied file
[src=/datastore1/wal/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000000000009.wal,
dst=/datastore1/archive/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000005463769.wal]
2024-03-16T07:20:18,852][INFO
][wal-file-archiver%EDIFCustomerMAINDR-#466%EDIFCustomerMAINDR%][FileWriteAheadLogManager]
Starting to copy WAL segment [absIdx=5463770, segIdx=0,
origFile=/datastore1/wal/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000000000000.wal,
dstFile=/datastore1/archive/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000005463770.wal]
2024-03-16T07:20:19,000][INFO
][wal-file-archiver%EDIFCustomerMAINDR-#466%EDIFCustomerMAINDR%][FileWriteAheadLogManager]
Copied file
[src=/datastore1/wal/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000000000000.wal,
dst=/datastore1/archive/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000005463770.wal]
2024-03-16T07:21:15,373][INFO
][wal-file-archiver%EDIFCustomerMAINDR-#466%EDIFCustomerMAINDR%][FileWriteAheadLogManager]
Starting to copy WAL segment [absIdx=5463771, segIdx=1,
origFile=/datastore1/wal/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000000000001.wal,
dstFile=/datastore1/archive/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000005463771.wal]
2024-03-16T07:21:15,500][INFO
][wal-file-archiver%EDIFCustomerMAINDR-#466%EDIFCustomerMAINDR%][FileWriteAheadLogManager]
Copied file
[src=/datastore1/wal/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000000000001.wal,
dst=/datastore1/archive/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000005463771.wal]
2024-03-16T07:21:52,493][INFO
][db-checkpoint-thread-#556%EDIFCustomerMAINDR%][GridCacheDatabaseSharedManager]
Checkpoint started [checkpointId=d12adda8-2202-4192-bbeb-f5083d3eabc9,
startPtr=FileWALPointer [idx=5463772, fileOff=44231024, len=56705],
checkpointLockWait=0ms, checkpointLockHoldTime=25ms,
walCpRecordFsyncDuration=5ms, pages=61176, reason='timeout']
2024-03-16T07:21:53,433][INFO
][db-checkpoint-thread-#556%EDIFCustomerMAINDR%][GridCacheDatabaseSharedManager]
Checkpoint finished [cpId=d12adda8-2202-4192-bbeb-f5083d3eabc9, pages=61176,
markPos=FileWALPointer [idx=5463772, fileOff=44231024, len=56705],
walSegmentsCleared=4, walSegmentsCovered=[5463768 - 5463771],
markDuration=45ms, pagesWrite=485ms, fsync=454ms, total=984ms]
2024-03-16T07:22:12,689][INFO
][wal-file-archiver%EDIFCustomerMAINDR-#466%EDIFCustomerMAINDR%][FileWriteAheadLogManager]
Starting to copy WAL segment [absIdx=5463772, segIdx=2,
origFile=/datastore1/wal/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000000000002.wal,
dstFile=/datastore1/archive/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000005463772.wal]
2024-03-16T07:22:12,855][INFO
][wal-file-archiver%EDIFCustomerMAINDR-#466%EDIFCustomerMAINDR%][FileWriteAheadLogManager]
Copied file
[src=/datastore1/wal/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000000000002.wal,
dst=/datastore1/archive/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000005463772.wal]
2024-03-16T07:22:46,007][INFO
][wal-file-archiver%EDIFCustomerMAINDR-#466%EDIFCustomerMAINDR%][FileWriteAheadLogManager]
Starting to copy WAL segment [absIdx=5463773, segIdx=3,
origFile=/datastore1/wal/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000000000003.wal,
dstFile=/datastore1/archive/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000005463773.wal]
2024-03-16T07:22:46,135][INFO
][wal-file-archiver%EDIFCustomerMAINDR-#466%EDIFCustomerMAINDR%][FileWriteAheadLogManager]
Copied file
[src=/datastore1/wal/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000000000003.wal,
dst=/datastore1/archive/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000005463773.wal]
2024-03-16T07:23:21,043][INFO
][wal-file-archiver%EDIFCustomerMAINDR-#466%EDIFCustomerMAINDR%][FileWriteAheadLogManager]
Starting to copy WAL segment [absIdx=5463774, segIdx=4,
origFile=/datastore1/wal/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000000000004.wal,
dstFile=/datastore1/archive/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000005463774.wal]
2024-03-16T07:23:21,176][INFO
][wal-file-archiver%EDIFCustomerMAINDR-#466%EDIFCustomerMAINDR%][FileWriteAheadLogManager]
Copied file
[src=/datastore1/wal/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000000000004.wal,
dst=/datastore1/archive/node00-34dda35b-606f-4124-936e-7b1a01531043/0000000005463774.wal]
[2024-03-16T07:23:30,071][ERROR][sys-stripe-101-#102%EDIFCustomerMAINDR%][]
Critical system error detected. Will be handled accordingly to configured
handler [hnd=StopNodeOrHaltFailureHandler
[tryStop=false, timeout=0, super=AbstractFailureHandler
[ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.ClassCastException:
o.a.i.i.processors.cache.distributed.near.GridNearGetRequest cannot be cast to
o.a.i.i.GridJobExecuteRequest]]
java.lang.ClassCastException:
org.apache.ignite.internal.processors.cache.distributed.near.GridNearGetRequest
cannot be cast to org.apache.ignite.internal.GridJobExecuteRequest
at
org.apache.ignite.internal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:1923)
~[ignite-core-2.7.6.jar:2.7.6]
at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
~[ignite-core-2.7.6.jar:2.7.6]
at
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197)
~[ignite-core-2.7.6.jar:2.7.6]
at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:127)
~[ignite-core-2.7.6.jar:2.7.6]
at
org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1093)
~[ignite-core-2.7.6.jar:2.7.6]
at
org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:505)
[ignite-core-2.7.6.jar:2.7.6]
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
[ignite-core-2.7.6.jar:2.7.6]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_351]
2024-03-16T07:23:30,071][ERROR][sys-stripe-101-#102%EDIFCustomerMAINDR%][]
Critical system error detected. Will be handled accordingly to configured
handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.ClassCastException:
o.a.i.i.processors.cache.distributed.near.GridNearGetRequest cannot be cast to
o.a.i.i.GridJobExecuteRequest]]
2024-03-16T07:23:30,080][WARN
][sys-stripe-101-#102%EDIFCustomerMAINDR%][FailureProcessor] No deadlocked
threads detected.
2024-03-16T07:23:32,895][WARN
][jvm-pause-detector-worker][IgniteKernal%EDIFCustomerMAINDR] Possible too long
JVM pause: 2809 milliseconds.
2024-03-16T07:23:32,912][WARN
][sys-stripe-101-#102%EDIFCustomerMAINDR%][FailureProcessor] Thread dump at
2024/03/16 07:23:32 IST
Thread [name="JMX server connection timeout 185576", id=185576,
state=TIMED_WAITING, blockCnt=191, waitCnt=192]
Lock [object=[I@20ce31fa, ownerName=null, ownerId=-1]
at java.lang.Object.wait(Native Method)
at
com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:168)
at java.lang.Thread.run(Thread.java:750)
Thread [name="sys-#89619%EDIFCustomerMAINDR%", id=185575, state=TIMED_WAITING,
blockCnt=0, waitCnt=1]
Lock
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2fa3cce5,
ownerName=null, ownerId=-1]
at sun.misc.Unsafe.park(Native Method)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)..............................
[2024-03-16T07:23:32,919][ERROR][sys-stripe-101-#102%EDIFCustomerMAINDR%][] JVM
will be halted immediately due to the failure: [failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.ClassCastException:
o.a.i.i.processors.cache.distributed.near.GridNearGetRequest cannot be cast to
o.a.i.i.GridJobExecuteRequest]]
2024-03-16T07:23:32,919][ERROR][sys-stripe-101-#102%EDIFCustomerMAINDR%][] JVM
will be halted immediately due to the failure: [failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.ClassCastException:
o.a.i.i.processors.cache.distributed.near.GridNearGetRequest cannot be cast to
o.a.i.i.GridJobExecuteRequest]]
#################################################################################################
Thanks,
Gangaiah
"Confidentiality Warning: This message and any attachments are intended only
for the use of the intended recipient(s).
are confidential and may be privileged. If you are not the intended recipient.
you are hereby notified that any
review. re-transmission. conversion to hard copy. copying. circulation or other
use of this message and any attachments is
strictly prohibited. If you are not the intended recipient. please notify the
sender immediately by return email.
and delete this message and any attachments from your system.
Virus Warning: Although the company has taken reasonable precautions to ensure
no viruses are present in this email.
The company cannot accept responsibility for any loss or damage arising from
the use of this email or attachment."