Hello!

Can you please find an actual stack trace from partition-exchanger thread
in that log?

One that starts with Thread [name="partition-exchanger" ?

Regards,
-- 
Ilya Kasnacheev


вт, 10 нояб. 2020 г. в 15:54, Kamlesh Joshi <[email protected]>:

> Hi Igniters,
>
>
>
> Sometimes 'Blocked system-critical thread' errors are getting printed in
> prod cluster logs. As per below logs, it's saying exchange-worker blocked
> for 60s. There is no node join/left or cluster activation or cache
> creation, still why partition exchange is triggered. Even it is triggered,
>  it is blocked for *60s *which is huge time from prod perspective.
>
>
>
> Below are error details,
>
> [2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G]
> Blocked system-critical thread has been detected. This can lead to
> cluster-wide undefined behaviour [threadName=partition-exchanger,
> *blockedFor=60s*]
>
> [2020-11-09T10:55:18,888][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][G]
> Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391,
> state=TIMED_WAITING, blockCnt=16, waitCnt=2469961]
>
>
>
> Cluster is responding but these errors priting, we are not understanding
> what's the cause, could you please help us.
>
>
>
> Below is log snippet,
>
>
>
> [2020-11-09T10:51:21,458][INFO
> ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager]
> Checkpoint started [checkpointId=3bd28c8f-d9c9-4110-ab8c-24cf3a5c44a3,
> startPtr=FileWALPointer [idx=1279664, fileOff=28148165, len=49557],
> checkpointLockWait=0ms, checkpointLockHoldTime=34ms,
> walCpRecordFsyncDuration=7ms, pages=84152, reason='timeout']
>
> [2020-11-09T10:51:23,499][INFO
> ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager]
> Checkpoint finished [cpId=3bd28c8f-d9c9-4110-ab8c-24cf3a5c44a3,
> pages=84152, markPos=FileWALPointer [idx=1279664, fileOff=28148165,
> len=49557], walSegmentsCleared=5, walSegmentsCovered=[1279658 - 1279663],
> markDuration=79ms, pagesWrite=1195ms, fsync=845ms, total=2119ms]
>
> [2020-11-09T10:51:36,788][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279664, segIdx=4,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000004.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279664.wal]
>
> [2020-11-09T10:51:36,954][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000004.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279664.wal]
>
> [2020-11-09T10:52:02,018][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279665, segIdx=5,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000005.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279665.wal]
>
> [2020-11-09T10:52:02,200][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000005.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279665.wal]
>
> [2020-11-09T10:52:36,541][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279666, segIdx=6,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000006.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279666.wal]
>
> [2020-11-09T10:52:36,703][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000006.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279666.wal]
>
> [2020-11-09T10:53:11,068][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279667, segIdx=7,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000007.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279667.wal]
>
> [2020-11-09T10:53:11,239][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000007.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279667.wal]
>
> [2020-11-09T10:53:45,694][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279668, segIdx=8,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000008.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279668.wal]
>
> [2020-11-09T10:53:45,868][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000008.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279668.wal]
>
> [2020-11-09T10:54:21,411][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279669, segIdx=9,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000009.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279669.wal]
>
> [2020-11-09T10:54:21,480][INFO
> ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager]
> Checkpoint started [checkpointId=33221fc2-5ab0-4929-9d23-8940e0552272,
> startPtr=FileWALPointer [idx=1279670, fileOff=5561792, len=49557],
> checkpointLockWait=0ms, checkpointLockHoldTime=51ms,
> walCpRecordFsyncDuration=5ms, pages=86602, reason='timeout']
>
> [2020-11-09T10:54:21,578][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000009.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279669.wal]
>
> [2020-11-09T10:54:23,624][INFO
> ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager]
> Checkpoint finished [cpId=33221fc2-5ab0-4929-9d23-8940e0552272,
> pages=86602, markPos=FileWALPointer [idx=1279670, fileOff=5561792,
> len=49557], walSegmentsCleared=6, walSegmentsCovered=[1279664 - 1279669],
> markDuration=94ms, pagesWrite=1219ms, fsync=925ms, total=2238ms]
>
> [2020-11-09T10:54:45,727][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279670, segIdx=0,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000000.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279670.wal]
>
> [2020-11-09T10:54:45,886][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000000.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279670.wal]
>
> [2020-11-09T10:55:17,146][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279671, segIdx=1,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000001.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279671.wal]
>
> [2020-11-09T10:55:17,343][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000001.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279671.wal]
>
> [2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G]
> Blocked system-critical thread has been detected. This can lead to
> cluster-wide undefined behaviour [threadName=partition-exchanger,
> blockedFor=60s]
>
> [2020-11-09T10:55:18,888][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][G]
> Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391,
> state=TIMED_WAITING, blockCnt=16, waitCnt=2469961]
>
>     Lock
> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@223507fb,
> ownerName=null, ownerId=-1]
>
>
>
> [2020-11-09T10:55:18,889][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][]
> Critical system error detected. Will be handled accordingly to configured
> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
> [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
> finished=false, heartbeatTs=1604899458881]]]
>
> org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger,
> igniteInstanceName=EDIFCustomerCC, finished=false,
> heartbeatTs=1604899458881]
>
>                 at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)
> [ignite-core-2.7.6.jar:2.7.6]
>
>                 at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
> [ignite-core-2.7.6.jar:2.7.6]
>
>                 at
> org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
> [ignite-core-2.7.6.jar:2.7.6]
>
>                 at
> org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)
> [ignite-core-2.7.6.jar:2.7.6]
>
>                 at
> org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513)
> [ignite-core-2.7.6.jar:2.7.6]
>
>                 at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> [ignite-core-2.7.6.jar:2.7.6]
>
>                 at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
>
> [2020-11-09T10:55:18,894][WARN
> ][sys-stripe-118-#119%EDIFCustomerCC%][FailureProcessor] No deadlocked
> threads detected.
>
> [2020-11-09T10:55:21,404][WARN
> ][jvm-pause-detector-worker][IgniteKernal%EDIFCustomerCC] Possible too long
> JVM pause: 2477 milliseconds.
>
> [2020-11-09T10:55:21,422][WARN
> ][sys-stripe-118-#119%EDIFCustomerCC%][FailureProcessor] Thread dump at
> 2020/11/09 10:55:21 IST
>
> Thread [name="sys-#213675%EDIFCustomerCC%", id=369652,
> state=TIMED_WAITING, blockCnt=0, waitCnt=1]
>
>     Lock
> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42c969fd,
> ownerName=null, ownerId=-1]
>
>         at sun.misc.Unsafe.park(Native Method)
>
>         at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>
>         at
> java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
>         at java.lang.Thread.run(Thread.java:748)
>
>
>
> Thread [name="sys-#213674%EDIFCustomerCC%", id=369651,
> state=TIMED_WAITING, blockCnt=0, waitCnt=1]
>
>     Lock
> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42c969fd,
> ownerName=null, ownerId=-1]
>
>         at sun.misc.Unsafe.park(Native Method)
>
>         at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>
>         at
> java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
>         at java.lang.Thread.run(Thread.java:748)
>
>
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>

Reply via email to