Hello!
Oct 12 17:49:41 nalrcsvridbq02 kernel: watchdog: BUG: soft lockup - CPU#0
stuck for 38s! [kworker/u256:0:3703404]
This is bad. Your system kernel says your CPU#0 was hanging for 38 seconds.
This is enough to trigger failure detection timeout and kill and instance,
or at least for the node to be segmented from cluster by the other nodes.
Regards,
--
Ilya Kasnacheev
ср, 14 окт. 2020 г. в 16:00, bbellrose :
> Oct 12 17:47:39 nalrcsvridbq02 Ignite[2031634]: [17:47:39] Possible failure
> suppressed accordingly to a configured handler
> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
> failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class
> o.a.i.IgniteException: GridWorker [name=nio-acceptor-tcp-comm,
> igniteInstanceName=RailConnect Ignite QA Grid, finished=false,
> heartbeatTs=1602539219763]]]
> Oct 12 17:48:20 nalrcsvridbq02 Ignite[2031634]: [17:48:20] Possible failure
> suppressed accordingly to a configured handler
> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
> failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class
> o.a.i.IgniteException: GridWorker [name=grid-nio-worker-tcp-comm-1,
> igniteInstanceName=RailConnect Ignite QA Grid, finished=false,
> heartbeatTs=1602539260020]]]
> Oct 12 17:48:20 nalrcsvridbq02 Ignite[2031634]: [17:48:20] Possible failure
> suppressed accordingly to a configured handler
> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
> failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class
> o.a.i.IgniteException: GridWorker [name=ttl-cleanup-worker,
> igniteInstanceName=RailConnect Ignite QA Grid, finished=false,
> heartbeatTs=1602539260020]]]
> Oct 12 17:48:20 nalrcsvridbq02 Ignite[2031634]: [17:48:20] Possible failure
> suppressed accordingly to a configured handler
> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
> failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class
> o.a.i.IgniteException: GridWorker [name=db-checkpoint-thread,
> igniteInstanceName=RailConnect Ignite QA Grid, finished=false,
> heartbeatTs=1602539260006]]]
> Oct 12 17:49:00 nalrcsvridbq02 chronyd[1216]: Forward time jump detected!
> Oct 12 17:49:00 nalrcsvridbq02 chronyd[1216]: Can't synchronise: no
> selectable sources
> Oct 12 17:49:00 nalrcsvridbq02 process-agent[2039258]: 2020-10-12 17:49:00
> EDT | PROCESS | INFO | (collector.go:209 in func1) | Delivery queues:
> process[size=0, weight=0], pod[size=0, weight=0]
> Oct 12 17:49:00 nalrcsvridbq02 agent[2039257]: 2020-10-12 17:49:00 EDT |
> CORE | ERROR | (pkg/forwarder/worker.go:178 in process) | Error while
> processing transaction: error while sending transaction, rescheduling it:
> Post
>
> https://7-22-1-app.agent.datadoghq.com/api/v1/series?api_key=*44602
> :
> net/http: request canceled (Client.Timeout exceeded while awaiting headers)
> Oct 12 17:49:00 nalrcsvridbq02 trace-agent[2039259]: 2020-10-12 17:49:00
> EDT
> | TRACE | INFO | (pkg/trace/info/stats.go:101 in LogStats) | No data
> received
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: [17:49:00] Topology
> snapshot
> [ver=21, locNode=2eca41b3, servers=1, clients=0, state=ACTIVE, CPUs=2,
> offheap=2.0GB, heap=0.25GB]
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: [17:49:00] ^-- Baseline
> [id=0, size=2, online=1, offline=1]
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: [17:49:00] (err) Failed to
> execute compound future reducer: GridNearTxFinishFuture
> [futId=7a28b630571-b3eac955-0171-4b45-b048-84653e88427e, tx=GridNearTxLocal
> [mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping
> [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey
> [key=KeyCacheObject
> [hasValBytes=true], cacheId=-27866919], val=BinaryObject
> [idHash=1523169004,
> hash=1743117496][op=CREATE, val=], prevVal=[op=NOOP, val=null],
> oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1,
> conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null,
> filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=true,
> entry=GridDhtDetachedCacheEntry [super=GridDistributedCacheEntry
> [super=GridCacheMapEntry [key=KeyCacheObject [hasValBytes=true], val=null,
> ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=684422756,
> extras=null, flags=0]]], prepared=0, locked=false,
> nodeId=3f8b7981-ee81-4ca4-9f52-5c6f03cb8cee, locMapped=false,
> expiryPlc=null, transferExpiryPlc=false, flags=2, partUpdateCntr=0,
>