Hi all,
This is regarding a rather recent issue that we’ve been facing. We run 2 client
instances and 26 apache ignite instances. All are AWS R4.2xLarge nodes.
Recently we’ve been seeing this issue where when trying to fetch an atomicLong
or atomicReference, the executing thread gets stuck and doesn’t return. This
issue usually happens on 1 or 2 ignite instances. I am not sure why this
happens and so any help on this would be really appreciated. The version of
Ignite we use is 2.7.5
This is the thread dump while trying to get an atomicReference:
"main" #1 prio=5 os_prio=0 cpu=3528.41ms elapsed=1067.33s allocated=312M
defined_classes=9309 tid=0x00007f4ce4046fc0 nid=0x1537 waiting on condition
[0x00007f4cece90000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00007f4cbfe7c7d0> (a
java.util.concurrent.CountDownLatch$Sync)
at
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly([email protected]/AbstractQueuedSynchronizer.java:1039)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly([email protected]/AbstractQueuedSynchronizer.java:1345)
at
java.util.concurrent.CountDownLatch.await([email protected]/CountDownLatch.java:232)
at
org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7612)
at
org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.awaitInitialization(DataStructuresProcessor.java:1147)
at
org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.getAtomic(DataStructuresProcessor.java:506)
at
org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.atomicReference(DataStructuresProcessor.java:744)
at
org.apache.ignite.internal.IgniteKernal.atomicReference(IgniteKernal.java:3743)
at
org.apache.ignite.internal.IgniteKernal.atomicReference(IgniteKernal.java:3732)
at
company.explore.cache.persist.SavedAudienceLocationProvider.getSavedAudienceLocation(SavedAudienceLocationProvider.java:27)
at
company.explore.listeners.lifecycle.LifecycleListener.configureSavedAudienceLocation(LifecycleListener.java:45)
at
company.explore.listeners.lifecycle.LifecycleListener.onLifecycleEvent(LifecycleListener.java:38)
at
org.apache.ignite.internal.IgniteKernal.notifyLifecycleBeans(IgniteKernal.java:725)
at
org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1156)
at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
- locked <0x00007f4cbf072a38> (a
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance)
at
org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
at
org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
at
org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
at
org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
at
org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
at
org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
at org.apache.ignite.Ignition.start(Ignition.java:348)
at
org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
Since this is stuck any Ignition.ignite calls fail as well and cause the job
not to go through:
"pub-#22" #48 prio=5 os_prio=0 cpu=5.76ms elapsed=1036.50s allocated=421K
defined_classes=6 tid=0x00007f4ce4cf3990 nid=0x1607 waiting on condition
[0x00007f40375f6000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00007f4cbf16d9e0> (a
java.util.concurrent.CountDownLatch$Sync)
at
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly([email protected]/AbstractQueuedSynchronizer.java:1039)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly([email protected]/AbstractQueuedSynchronizer.java:1345)
at
java.util.concurrent.CountDownLatch.await([email protected]/CountDownLatch.java:232)
at
org.apache.ignite.internal.util.IgniteUtils.awaitQuiet(IgniteUtils.java:7657)
at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.grid(IgnitionEx.java:1671)
at
org.apache.ignite.internal.IgnitionEx.grid(IgnitionEx.java:1389)
at
org.apache.ignite.internal.IgnitionEx.grid(IgnitionEx.java:1258)
at org.apache.ignite.Ignition.ignite(Ignition.java:489)
at
company.explore.dataload.person.LoadPersonAttributeJob.call(LoadPersonAttributeJob.java:58)
at
company.explore.dataload.person.LoadPersonAttributeJob.call(LoadPersonAttributeJob.java:31)
at
org.apache.ignite.internal.processors.closure.GridClosureProcessor$C2.execute(GridClosureProcessor.java:1855)
at
org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:568)
at
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6817)
at
org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:562)
at
org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at
java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
at java.lang.Thread.run([email protected]/Thread.java:834)
Similarly this is an instance where the thread is waiting for CountDownLatch
when trying to get atomicLong:
"pub-#489" #608 prio=5 os_prio=0 cpu=16.80ms elapsed=7076.10s allocated=2409K
defined_classes=17 tid=0x00007f48c8014c60 nid=0x5bd5 waiting on condition
[0x00007f48359e1000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00007f518aba6060> (a
java.util.concurrent.CountDownLatch$Sync)
at
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly([email protected]/AbstractQueuedSynchronizer.java:1039)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly([email protected]/AbstractQueuedSynchronizer.java:1345)
at
java.util.concurrent.CountDownLatch.await([email protected]/CountDownLatch.java:232)
at
org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7612)
at
org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.awaitInitialization(DataStructuresProcessor.java:1147)
at
org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.getAtomic(DataStructuresProcessor.java:506)
at
org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.atomicLong(DataStructuresProcessor.java:463)
at
org.apache.ignite.internal.IgniteKernal.atomicLong(IgniteKernal.java:3716)
at
org.apache.ignite.internal.IgniteKernal.atomicLong(IgniteKernal.java:3705)
at
company.explore.cache.persist.person.SerializationStatus.getSerializeCounter(SerializationStatus.java:86)
at
company.explore.cache.persist.person.SerializationStatus.startNodeSerialization(SerializationStatus.java:21)
at
company.explore.cache.persist.personv2.PersonSerializationJob.serializePeopleData(PersonSerializationJob.java:98)
at
company.explore.cache.persist.personv2.PersonSerializationJob.run(PersonSerializationJob.java:75)
at
org.apache.ignite.internal.processors.closure.GridClosureProcessor$C4.execute(GridClosureProcessor.java:1944)
at
org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:568)
at
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6817)
at
org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:562)
at
org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at
java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
at java.lang.Thread.run([email protected]/Thread.java:834)
These issues have only started coming up as of the past 2 months or so. The
system itself has been very stable for a long time. I haven’t posted the entire
thread dump as it would be quite large. If needed, I can post it on pastebin or
upload it somewhere.
Since this really isn’t a very consistent issue I am not sure about how to
create a reproducer project. But I can provide any logs or so if needed.
The entire thread dumps have been posted on pastebin. Please find the links
below:
Atomic Reference related thread dump: pastebin.com/ydNMFSEP
Atomic Long related thread dump: pastebin.com/psJgwi3F
Any help is much appreciated. Thanks!
Best Regards,
Paul
---------------------------------------------------------------------------------------Disclaimer----------------------------------------------------------------------------------------------
****Views and opinions expressed in this e-mail belong to their author and do
not necessarily represent views and opinions of Ugam.
Our employees are obliged not to make any defamatory statement or infringe any
legal right.
Therefore, Ugam does not accept any responsibility or liability for such
statements. The content of this email is confidential and intended for the
recipient specified in message only. It is strictly forbidden to share any part
of this message with any third party, without a written consent of the sender.
If you have received this message by mistake, please reply to this message and
follow with its deletion, so that we can ensure such a mistake does not occur
in the future.
Warning: Sufficient measures have been taken to scan any presence of viruses
however the recipient should check this email and any attachments for the
presence of viruses as full security of the email cannot be ensured despite our
best efforts.
Therefore, Ugam accepts no liability for any damage inflicted by viewing the
content of this email.. ****
Please do not print this email unless it is necessary. Every unprinted email
helps the environment.