Aleksey Plekhanov created IGNITE-8166: -----------------------------------------
Summary: stopGrid() hangs in some cases when node is invalidated and PDS is enabled Key: IGNITE-8166 URL: https://issues.apache.org/jira/browse/IGNITE-8166 Project: Ignite Issue Type: Bug Affects Versions: 2.5 Reporter: Aleksey Plekhanov Node invalidation via FailureProcessor can hang {{exchange-worker}} and {{stopGrid()}} when PDS is enabled. Reproducer (reproducer is racy, sometimes finished without hang): {code:java} public class StopNodeHangsTest extends GridCommonAbstractTest { /** Offheap size for memory policy. */ private static final int SIZE = 10 * 1024 * 1024; /** Page size. */ static final int PAGE_SIZE = 2048; /** Number of entries. */ static final int ENTRIES = 2_000; /** {@inheritDoc} */ @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception { IgniteConfiguration cfg = super.getConfiguration(igniteInstanceName); DataStorageConfiguration dsCfg = new DataStorageConfiguration(); DataRegionConfiguration dfltPlcCfg = new DataRegionConfiguration(); dfltPlcCfg.setName("dfltPlc"); dfltPlcCfg.setInitialSize(SIZE); dfltPlcCfg.setMaxSize(SIZE); dfltPlcCfg.setPersistenceEnabled(true); dsCfg.setDefaultDataRegionConfiguration(dfltPlcCfg); dsCfg.setPageSize(PAGE_SIZE); cfg.setDataStorageConfiguration(dsCfg); cfg.setFailureHandler(new FailureHandler() { @Override public boolean onFailure(Ignite ignite, FailureContext failureCtx) { return true; } }); return cfg; } public void testStopNodeHangs() throws Exception { cleanPersistenceDir(); IgniteEx ignite0 = startGrid(0); IgniteEx ignite1 = startGrid(1); ignite1.cluster().active(true); awaitPartitionMapExchange(); IgniteCache cache = ignite1.getOrCreateCache("TEST"); Map<Integer, Object> entries = new HashMap<>(); for (int i = 0; i < ENTRIES; i++) entries.put(i, new byte[PAGE_SIZE * 2 / 3]); cache.putAll(entries); ignite1.context().failure().process(new FailureContext(FailureType.CRITICAL_ERROR, null)); stopGrid(0); stopGrid(1); } } {code} {{stopGrid(1)}} waiting until exchange finished, {{exchange-worker}} waits on method {{GridCacheDatabaseSharedManager#checkpointReadLock}} for {{CheckpointProgressSnapshot#cpBeginFut}}, but this future is never done because {{db-checkpoint-thread}} got exception at {{GridCacheDatabaseSharedManager.Checkpointer#markCheckpointBegin}} thrown by {{FileWriteAheadLogManager#checkNode}} and leave method {{markCheckpointBegin}} before future is done ({{curr.cpBeginFut.onDone();}}) -- This message was sent by Atlassian JIRA (v7.6.3#76005)