Aleksey Plekhanov created IGNITE-12845:
------------------------------------------

             Summary: GridNioServer can infinitely lose some events 
                 Key: IGNITE-12845
                 URL: https://issues.apache.org/jira/browse/IGNITE-12845
             Project: Ignite
          Issue Type: Bug
            Reporter: Aleksey Plekhanov


With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
{{GridNioServer}} can lose some events for a channel (depending on JDK version 
and OS). It can lead to connected applications hang. Reproducer: 

 
{code:java}
    public void testConcurrentLoad() throws Exception {
        startGrid(0);

        try (IgniteClient client = Ignition.startClient(new 
ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
            ClientCache<Integer, Integer> cache = 
client.getOrCreateCache(DEFAULT_CACHE_NAME);

            GridTestUtils.runMultiThreaded(
                () -> {
                    for (int i = 0; i < 1000; i++)
                        cache.put(i, i);
                }, 5, "run-async");
        }
    }
{code}
This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 14), 
hangs on some Linux environments (for example passed more than 100 times on 
desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 11) 
and never hanged (passed more than 100 times) on windows system, but passes on 
all systems and JDK versions when system property {{IGNITE_NO_SELECTOR_OPTS = 
true}} is set.

 

The root cause: optimized {{SelectedSelectionKeySet}} always returns {{false}} 
for {{contains()}} method. The {{contains()}} method used by 
{{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:

 
{code:java}
if (selectedKeys.contains(ski)) {
    if (ski.translateAndUpdateReadyOps(rOps)) {
        return 1;
    }
} else {
    ski.translateAndSetReadyOps(rOps);
    if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
        selectedKeys.add(ski);
        return 1;
    }
}
{code}
So, for fair implementation, if a selection key is contained in the selected 
keys set, then ready operations flags are updated, but for 
{{SelectedSelectionKeySet}} ready operations flags will be always overridden 
and new selector key will be added even if it's already contained in the set. 
Some {{SelectorImpl}} implementations can pass several events for one selector 
key to {{processReadyEvents }}method (for example, MacOs implementation 
{{KQueueSelectorImpl}} works in such a way). In this case duplicated selector 
keys will be added to {{selectedKeys}} and all events except last will be lost.

Two bad things happen in {{GridNioServer}} due to described above reasons:
 # Some event flags are lost and the worker doesn't process corresponding 
action (for attached reproducer "channel is ready for reading" event is lost 
and the workers never read the channel after some point in time).
 # Duplicated selector keys with the same event flags (for attached reproducer 
it's "channel is ready for writing" event, this duplication leads to wrong 
processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which will be 
{{false}} in some cases, but at the same time selector key's {{interestedOps}} 
will contain {{OP_WRITE}} operation and this operation never be excluded) 

Possible solutions:
 * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
will solve all problems but can be resource consuming)
 * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when adding 
{{OP_WRITE}} to {{interestedOps}} (for example in 
{{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
"channel is ready for reading" events (but not data) still can be lost, but not 
infinitely, and eventually data will be read.
 * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
{{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
This solution has the same shortcomings as the previous one. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to