[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150146#comment-17150146 ] Alexey Goncharuk commented on IGNITE-12845: --- Looks good to me, thanks! > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Assignee: Aleksey Plekhanov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on Windows with some JDK versions (tested with JDK 11, 14), but > passes on Windows with JDK 8, Linux systems, or when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the same shortcomings as the previous one. > * Hybrid approach. Use some probabilistic implementation for {{contains}} > method (bloom filter or just check the last element) and use one of two > previous solutions as a workaround, for cases when we incorrectly return > {{false}} for {{contains}}. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139264#comment-17139264 ] Ignite TC Bot commented on IGNITE-12845: {panel:title=Branch: [pull/7879/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} [TeamCity *-- Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=5396155buildTypeId=IgniteTests24Java8_RunAll] > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Assignee: Aleksey Plekhanov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on Windows with some JDK versions (tested with JDK 11, 14), but > passes on Windows with JDK 8, Linux systems, or when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the same shortcomings as the previous one. > * Hybrid approach. Use some probabilistic implementation for {{contains}} > method (bloom filter or just check the last element) and use one of two > previous solutions as a workaround,
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139263#comment-17139263 ] Aleksey Plekhanov commented on IGNITE-12845: [~agoncharuk], as communication SPI maintainer, can you please review the patch? I didn't add any new tests because there are already exists some tests with this problem, which hangs on MacOS and Windows with JDK 11+ (ComputeTaskTest#testExecuteTaskConcurrentLoad, AsyncChannelTest#testConcurrentRequests, AsyncChannelTest#testConcurrentQueries) > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Assignee: Aleksey Plekhanov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on Windows with some JDK versions (tested with JDK 11, 14), but > passes on Windows with JDK 14, Linux systems, or when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the same shortcomings as the previous one. > * Hybrid approach. Use some probabilistic implementation for {{contains}} > method (bloom filter or
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121352#comment-17121352 ] Aleksey Plekhanov commented on IGNITE-12845: [~ptupitsyn], most likely NIO bug should not affect Linux systems (but it certainly affects MacOS). I found another bug in java thin client compute implementation (IGNITE-13106). And now I think that team-city hangs (which I mention in original ticket description) were due to compute bug, but not NIO bug (originally I've tested compute, but later wrote simplified reproducer with cache.put). Please have a look at IGNITE-13106, perhaps .Net client has the same problems. > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write >
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121055#comment-17121055 ] Aleksey Plekhanov commented on IGNITE-12845: [~ptupitsyn], can you please check locally your case against the attached pull-request? > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the same shortcomings as the previous one. > * Hybrid approach. Use some probabilistic implementation for {{contains}} > method (bloom filter or just check the last element) and use one of two > previous solutions as a workaround, for cases when we incorrectly return > {{false}} for
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120439#comment-17120439 ] Pavel Tupitsyn commented on IGNITE-12845: - [~alex_pl] I'm not really sure but I think I faced this locally while running thin client compute tests a few times - compute task never receives any completed/cancelled event and just hags forever. OpenJDK 1.8.0_252, Ubuntu 20.04. > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the same shortcomings as the previous one. > * Hybrid approach. Use some probabilistic implementation for {{contains}} > method (bloom filter or just check the last element) and use one of two > previous
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119477#comment-17119477 ] Aleksey Plekhanov commented on IGNITE-12845: [~ptupitsyn] not yet. I've tried approach 2 and 3 from ticket description and it seems to be working. If nobody takes this ticket before, I think I can take it in a week or two. Did you face the same problem? Can you share details (OS, JDK)? > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the same shortcomings as the previous one. > * Hybrid approach. Use some probabilistic implementation for {{contains}} > method (bloom filter or just check the last element) and use one of
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119356#comment-17119356 ] Pavel Tupitsyn commented on IGNITE-12845: - [~alex_pl] [~ivandasch] any updates on this issue? > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the same shortcomings as the previous one. > * Hybrid approach. Use some probabilistic implementation for {{contains}} > method (bloom filter or just check the last element) and use one of two > previous solutions as a workaround, for cases when we incorrectly return > {{false}} for {{contains}}. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074571#comment-17074571 ] Ivan Daschinskiy commented on IGNITE-12845: --- [~alex_pl] Yes, now it seems to me that you are right, right now we can do a quick bugfix (your propositions 2 or 3) and plan to rewrite SelecteSelectionKeySet without rush. I will create separate ticket and start diskussion on dev-list soon. > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the same shortcomings as the previous one. > * Hybrid approach. Use some probabilistic implementation for {{contains}} > method (bloom filter or just check the last element) and use one of
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074547#comment-17074547 ] Ivan Daschinskiy commented on IGNITE-12845: --- [~alex_pl] >From oracle jdk8 {code:java} public Set keys() { if (!this.isOpen() && !Util.atBugLevel("1.4")) { throw new ClosedSelectorException(); } else { return this.publicKeys; } } {code} Yes, current netty implementation doesn't contain contains method. I just reply to your proposition about reset(). I suggests to improve netty implementation by adding hash table for selector array indices. This solves contains problem absolutelly. > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074524#comment-17074524 ] Aleksey Plekhanov commented on IGNITE-12845: [~ivandasch], new netty implementation still returns {{false}} for {{contains()}}. As far as I understand just moving to the new version doesn't resolve the current bug, we should change the way we use it (unsubscribe from write events if have no plans to write anything, that what I propose to do by this ticket). {{AbstractNioClientWorker#checkIdle}} iterates through {{keys()}}, but not {{selectedKeys()}} which was injected. > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074472#comment-17074472 ] Ivan Daschinskiy commented on IGNITE-12845: --- [~alex_pl] The main reason, AFAIK, is to solve memory leak problems. I.E. [see this SO discussion |https://stackoverflow.com/questions/34645752/when-nettys-io-netty-channel-nio-selectedselectionkeyset-hold-too-much-selectio]. If we clear set propertly before every select or selectNow, and nullify when iterating, we solve original problem too. My solution is just to add some index (simple int[] array) to existing data structure and move this outdated implementation toward to [new netty one|https://github.com/netty/netty/blob/c74b3f3a3b73fee125048b0f486fc9c19fb3bc14/transport/src/main/java/io/netty/channel/nio/SelectedSelectionKeySet.java] As you see, current netty implementation reset set by nullifying array and this is not a problem at all for netty since 2016 > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074455#comment-17074455 ] Aleksey Plekhanov commented on IGNITE-12845: Creating a new array it's a pressure to GC, what we a trying to avoid by our own implementation. I agreed about the dev-list. > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the same shortcomings as the previous one. > * Hybrid approach. Use some probabilistic implementation for {{contains}} > method (bloom filter or just check the last element) and use one of two > previous solutions as a workaround, for cases when we incorrectly return > {{false}} for {{contains}}. >
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074440#comment-17074440 ] Ivan Daschinskiy commented on IGNITE-12845: --- [~alex_pl] reset -- just put array reference to stack, set reference to null then set freshly created array to field? You suppose that it costly? Current implementation is obviously incorrect, and some code in GridNioServer uses iteration over set and simply don't work. This should be fixed. > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the same shortcomings as the previous one. > * Hybrid approach. Use some probabilistic implementation for {{contains}} > method (bloom
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074377#comment-17074377 ] Aleksey Plekhanov commented on IGNITE-12845: [~ivandasch], it still will be O(MAX_SIZE) for reset operation. NIO is a critical part of Ignite, any change of underlying storage for {{SelectedSelectionKeySet}} is risky and must be carefully tested for performance. I think bug should be fixed as simple as possible and the ticket should be targeted to 2.8.1. I propose a 2-nd or 3-rd solution from the ticket description. It's simple, it solves the problem (it's not the most optimal solution, but at least the problem will be not critical anymore), it doesn't affect performance. Also, a ticket for improvement can be created and targeted to 2.9 or later release. Feel free to create such a ticket. Perhaps the solution should be discussed on the dev-list to involve more participants to the discussion. I.e. let's fix the bug by the "Bug" ticket targeted to 2.8.1 and make an improvement by the "Improvement" ticket targeted to 2.9. WDYT? > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073357#comment-17073357 ] Ivan Daschinskiy commented on IGNITE-12845: --- [~alex_pl] If implementation uses robin-hood hashing and good hash function(we can apply murmur32 to hashCode()), many sources say that load factor can be 0.8 without any affection to performance. See here for example https://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-your-default-hash-table-implementation/ > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the same shortcomings as the previous one. > * Hybrid approach. Use some probabilistic implementation for
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073175#comment-17073175 ] Aleksey Plekhanov commented on IGNITE-12845: [~ivandasch], I'm not sure about open-addressing hash set, iterating over such a set can be very ineffective when the set is almost empty. > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the same shortcomings as the previous one. > * Hybrid approach. Use some probabilistic implementation for {{contains}} > method (bloom filter or just check the last element) and use one of two > previous solutions as a workaround, for cases when we incorrectly return > {{false}} for
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073124#comment-17073124 ] Ivan Daschinskiy commented on IGNITE-12845: --- [~alex_pl] I think, that implementing SelectedSelectionKeySet as open-addressing hash set can solve this problem completely. * Calling remove will nullify array element as current implementation did. * We can use simple selector usage pattern (iterate and remove) Yes, we should clear this set before every selectNow() or select(), but we can wrap Selector and dot all job right (see [current implementation of set in netty|https://github.com/netty/netty/commit/795f318c3c11ec0520e7acd963ad4b310c287c20#diff-47ddf03d4cdcb32be935ca412f455ee5] for example. Also, I suggest as a fix use Unsafe for instrumentation and fallback to reflection as last resort, as in Netty done. Do you mind if I assign this ticket to me? Do you have any objections to my suggestions? > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071597#comment-17071597 ] Aleksey Plekhanov commented on IGNITE-12845: [~antonovsergey93], sorry, for JDK 8 it's inside derived class implementation (KQueueSelectorImpl for Mac OS, for example), but later it was moved to SelectorImpl (at least since JDK 11) > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. If events will be reordered > (first "channel is ready for writing", after it "channel is ready for > reading") then write to the channel will be only processed after all data > will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the same shortcomings as the previous one. > * Hybrid approach. Use some probabilistic implementation for {{contains}} > method (bloom filter or just check the last element) and use one of two > previous solutions as a workaround, for cases
[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events
[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071590#comment-17071590 ] Sergey Antonov commented on IGNITE-12845: - [~alex_pl] I didn't find any {{Set#contains(Object)}} usages in {{sun.nio.ch.SelectorImpl}} in jdk8 (1.8.0_191). > GridNioServer can infinitely lose some events > -- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug >Reporter: Aleksey Plekhanov >Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the same shortcomings as the previous one. > -- This message was sent by Atlassian Jira (v8.3.4#803005)