[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-07-02 Thread Alexey Goncharuk (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150146#comment-17150146
 ] 

Alexey Goncharuk commented on IGNITE-12845:
---

Looks good to me, thanks!

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Assignee: Aleksey Plekhanov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on Windows with some JDK versions (tested with JDK 11, 14), but 
> passes on Windows with JDK 8, Linux systems, or when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the same shortcomings as the previous one. 
>  * Hybrid approach. Use some probabilistic implementation for {{contains}} 
> method (bloom filter or just check the last element) and use one of two 
> previous solutions as a workaround, for cases when we incorrectly return 
> {{false}} for {{contains}}. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-06-18 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139264#comment-17139264
 ] 

Ignite TC Bot commented on IGNITE-12845:


{panel:title=Branch: [pull/7879/head] Base: [master] : No blockers 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
[TeamCity *-- Run :: All* 
Results|https://ci.ignite.apache.org/viewLog.html?buildId=5396155buildTypeId=IgniteTests24Java8_RunAll]

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Assignee: Aleksey Plekhanov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on Windows with some JDK versions (tested with JDK 11, 14), but 
> passes on Windows with JDK 8, Linux systems, or when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the same shortcomings as the previous one. 
>  * Hybrid approach. Use some probabilistic implementation for {{contains}} 
> method (bloom filter or just check the last element) and use one of two 
> previous solutions as a workaround, 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-06-18 Thread Aleksey Plekhanov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139263#comment-17139263
 ] 

Aleksey Plekhanov commented on IGNITE-12845:


[~agoncharuk], as communication SPI maintainer, can you please review the patch?

I didn't add any new tests because there are already exists some tests with 
this problem, which hangs on MacOS and Windows with JDK 11+ 
(ComputeTaskTest#testExecuteTaskConcurrentLoad, 
AsyncChannelTest#testConcurrentRequests, AsyncChannelTest#testConcurrentQueries)

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Assignee: Aleksey Plekhanov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on Windows with some JDK versions (tested with JDK 11, 14), but 
> passes on Windows with JDK 14, Linux systems, or when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the same shortcomings as the previous one. 
>  * Hybrid approach. Use some probabilistic implementation for {{contains}} 
> method (bloom filter or 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-06-01 Thread Aleksey Plekhanov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121352#comment-17121352
 ] 

Aleksey Plekhanov commented on IGNITE-12845:


[~ptupitsyn], most likely NIO bug should not affect Linux systems (but it 
certainly affects MacOS). I found another bug in java thin client compute 
implementation (IGNITE-13106). And now I think that team-city hangs (which I 
mention in original ticket description) were due to compute bug, but not NIO 
bug (originally I've tested compute, but later wrote simplified reproducer with 
cache.put). Please have a look at IGNITE-13106, perhaps .Net client has the 
same problems. 

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-06-01 Thread Aleksey Plekhanov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121055#comment-17121055
 ] 

Aleksey Plekhanov commented on IGNITE-12845:


[~ptupitsyn], can you please check locally your case against the attached 
pull-request?

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the same shortcomings as the previous one. 
>  * Hybrid approach. Use some probabilistic implementation for {{contains}} 
> method (bloom filter or just check the last element) and use one of two 
> previous solutions as a workaround, for cases when we incorrectly return 
> {{false}} for 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-05-31 Thread Pavel Tupitsyn (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120439#comment-17120439
 ] 

Pavel Tupitsyn commented on IGNITE-12845:
-

[~alex_pl] I'm not really sure but I think I faced this locally while running 
thin client compute tests a few times - compute task never receives any 
completed/cancelled event and just hags forever. OpenJDK 1.8.0_252, Ubuntu 
20.04.

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the same shortcomings as the previous one. 
>  * Hybrid approach. Use some probabilistic implementation for {{contains}} 
> method (bloom filter or just check the last element) and use one of two 
> previous 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-05-29 Thread Aleksey Plekhanov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119477#comment-17119477
 ] 

Aleksey Plekhanov commented on IGNITE-12845:


[~ptupitsyn] not yet. I've tried approach 2 and 3 from ticket description and 
it seems to be working. If nobody takes this ticket before, I think I can take 
it in a week or two.

Did you face the same problem? Can you share details (OS, JDK)?

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the same shortcomings as the previous one. 
>  * Hybrid approach. Use some probabilistic implementation for {{contains}} 
> method (bloom filter or just check the last element) and use one of 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-05-29 Thread Pavel Tupitsyn (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119356#comment-17119356
 ] 

Pavel Tupitsyn commented on IGNITE-12845:
-

[~alex_pl] [~ivandasch] any updates on this issue?

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the same shortcomings as the previous one. 
>  * Hybrid approach. Use some probabilistic implementation for {{contains}} 
> method (bloom filter or just check the last element) and use one of two 
> previous solutions as a workaround, for cases when we incorrectly return 
> {{false}} for {{contains}}. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-04-03 Thread Ivan Daschinskiy (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074571#comment-17074571
 ] 

Ivan Daschinskiy commented on IGNITE-12845:
---

[~alex_pl] Yes, now it seems to me that you are right, right now we can do a 
quick bugfix (your propositions 2 or 3) and plan to rewrite 
SelecteSelectionKeySet without rush. I will create separate ticket and start 
diskussion on dev-list soon.

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the same shortcomings as the previous one. 
>  * Hybrid approach. Use some probabilistic implementation for {{contains}} 
> method (bloom filter or just check the last element) and use one of 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-04-03 Thread Ivan Daschinskiy (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074547#comment-17074547
 ] 

Ivan Daschinskiy commented on IGNITE-12845:
---

[~alex_pl]
>From oracle jdk8
{code:java}
public Set keys() {
if (!this.isOpen() && !Util.atBugLevel("1.4")) {
throw new ClosedSelectorException();
} else {
return this.publicKeys;
}
}
{code}

Yes, current netty implementation doesn't contain contains method. I just reply 
to your proposition about reset(). 
I suggests to improve netty implementation by adding hash table for selector 
array indices. This solves contains problem absolutelly. 


> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-04-03 Thread Aleksey Plekhanov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074524#comment-17074524
 ] 

Aleksey Plekhanov commented on IGNITE-12845:


[~ivandasch], new netty implementation still returns {{false}} for 
{{contains()}}. As far as I understand just moving to the new version doesn't 
resolve the current bug, we should change the way we use it (unsubscribe from 
write events if have no plans to write anything, that what I propose to do by 
this ticket). {{AbstractNioClientWorker#checkIdle}} iterates through 
{{keys()}}, but not {{selectedKeys()}} which was injected.

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-04-03 Thread Ivan Daschinskiy (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074472#comment-17074472
 ] 

Ivan Daschinskiy commented on IGNITE-12845:
---

[~alex_pl] The main reason, AFAIK, is to solve memory leak problems. I.E. [see 
this SO discussion 
|https://stackoverflow.com/questions/34645752/when-nettys-io-netty-channel-nio-selectedselectionkeyset-hold-too-much-selectio].
 If we clear set propertly before every select or selectNow, and nullify when 
iterating, we  solve original problem too. My solution is just to add some 
index (simple int[] array) to existing data structure and move this outdated 
implementation toward to [new netty 
one|https://github.com/netty/netty/blob/c74b3f3a3b73fee125048b0f486fc9c19fb3bc14/transport/src/main/java/io/netty/channel/nio/SelectedSelectionKeySet.java]
As you see, current netty implementation reset set by nullifying array and this 
is not a problem at all for netty since 2016

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-04-03 Thread Aleksey Plekhanov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074455#comment-17074455
 ] 

Aleksey Plekhanov commented on IGNITE-12845:


Creating a new array it's a pressure to GC, what we a trying to avoid by our 
own implementation.

I agreed about the dev-list.

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the same shortcomings as the previous one. 
>  * Hybrid approach. Use some probabilistic implementation for {{contains}} 
> method (bloom filter or just check the last element) and use one of two 
> previous solutions as a workaround, for cases when we incorrectly return 
> {{false}} for {{contains}}. 
>  



[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-04-03 Thread Ivan Daschinskiy (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074440#comment-17074440
 ] 

Ivan Daschinskiy commented on IGNITE-12845:
---

[~alex_pl] reset -- just put array reference to stack, set reference to null 
then set freshly created array to field?  You suppose that it costly?

Current implementation is obviously incorrect, and some code in GridNioServer 
uses iteration over set and simply don't work.
This should be fixed.

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the same shortcomings as the previous one. 
>  * Hybrid approach. Use some probabilistic implementation for {{contains}} 
> method (bloom 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-04-03 Thread Aleksey Plekhanov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074377#comment-17074377
 ] 

Aleksey Plekhanov commented on IGNITE-12845:


[~ivandasch], it still will be O(MAX_SIZE) for reset operation.

NIO is a critical part of Ignite, any change of underlying storage for 
{{SelectedSelectionKeySet}} is risky and must be carefully tested for 
performance. I think bug should be fixed as simple as possible and the ticket 
should be targeted to 2.8.1. I propose a 2-nd or 3-rd solution from the ticket 
description. It's simple, it solves the problem (it's not the most optimal 
solution, but at least the problem will be not critical anymore), it doesn't 
affect performance. Also, a ticket for improvement can be created and targeted 
to 2.9 or later release. Feel free to create such a ticket. Perhaps the 
solution should be discussed on the dev-list to involve more participants to 
the discussion.

I.e. let's fix the bug by the "Bug" ticket targeted to 2.8.1 and make an 
improvement by the "Improvement" ticket targeted to 2.9. WDYT?

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-04-01 Thread Ivan Daschinskiy (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073357#comment-17073357
 ] 

Ivan Daschinskiy commented on IGNITE-12845:
---

[~alex_pl] If implementation uses robin-hood hashing and good hash function(we 
can apply murmur32 to hashCode()), many sources say that load factor can be 0.8 
without any affection to performance. See here for example 
https://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-your-default-hash-table-implementation/

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the same shortcomings as the previous one. 
>  * Hybrid approach. Use some probabilistic implementation for 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-04-01 Thread Aleksey Plekhanov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073175#comment-17073175
 ] 

Aleksey Plekhanov commented on IGNITE-12845:


[~ivandasch], 

I'm not sure about open-addressing hash set, iterating over such a set can be 
very ineffective when the set is almost empty. 

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the same shortcomings as the previous one. 
>  * Hybrid approach. Use some probabilistic implementation for {{contains}} 
> method (bloom filter or just check the last element) and use one of two 
> previous solutions as a workaround, for cases when we incorrectly return 
> {{false}} for 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-04-01 Thread Ivan Daschinskiy (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073124#comment-17073124
 ] 

Ivan Daschinskiy commented on IGNITE-12845:
---

[~alex_pl] I think, that implementing SelectedSelectionKeySet as 
open-addressing hash set can solve this problem completely. 
*  Calling remove will nullify array element as current implementation did.
*  We can use simple selector usage pattern (iterate and remove)

Yes, we should clear this set before every selectNow() or select(), but we can 
wrap Selector and dot all job right
(see [current implementation of set in 
netty|https://github.com/netty/netty/commit/795f318c3c11ec0520e7acd963ad4b310c287c20#diff-47ddf03d4cdcb32be935ca412f455ee5]
 for example.

Also, I suggest as a fix use Unsafe for instrumentation and fallback to 
reflection as last resort, as in Netty done.

Do you mind if I assign this ticket to me? Do you have any objections to my 
suggestions?

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-03-31 Thread Aleksey Plekhanov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071597#comment-17071597
 ] 

Aleksey Plekhanov commented on IGNITE-12845:


[~antonovsergey93], sorry, for JDK 8 it's inside derived class implementation 
(KQueueSelectorImpl for Mac OS, for example), but later it was moved to 
SelectorImpl (at least since JDK 11)

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the same shortcomings as the previous one. 
>  * Hybrid approach. Use some probabilistic implementation for {{contains}} 
> method (bloom filter or just check the last element) and use one of two 
> previous solutions as a workaround, for cases 

[jira] [Commented] (IGNITE-12845) GridNioServer can infinitely lose some events

2020-03-31 Thread Sergey Antonov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071590#comment-17071590
 ] 

Sergey Antonov commented on IGNITE-12845:
-

[~alex_pl] I didn't find any {{Set#contains(Object)}} usages in 
{{sun.nio.ch.SelectorImpl}} in jdk8 (1.8.0_191). 

> GridNioServer can infinitely lose some events 
> --
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksey Plekhanov
>Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the same shortcomings as the previous one. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)