[jira] [Commented] (IGNITE-10808) Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage
[ https://issues.apache.org/jira/browse/IGNITE-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914111#comment-16914111 ] Denis Mekhanikov commented on IGNITE-10808: --- [~sergey-chugunov] Thanks a lot for the review! The tests don't show any blockers. [~DmitriyGovorukhin], could you help with a merge? > Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage > -- > > Key: IGNITE-10808 > URL: https://issues.apache.org/jira/browse/IGNITE-10808 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Stanislav Lukyanov >Assignee: Denis Mekhanikov >Priority: Major > Labels: discovery > Fix For: 2.8 > > Attachments: IgniteMetricsOverflowTest.java > > > A node receives a new metrics update message every `metricsUpdateFrequency` > milliseconds, and the message will be put at the top of the queue (because it > is a high priority message). > If processing one message takes more than `metricsUpdateFrequency` then > multiple `TcpDiscoveryMetricsUpdateMessage` will be in the queue. A long > enough delay (e.g. caused by a network glitch or GC) may lead to the queue > building up tens of metrics update messages which are essentially useless to > be processed. Finally, if processing a message on average takes a little more > than `metricsUpdateFrequency` (even for a relatively short period of time, > say, for a minute due to network issues) then the message worker will end up > processing only the metrics updates and the cluster will essentially hang. > Reproducer is attached. In the test, the queue first builds up and then very > slowly being teared down, causing "Failed to wait for PME" messages. > Need to change ServerImpl's SocketReader not to put another metrics update > message to the top of the queue if it already has one (or replace the one at > the top with new one). -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (IGNITE-10808) Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage
[ https://issues.apache.org/jira/browse/IGNITE-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914104#comment-16914104 ] Ignite TC Bot commented on IGNITE-10808: {panel:title=Branch: [pull/5771/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} [TeamCity *-- Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=4527057buildTypeId=IgniteTests24Java8_RunAll] > Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage > -- > > Key: IGNITE-10808 > URL: https://issues.apache.org/jira/browse/IGNITE-10808 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Stanislav Lukyanov >Assignee: Denis Mekhanikov >Priority: Major > Labels: discovery > Fix For: 2.8 > > Attachments: IgniteMetricsOverflowTest.java > > > A node receives a new metrics update message every `metricsUpdateFrequency` > milliseconds, and the message will be put at the top of the queue (because it > is a high priority message). > If processing one message takes more than `metricsUpdateFrequency` then > multiple `TcpDiscoveryMetricsUpdateMessage` will be in the queue. A long > enough delay (e.g. caused by a network glitch or GC) may lead to the queue > building up tens of metrics update messages which are essentially useless to > be processed. Finally, if processing a message on average takes a little more > than `metricsUpdateFrequency` (even for a relatively short period of time, > say, for a minute due to network issues) then the message worker will end up > processing only the metrics updates and the cluster will essentially hang. > Reproducer is attached. In the test, the queue first builds up and then very > slowly being teared down, causing "Failed to wait for PME" messages. > Need to change ServerImpl's SocketReader not to put another metrics update > message to the top of the queue if it already has one (or replace the one at > the top with new one). -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (IGNITE-10808) Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage
[ https://issues.apache.org/jira/browse/IGNITE-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913204#comment-16913204 ] Sergey Chugunov commented on IGNITE-10808: -- [~dmekhanikov], I reviewed your change one more time, it looks good to me now. I've triggered TC once again to make sure latest refactoring didn't introduce any problems, if run is green we are good to merge the change. Thank you for your efforts! > Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage > -- > > Key: IGNITE-10808 > URL: https://issues.apache.org/jira/browse/IGNITE-10808 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Stanislav Lukyanov >Assignee: Denis Mekhanikov >Priority: Major > Labels: discovery > Fix For: 2.8 > > Attachments: IgniteMetricsOverflowTest.java > > > A node receives a new metrics update message every `metricsUpdateFrequency` > milliseconds, and the message will be put at the top of the queue (because it > is a high priority message). > If processing one message takes more than `metricsUpdateFrequency` then > multiple `TcpDiscoveryMetricsUpdateMessage` will be in the queue. A long > enough delay (e.g. caused by a network glitch or GC) may lead to the queue > building up tens of metrics update messages which are essentially useless to > be processed. Finally, if processing a message on average takes a little more > than `metricsUpdateFrequency` (even for a relatively short period of time, > say, for a minute due to network issues) then the message worker will end up > processing only the metrics updates and the cluster will essentially hang. > Reproducer is attached. In the test, the queue first builds up and then very > slowly being teared down, causing "Failed to wait for PME" messages. > Need to change ServerImpl's SocketReader not to put another metrics update > message to the top of the queue if it already has one (or replace the one at > the top with new one). -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (IGNITE-10808) Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage
[ https://issues.apache.org/jira/browse/IGNITE-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912282#comment-16912282 ] Denis Mekhanikov commented on IGNITE-10808: --- [~sergey-chugunov] , I reverted the changes for {{TcpDiscoveryClientAckResponse}}. Now this is the only high priority message. If we decide to remove high priority discovery messages completely, then let's do it in a different ticket. Also some refactoring was performed. Could you take another look? > Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage > -- > > Key: IGNITE-10808 > URL: https://issues.apache.org/jira/browse/IGNITE-10808 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Stanislav Lukyanov >Assignee: Denis Mekhanikov >Priority: Major > Labels: discovery > Fix For: 2.8 > > Attachments: IgniteMetricsOverflowTest.java > > > A node receives a new metrics update message every `metricsUpdateFrequency` > milliseconds, and the message will be put at the top of the queue (because it > is a high priority message). > If processing one message takes more than `metricsUpdateFrequency` then > multiple `TcpDiscoveryMetricsUpdateMessage` will be in the queue. A long > enough delay (e.g. caused by a network glitch or GC) may lead to the queue > building up tens of metrics update messages which are essentially useless to > be processed. Finally, if processing a message on average takes a little more > than `metricsUpdateFrequency` (even for a relatively short period of time, > say, for a minute due to network issues) then the message worker will end up > processing only the metrics updates and the cluster will essentially hang. > Reproducer is attached. In the test, the queue first builds up and then very > slowly being teared down, causing "Failed to wait for PME" messages. > Need to change ServerImpl's SocketReader not to put another metrics update > message to the top of the queue if it already has one (or replace the one at > the top with new one). -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (IGNITE-10808) Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage
[ https://issues.apache.org/jira/browse/IGNITE-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911258#comment-16911258 ] Sergey Chugunov commented on IGNITE-10808: -- [~dmekhanikov], Along with MetricsUpdate message your change also affects TcpDiscoveryClientAckResponse which no longer be processed with priority to other messages. This may be risky. Could you check what are the consequences for client nodes' stability if acks are delivered to them with some delay? Thanks. > Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage > -- > > Key: IGNITE-10808 > URL: https://issues.apache.org/jira/browse/IGNITE-10808 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Stanislav Lukyanov >Assignee: Denis Mekhanikov >Priority: Major > Labels: discovery > Fix For: 2.8 > > Attachments: IgniteMetricsOverflowTest.java > > > A node receives a new metrics update message every `metricsUpdateFrequency` > milliseconds, and the message will be put at the top of the queue (because it > is a high priority message). > If processing one message takes more than `metricsUpdateFrequency` then > multiple `TcpDiscoveryMetricsUpdateMessage` will be in the queue. A long > enough delay (e.g. caused by a network glitch or GC) may lead to the queue > building up tens of metrics update messages which are essentially useless to > be processed. Finally, if processing a message on average takes a little more > than `metricsUpdateFrequency` (even for a relatively short period of time, > say, for a minute due to network issues) then the message worker will end up > processing only the metrics updates and the cluster will essentially hang. > Reproducer is attached. In the test, the queue first builds up and then very > slowly being teared down, causing "Failed to wait for PME" messages. > Need to change ServerImpl's SocketReader not to put another metrics update > message to the top of the queue if it already has one (or replace the one at > the top with new one). -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (IGNITE-10808) Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage
[ https://issues.apache.org/jira/browse/IGNITE-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908005#comment-16908005 ] Sergey Chugunov commented on IGNITE-10808: -- [~DmitriyGovorukhin], [~dmekhanikov], sure, I'll take a look. > Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage > -- > > Key: IGNITE-10808 > URL: https://issues.apache.org/jira/browse/IGNITE-10808 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Stanislav Lukyanov >Assignee: Denis Mekhanikov >Priority: Major > Labels: discovery > Fix For: 2.8 > > Attachments: IgniteMetricsOverflowTest.java > > > A node receives a new metrics update message every `metricsUpdateFrequency` > milliseconds, and the message will be put at the top of the queue (because it > is a high priority message). > If processing one message takes more than `metricsUpdateFrequency` then > multiple `TcpDiscoveryMetricsUpdateMessage` will be in the queue. A long > enough delay (e.g. caused by a network glitch or GC) may lead to the queue > building up tens of metrics update messages which are essentially useless to > be processed. Finally, if processing a message on average takes a little more > than `metricsUpdateFrequency` (even for a relatively short period of time, > say, for a minute due to network issues) then the message worker will end up > processing only the metrics updates and the cluster will essentially hang. > Reproducer is attached. In the test, the queue first builds up and then very > slowly being teared down, causing "Failed to wait for PME" messages. > Need to change ServerImpl's SocketReader not to put another metrics update > message to the top of the queue if it already has one (or replace the one at > the top with new one). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (IGNITE-10808) Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage
[ https://issues.apache.org/jira/browse/IGNITE-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907346#comment-16907346 ] Dmitriy Govorukhin commented on IGNITE-10808: - [~sergey-chugunov] Could you please to help with the review? > Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage > -- > > Key: IGNITE-10808 > URL: https://issues.apache.org/jira/browse/IGNITE-10808 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Stanislav Lukyanov >Assignee: Denis Mekhanikov >Priority: Major > Labels: discovery > Fix For: 2.8 > > Attachments: IgniteMetricsOverflowTest.java > > > A node receives a new metrics update message every `metricsUpdateFrequency` > milliseconds, and the message will be put at the top of the queue (because it > is a high priority message). > If processing one message takes more than `metricsUpdateFrequency` then > multiple `TcpDiscoveryMetricsUpdateMessage` will be in the queue. A long > enough delay (e.g. caused by a network glitch or GC) may lead to the queue > building up tens of metrics update messages which are essentially useless to > be processed. Finally, if processing a message on average takes a little more > than `metricsUpdateFrequency` (even for a relatively short period of time, > say, for a minute due to network issues) then the message worker will end up > processing only the metrics updates and the cluster will essentially hang. > Reproducer is attached. In the test, the queue first builds up and then very > slowly being teared down, causing "Failed to wait for PME" messages. > Need to change ServerImpl's SocketReader not to put another metrics update > message to the top of the queue if it already has one (or replace the one at > the top with new one). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (IGNITE-10808) Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage
[ https://issues.apache.org/jira/browse/IGNITE-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904978#comment-16904978 ] Ignite TC Bot commented on IGNITE-10808: {panel:title=Branch: [pull/5771/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} [TeamCity *-- Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=4457883buildTypeId=IgniteTests24Java8_RunAll] > Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage > -- > > Key: IGNITE-10808 > URL: https://issues.apache.org/jira/browse/IGNITE-10808 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Stanislav Lukyanov >Assignee: Denis Mekhanikov >Priority: Major > Labels: discovery > Fix For: 2.8 > > Attachments: IgniteMetricsOverflowTest.java > > > A node receives a new metrics update message every `metricsUpdateFrequency` > milliseconds, and the message will be put at the top of the queue (because it > is a high priority message). > If processing one message takes more than `metricsUpdateFrequency` then > multiple `TcpDiscoveryMetricsUpdateMessage` will be in the queue. A long > enough delay (e.g. caused by a network glitch or GC) may lead to the queue > building up tens of metrics update messages which are essentially useless to > be processed. Finally, if processing a message on average takes a little more > than `metricsUpdateFrequency` (even for a relatively short period of time, > say, for a minute due to network issues) then the message worker will end up > processing only the metrics updates and the cluster will essentially hang. > Reproducer is attached. In the test, the queue first builds up and then very > slowly being teared down, causing "Failed to wait for PME" messages. > Need to change ServerImpl's SocketReader not to put another metrics update > message to the top of the queue if it already has one (or replace the one at > the top with new one). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (IGNITE-10808) Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage
[ https://issues.apache.org/jira/browse/IGNITE-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730647#comment-16730647 ] ASF GitHub Bot commented on IGNITE-10808: - GitHub user dmekhanikov opened a pull request: https://github.com/apache/ignite/pull/5771 IGNITE-10808 Ensure RingMessageWorker's progress. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gridgain/apache-ignite ignite-10808 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/ignite/pull/5771.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5771 commit 8ccbec84568dc4058ea27dbfbbab0fa2e34477d9 Author: Denis Mekhanikov Date: 2018-12-26T15:37:45Z IGNITE-10808 Add a metrics overflow test. commit b4ee4478f5e9819957ca52b962af42ab3d17452a Author: Denis Mekhanikov Date: 2018-12-28T16:03:20Z IGNITE-10808 Drop stale TcpDiscoveryMetricsUpdateMessage upon arrival of fresh ones. commit 1faec4bc05834dbe1ed7eaf2b5a3265ad503c385 Author: Denis Mekhanikov Date: 2018-12-29T10:39:05Z IGNITE-10808 Modify test. commit 632dad49954247e1e16f6781a0be88b7e075c399 Author: Denis Mekhanikov Date: 2018-12-29T11:26:52Z IGNITE-10808 Track RingMessageWorker's progress. > Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage > -- > > Key: IGNITE-10808 > URL: https://issues.apache.org/jira/browse/IGNITE-10808 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Stanislav Lukyanov >Assignee: Denis Mekhanikov >Priority: Major > Labels: discovery > Fix For: 2.8 > > Attachments: IgniteMetricsOverflowTest.java > > > A node receives a new metrics update message every `metricsUpdateFrequency` > milliseconds, and the message will be put at the top of the queue (because it > is a high priority message). > If processing one message takes more than `metricsUpdateFrequency` then > multiple `TcpDiscoveryMetricsUpdateMessage` will be in the queue. A long > enough delay (e.g. caused by a network glitch or GC) may lead to the queue > building up tens of metrics update messages which are essentially useless to > be processed. Finally, if processing a message on average takes a little more > than `metricsUpdateFrequency` (even for a relatively short period of time, > say, for a minute due to network issues) then the message worker will end up > processing only the metrics updates and the cluster will essentially hang. > Reproducer is attached. In the test, the queue first builds up and then very > slowly being teared down, causing "Failed to wait for PME" messages. > Need to change ServerImpl's SocketReader not to put another metrics update > message to the top of the queue if it already has one (or replace the one at > the top with new one). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-10808) Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage
[ https://issues.apache.org/jira/browse/IGNITE-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728702#comment-16728702 ] Stanislav Lukyanov commented on IGNITE-10808: - There are two parts in this problem: 1) The queue may grow indefinitely if metrics updates are generated faster than they're processed. This can be solved by removing all of the updates but the latest one. When a new metrics update is added to the queue, we should check if there is another metrics update in the queue already. If there then replace the old one with the new one (at the same place in the queue). We should be careful and only replace the metrics update on their first ring pass - the messages on the second ring pass should be left in the queue. 2) The metrics updates may take too much of the discovery worker capacity leading to starvation-type issues. This can be solved by making metrics update normal priority instead of high priority. To avoid triggering failure detection we need to make sure that all messages, not only metrics updates, reset the failure detection timer. > Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage > -- > > Key: IGNITE-10808 > URL: https://issues.apache.org/jira/browse/IGNITE-10808 > Project: Ignite > Issue Type: Bug >Reporter: Stanislav Lukyanov >Priority: Major > Attachments: IgniteMetricsOverflowTest.java > > > A node receives a new metrics update message every `metricsUpdateFrequency` > milliseconds, and the message will be put at the top of the queue (because it > is a high priority message). > If processing one message takes more than `metricsUpdateFrequency` then > multiple `TcpDiscoveryMetricsUpdateMessage` will be in the queue. A long > enough delay (e.g. caused by a network glitch or GC) may lead to the queue > building up tens of metrics update messages which are essentially useless to > be processed. Finally, if processing a message on average takes a little more > than `metricsUpdateFrequency` (even for a relatively short period of time, > say, for a minute due to network issues) then the message worker will end up > processing only the metrics updates and the cluster will essentially hang. > Reproducer is attached. In the test, the queue first builds up and then very > slowly being teared down, causing "Failed to wait for PME" messages. > Need to change ServerImpl's SocketReader not to put another metrics update > message to the top of the queue if it already has one (or replace the one at > the top with new one). -- This message was sent by Atlassian JIRA (v7.6.3#76005)