[jira] [Commented] (GEODE-10144) Regression in geode-native test CqPlusAuthInitializeTest.reAuthenticateWithDurable
[ https://issues.apache.org/jira/browse/GEODE-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525824#comment-17525824 ] Anthony Baker commented on GEODE-10144: --- I think the action here is to 1) update the SecurityManager implementation in the test to consistently evaluate tokens (ie never authorize an expired token) 2) Remove CQ/RI since this is not supported for reauthentication + old protocol versions > Regression in geode-native test > CqPlusAuthInitializeTest.reAuthenticateWithDurable > -- > > Key: GEODE-10144 > URL: https://issues.apache.org/jira/browse/GEODE-10144 > Project: Geode > Issue Type: Bug > Components: client/server, native client >Affects Versions: 1.15.0 >Reporter: Blake Bender >Assignee: Blake Bender >Priority: Major > Labels: NativeClient > Fix For: 1.15.0 > > > This test is failing across the board in the `geode-native` PR pipeline. > Main develop pipeline is green only because nothing can get through the PR > pipeline to clear checkin gates. We have green CI runs with 1.15. build 918, > then it started failing when we picked up build 924. > > [~moleske] tracked this back to this commit: > [https://github.com/apache/geode/commit/2554f42b925f2b9b8ca7eee14c7a887436b1d9db|https://github.com/apache/geode/commit/2554f42b925f2b9b8ca7eee14c7a887436b1d9db]. > See his notes in `geode-native` PR # 947 > ([https://github.com/apache/geode-native/pull/947]) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (GEODE-10144) Regression in geode-native test CqPlusAuthInitializeTest.reAuthenticateWithDurable
[ https://issues.apache.org/jira/browse/GEODE-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521963#comment-17521963 ] Blake Bender commented on GEODE-10144: -- This has been regressed again as of build 1.15.0.1097. > Regression in geode-native test > CqPlusAuthInitializeTest.reAuthenticateWithDurable > -- > > Key: GEODE-10144 > URL: https://issues.apache.org/jira/browse/GEODE-10144 > Project: Geode > Issue Type: Bug > Components: client/server >Affects Versions: 1.15.0 >Reporter: Blake Bender >Assignee: Jinmei Liao >Priority: Major > Labels: blocks-1.15.0, needsTriage > Fix For: 1.15.0 > > > This test is failing across the board in the `geode-native` PR pipeline. > Main develop pipeline is green only because nothing can get through the PR > pipeline to clear checkin gates. We have green CI runs with 1.15. build 918, > then it started failing when we picked up build 924. > > [~moleske] tracked this back to this commit: > [https://github.com/apache/geode/commit/2554f42b925f2b9b8ca7eee14c7a887436b1d9db|https://github.com/apache/geode/commit/2554f42b925f2b9b8ca7eee14c7a887436b1d9db]. > See his notes in `geode-native` PR # 947 > ([https://github.com/apache/geode-native/pull/947]) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (GEODE-10144) Regression in geode-native test CqPlusAuthInitializeTest.reAuthenticateWithDurable
[ https://issues.apache.org/jira/browse/GEODE-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511339#comment-17511339 ] Barrett Oglesby commented on GEODE-10144: - This issue comes down to a few factors: - The default PoolFactory::DEFAULT_SUBSCRIPTION_ACK_INTERVAL of 100 seconds means all the events stay on the queue for entire test. This means that every time the client disconnects, the server starts over again processing the queue from the beginning. - The NC client is version GFE 9.0 (earlier) so no ClientReAuthenticateMessage is sent to it when a AuthenticationExpiredException occurs. The server waits anyway in case new credentials are sent through another operation. - With the new changes, the server waits 5 seconds to be notified of a re-auth. If no re-auth occurs, it waits the entire 5 seconds. - With the old code, the server waits 200 ms before attempting to process the event again (which includes asking for authorization again). The SimulatedExpirationSecurityManager randomly decides whether to authorize the event. 99% of the time, it returns true. So, the second request will almost always return true. - So without any external event (like new credentials): -- With the the old code, the Message Dispatcher processes the event successfully after 200 ms with no client disconnect -- With the new code, the Message Dispatcher waits 5 seconds and then disconnects the client > Regression in geode-native test > CqPlusAuthInitializeTest.reAuthenticateWithDurable > -- > > Key: GEODE-10144 > URL: https://issues.apache.org/jira/browse/GEODE-10144 > Project: Geode > Issue Type: Bug > Components: client/server >Affects Versions: 1.15.0 >Reporter: Blake Bender >Assignee: Jinmei Liao >Priority: Major > Labels: blocks-1.15.0, needsTriage > Fix For: 1.15.0 > > > This test is failing across the board in the `geode-native` PR pipeline. > Main develop pipeline is green only because nothing can get through the PR > pipeline to clear checkin gates. We have green CI runs with 1.15. build 918, > then it started failing when we picked up build 924. > > [~moleske] tracked this back to this commit: > [https://github.com/apache/geode/commit/2554f42b925f2b9b8ca7eee14c7a887436b1d9db|https://github.com/apache/geode/commit/2554f42b925f2b9b8ca7eee14c7a887436b1d9db]. > See his notes in `geode-native` PR # 947 > ([https://github.com/apache/geode-native/pull/947]) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (GEODE-10144) Regression in geode-native test CqPlusAuthInitializeTest.reAuthenticateWithDurable
[ https://issues.apache.org/jira/browse/GEODE-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510954#comment-17510954 ] Barrett Oglesby commented on GEODE-10144: - Here is why this test passes with the previous server code. In the previous server code, after the Client Message Dispatcher caught an AuthenticationExpiredException, it slept for 200 ms before trying again. It does this for up to 5 seconds before giving up. Each time it retries, it asks for authorization again. Here is a case where SimulatedExpirationSecurityManager.authorize throws an AuthenticationExpiredException: {noformat} [warn 2022/03/22 14:59:04.110 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_CREATE; key=key130 [warn 2022/03/22 14:59:04.110 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to throw AuthenticationExpiredException [warn 2022/03/22 14:59:04.110 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher caught AuthenticationExpiredException [warn 2022/03/22 14:59:04.110 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher skipped sending ClientReAuthenticateMessage clientVersion=GFE 9.0 {noformat} The Client Message Dispatcher sleeps for 200 ms: {noformat} [warn 2022/03/22 14:59:04.110 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to sleep 1 for 200 ms [warn 2022/03/22 14:59:04.311 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher done sleep 1 {noformat} When it wakes up, it checks for authorization again. This time, the SimulatedExpirationSecurityManager returns true, so the message is sent: {noformat} [warn 2022/03/22 14:59:04.311 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_CREATE; key=key130 [warn 2022/03/22 14:59:04.311 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 14:59:04.311 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher done dispatchMessage operation=AFTER_CREATE; key=key130 {noformat} This path is not relying on outside operations to notify the Client Message Dispatcher. The SimulatedExpirationSecurityManager authorizes the operation after the sleep. So at the end of the run where there are no client operations, the Client Message Dispatcher is most likely only going to sleep 200 ms. There is never going to be a 5 second wait. I did see a few times where the Client Message Dispatcher slept twice through the loop (so 400 ms): {noformat} [warn 2022/03/22 14:59:23.924 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_UPDATE; key=key4820 [warn 2022/03/22 14:59:23.924 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to throw AuthenticationExpiredException [warn 2022/03/22 14:59:23.924 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher caught AuthenticationExpiredException [warn 2022/03/22 14:59:23.924 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher skipped sending ClientReAuthenticateMessage clientVersion=GFE 9.0 [warn 2022/03/22 14:59:23.924 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to sleep 1 for 200 ms [warn 2022/03/22 14:59:24.124 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher done sleep 1 [warn 2022/03/22 14:59:24.124 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_UPDATE; key=key4820 [warn 2022/03/22 14:59:24.124 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to throw AuthenticationExpiredException [warn 2022/03/22 14:59:24.124 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher caught AuthenticationExpiredException [warn 2022/03/22 14:59:24.124 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about
[jira] [Commented] (GEODE-10144) Regression in geode-native test CqPlusAuthInitializeTest.reAuthenticateWithDurable
[ https://issues.apache.org/jira/browse/GEODE-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510953#comment-17510953 ] Barrett Oglesby commented on GEODE-10144: - Even though this JIRA is resolved, I did some analysis on it. I see whats going on in this test. At the beginning of the test client cache operations are occurring simultaneously with message dispatching from the server to the client. Here is a ServerConnection processing puts: {noformat} [warn 2022/03/22 15:40:01.096 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX Put70.cmdExecute operation=UPDATE; key=key50 [warn 2022/03/22 15:40:01.096 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.099 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX Put70.cmdExecute operation=UPDATE; key=key51 [warn 2022/03/22 15:40:01.099 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.101 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX Put70.cmdExecute operation=UPDATE; key=key52 [warn 2022/03/22 15:40:01.102 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.104 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX Put70.cmdExecute operation=UPDATE; key=key53 [warn 2022/03/22 15:40:01.104 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.106 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX Put70.cmdExecute operation=UPDATE; key=key54 [warn 2022/03/22 15:40:01.107 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x50] XXX SimulatedExpirationSecurityManager.authorize about to return true {noformat} At the same time, the Client Message Dispatcher is dispatching events to the client: {noformat} [warn 2022/03/22 15:40:01.098 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_CREATE; key=key50 [warn 2022/03/22 15:40:01.098 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.099 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher done dispatchMessage operation=AFTER_CREATE; key=key50 [warn 2022/03/22 15:40:01.101 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_CREATE; key=key51 [warn 2022/03/22 15:40:01.101 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.101 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher done dispatchMessage operation=AFTER_CREATE; key=key51 [warn 2022/03/22 15:40:01.103 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_CREATE; key=key52 [warn 2022/03/22 15:40:01.104 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.104 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher done dispatchMessage operation=AFTER_CREATE; key=key52 [warn 2022/03/22 15:40:01.106 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher about to dispatchMessage operation=AFTER_CREATE; key=key53 [warn 2022/03/22 15:40:01.106 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX SimulatedExpirationSecurityManager.authorize about to return true [warn 2022/03/22 15:40:01.106 PDT CqPlusAuthInitializeTest_reAuthenticateWithDurable_server_0 tid=0x51] XXX MessageDispatcher.runDispatcher done dispatchMessage operation=AFTER_CREATE; key=key53 {noformat} While the Client Message Dispatcher, it requests authorization which fails. Since the NC is not the latest version, no ClientReAuthenticateMessage is sent to it. The dispatcher waits anyway in case another operation updates the credentials: {noformat} [warn 2022/03/22 15:40:01.109 PDT CqPlusAuthInitializeTest_reAuthen
[jira] [Commented] (GEODE-10144) Regression in geode-native test CqPlusAuthInitializeTest.reAuthenticateWithDurable
[ https://issues.apache.org/jira/browse/GEODE-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510939#comment-17510939 ] Jinmei Liao commented on GEODE-10144: - commit reverted > Regression in geode-native test > CqPlusAuthInitializeTest.reAuthenticateWithDurable > -- > > Key: GEODE-10144 > URL: https://issues.apache.org/jira/browse/GEODE-10144 > Project: Geode > Issue Type: Bug > Components: client/server >Affects Versions: 1.15.0 >Reporter: Blake Bender >Assignee: Jinmei Liao >Priority: Major > Labels: blocks-1.15.0, needsTriage > > This test is failing across the board in the `geode-native` PR pipeline. > Main develop pipeline is green only because nothing can get through the PR > pipeline to clear checkin gates. We have green CI runs with 1.15. build 918, > then it started failing when we picked up build 924. > > [~moleske] tracked this back to this commit: > [https://github.com/apache/geode/commit/2554f42b925f2b9b8ca7eee14c7a887436b1d9db|https://github.com/apache/geode/commit/2554f42b925f2b9b8ca7eee14c7a887436b1d9db]. > See his notes in `geode-native` PR # 947 > ([https://github.com/apache/geode-native/pull/947]) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (GEODE-10144) Regression in geode-native test CqPlusAuthInitializeTest.reAuthenticateWithDurable
[ https://issues.apache.org/jira/browse/GEODE-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510485#comment-17510485 ] Blake Bender commented on GEODE-10144: -- Notes from investigation 2022-03-21: * Description of test: Test creates a durable cache, provides an instrumented AuthInitialize implementation to the client and a SecurityManager to the server. Test registers a listener that counts events, then does: 5000 puts, 5000 updates, 5000 destroys, then checks that the listener got 5000 LOCAL_CREATE, 5000 LOCAL_UPDATE, and 5000 LOCAL_DESTROY messages. Next, test shuts down the cache and creates a 2nd durable cache with the same durable-client-id, registers CQ again, and expects all events to still be in the queue and to receive them all again. After conversation with Barry, the 2nd part of the test may or may not be valid, but is at the very least a strange thing to do. * Prior to the commit in question, we discovered that the first half of the test would always pass, but the 2nd half was a little flaky when running a debug build of geode-native. Due to performance of unoptimized code and debug-level logging, the test would sometimes overrun the default subscription ack interval (100 seconds), thus losing some events from the queue and failing the event count. * After the bad commit, we've observed that no LOCAL_DESTROY events are received in the first part of the test. We are looking into why this is the case. > Regression in geode-native test > CqPlusAuthInitializeTest.reAuthenticateWithDurable > -- > > Key: GEODE-10144 > URL: https://issues.apache.org/jira/browse/GEODE-10144 > Project: Geode > Issue Type: Bug > Components: client/server >Affects Versions: 1.15.0 >Reporter: Blake Bender >Assignee: Jinmei Liao >Priority: Major > Labels: blocks-1.15.0, needsTriage > > This test is failing across the board in the `geode-native` PR pipeline. > Main develop pipeline is green only because nothing can get through the PR > pipeline to clear checkin gates. We have green CI runs with 1.15. build 918, > then it started failing when we picked up build 924. > > [~moleske] tracked this back to this commit: > [https://github.com/apache/geode/commit/2554f42b925f2b9b8ca7eee14c7a887436b1d9db|https://github.com/apache/geode/commit/2554f42b925f2b9b8ca7eee14c7a887436b1d9db]. > See his notes in `geode-native` PR # 947 > ([https://github.com/apache/geode-native/pull/947]) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (GEODE-10144) Regression in geode-native test CqPlusAuthInitializeTest.reAuthenticateWithDurable
[ https://issues.apache.org/jira/browse/GEODE-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510479#comment-17510479 ] Blake Bender commented on GEODE-10144: -- [~jinmeiliao] Be sure to check in with [~boglesby] before diving into this one, he and I have been working on it since yesterday. > Regression in geode-native test > CqPlusAuthInitializeTest.reAuthenticateWithDurable > -- > > Key: GEODE-10144 > URL: https://issues.apache.org/jira/browse/GEODE-10144 > Project: Geode > Issue Type: Bug > Components: client/server >Affects Versions: 1.15.0 >Reporter: Blake Bender >Assignee: Jinmei Liao >Priority: Major > Labels: blocks-1.15.0, needsTriage > > This test is failing across the board in the `geode-native` PR pipeline. > Main develop pipeline is green only because nothing can get through the PR > pipeline to clear checkin gates. We have green CI runs with 1.15. build 918, > then it started failing when we picked up build 924. > > [~moleske] tracked this back to this commit: > [https://github.com/apache/geode/commit/2554f42b925f2b9b8ca7eee14c7a887436b1d9db|https://github.com/apache/geode/commit/2554f42b925f2b9b8ca7eee14c7a887436b1d9db]. > See his notes in `geode-native` PR # 947 > ([https://github.com/apache/geode-native/pull/947]) -- This message was sent by Atlassian Jira (v8.20.1#820001)