[GitHub] [geode] boglesby commented on pull request #7378: GEODE-10056: Improve gateway-receiver load balance
boglesby commented on pull request #7378: URL: https://github.com/apache/geode/pull/7378#issuecomment-1077908228 Thats a pretty cool idea. I'm not sure whether the CacheServerMXBean has that behavior, but I guess it could be added. In any event, I think this change is good. I'm approving this change, but you need to address the ParallelGatewaySenderConnectionLoadBalanceDistributedTest failure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@geode.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [geode] boglesby commented on pull request #7378: GEODE-10056: Improve gateway-receiver load balance
boglesby commented on pull request #7378: URL: https://github.com/apache/geode/pull/7378#issuecomment-1071176231 I ran a few tests with some extra logging on these changes. They look good. The receiver exchanges profiles with the locator: ``` [warn 2022/03/16 14:16:12.440 PDT locator-ln tid=0x50] XXX LocatorLoadSnapshot.updateConnectionLoadMap location=192.168.1.5:5370; load=0.0 [warn 2022/03/16 14:16:12.441 PDT locator-ln tid=0x50] XXX LocatorLoadSnapshot.updateConnectionLoadMap current load for location=192.168.1.5:5370; group=__recv__group; inputLoad=0.0; currentLoad=0.0 [warn 2022/03/16 14:16:12.441 PDT locator-ln tid=0x50] XXX LocatorLoadSnapshot.updateConnectionLoadMap updated load for location=192.168.1.5:5370; group=__recv__group; inputLoad=0.0; newLoad=0.0 ``` The connectionLoadMap shows 2 groups, namely the null group (default) and the __recv__group group (gateway receiver), each with load=0.0: ``` [warn 2022/03/16 14:16:13.777 PDT locator-ln tid=0x43] XXX LocatorLoadSnapshot.logConnectionLoadMap The connectionLoadMap contains the following 2 entries: group=null location=192.168.1.5:56224; load=0.0 group=__recv__group location=192.168.1.5:5370; load=0.0 ``` Sender connects to the receiver: With the default of 5 dispatcher threads, 5 connections are made to the receiver. The load goes from 0.0 to 0.006246: ``` [warn 2022/03/16 14:16:53.836 PDT locator-ln tid=0x47] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadBeforeUpdate=0.0 [warn 2022/03/16 14:16:53.836 PDT locator-ln tid=0x47] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadAfterUpdate=0.00125 [warn 2022/03/16 14:16:53.836 PDT locator-ln tid=0x5c] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadBeforeUpdate=0.00125 [warn 2022/03/16 14:16:53.836 PDT locator-ln tid=0x5c] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadAfterUpdate=0.0025 [warn 2022/03/16 14:16:53.837 PDT locator-ln tid=0x5b] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadBeforeUpdate=0.0025 [warn 2022/03/16 14:16:53.837 PDT locator-ln tid=0x5b] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadAfterUpdate=0.00375 [warn 2022/03/16 14:16:53.837 PDT locator-ln tid=0x5a] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadBeforeUpdate=0.00375 [warn 2022/03/16 14:16:53.837 PDT locator-ln tid=0x5a] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadAfterUpdate=0.005 [warn 2022/03/16 14:16:53.838 PDT locator-ln tid=0x59] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadBeforeUpdate=0.005 [warn 2022/03/16 14:16:53.838 PDT locator-ln tid=0x59] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadAfterUpdate=0.006246 ``` The connectionLoadMap shows the same 2 groups but now the __recv__group group load is 0.006246 for the gateway receiver: ``` [warn 2022/03/16 14:16:55.831 PDT locator-ln tid=0x43] XXX LocatorLoadSnapshot.logConnectionLoadMap The connectionLoadMap contains the following 2 entries: group=null location=192.168.1.5:56224; load=0.0 group=__recv__group location=192.168.1.5:5370; load=0.006246 ``` Update the load: Periodically, the server sends an updated load to the locator. ``` [warn 2022/03/16 14:16:57.464 PDT locator-ln :41002 unshared ordered sender uid=5 dom #1 local port=45635 remote port=56270> tid=0x5e] XXX LocatorLoadSnapshot.updateConnectionLoadMap current load for location=192.168.1.5:5370; group=__recv__group; inputLoad=0.00625; currentLoad=0.006246 [warn 2022/03/16 14:16:57.464 PDT locator-ln :41002 unshared ordered sender uid=5 dom #1 local port=45635 remote port=56270> tid=0x5e] XXX LocatorLoadSnapshot.updateConnectionLoadMap updated load for location=192.168.1.5:5370; group=__recv__group; inputLoad=0.00625; newLoad=0.00625 [warn 2022/03/16 14:16:57.832 PDT locator-ln tid=0x43] XXX LocatorLoadSnapshot.logConnectionLoadMap The connectionLoadMap contains the following 2 entries: group=null location=192.168.1.5:56224; load=0.0 group=__recv__group location=192.168.1.5:5370; load=0.00625 ``` Update the load after ping connection has been made: After another connection is made, the load is updated again. ``` [warn 2022/03/16 14:17:02.466 PDT locator-ln
[GitHub] [geode] boglesby commented on pull request #7378: GEODE-10056: Improve gateway-receiver load balance
boglesby commented on pull request #7378: URL: https://github.com/apache/geode/pull/7378#issuecomment-1071176231 I ran a few tests with some extra logging on these changes. They look good. The receiver exchanges profiles with the locator: ``` [warn 2022/03/16 14:16:12.440 PDT locator-ln tid=0x50] XXX LocatorLoadSnapshot.updateConnectionLoadMap location=192.168.1.5:5370; load=0.0 [warn 2022/03/16 14:16:12.441 PDT locator-ln tid=0x50] XXX LocatorLoadSnapshot.updateConnectionLoadMap current load for location=192.168.1.5:5370; group=__recv__group; inputLoad=0.0; currentLoad=0.0 [warn 2022/03/16 14:16:12.441 PDT locator-ln tid=0x50] XXX LocatorLoadSnapshot.updateConnectionLoadMap updated load for location=192.168.1.5:5370; group=__recv__group; inputLoad=0.0; newLoad=0.0 ``` The connectionLoadMap shows 2 groups, namely the null group (default) and the __recv__group group (gateway receiver), each with load=0.0: ``` [warn 2022/03/16 14:16:13.777 PDT locator-ln tid=0x43] XXX LocatorLoadSnapshot.logConnectionLoadMap The connectionLoadMap contains the following 2 entries: group=null location=192.168.1.5:56224; load=0.0 group=__recv__group location=192.168.1.5:5370; load=0.0 ``` Sender connects to the receiver: With the default of 5 dispatcher threads, 5 connections are made to the receiver. The load goes from 0.0 to 0.006246: ``` [warn 2022/03/16 14:16:53.836 PDT locator-ln tid=0x47] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadBeforeUpdate=0.0 [warn 2022/03/16 14:16:53.836 PDT locator-ln tid=0x47] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadAfterUpdate=0.00125 [warn 2022/03/16 14:16:53.836 PDT locator-ln tid=0x5c] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadBeforeUpdate=0.00125 [warn 2022/03/16 14:16:53.836 PDT locator-ln tid=0x5c] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadAfterUpdate=0.0025 [warn 2022/03/16 14:16:53.837 PDT locator-ln tid=0x5b] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadBeforeUpdate=0.0025 [warn 2022/03/16 14:16:53.837 PDT locator-ln tid=0x5b] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadAfterUpdate=0.00375 [warn 2022/03/16 14:16:53.837 PDT locator-ln tid=0x5a] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadBeforeUpdate=0.00375 [warn 2022/03/16 14:16:53.837 PDT locator-ln tid=0x5a] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadAfterUpdate=0.005 [warn 2022/03/16 14:16:53.838 PDT locator-ln tid=0x59] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadBeforeUpdate=0.005 [warn 2022/03/16 14:16:53.838 PDT locator-ln tid=0x59] XXX LocatorLoadSnapshot.getServerForConnection group=__recv__group; server=192.168.1.5:5370; loadAfterUpdate=0.006246 ``` The connectionLoadMap shows the same 2 groups but now the __recv__group group load is 0.006246 for the gateway receiver: ``` [warn 2022/03/16 14:16:55.831 PDT locator-ln tid=0x43] XXX LocatorLoadSnapshot.logConnectionLoadMap The connectionLoadMap contains the following 2 entries: group=null location=192.168.1.5:56224; load=0.0 group=__recv__group location=192.168.1.5:5370; load=0.006246 ``` Update the load: Periodically, the server sends an updated load to the locator. ``` [warn 2022/03/16 14:16:57.464 PDT locator-ln :41002 unshared ordered sender uid=5 dom #1 local port=45635 remote port=56270> tid=0x5e] XXX LocatorLoadSnapshot.updateConnectionLoadMap current load for location=192.168.1.5:5370; group=__recv__group; inputLoad=0.00625; currentLoad=0.006246 [warn 2022/03/16 14:16:57.464 PDT locator-ln :41002 unshared ordered sender uid=5 dom #1 local port=45635 remote port=56270> tid=0x5e] XXX LocatorLoadSnapshot.updateConnectionLoadMap updated load for location=192.168.1.5:5370; group=__recv__group; inputLoad=0.00625; newLoad=0.00625 [warn 2022/03/16 14:16:57.832 PDT locator-ln tid=0x43] XXX LocatorLoadSnapshot.logConnectionLoadMap The connectionLoadMap contains the following 2 entries: group=null location=192.168.1.5:56224; load=0.0 group=__recv__group location=192.168.1.5:5370; load=0.00625 ``` Update the load after ping connection has been made: After another connection is made, the load is updated again. ``` [warn 2022/03/16 14:17:02.466 PDT locator-ln
[GitHub] [geode] boglesby commented on pull request #7378: GEODE-10056: Improve gateway-receiver load balance
boglesby commented on pull request #7378: URL: https://github.com/apache/geode/pull/7378#issuecomment-1068616384 I'm not sure how to resolve the race condition you mention, but I see similar behavior with client/server connections. If a burst of connections is requested and none of those are made before the next load is received from the server, then the locator's load for that server gets reset back to zero. A burst of connections (10 in this case) causes the load to go from 0.0 to 0.01248: ``` [warn 2022/03/15 14:38:37.905 PDT locator tid=0x24] XXX LocatorLoadSnapshot.getServerForConnection potentialServers={192.168.1.5:51249@192.168.1.5(server1:30200):41001=LoadHolder[0.0, 192.168.1.5:51249, loadPollInterval=5000, 0.00125]} [warn 2022/03/15 14:38:37.906 PDT locator tid=0x24] XXX LocatorLoadSnapshot.getServerForConnection selectedServer=192.168.1.5:51249; loadBeforeUpdate=0.0 [warn 2022/03/15 14:38:37.907 PDT locator tid=0x24] XXX LoadHolder.incConnections location=192.168.1.5:51249; load=0.00125 [warn 2022/03/15 14:38:37.907 PDT locator tid=0x24] XXX LocatorLoadSnapshot.getServerForConnection selectedServer=192.168.1.5:51249; loadAfterUpdate=0.00125 ... [warn 2022/03/15 14:38:38.005 PDT locator tid=0x24] XXX LocatorLoadSnapshot.getServerForConnection potentialServers={192.168.1.5:51249@192.168.1.5(server1:30200):41001=LoadHolder[0.01124, 192.168.1.5:51249, loadPollInterval=5000, 0.00125]} [warn 2022/03/15 14:38:38.005 PDT locator tid=0x24] XXX LocatorLoadSnapshot.getServerForConnection selectedServer=192.168.1.5:51249; loadBeforeUpdate=0.01124 [warn 2022/03/15 14:38:38.005 PDT locator tid=0x24] XXX LoadHolder.incConnections location=192.168.1.5:51249; load=0.01248 [warn 2022/03/15 14:38:38.005 PDT locator tid=0x24] XXX LocatorLoadSnapshot.getServerForConnection selectedServer=192.168.1.5:51249; loadAfterUpdate=0.01248 ``` If none of those connections are made before the next load is sent by that server, its load goes from 0.01248 to 0.0: ``` [warn 2022/03/15 14:39:25.140 PDT locator :41001 unshared ordered sender uid=5 dom #1 local port=55139 remote port=51286> tid=0x56] XXX LocatorLoadSnapshot.updateLoad about to update connectionLoadMap location=192.168.1.5:51249; load=0.0; loadPerConnection=0.00125 [warn 2022/03/15 14:39:25.140 PDT locator :41001 unshared ordered sender uid=5 dom #1 local port=55139 remote port=51286> tid=0x56] XXX LocatorLoadSnapshot.updateMap location=192.168.1.5:51249; loadBeforeUpdate=0.01248 [warn 2022/03/15 14:39:25.141 PDT locator :41001 unshared ordered sender uid=5 dom #1 local port=55139 remote port=51286> tid=0x56] XXX LocatorLoadSnapshot.updateMap location=192.168.1.5:51249; loadAfterUpdate=0.0 [warn 2022/03/15 14:39:25.141 PDT locator :41001 unshared ordered sender uid=5 dom #1 local port=55139 remote port=51286> tid=0x56] XXX LocatorLoadSnapshot.updateLoad done update connectionLoadMap location=192.168.1.5:51249 ``` The load for the next request starts is 0.0 again: ``` [warn 2022/03/15 14:39:33.475 PDT locator tid=0x54] XXX LocatorLoadSnapshot.getServerForConnection potentialServers={192.168.1.5:51249@192.168.1.5(server1:30200):41001=LoadHolder[0.0, 192.168.1.5:51249, loadPollInterval=5000, 0.00125]} [warn 2022/03/15 14:39:33.475 PDT locator tid=0x54] XXX LocatorLoadSnapshot.getServerForConnection selectedServer=192.168.1.5:51249; loadBeforeUpdate=0.0 [warn 2022/03/15 14:39:33.475 PDT locator tid=0x54] XXX LoadHolder.incConnections location=192.168.1.5:51249; load=0.00125 [warn 2022/03/15 14:39:33.475 PDT locator tid=0x54] XXX LocatorLoadSnapshot.getServerForConnection selectedServer=192.168.1.5:51249; loadAfterUpdate=0.00125 ... ``` One thing to note is that the load is only sent load-poll-interval (default=5 seconds) if it has changed. If it hasn't changed then it only gets sent every update frequency (which is 10 * 5 seconds by default). There is a boolean to control that frequency too: ``` private static final int FORCE_LOAD_UPDATE_FREQUENCY = getInteger( GeodeGlossary.GEMFIRE_PREFIX + "BridgeServer.FORCE_LOAD_UPDATE_FREQUENCY", 10); ``` The load-poll-interva is configurable, but currently only for the cache server not the gateway receiver. It probably wouldn't be too hard to add this support to gateway receiver. Also, there is a gfsh load-balance gateway-sender command that could help alleviate this condition. I'm still reviewing the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@geode.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org