[ 
https://issues.apache.org/jira/browse/GEODE-4990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan José Ramos Cassella resolved GEODE-4990.
---------------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.6.0

> Race Condition When Using Cluster Configuration Service
> -------------------------------------------------------
>
>                 Key: GEODE-4990
>                 URL: https://issues.apache.org/jira/browse/GEODE-4990
>             Project: Geode
>          Issue Type: Bug
>          Components: configuration, locator
>    Affects Versions: 1.4.0
>            Reporter: Juan José Ramos Cassella
>            Assignee: Juan José Ramos Cassella
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.6.0
>
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> There's a small window of time on which the locator 
> ({{enable-cluster-configuration=true}}) is receiving TCP requests from 
> starting up servers ({{use-cluster-configuration=true}}) but the cluster 
> configuration service is not yet fully loaded, because of this the servers 
> fail to startup and throw the following exception:
> {noformat}
> Exception in thread "main" org.apache.geode.cache.execute.FunctionException: 
> java.lang.NullPointerException
>       at 
> org.apache.geode.internal.cache.execute.FunctionStreamingResultCollector.getResult(FunctionStreamingResultCollector.java:232)
>       at 
> org.apache.geode.internal.cache.ClusterConfigurationLoader.requestConfigurationFromLocators(ClusterConfigurationLoader.java:285)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:1036)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.<init>(GemFireCacheImpl.java:835)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.basicCreate(GemFireCacheImpl.java:775)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.create(GemFireCacheImpl.java:764)
>       at org.apache.geode.cache.CacheFactory.create(CacheFactory.java:175)
>       at org.apache.geode.cache.CacheFactory.create(CacheFactory.java:222)
>       at 
> org.apache.geode.distributed.internal.DefaultServerLauncherCacheProvider.createCache(DefaultServerLauncherCacheProvider.java:52)
>       at 
> org.apache.geode.distributed.ServerLauncher.createCache(ServerLauncher.java:844)
>       at 
> org.apache.geode.distributed.ServerLauncher.start(ServerLauncher.java:762)
>       at 
> org.apache.geode.distributed.ServerLauncher.run(ServerLauncher.java:692)
>       at 
> org.apache.geode.distributed.ServerLauncher.main(ServerLauncher.java:226)
> Caused by: java.lang.NullPointerException
>       at 
> org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction.execute(GetClusterConfigurationFunction.java:55)
>       at 
> org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:186)
>       at 
> org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382)
>       at 
> org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at 
> org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:1099)
>       at 
> org.apache.geode.distributed.internal.DistributionManager.access$000(DistributionManager.java:108)
>       at 
> org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:970)
>       at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The logs from the locator, when using {{log-level=fine}} show the following:
> {noformat}
> [fine 2018/04/03 15:24:40.501 IST locator1 <Function Execution Processor1> 
> tid=0x51] Exception occurred on remote member while executing Function: 
> org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction
> java.lang.NullPointerException
>       at 
> org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction.execute(GetClusterConfigurationFunction.java:55)
>       at 
> org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:186)
>       at 
> org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382)
>       at 
> org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at 
> org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:1099)
>       at 
> org.apache.geode.distributed.internal.DistributionManager.access$000(DistributionManager.java:108)
>       at 
> org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:970)
>       at java.lang.Thread.run(Thread.java:748)
> [fine 2018/04/03 15:24:40.502 IST locator1 <Function Execution Processor1> 
> tid=0x51] Replying with exception: ReplyMessage processorId=2 from null with 
> exception org.apache.geode.distributed.internal.ReplyException: 
> java.lang.NullPointerException
> org.apache.geode.distributed.internal.ReplyException: 
> java.lang.NullPointerException
>       at 
> org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:216)
>       at 
> org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382)
>       at 
> org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at 
> org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:1099)
>       at 
> org.apache.geode.distributed.internal.DistributionManager.access$000(DistributionManager.java:108)
>       at 
> org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:970)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>       at 
> org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction.execute(GetClusterConfigurationFunction.java:55)
>       at 
> org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:186)
>       ... 8 more
> {noformat}
> It doesn't matter wether the user has configured a value for the parameter 
> {{locator-wait-time}} or not, the value is only used when establishing the 
> TCP connection and, since the locator is already running, the server 
> continues the startup process and tries to get the cluster configuration 
> service, failing afterwards due to the {{NullPointerException}}.
> The problematic code can be seen below:
> {code:title=GetClusterConfigurationFunction.java|borderStyle=solid}
> public class GetClusterConfigurationFunction implements InternalFunction {
>   private static final Logger logger = LogService.getLogger();
>   @Override
>   public void execute(FunctionContext context) {
>     InternalClusterConfigurationService clusterConfigurationService =
>         InternalLocator.getLocator().getSharedConfiguration();
>     Set<String> groups = (Set<String>) context.getArguments();
>     logger.info("Received request for configuration  : {}", groups);
>     try {
>     // [XXXXX]: clusterConfigurationService is null as it hasn't been loaded 
> yet.
>       ConfigurationResponse response =
>           clusterConfigurationService.createConfigurationResponse(groups);
>       context.getResultSender().lastResult(response);
>     } catch (IOException e) {
>       logger.error("Unable to retrieve the cluster configuration", e);
>       context.getResultSender().lastResult(e);
>     }
>   }
> }
> {code}
> The servers, on the other hand, already have the logic in place to wait and 
> retry the retrieval of the cluster configuration service when the response is 
> {{null}}:
> {code:title=ClusterConfigurationLoader.java|borderStyle=solid}
> public ConfigurationResponse requestConfigurationFromLocators(String 
> groupList,
>       Set<InternalDistributedMember> locatorList)
>       throws ClusterConfigurationNotAvailableException, UnknownHostException {
>     Set<String> groups = getGroups(groupList);
>     GetClusterConfigurationFunction function = new 
> GetClusterConfigurationFunction();
>     ConfigurationResponse response = null;
>     int attempts = 6;
>     OUTER: while (attempts > 0) {
>       for (InternalDistributedMember locator : locatorList) {
>         logger.info("Attempting to retrieve cluster configuration from {} - 
> {} attempts remaining",
>             locator.getName(), attempts);
>         response = requestConfigurationFromOneLocator(locator, groups);
>         if (response != null) {
>           break OUTER;
>         }
>       }
>       try {
>         Thread.sleep(10000);
>       } catch (InterruptedException ex) {
>         break;
>       }
>       attempts--;
>     }
>     // if the response is null
>     if (response == null) {
>       throw new ClusterConfigurationNotAvailableException(
>           "Unable to retrieve cluster configuration from the locator.");
>     }
>     return response;
>   }
> {code}
> That said, the fix seems to be quite simple and it implies modifying 
> {{GetClusterConfigurationFunction}} to return {{null}} whenever the cluster 
> configuration is enabled but not fully loaded yet, and throw an exception 
> whenever the cluster configuration service is not enabled on the locator 
> (allowing us to fail fast when the server is requesting the cluster 
> configuration from a locator that doesn't have it enabled).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to