Juan José Ramos Cassella created GEODE-4990:
-----------------------------------------------

             Summary: Race Condition When Using Cluster Configuration Service
                 Key: GEODE-4990
                 URL: https://issues.apache.org/jira/browse/GEODE-4990
             Project: Geode
          Issue Type: Bug
          Components: configuration, locator
            Reporter: Juan José Ramos Cassella


There's a small window of time on which the locator 
({{enable-cluster-configuration=true}}) is receiving TCP requests from starting 
up servers ({{use-cluster-configuration=true}}) but the cluster configuration 
service is not yet fully loaded, because of this the servers fail to startup 
and throw the following exception:

{noformat}
Exception in thread "main" org.apache.geode.cache.execute.FunctionException: 
java.lang.NullPointerException
        at 
org.apache.geode.internal.cache.execute.FunctionStreamingResultCollector.getResult(FunctionStreamingResultCollector.java:232)
        at 
org.apache.geode.internal.cache.ClusterConfigurationLoader.requestConfigurationFromLocators(ClusterConfigurationLoader.java:285)
        at 
org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:1036)
        at 
org.apache.geode.internal.cache.GemFireCacheImpl.<init>(GemFireCacheImpl.java:835)
        at 
org.apache.geode.internal.cache.GemFireCacheImpl.basicCreate(GemFireCacheImpl.java:775)
        at 
org.apache.geode.internal.cache.GemFireCacheImpl.create(GemFireCacheImpl.java:764)
        at org.apache.geode.cache.CacheFactory.create(CacheFactory.java:175)
        at org.apache.geode.cache.CacheFactory.create(CacheFactory.java:222)
        at 
org.apache.geode.distributed.internal.DefaultServerLauncherCacheProvider.createCache(DefaultServerLauncherCacheProvider.java:52)
        at 
org.apache.geode.distributed.ServerLauncher.createCache(ServerLauncher.java:844)
        at 
org.apache.geode.distributed.ServerLauncher.start(ServerLauncher.java:762)
        at 
org.apache.geode.distributed.ServerLauncher.run(ServerLauncher.java:692)
        at 
org.apache.geode.distributed.ServerLauncher.main(ServerLauncher.java:226)
Caused by: java.lang.NullPointerException
        at 
org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction.execute(GetClusterConfigurationFunction.java:55)
        at 
org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:186)
        at 
org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382)
        at 
org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:1099)
        at 
org.apache.geode.distributed.internal.DistributionManager.access$000(DistributionManager.java:108)
        at 
org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:970)
        at java.lang.Thread.run(Thread.java:748)
{noformat}

The logs from the locator, when using {{log-level=fine}} show the following:

{noformat}
[fine 2018/04/03 15:24:40.501 IST locator1 <Function Execution Processor1> 
tid=0x51] Exception occurred on remote member while executing Function: 
org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction
java.lang.NullPointerException
        at 
org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction.execute(GetClusterConfigurationFunction.java:55)
        at 
org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:186)
        at 
org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382)
        at 
org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:1099)
        at 
org.apache.geode.distributed.internal.DistributionManager.access$000(DistributionManager.java:108)
        at 
org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:970)
        at java.lang.Thread.run(Thread.java:748)

[fine 2018/04/03 15:24:40.502 IST locator1 <Function Execution Processor1> 
tid=0x51] Replying with exception: ReplyMessage processorId=2 from null with 
exception org.apache.geode.distributed.internal.ReplyException: 
java.lang.NullPointerException
org.apache.geode.distributed.internal.ReplyException: 
java.lang.NullPointerException
        at 
org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:216)
        at 
org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382)
        at 
org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:1099)
        at 
org.apache.geode.distributed.internal.DistributionManager.access$000(DistributionManager.java:108)
        at 
org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:970)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
        at 
org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction.execute(GetClusterConfigurationFunction.java:55)
        at 
org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:186)
        ... 8 more
{noformat}

It doesn't matter wether the user has configured a value for the parameter 
{{locator-wait-time}} or not, the value is only used when establishing the TCP 
connection and, since the locator is already running, the server continues the 
startup process and tries to get the cluster configuration service, failing 
afterwards due to the {{NullPointerException}}.
The problematic code can be seen below:

{code:title=GetClusterConfigurationFunction.java|borderStyle=solid}
public class GetClusterConfigurationFunction implements InternalFunction {
  private static final Logger logger = LogService.getLogger();

  @Override
  public void execute(FunctionContext context) {
    InternalClusterConfigurationService clusterConfigurationService =
        InternalLocator.getLocator().getSharedConfiguration();

    Set<String> groups = (Set<String>) context.getArguments();

    logger.info("Received request for configuration  : {}", groups);

    try {
    // [XXXXX]: clusterConfigurationService is null as it hasn't been loaded 
yet.
      ConfigurationResponse response =
          clusterConfigurationService.createConfigurationResponse(groups);
      context.getResultSender().lastResult(response);
    } catch (IOException e) {
      logger.error("Unable to retrieve the cluster configuration", e);
      context.getResultSender().lastResult(e);
    }
  }
}
{code}

The servers, on the other hand, already have the logic in place to wait and 
retry the retrieval of the cluster configuration service when the response is 
{{null}}:

{code:title=ClusterConfigurationLoader.java|borderStyle=solid}
public ConfigurationResponse requestConfigurationFromLocators(String groupList,
      Set<InternalDistributedMember> locatorList)
      throws ClusterConfigurationNotAvailableException, UnknownHostException {

    Set<String> groups = getGroups(groupList);
    GetClusterConfigurationFunction function = new 
GetClusterConfigurationFunction();

    ConfigurationResponse response = null;

    int attempts = 6;
    OUTER: while (attempts > 0) {
      for (InternalDistributedMember locator : locatorList) {
        logger.info("Attempting to retrieve cluster configuration from {} - {} 
attempts remaining",
            locator.getName(), attempts);
        response = requestConfigurationFromOneLocator(locator, groups);
        if (response != null) {
          break OUTER;
        }
      }

      try {
        Thread.sleep(10000);
      } catch (InterruptedException ex) {
        break;
      }

      attempts--;
    }

    // if the response is null
    if (response == null) {
      throw new ClusterConfigurationNotAvailableException(
          "Unable to retrieve cluster configuration from the locator.");
    }

    return response;
  }
{code}

That said, the fix seems to be quite simple and it implies modifying 
{{GetClusterConfigurationFunction}} to return {{null}} whenever the cluster 
configuration is enabled but not fully loaded yet, and throw an exception 
whenever the cluster configuration service is not enabled on the locator 
(allowing us to fail fast when the server is requesting the cluster 
configuration from a locator that doesn't have it enabled).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to