Andrzej Bialecki created SOLR-14505: ---------------------------------------
Summary: Intermittent NPE in ZkController.registerLiveNodesListener() Key: SOLR-14505 URL: https://issues.apache.org/jira/browse/SOLR-14505 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Affects Versions: 8.3.1 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Reported by [~cjcowie] in SOLR-13072 & mailing lists: "Running on Solr 8.3.1 {code:java} 2020-05-19 03:44:40.220 INFO (main) [ ] o.a.s.c.ZkContainer Zookeeper client=XXXXXXXXX:9983/xxxx_cluster 2020-05-19 03:44:40.238 INFO (main) [ ] o.a.s.c.c.SolrZkClient Using ZkCredentialsProvider: xxxxxxx.zookeeper.auth.internal.EncodedZkCredentialsProvider 2020-05-19 03:44:40.241 INFO (main) [ ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper 2020-05-19 03:44:40.245 INFO (zkConnectionManagerCallback-11-thread-1) [ ] o.a.s.c.c.ConnectionManager zkClient has connected 2020-05-19 03:44:40.245 INFO (main) [ ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper 2020-05-19 03:44:40.359 INFO (main) [ ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper 2020-05-19 03:44:40.361 INFO (zkConnectionManagerCallback-13-thread-1) [ ] o.a.s.c.c.ConnectionManager zkClient has connected 2020-05-19 03:44:40.361 INFO (main) [ ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper 2020-05-19 03:44:40.417 INFO (main) [ ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (0) -> (1) 2020-05-19 03:44:56.606 INFO (zkCallback-12-thread-2) [ ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> (0) 2020-05-19 03:44:56.614 ERROR (main) [ ] o.a.s.s.SolrDispatchFilter Could not start Solr. Check solr/home property and the logs 2020-05-19 03:44:56.639 ERROR (main) [ ] o.a.s.c.SolrCore null:java.lang.NullPointerException at org.apache.solr.cloud.ZkController.lambda$registerLiveNodesListener$10(ZkController.java:1020) at org.apache.solr.common.cloud.ZkStateReader.registerLiveNodesListener(ZkStateReader.java:880) at org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035) at org.apache.solr.cloud.ZkController.init(ZkController.java:917) at org.apache.solr.cloud.ZkController.<init>(ZkController.java:473) at org.apache.solr.core.ZkContainer.initZooKeeper(ZkContainer.java:115) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:631)at xxxxxxx.solr.servlet.RobustSolrDispatchFilter.createCoreContainer(RobustSolrDispatchFilter.java:71) {code} I couldn't find any bug reports in JIRA for the NPE. Here's the full log [https://drive.google.com/open?id=1hQrF25blNgKLXijOMYJ30wn-Lfy6uKVm] The NPE is coming from [https://github.com/apache/lucene-solr/blob/c18666ad05afc02979c150aacd4810cff02e43f3/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1020] _byte[] json = Utils.toJSON(Collections.singletonMap("timestamp", cloudManager.getTimeSource().getEpochTimeNs()));_ so I don't know whether it's the cloudManager or the time source that's null. That bit of the ZkController was added by https://issues.apache.org/jira/browse/SOLR-13072 and I see it is only hit if _zkStateReader.getAutoScalingConfig().hasTriggerForEvents(TriggerEventType.NODELOST);_ We have never (knowingly) configured autoscaling and we don't use it, but I see the autoscaling files are present in ZK. Is the autoscaling.json etc created by default when it is absent in ZooKeeper? The interesting bit of the log above, aside from the NPE, is this: _2020-05-19 03:44:40.417 INFO (main) [ ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (0) -> (1)_ _2020-05-19 03:44:56.606 INFO (zkCallback-12-thread-2) [ ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> (0)_ Which suggests to me that there's either a race condition or the problem is caused by some zookeeper outage during startup. Since there's a 16 second gap between those messages. It's possible that the problem is in some way caused by our own code in xxxxxxx.solr.servlet.RobustSolrDispatchFilter which wraps the SolrDispatchFIlter, and creates a SolrZkClient in a try/with resources (so should be autoclosed), but that all happens before createCoreContainer is called. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org