[jira] [Comment Edited] (SOLR-14504) ZkController LiveNodesListener has NullPointerException in startup race

2020-05-21 Thread Colvin Cowie (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113064#comment-17113064
 ] 

Colvin Cowie edited comment on SOLR-14504 at 5/21/20, 11:01 AM:


Without making big changes to ZkController's actual implementation, I don't 
really see a way to write a useful (automated) regression test for this.

[~ab] what are you thoughts on just fixing it without a test? Or can you see a 
good way to test it?


was (Author: cjcowie):
Without making big changes to ZkController's actual implementation, I don't 
really see a way to write a useful (automated) regression test for this.

[~ab] what are you thoughts on just fixing it withtout a test? Or can you see a 
good way to test it?

> ZkController LiveNodesListener has NullPointerException in startup race
> ---
>
> Key: SOLR-14504
> URL: https://issues.apache.org/jira/browse/SOLR-14504
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.7, 7.7.1, 7.7.2, 8.0, 8.1, 8.2, 7.7.3, 8.1.1, 8.3, 
> 8.4, 8.3.1, 8.5, 8.4.1, 8.5.1
>Reporter: Colvin Cowie
>Priority: Minor
> Attachments: SOLR-14504.patch
>
>
> If a NODELOST event happens before the cloudManager is initialized then a 
> NullPointerException will occur on this line 
> [https://github.com/apache/lucene-solr/blob/c18666ad05afc02979c150aacd4810cff02e43f3/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1020]
> {code:java}
> byte[] json = Utils.toJSON(Collections.singletonMap("timestamp", 
> cloudManager.getTimeSource().getEpochTimeNs())); {code}
> Rather than accessing cloudManager directly, getSolrCloudManager() should be 
> called.
>  
> This happens very rarely, but if it happens it will stop Solr starting, 
> result in "CoreContainer is either not initialized or shutting down". Snippet 
> from 8.3.1
> {noformat}
> 2020-05-19 03:44:40.241 INFO  (main) [   ] o.a.s.c.c.ConnectionManager 
> Waiting for client to connect to ZooKeeper
> 2020-05-19 03:44:40.245 INFO  (zkConnectionManagerCallback-11-thread-1) [   ] 
> o.a.s.c.c.ConnectionManager zkClient has connected
> 2020-05-19 03:44:40.245 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client 
> is connected to ZooKeeper
> 2020-05-19 03:44:40.359 INFO  (main) [   ] o.a.s.c.c.ConnectionManager 
> Waiting for client to connect to ZooKeeper
> 2020-05-19 03:44:40.361 INFO  (zkConnectionManagerCallback-13-thread-1) [   ] 
> o.a.s.c.c.ConnectionManager zkClient has connected
> 2020-05-19 03:44:40.361 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client 
> is connected to ZooKeeper
> 2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader Updated 
> live nodes from ZooKeeper... (0) -> (1)
> 2020-05-19
>  03:44:56.606 INFO  (zkCallback-12-thread-2) [   ] 
> o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> 
> (0)
> 2020-05-19 03:44:56.614 ERROR (main) [   ] 
> o.a.s.s.SolrDispatchFilter Could not start Solr. Check solr/home 
> property and the logs
> 2020-05-19 03:44:56.639 ERROR (main) [   ] o.a.s.c.SolrCore 
> null:java.lang.NullPointerException
>   at 
> org.apache.solr.cloud.ZkController.lambda$registerLiveNodesListener$10(ZkController.java:1020)
>   at 
> org.apache.solr.common.cloud.ZkStateReader.registerLiveNodesListener(ZkStateReader.java:880)
>   at 
> org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035)
>   at org.apache.solr.cloud.ZkController.init(ZkController.java:917)
>   at org.apache.solr.cloud.ZkController.(ZkController.java:473)
>   at org.apache.solr.core.ZkContainer.initZooKeeper(ZkContainer.java:115)
>   at 
> org.apache.solr.core.CoreContainer.load(CoreContainer.java:631){noformat}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-14504) ZkController LiveNodesListener has NullPointerException in startup race

2020-05-21 Thread Colvin Cowie (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113034#comment-17113034
 ] 

Colvin Cowie edited comment on SOLR-14504 at 5/21/20, 10:16 AM:


So with breakpoints on in a debugger it's easy to reproduce this, but I'm not 
seeing a _nice_ way to write a test for it, at least not in the 
ZkControllerTest. There's a lot of stuff happening in the constructor of the 
ZkController (both registerLiveNodesListener and getSolrCloudManager are 
called, so the race only exists between those calls) and 
registerLiveNodesListener is private.

 

I've gotten part way to testing it by constructing the ZkController on another 
thread, but then the NullPointerException can just be lost in a callback 
thread's stack... The NullPointerException causing the Core container load to 
fail can only happen when a node lost event occurs in the initial {{at 
org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035)}}

 

In fact it seems as though the NPE on startup can only happen if a node is lost 
between the first call to getLiveNodes() and the second call to getLiveNodes()  
in org.apache.solr.common.cloud.ZkStateReader#registerLiveNodesListener
{code:java}
if (listener.onChange(new TreeSet<>(getClusterState().getLiveNodes()), new 
TreeSet<>(getClusterState().getLiveNodes( {{code}
since otherwise the LiveNodesListener will just exit before it gets to the 
point it uses the cloud manager at.

 

So the window for it to happen seems to be very small


was (Author: cjcowie):
So with breakpoints on in a debugger it's easy to reproduce this, but I'm not 
seeing a _nice_ way to write a test for it, at least not in the 
ZkControllerTest. There's a lot of stuff happening in the constructor of the 
ZkController (both registerLiveNodesListener and getSolrCloudManager are 
called, so the race only exists between those calls) and 
registerLiveNodesListener is private.

 

I've gotten part way to testing it by constructing the ZkController on another 
thread, but then the NullPointerException is just lost in a callback thread's 
stack... The NullPointerException causing the Core container load to fail can 
only happen when a node lost event occurs in the initial {{at 
org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035)}}

 

In fact it seems as though the NPE on startup can only happen if a node is lost 
between the first call to getLiveNodes() and the second call to getLiveNodes()  
in org.apache.solr.common.cloud.ZkStateReader#registerLiveNodesListener
{code:java}
if (listener.onChange(new TreeSet<>(getClusterState().getLiveNodes()), new 
TreeSet<>(getClusterState().getLiveNodes( {{code}
since otherwise the LiveNodesListener will just exit before it gets to the 
point it uses the cloud manager at

> ZkController LiveNodesListener has NullPointerException in startup race
> ---
>
> Key: SOLR-14504
> URL: https://issues.apache.org/jira/browse/SOLR-14504
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.7, 7.7.1, 7.7.2, 8.0, 8.1, 8.2, 7.7.3, 8.1.1, 8.3, 
> 8.4, 8.3.1, 8.5, 8.4.1, 8.5.1
>Reporter: Colvin Cowie
>Priority: Minor
>
> If a NODELOST event happens before the cloudManager is initialized then a 
> NullPointerException will occur on this line 
> [https://github.com/apache/lucene-solr/blob/c18666ad05afc02979c150aacd4810cff02e43f3/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1020]
> {code:java}
> byte[] json = Utils.toJSON(Collections.singletonMap("timestamp", 
> cloudManager.getTimeSource().getEpochTimeNs())); {code}
> Rather than accessing cloudManager directly, getSolrCloudManager() should be 
> called.
>  
> This happens very rarely, but if it happens it will stop Solr starting, 
> result in "CoreContainer is either not initialized or shutting down". Snippet 
> from 8.3.1
> {noformat}
> 2020-05-19 03:44:40.241 INFO  (main) [   ] o.a.s.c.c.ConnectionManager 
> Waiting for client to connect to ZooKeeper
> 2020-05-19 03:44:40.245 INFO  (zkConnectionManagerCallback-11-thread-1) [   ] 
> o.a.s.c.c.ConnectionManager zkClient has connected
> 2020-05-19 03:44:40.245 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client 
> is connected to ZooKeeper
> 2020-05-19 03:44:40.359 INFO  (main) [   ] o.a.s.c.c.ConnectionManager 
> Waiting for client to connect to ZooKeeper
> 2020-05-19 03:44:40.361 INFO  (zkConnectionManagerCallback-13-thread-1) [   ] 
> o.a.s.c.c.ConnectionManager zkClient has connected
> 2020-05-19 03:44:40.361 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client 
> is connected to ZooKeeper
> 2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader Updated 
> live nodes 

[jira] [Comment Edited] (SOLR-14504) ZkController LiveNodesListener has NullPointerException in startup race

2020-05-21 Thread Colvin Cowie (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113034#comment-17113034
 ] 

Colvin Cowie edited comment on SOLR-14504 at 5/21/20, 10:15 AM:


So with breakpoints on in a debugger it's easy to reproduce this, but I'm not 
seeing a _nice_ way to write a test for it, at least not in the 
ZkControllerTest. There's a lot of stuff happening in the constructor of the 
ZkController (both registerLiveNodesListener and getSolrCloudManager are 
called, so the race only exists between those calls) and 
registerLiveNodesListener is private.

 

I've gotten part way to testing it by constructing the ZkController on another 
thread, but then the NullPointerException is just lost in a callback thread's 
stack... The NullPointerException causing the Core container load to fail can 
only happen when a node lost event occurs in the initial {{at 
org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035)}}

 

In fact it seems as though the NPE on startup can only happen if a node is lost 
between the first call to getLiveNodes() and the second call to getLiveNodes()  
in org.apache.solr.common.cloud.ZkStateReader#registerLiveNodesListener
{code:java}
if (listener.onChange(new TreeSet<>(getClusterState().getLiveNodes()), new 
TreeSet<>(getClusterState().getLiveNodes( {{code}
since otherwise the LiveNodesListener will just exit before it gets to the 
point it uses the cloud manager at


was (Author: cjcowie):
So with breakpoints on in a debugger it's easy to reproduce this, but I'm not 
seeing a _nice_ way to write a test for it, at least not in the 
ZkControllerTest. There's a lot of stuff happening in the constructor of the 
ZkController (both registerLiveNodesListener and getSolrCloudManager are 
called, so the race only exists between those calls) and 
registerLiveNodesListener is private.

 

I've gotten part way to testing it by constructing the ZkController on another 
thread, but then the NullPointerException is just lost in a callback thread's 
stack... The NullPointerException causing the Core container load to fail can 
only happen when a node lost event occurs in the initial {{at 
org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035)}}

 

 

> ZkController LiveNodesListener has NullPointerException in startup race
> ---
>
> Key: SOLR-14504
> URL: https://issues.apache.org/jira/browse/SOLR-14504
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.7, 7.7.1, 7.7.2, 8.0, 8.1, 8.2, 7.7.3, 8.1.1, 8.3, 
> 8.4, 8.3.1, 8.5, 8.4.1, 8.5.1
>Reporter: Colvin Cowie
>Priority: Minor
>
> If a NODELOST event happens before the cloudManager is initialized then a 
> NullPointerException will occur on this line 
> [https://github.com/apache/lucene-solr/blob/c18666ad05afc02979c150aacd4810cff02e43f3/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1020]
> {code:java}
> byte[] json = Utils.toJSON(Collections.singletonMap("timestamp", 
> cloudManager.getTimeSource().getEpochTimeNs())); {code}
> Rather than accessing cloudManager directly, getSolrCloudManager() should be 
> called.
>  
> This happens very rarely, but if it happens it will stop Solr starting, 
> result in "CoreContainer is either not initialized or shutting down". Snippet 
> from 8.3.1
> {noformat}
> 2020-05-19 03:44:40.241 INFO  (main) [   ] o.a.s.c.c.ConnectionManager 
> Waiting for client to connect to ZooKeeper
> 2020-05-19 03:44:40.245 INFO  (zkConnectionManagerCallback-11-thread-1) [   ] 
> o.a.s.c.c.ConnectionManager zkClient has connected
> 2020-05-19 03:44:40.245 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client 
> is connected to ZooKeeper
> 2020-05-19 03:44:40.359 INFO  (main) [   ] o.a.s.c.c.ConnectionManager 
> Waiting for client to connect to ZooKeeper
> 2020-05-19 03:44:40.361 INFO  (zkConnectionManagerCallback-13-thread-1) [   ] 
> o.a.s.c.c.ConnectionManager zkClient has connected
> 2020-05-19 03:44:40.361 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client 
> is connected to ZooKeeper
> 2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader Updated 
> live nodes from ZooKeeper... (0) -> (1)
> 2020-05-19
>  03:44:56.606 INFO  (zkCallback-12-thread-2) [   ] 
> o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> 
> (0)
> 2020-05-19 03:44:56.614 ERROR (main) [   ] 
> o.a.s.s.SolrDispatchFilter Could not start Solr. Check solr/home 
> property and the logs
> 2020-05-19 03:44:56.639 ERROR (main) [   ] o.a.s.c.SolrCore 
> null:java.lang.NullPointerException
>   at 
> org.apache.solr.cloud.ZkController.lambda$registerLiveNodesListener$10(ZkController.java:1020)
>   at 
>