Hi, Zookeeper exposes its configuration in that znode regardless of whether users have enabled dynamic re-configuration. Solr does not support and does not care about dynamic configuration. But the UI does compare Zookeeper's actual configuration to the static ZK_HOST string provided on the solr side and looks for problems. So say you have 3 zk nodes but your ZK_HOST lists only one. If that one zookeeper host reponsds, it will use the zk-provided configuration as the master source of truth and obtain status for every single node. Then it will display a warning on that page that you should really configure all addresses for fault tolerance.
An improvement we could perhaps do is to explicitly detect 0.0.0.0 IP and issue a separate warning in UI for it, and fall back to only checking status for the host/ip names in ZK_HOST since we know the others won't resolve. Then you will not get the 30s delay. If you want to open a JIRA and submit a PR I'll help review. Jan > 3. feb. 2023 kl. 06:30 skrev michael dürr <[email protected]>: > > Hi Jan, > > Thanks for answering! > >> I don't know how you run these zk's dockerized, but I'd look for a > workaround where you can configure the correct address in zk's > configuration. Then Solr will be happy. > > There exists a workaround where you can assign certain address ranges to a > docker host to run each of its containers with an external ip the container > can bind to: > > https://solr.apache.org/guide/solr/latest/deployment-guide/docker-networking.html > > Anyways, this is pretty cumbersome and not really necessary as the > communication between zookeeper nodes perfectly works by locally binding to > "0.0.0.0". > The problem only occurs when using the admin ui, because of the fact that > (as mentioned in my former email) solr tries to connect to "0.0.0.0" as it > reads that IP from the /zookeeper/config znode of the zookeeper ensemble. > That causes solr to call ZookeeperStatusHandler.getZkRawResponse(String > zkHostPort, String fourLetterWordCommand) with "0.0.0.0:2181". > This does not happen for any other scenarios as solr won't use "0.0.0.0" > (dynamic configuration is disabled by default) but only the IPs from the > configured zkHost string (which does not mention "0.0.0.0" but only valid > IPs). > > So I'm just wondering why solr ui tries to use IPs read > from /zookeeper/config znode even in case the zookeeper config does > explicitly disable dynamic reconfiguration (reconfigEnabled=false). > I'd expect solr to respect the zookeeper config and not try to resolve > addresses other than those configured in zkHost string. > >> Are you saying that in 8.11, the test with zkcli.sh to 0.0.0.0:2181 returns > immediately instead of after 30s? > > Yes, I tested it now with 8.11.1 and the call immediately returns an > exception (trace below). > After 30 seconds the call returns due to a timeout. But in contrast to solr > 9.1. the call for 8.11 immediately returns an exception. So the solr ui > also gets immediately responsive. > > root@solr1:/opt/solr# export ZK_HOST=0.0.0.0:2181 > > root@solr1:/opt/solr# time server/scripts/cloud-scripts/zkcli.sh -z > $ZK_HOST -cmd get /zookeeper/config > INFO - 2023-02-03 06:19:30.748; > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to > connect to ZooKeeper > WARN - 2023-02-03 06:19:30.759; org.apache.zookeeper.ClientCnxn; Session > 0x0 for sever 0.0.0.0/0.0.0.0:2181, Closing socket connection. Attempting > reconnect except it is a SessionExpiredException. => > java.net.ConnectException: Connection refused > at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native > Method) > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?] > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777) > ~[?:?] > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:344) > ~[zookeeper-3.6.2.jar:3.6.2] > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275) > ~[zookeeper-3.6.2.jar:3.6.2] > WARN - 2023-02-03 06:19:31.867; org.apache.zookeeper.ClientCnxn; Session > 0x0 for sever 0.0.0.0/0.0.0.0:2181, Closing socket connection. Attempting > reconnect except it is a SessionExpiredException. => > > === same exception multiple times and then... === > > WARN - 2023-02-03 06:20:01.692; org.apache.zookeeper.ClientCnxn; An > exception was thrown while closing send thread for session 0x0. => > java.net.ConnectException: Connection refused > at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native > Method) > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?] > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777) > ~[?:?] > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:344) > ~[zookeeper-3.6.2.jar:3.6.2] > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275) > ~[zookeeper-3.6.2.jar:3.6.2] > Exception in thread "main" org.apache.solr.common.SolrException: > java.util.concurrent.TimeoutException: Could not connect to ZooKeeper > 0.0.0.0:2181 within 30000 ms > at > org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:195) > at > org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:119) > at > org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:109) > at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:196) > Caused by: java.util.concurrent.TimeoutException: Could not connect to > ZooKeeper 0.0.0.0:2181 within 30000 ms > at > org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:251) > > at > org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:186) > ... 3 more > > real 0m31.693s > user 0m2.375s > sys 0m0.199s > > Thanks, > Michael > > On Thu, Feb 2, 2023 at 12:32 PM Jan Høydahl <[email protected]> wrote: > >> Hi, >> >> Following up on this. I'd still say that the issue here seems to be that >> your zookeeper config lists 0.0.0.0 as ip address for client connections. >> >>>>> The problem is related to the fact that we run solr and the zookeeper >>>>> ensemble dockerized. As we cannot bind zookeeper from docker to its >> host's >>>>> external ip address, we have to use "0.0.0.0" as the server address >> >> I don't know how you run these zk's dockerized, but I'd look for a >> workaround where you can configure the correct address in zk's >> configuration. Then Solr will be happy. >> >> Are you saying that in 8.11, the test with zkcli.sh to 0.0.0.0:2181 >> returns immediately instead of after 30s? >> >> Jan >> >>> 15. des. 2022 kl. 07:10 skrev michael dürr <[email protected]>: >>> >>> Hi Jan, >>> >>> Thanks for answering! >>> >>> I'm pretty sure the reason is related to the problem that solr tries to >>> connect to "0.0.0.0" as it reads that IP from the /zookeeper/config znode >>> of the zookeeper ensemble. >>> The connection I'm talking about is when >>> ZookeeperStatusHandler.getZkRawResponse(String zkHostPort, String >>> fourLetterWordCommand) tries to open a Socket to "0.0.0.0:2181". >>> After a while the connect fails but as said this takes a long time. I did >>> not debug deeper as this already is jdk code then. >>> >>> The timings for the valid zookeeper addresses (i.e. those from the static >>> configuration string) are listed later. What causes problems is the >> attempt >>> to connect to 0.0.0.0:2181: >>> >>> /opt/solr-9.1.0$ export ZK_HOST=0.0.0.0:2181 >>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST >>> -cmd get /zookeeper/config >>> WARN - 2022-12-15 06:57:44.828; >> org.apache.solr.common.cloud.SolrZkClient; >>> Using default ZkCredentialsInjector. ZkCredentialsInjector is not secure, >>> it creates an empty list of credentials which leads to 'OPEN_ACL_UNSAFE' >>> ACLs to Zookeeper nodes >>> INFO - 2022-12-15 06:57:44.852; >>> org.apache.solr.common.cloud.ConnectionManager; Waiting up to 30000ms for >>> client to connect to ZooKeeper >>> Exception in thread "main" org.apache.solr.common.SolrException: >>> java.util.concurrent.TimeoutException: Could not connect to ZooKeeper >>> 0.0.0.0:2181 within 30000 ms >>> at >>> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:225) >>> at >>> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:137) >>> at >>> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:120) >>> at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:260) >>> Caused by: java.util.concurrent.TimeoutException: Could not connect to >>> ZooKeeper 0.0.0.0:2181 within 30000 ms >>> at >>> >> org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:297) >>> at >>> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:216) >>> ... 3 more >>> >>> real 0m31.728s >>> user 0m3.284s >>> sys 0m0.226s >>> >>> Of course this will fail but this was not a problem before (solr 8.11.1). >>> The call also failed but returned fast. >>> >>> Here the timings you are interested in for each of my 3 zookeeper nodes >>> (adjusted to my setup). The interesting part are the results from >> fetching >>> the /zookeeper/config as it shows the server configurations that include >>> the "0.0.0.0" addresses: >>> >>> /opt/solr-9.1.0$ export ZK_HOST=192.168.0.109:2181 >>> >>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST >>> -cmd get /zookeeper/config >>> server.1=0.0.0.0:2888:3888:participant;0.0.0.0:2181 >>> server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181 >>> server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181 >>> version=0 >>> >>> real 0m0.810s >>> user 0m3.142s >>> sys 0m0.148s >>> >>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST >>> -cmd ls /solr/live_nodes >>> /solr/live_nodes (2) >>> /solr/live_nodes/192.168.0.222:8983_solr (0) >>> /solr/live_nodes/192.168.0.223:8983_solr (0) >>> >>> real 0m0.838s >>> user 0m3.166s >>> sys 0m0.210s >>> >>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST >>> -cmd get /solr/configs/cms_20221214_142242/stopwords.txt >>> # Licensed to the Apache Software Foundation (ASF) under one or more >>> # ... >>> >>> real 0m0.836s >>> user 0m3.121s >>> sys 0m0.173s >>> >>> /opt/solr-9.1.0$ export ZK_HOST=192.168.0.126:2181 >>> >>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST >>> -cmd get /zookeeper/config >>> server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181 >>> server.2=0.0.0.0:2888:3888:participant;0.0.0.0:2181 >>> server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181 >>> version=0 >>> >>> real 0m0.843s >>> user 0m3.300s >>> sys 0m0.183s >>> >>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST >>> -cmd ls /solr/live_nodes >>> /solr/live_nodes (2) >>> /solr/live_nodes/192.168.0.222:8983_solr (0) >>> /solr/live_nodes/192.168.0.223:8983_solr (0) >>> >>> real 0m0.807s >>> user 0m3.035s >>> sys 0m0.164s >>> >>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST >>> -cmd get /solr/configs/cms_20221214_142242/stopwords.txt >>> # Licensed to the Apache Software Foundation (ASF) under one or more >>> # ... >>> >>> real 0m0.859s >>> user 0m3.354s >>> sys 0m0.177s >>> >>> export ZK_HOST=192.168.0.2:2181 >>> >>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST >>> -cmd get /zookeeper/config >>> server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181 >>> server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181 >>> server.3=0.0.0.0:2888:3888:participant;0.0.0.0:2181 >>> version=0 >>> >>> real 0m0.790s >>> user 0m2.838s >>> sys 0m0.154s >>> >>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST >>> -cmd ls /solr/live_nodes >>> /solr/live_nodes (2) >>> /solr/live_nodes/192.168.0.222:8983_solr (0) >>> /solr/live_nodes/192.168.0.223:8983_solr (0) >>> >>> real 0m0.861s >>> user 0m3.201s >>> sys 0m0.169s >>> >>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST >>> -cmd get /solr/configs/cms_20221214_142242/stopwords.txt >>> # Licensed to the Apache Software Foundation (ASF) under one or more >>> # ... >>> >>> real 0m0.779s >>> user 0m3.081s >>> sys 0m0.184s >>> >>> Thanks, >>> Michael >>> >>> On Wed, Dec 14, 2022 at 10:08 PM Jan Høydahl <[email protected]> >> wrote: >>> >>>> Hi, >>>> >>>> We always check how the zookeeper ensemble is configured, and this >>>> check does not depend on whether dynamic reconfiguration is possible or >>>> not, >>>> it is simply to detect the common mistake that a 3 node ensemble is >>>> addressed >>>> with only one of the hosts in the static config, or with wrong host >> names. >>>> >>>> Sounds like your problem is not with how Solr talks to ZK, but in how >> you >>>> have configured your network. You say >>>> >>>>> But this will cause the socket connect to block when resolving >>>>> "0.0.0.0" which makes everything very slow. >>>> >>>> Can you elaborate on exactly which connection you are talking about >>>> here, and why/where it is blocking? Can you perhaps attempt a few >> commands >>>> from the command line to illustrate your point? >>>> >>>> Assuming you are on Linux, and have the 'time' command available, try >> this >>>> >>>> export ZK_HOST=my-zookeeper:2181 >>>> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get >>>> /zookeeper/config >>>> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd ls >> /live_nodes >>>> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get >>>> /configs/_default/stopwords.txt >>>> >>>> What kind of timings do you see? >>>> >>>> Jan >>>> >>>>> 14. des. 2022 kl. 13:23 skrev michael dürr <[email protected]>: >>>>> >>>>> Hi, >>>>> >>>>> Since we have updated to Solr 9.1, the admin ui has become pretty slow. >>>>> >>>>> The problem is related to the fact that we run solr and the zookeeper >>>>> ensemble dockerized. As we cannot bind zookeeper from docker to its >>>> host's >>>>> external ip address, we have to use "0.0.0.0" as the server address >> which >>>>> causes problems when solr tries to get the zookeeper status (via >>>>> /solr/admin/zookeeper/status) >>>>> >>>>> Some debugging showed that ZookeeperStatusHandler.getZkStatus() always >>>>> tries to get the dynamic configuration from zookeeper in order to check >>>>> whether it contains all hosts of solr's static zookeeper configuration >>>>> string. But this will cause the socket connect to block when resolving >>>>> "0.0.0.0" which makes everything very slow. >>>>> >>>>> The approach to check whether zookeeper allows for dynamic >>>> reconfiguration >>>>> is based on the existence of the znode /zookeeper/config which seems >> not >>>> to >>>>> be a good approach as this znode will exist even in case the zookeeper >>>>> ensemble does not allow dynamic reconfiguration >> (reconfigEnabled=false). >>>>> >>>>> Can anybody suggest some simple action to avoid that blocking (i.e. the >>>>> dynamic configuration check) in order to get the status request return >>>> fast >>>>> again? >>>>> >>>>> It would be nice to have a configuration parameter that disables this >>>> check >>>>> independent of the zookeeper ensemble status. Especially as >>>>> reconfigEnabled=false is the default setting for zookeeper. >>>>> >>>>> Thanks, >>>>> Michael >>>> >>>> >> >>
