Hi,

Zookeeper exposes its configuration in that znode regardless of whether users 
have enabled dynamic re-configuration. Solr does not support and does not care 
about dynamic configuration. But the UI does compare Zookeeper's actual 
configuration to the static ZK_HOST string provided on the solr side and looks 
for problems. So say you have 3 zk nodes but your ZK_HOST lists only one. If 
that one zookeeper host reponsds, it will use the zk-provided configuration as 
the master source of truth and obtain status for every single node. Then it 
will display a warning on that page that you should really configure all 
addresses for fault tolerance.

An improvement we could perhaps do is to explicitly detect 0.0.0.0 IP and issue 
a separate warning in UI for it, and fall back to only checking status for the 
host/ip names in ZK_HOST since we know the others won't resolve. Then you will 
not get the 30s delay.

If you want to open a JIRA and submit a PR I'll help review.

Jan

> 3. feb. 2023 kl. 06:30 skrev michael dürr <[email protected]>:
> 
> Hi Jan,
> 
> Thanks for answering!
> 
>> I don't know how you run these zk's dockerized, but I'd look for a
> workaround where you can configure the correct address in zk's
> configuration. Then Solr will be happy.
> 
> There exists a workaround where you can assign certain address ranges to a
> docker host to run each of its containers with an external ip the container
> can bind to:
> 
> https://solr.apache.org/guide/solr/latest/deployment-guide/docker-networking.html
> 
> Anyways, this is pretty cumbersome and not really necessary as the
> communication between zookeeper nodes perfectly works by locally binding to
> "0.0.0.0".
> The problem only occurs when using the admin ui, because of the fact that
> (as mentioned in my former email) solr tries to connect to "0.0.0.0" as it
> reads that IP from the /zookeeper/config znode of the zookeeper ensemble.
> That causes solr to call ZookeeperStatusHandler.getZkRawResponse(String
> zkHostPort, String fourLetterWordCommand) with "0.0.0.0:2181".
> This does not happen for any other scenarios as solr won't use "0.0.0.0"
> (dynamic configuration is disabled by default) but only the IPs from the
> configured zkHost string (which does not mention "0.0.0.0" but only valid
> IPs).
> 
> So I'm just wondering why solr ui tries to use IPs read
> from  /zookeeper/config znode even in case the zookeeper config does
> explicitly disable dynamic reconfiguration (reconfigEnabled=false).
> I'd expect solr to respect the zookeeper config and not try to resolve
> addresses other than those configured in zkHost string.
> 
>> Are you saying that in 8.11, the test with zkcli.sh to 0.0.0.0:2181 returns
> immediately instead of after 30s?
> 
> Yes, I tested it now with 8.11.1 and the call immediately returns an
> exception (trace below).
> After 30 seconds the call returns due to a timeout. But in contrast to solr
> 9.1. the call for 8.11 immediately returns an exception. So the solr ui
> also gets immediately responsive.
> 
> root@solr1:/opt/solr#  export ZK_HOST=0.0.0.0:2181
> 
> root@solr1:/opt/solr#  time server/scripts/cloud-scripts/zkcli.sh -z
> $ZK_HOST -cmd get /zookeeper/config
> INFO  - 2023-02-03 06:19:30.748;
> org.apache.solr.common.cloud.ConnectionManager; Waiting for client to
> connect to ZooKeeper
> WARN  - 2023-02-03 06:19:30.759; org.apache.zookeeper.ClientCnxn; Session
> 0x0 for sever 0.0.0.0/0.0.0.0:2181, Closing socket connection. Attempting
> reconnect except it is a SessionExpiredException. =>
> java.net.ConnectException: Connection refused
>       at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native
> Method)
> java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
>       at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
> ~[?:?]
>       at
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:344)
> ~[zookeeper-3.6.2.jar:3.6.2]
>       at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275)
> ~[zookeeper-3.6.2.jar:3.6.2]
> WARN  - 2023-02-03 06:19:31.867; org.apache.zookeeper.ClientCnxn; Session
> 0x0 for sever 0.0.0.0/0.0.0.0:2181, Closing socket connection. Attempting
> reconnect except it is a SessionExpiredException. =>
> 
> === same exception multiple times and then... ===
> 
> WARN  - 2023-02-03 06:20:01.692; org.apache.zookeeper.ClientCnxn; An
> exception was thrown while closing send thread for session 0x0. =>
> java.net.ConnectException: Connection refused
>       at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native
> Method)
> java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
>       at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
> ~[?:?]
>       at
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:344)
> ~[zookeeper-3.6.2.jar:3.6.2]
>       at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275)
> ~[zookeeper-3.6.2.jar:3.6.2]
> Exception in thread "main" org.apache.solr.common.SolrException:
> java.util.concurrent.TimeoutException: Could not connect to ZooKeeper
> 0.0.0.0:2181 within 30000 ms
>       at
> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:195)
>       at
> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:119)
>       at
> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:109)
>       at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:196)
> Caused by: java.util.concurrent.TimeoutException: Could not connect to
> ZooKeeper 0.0.0.0:2181 within 30000 ms
>       at
> org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:251)
> 
>       at
> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:186)
>       ... 3 more
> 
> real    0m31.693s
> user    0m2.375s
> sys     0m0.199s
> 
> Thanks,
> Michael
> 
> On Thu, Feb 2, 2023 at 12:32 PM Jan Høydahl <[email protected]> wrote:
> 
>> Hi,
>> 
>> Following up on this. I'd still say that the issue here seems to be that
>> your zookeeper config lists 0.0.0.0 as ip address for client connections.
>> 
>>>>> The problem is related to the fact that we run solr and the zookeeper
>>>>> ensemble dockerized. As we cannot bind zookeeper from docker to its
>> host's
>>>>> external ip address, we have to use "0.0.0.0" as the server address
>> 
>> I don't know how you run these zk's dockerized, but I'd look for a
>> workaround where you can configure the correct address in zk's
>> configuration. Then Solr will be happy.
>> 
>> Are you saying that in 8.11, the test with zkcli.sh to 0.0.0.0:2181
>> returns immediately instead of after 30s?
>> 
>> Jan
>> 
>>> 15. des. 2022 kl. 07:10 skrev michael dürr <[email protected]>:
>>> 
>>> Hi Jan,
>>> 
>>> Thanks for answering!
>>> 
>>> I'm pretty sure the reason is related to the problem that solr tries to
>>> connect to "0.0.0.0" as it reads that IP from the /zookeeper/config znode
>>> of the zookeeper ensemble.
>>> The connection I'm talking about is when
>>> ZookeeperStatusHandler.getZkRawResponse(String zkHostPort, String
>>> fourLetterWordCommand) tries to open a Socket to "0.0.0.0:2181".
>>> After a while the connect fails but as said this takes a long time. I did
>>> not debug deeper as this already is jdk code then.
>>> 
>>> The timings for the valid zookeeper addresses (i.e. those from the static
>>> configuration string) are listed later. What causes problems is the
>> attempt
>>> to connect to 0.0.0.0:2181:
>>> 
>>> /opt/solr-9.1.0$ export ZK_HOST=0.0.0.0:2181
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd get /zookeeper/config
>>> WARN  - 2022-12-15 06:57:44.828;
>> org.apache.solr.common.cloud.SolrZkClient;
>>> Using default ZkCredentialsInjector. ZkCredentialsInjector is not secure,
>>> it creates an empty list of credentials which leads to 'OPEN_ACL_UNSAFE'
>>> ACLs to Zookeeper nodes
>>> INFO  - 2022-12-15 06:57:44.852;
>>> org.apache.solr.common.cloud.ConnectionManager; Waiting up to 30000ms for
>>> client to connect to ZooKeeper
>>> Exception in thread "main" org.apache.solr.common.SolrException:
>>> java.util.concurrent.TimeoutException: Could not connect to ZooKeeper
>>> 0.0.0.0:2181 within 30000 ms
>>>       at
>>> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:225)
>>>       at
>>> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:137)
>>>       at
>>> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:120)
>>>       at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:260)
>>> Caused by: java.util.concurrent.TimeoutException: Could not connect to
>>> ZooKeeper 0.0.0.0:2181 within 30000 ms
>>>       at
>>> 
>> org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:297)
>>>       at
>>> org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:216)
>>>       ... 3 more
>>> 
>>> real    0m31.728s
>>> user    0m3.284s
>>> sys     0m0.226s
>>> 
>>> Of course this will fail but this was not a problem before (solr 8.11.1).
>>> The call also failed but returned fast.
>>> 
>>> Here the timings you are interested in for each of my 3 zookeeper nodes
>>> (adjusted to my setup). The interesting part are the results from
>> fetching
>>> the /zookeeper/config as it shows the server configurations that include
>>> the "0.0.0.0" addresses:
>>> 
>>> /opt/solr-9.1.0$ export ZK_HOST=192.168.0.109:2181
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd get /zookeeper/config
>>> server.1=0.0.0.0:2888:3888:participant;0.0.0.0:2181
>>> server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181
>>> server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181
>>> version=0
>>> 
>>> real    0m0.810s
>>> user    0m3.142s
>>> sys     0m0.148s
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd ls /solr/live_nodes
>>> /solr/live_nodes (2)
>>> /solr/live_nodes/192.168.0.222:8983_solr (0)
>>> /solr/live_nodes/192.168.0.223:8983_solr (0)
>>> 
>>> real    0m0.838s
>>> user    0m3.166s
>>> sys     0m0.210s
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd get /solr/configs/cms_20221214_142242/stopwords.txt
>>> # Licensed to the Apache Software Foundation (ASF) under one or more
>>> # ...
>>> 
>>> real    0m0.836s
>>> user    0m3.121s
>>> sys     0m0.173s
>>> 
>>> /opt/solr-9.1.0$ export ZK_HOST=192.168.0.126:2181
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd get /zookeeper/config
>>> server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181
>>> server.2=0.0.0.0:2888:3888:participant;0.0.0.0:2181
>>> server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181
>>> version=0
>>> 
>>> real    0m0.843s
>>> user    0m3.300s
>>> sys     0m0.183s
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd ls /solr/live_nodes
>>> /solr/live_nodes (2)
>>> /solr/live_nodes/192.168.0.222:8983_solr (0)
>>> /solr/live_nodes/192.168.0.223:8983_solr (0)
>>> 
>>> real    0m0.807s
>>> user    0m3.035s
>>> sys     0m0.164s
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd get /solr/configs/cms_20221214_142242/stopwords.txt
>>> # Licensed to the Apache Software Foundation (ASF) under one or more
>>> # ...
>>> 
>>> real    0m0.859s
>>> user    0m3.354s
>>> sys     0m0.177s
>>> 
>>> export ZK_HOST=192.168.0.2:2181
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd get /zookeeper/config
>>> server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181
>>> server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181
>>> server.3=0.0.0.0:2888:3888:participant;0.0.0.0:2181
>>> version=0
>>> 
>>> real    0m0.790s
>>> user    0m2.838s
>>> sys     0m0.154s
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd ls /solr/live_nodes
>>> /solr/live_nodes (2)
>>> /solr/live_nodes/192.168.0.222:8983_solr (0)
>>> /solr/live_nodes/192.168.0.223:8983_solr (0)
>>> 
>>> real    0m0.861s
>>> user    0m3.201s
>>> sys     0m0.169s
>>> 
>>> /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
>>> -cmd get /solr/configs/cms_20221214_142242/stopwords.txt
>>> # Licensed to the Apache Software Foundation (ASF) under one or more
>>> # ...
>>> 
>>> real    0m0.779s
>>> user    0m3.081s
>>> sys     0m0.184s
>>> 
>>> Thanks,
>>> Michael
>>> 
>>> On Wed, Dec 14, 2022 at 10:08 PM Jan Høydahl <[email protected]>
>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> We always check how the zookeeper ensemble is configured, and this
>>>> check does not depend on whether dynamic reconfiguration is possible or
>>>> not,
>>>> it is simply to detect the common mistake that a 3 node ensemble is
>>>> addressed
>>>> with only one of the hosts in the static config, or with wrong host
>> names.
>>>> 
>>>> Sounds like your problem is not with how Solr talks to ZK, but in how
>> you
>>>> have configured your network. You say
>>>> 
>>>>> But this will cause the socket connect to block when resolving
>>>>> "0.0.0.0" which makes everything very slow.
>>>> 
>>>> Can you elaborate on exactly which connection you are talking about
>>>> here, and why/where it is blocking? Can you perhaps attempt a few
>> commands
>>>> from the command line to illustrate your point?
>>>> 
>>>> Assuming you are on Linux, and have the 'time' command available, try
>> this
>>>> 
>>>> export ZK_HOST=my-zookeeper:2181
>>>> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get
>>>> /zookeeper/config
>>>> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd ls
>> /live_nodes
>>>> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get
>>>> /configs/_default/stopwords.txt
>>>> 
>>>> What kind of timings do you see?
>>>> 
>>>> Jan
>>>> 
>>>>> 14. des. 2022 kl. 13:23 skrev michael dürr <[email protected]>:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Since we have updated to Solr 9.1, the admin ui has become pretty slow.
>>>>> 
>>>>> The problem is related to the fact that we run solr and the zookeeper
>>>>> ensemble dockerized. As we cannot bind zookeeper from docker to its
>>>> host's
>>>>> external ip address, we have to use "0.0.0.0" as the server address
>> which
>>>>> causes problems when solr tries to get the zookeeper status (via
>>>>> /solr/admin/zookeeper/status)
>>>>> 
>>>>> Some debugging showed that ZookeeperStatusHandler.getZkStatus() always
>>>>> tries to get the dynamic configuration from zookeeper in order to check
>>>>> whether it contains all hosts of solr's static zookeeper configuration
>>>>> string. But this will cause the socket connect to block when resolving
>>>>> "0.0.0.0" which makes everything very slow.
>>>>> 
>>>>> The approach to check whether zookeeper allows for dynamic
>>>> reconfiguration
>>>>> is based on the existence of the znode /zookeeper/config which seems
>> not
>>>> to
>>>>> be a good approach as this znode will exist even in case the zookeeper
>>>>> ensemble does not allow dynamic reconfiguration
>> (reconfigEnabled=false).
>>>>> 
>>>>> Can anybody suggest some simple action to avoid that blocking (i.e. the
>>>>> dynamic configuration check) in order to get the status request return
>>>> fast
>>>>> again?
>>>>> 
>>>>> It would be nice to have a configuration parameter that disables this
>>>> check
>>>>> independent of the zookeeper ensemble status. Especially as
>>>>> reconfigEnabled=false is the default setting for zookeeper.
>>>>> 
>>>>> Thanks,
>>>>> Michael
>>>> 
>>>> 
>> 
>> 

Reply via email to