Dear all, I'm experiencing a strange behaviour with a SolrCloud cluster. Cluster description I have a cluster with a total of 38 nodes. All nodes are installed with the following features:
* OS: Debian GNU/Linux 9.13 (stretch) * JRE: openjdk version "11.0.6" 2020-01-14 * Apache Solr: Apache Solr 8.11.2 The cluster nodes are divided as follows: Nodes used for indexing solrindex-01 solrindex-02 Nodes used for queries solrquery-01 solrquery-02 Cluster nodes with collections solrnode-01 ... solrnode-34 Configuration of the collection In the cluster I have a collection (i.e testcollection) divided on the various nodes through different shards (one shard for each month, i.e. shard_202201, shard_202202, ...) Problem >From time to time the solrquery-01 node is no longer able to query the entire >collection and in particular it is unable to contact some replicas of the >collection present on the other nodes of the cluster. The problem does not >resolve itself but it is necessary to restart the Apache Solr service on the >solrquery-01 node. In particular: If I try to query a specific replica from the solrquery-01 node, the request remains pending until it times out Query http://solrquery-01:8080/solr/volocomapi_search/select?q=UniqueReference:DOC_EBF3D4C11F1239852490280F583D052FC214A10D6E716BD98C19CBC599E5EFED&debug=true&shards=http://solrnode-24.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n575/ Response [cid:image001.jpg@01D8BE28.664755F0] By executing the same query from another node (eg: solrnode-01) the query is successful. Query http://solrnode-01:8080/solr/volocomapi_search/select?q=UniqueReference:DOC_EBF3D4C11F1239852490280F583D052FC214A10D6E716BD98C19CBC599E5EFED&debug=true&shards=http://solrnode-24.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n575/ Response: [cid:image002.jpg@01D8BE28.664755F0] The same happens if I try to run the query to a different replica Query http://solrquery-01:8080/solr/volocomapi_search/select?q=UniqueReference:DOC_EBF3D4C11F1239852490280F583D052FC214A10D6E716BD98C19CBC599E5EFED&debug=true&shards=http://solrnode-23.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n573/ Response [cid:image003.jpg@01D8BE28.664755F0] Checking the network traffic with tcpdump on the solrquery-01 machine does not show any connection as it does on the solrnode-01 machine tcpdump from the solrquery-01 machine [cid:image004.jpg@01D8BE28.664755F0] tcpdump on the solrnode-01 machine [cid:image005.jpg@01D8BE28.664755F0] Question Do you have any suggestions on how to investigate this issue further? Suggestions on possible solutions? Thank you in advance, Matteo