Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-23 Thread via GitHub


dsmiley merged PR #2935:
URL: https://github.com/apache/solr/pull/2935


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-22 Thread via GitHub


dsmiley commented on PR #2935:
URL: https://github.com/apache/solr/pull/2935#issuecomment-2608306662

   I'll merge this in about a day as-is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-22 Thread via GitHub


mlbiscoc commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1925986935


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -216,38 +236,42 @@ private SimpleOrderedMap submitClusterStateRequest(
 
   @Override
   public Set getLiveNodes() {
-if (liveNodes == null) {
-  throw new RuntimeException(
-  "We don't know of any live_nodes to fetch the"
-  + " latest live_nodes information from. "
-  + "If you think your Solr cluster is up and is accessible,"
-  + " you could try re-creating a new CloudSolrClient using 
working"
-  + " solrUrl(s) or zkHost(s).");
-}
 if (TimeUnit.SECONDS.convert((System.nanoTime() - liveNodesTimestamp), 
TimeUnit.NANOSECONDS)
 > getCacheTimeout()) {
-  for (String nodeName : liveNodes) {
-String baseUrl = Utils.getBaseUrlForNodeName(nodeName, urlScheme);
-try (SolrClient client = getSolrClient(baseUrl)) {
-  Set liveNodes = fetchLiveNodes(client);
-  this.liveNodes = (liveNodes);
-  liveNodesTimestamp = System.nanoTime();
-  return liveNodes;
-} catch (Exception e) {
-  log.warn("Attempt to fetch cluster state from {} failed.", baseUrl, 
e);
-}
-  }
+
+  if (liveNodes.stream()
+  .anyMatch((node) -> 
updateLiveNodes(URLUtil.getBaseUrlForNodeName(node, urlScheme
+return this.liveNodes;
+
+  log.warn(
+  "Attempt to fetch cluster state from all known live nodes {} failed. 
Trying backup nodes",
+  liveNodes);
+
+  if (configuredNodes.stream().anyMatch((node) -> 
updateLiveNodes(node.toString(

Review Comment:
   We could but to me it makes sense to try live nodes first unless this class 
was explicitly created as your JIRA proposal pointed, in that it only uses the 
passed URLs and nothing else. 
   
   I think it should be used as backup and go for fetching from live nodes 
first and assume by its name it is "live". If we do the opt in as a feature 
flag for dynamic node discovery then that gives them control. I'm fine with 
flipping it with configured first though if you feel strongly about it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-22 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1925844451


##
solr/CHANGES.txt:
##
@@ -176,6 +176,10 @@ Bug Fixes
 * SOLR-17629: If SQLHandler failed to open the underlying stream (e.g. Solr 
returns an error; could be user/syntax problem),
   it needs to close the stream to cleanup resources but wasn't. (David Smiley)
 
+* SOLR-17519: SolrJ fix to CloudSolrClient and HTTP ClusterState where it can 
fail to request cluster state
+  if its current live nodes list is stale. HTTP ClusterState now retains the 
initial configured list of passed URLs as backup

Review Comment:
   I don't think Solr users know about "HTTP ClusterState"; it's not likely in 
their vocabulary.  On the other hand, wording like "SolrJ CloudSolrClient 
configured with Solr URLs" is likely to be understood.



##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -216,38 +236,42 @@ private SimpleOrderedMap submitClusterStateRequest(
 
   @Override
   public Set getLiveNodes() {
-if (liveNodes == null) {
-  throw new RuntimeException(
-  "We don't know of any live_nodes to fetch the"
-  + " latest live_nodes information from. "
-  + "If you think your Solr cluster is up and is accessible,"
-  + " you could try re-creating a new CloudSolrClient using 
working"
-  + " solrUrl(s) or zkHost(s).");
-}
 if (TimeUnit.SECONDS.convert((System.nanoTime() - liveNodesTimestamp), 
TimeUnit.NANOSECONDS)
 > getCacheTimeout()) {
-  for (String nodeName : liveNodes) {
-String baseUrl = Utils.getBaseUrlForNodeName(nodeName, urlScheme);
-try (SolrClient client = getSolrClient(baseUrl)) {
-  Set liveNodes = fetchLiveNodes(client);
-  this.liveNodes = (liveNodes);
-  liveNodesTimestamp = System.nanoTime();
-  return liveNodes;
-} catch (Exception e) {
-  log.warn("Attempt to fetch cluster state from {} failed.", baseUrl, 
e);
-}
-  }
+
+  if (liveNodes.stream()
+  .anyMatch((node) -> 
updateLiveNodes(URLUtil.getBaseUrlForNodeName(node, urlScheme
+return this.liveNodes;
+
+  log.warn(
+  "Attempt to fetch cluster state from all known live nodes {} failed. 
Trying backup nodes",
+  liveNodes);
+
+  if (configuredNodes.stream().anyMatch((node) -> 
updateLiveNodes(node.toString(

Review Comment:
   Shouldn't we check configuredNodes *first* so as to give the user the 
possibility of some control, basically addressing the conversation points in 
JIRA?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-22 Thread via GitHub


mlbiscoc commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1925380385


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -387,6 +398,19 @@ public String getPolicyNameByCollection(String coll) {
 + "ZkClientClusterStateProvider can be used for this."); // TODO
   }
 
+  private List setConfiguredNodes(List solrUrls) {

Review Comment:
   You're right bad naming. It's only used in the `init` so Instead I just 
removed the function as I don't think we need another utility method and it was 
somewhat small enough inlined the whole thing into the stream where it was 
called. Can move it into a private static utility if you prefer otherwise.



##
solr/solrj/src/test/org/apache/solr/client/solrj/impl/ClusterStateProviderTest.java:
##
@@ -204,8 +204,8 @@ public void 
testClusterStateProviderDownedInitialLiveNodes() throws Exception {
   var jettyNode1 = cluster.getJettySolrRunner(0);
   var jettyNode2 = cluster.getJettySolrRunner(1);
 
-  String nodeName1 = 
getNodeNameFromSolrUrl(jettyNode1.getBaseUrl().toString());
-  String nodeName2 = 
getNodeNameFromSolrUrl(jettyNode2.getBaseUrl().toString());
+  String nodeName1 = 
getNodeNameForBaseUrl(jettyNode1.getBaseUrl().toString());

Review Comment:
   Yeah it was deliberate. Was mostly to keep consistency between the names. 
`getNodeNameForBaseUrl` and `getBaseUrlForNodeName` but can switch it back if 
you prefer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-21 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1924605334


##
solr/solrj/src/test/org/apache/solr/client/solrj/impl/ClusterStateProviderTest.java:
##
@@ -204,8 +204,8 @@ public void 
testClusterStateProviderDownedInitialLiveNodes() throws Exception {
   var jettyNode1 = cluster.getJettySolrRunner(0);
   var jettyNode2 = cluster.getJettySolrRunner(1);
 
-  String nodeName1 = 
getNodeNameFromSolrUrl(jettyNode1.getBaseUrl().toString());
-  String nodeName2 = 
getNodeNameFromSolrUrl(jettyNode2.getBaseUrl().toString());
+  String nodeName1 = 
getNodeNameForBaseUrl(jettyNode1.getBaseUrl().toString());

Review Comment:
   Was From -> For deliberate?  IMO it was named fine.



##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -387,6 +398,19 @@ public String getPolicyNameByCollection(String coll) {
 + "ZkClientClusterStateProvider can be used for this."); // TODO
   }
 
+  private List setConfiguredNodes(List solrUrls) {

Review Comment:
   this doesn't actually "set" anything; it's a utility function mapping input 
to output converting to a URL.  It could even be static.   If you declare a 
static method that only converts a URL with the try-catch, it'd be simple for 
the caller (init) to invoke it with the streams mapping in a one-liner.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-21 Thread via GitHub


mlbiscoc commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1924556359


##
solr/test-framework/src/java/org/apache/solr/cloud/MiniSolrCloudCluster.java:
##
@@ -530,14 +530,14 @@ public JettySolrRunner stopJettySolrRunner(int index) 
throws Exception {
   }
 
   /**
-   * Add a previously stopped node back to the cluster
+   * Add a previously stopped node back to the cluster and reuse its port
*
* @param jetty a {@link JettySolrRunner} previously returned by {@link 
#stopJettySolrRunner(int)}
* @return the started node
* @throws Exception on error
*/
   public JettySolrRunner startJettySolrRunner(JettySolrRunner jetty) throws 
Exception {
-jetty.start(false);
+jetty.start();

Review Comment:
   Thanks for find that. Thats fair. For now, I overloaded the function with a 
`reusePort` to not break anything and maybe change it to just reuse port later. 
Is it fine to do that for now, or should I just be calling jetty itself and 
reuse the port in the tests?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-21 Thread via GitHub


mlbiscoc commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1924555335


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -51,6 +56,7 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
   private String urlScheme;
+  private List configuredNodes;

Review Comment:
   Changed to `List` and changed `updateLiveNodes` accordingly. Lmk if how 
I changed it makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-21 Thread via GitHub


mlbiscoc commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1924552833


##
solr/solrj/src/java/org/apache/solr/common/util/URLUtil.java:
##
@@ -98,4 +102,59 @@ private static String removeTrailingSlashIfPresent(String 
url) {
 
 return url;
   }
+
+  /**
+   * Construct a V1 base url for the Solr node, given its name (e.g., 
'app-node-1:8983_solr') and a
+   * URL scheme.
+   *
+   * @param nodeName name of the Solr node
+   * @param urlScheme scheme for the base url ('http' or 'https')
+   * @return url that looks like {@code https://app-node-1:8983/solr}
+   * @throws IllegalArgumentException if the provided node name is malformed
+   */
+  public static String getBaseUrlForNodeName(final String nodeName, final 
String urlScheme) {
+return getBaseUrlForNodeName(nodeName, urlScheme, false);
+  }
+
+  /**
+   * Construct a V1 or a V2 base url for the Solr node, given its name (e.g.,
+   * 'app-node-1:8983_solr') and a URL scheme.
+   *
+   * @param nodeName name of the Solr node
+   * @param urlScheme scheme for the base url ('http' or 'https')
+   * @param isV2 whether a V2 url should be constructed
+   * @return url that looks like {@code https://app-node-1:8983/api} (V2) or 
{@code
+   * https://app-node-1:8983/solr} (V1)
+   * @throws IllegalArgumentException if the provided node name is malformed
+   */
+  public static String getBaseUrlForNodeName(
+  final String nodeName, final String urlScheme, boolean isV2) {
+final int colonAt = nodeName.indexOf(':');
+if (colonAt == -1) {
+  throw new IllegalArgumentException(
+  "nodeName does not contain expected ':' separator: " + nodeName);
+}
+
+final int _offset = nodeName.indexOf('_', colonAt);
+if (_offset < 0) {
+  throw new IllegalArgumentException(
+  "nodeName does not contain expected '_' separator: " + nodeName);
+}
+final String hostAndPort = nodeName.substring(0, _offset);
+return urlScheme + "://" + hostAndPort + "/" + (isV2 ? "api" : "solr");
+  }
+
+  /**
+   * Construct base Solr URL to a Solr node name
+   *
+   * @param solrUrl Given a base Solr URL string (e.g., 
'https://app-node-1:8983/solr')
+   * @return Node name that looks like {@code app-node-1:8983_solr}
+   * @throws MalformedURLException if the provided URL string is malformed
+   * @throws URISyntaxException if the provided URL string could not be parsed 
as a URI reference.
+   */
+  public static String getNodeNameForBaseUrl(String solrUrl)
+  throws MalformedURLException, URISyntaxException {
+URL url = new URI(solrUrl).toURL();
+return url.getAuthority() + url.getPath().replace('/', '_');

Review Comment:
   After changing configured nodes to List, this function isn't used 
anymore except for tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-21 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1924427707


##
solr/test-framework/src/java/org/apache/solr/cloud/MiniSolrCloudCluster.java:
##
@@ -530,14 +530,14 @@ public JettySolrRunner stopJettySolrRunner(int index) 
throws Exception {
   }
 
   /**
-   * Add a previously stopped node back to the cluster
+   * Add a previously stopped node back to the cluster and reuse its port
*
* @param jetty a {@link JettySolrRunner} previously returned by {@link 
#stopJettySolrRunner(int)}
* @return the started node
* @throws Exception on error
*/
   public JettySolrRunner startJettySolrRunner(JettySolrRunner jetty) throws 
Exception {
-jetty.start(false);
+jetty.start();

Review Comment:
   When faced with curiosities like this, it's useful to do a bit of code/JIRA 
archeology.  `git blame` (via IntelliJ's margin, BTW) shows the current setting 
(to pick a random port) was *deliberate* and had its own JIRA -- 
https://issues.apache.org/jira/browse/SOLR-9474 and I found a related one: 
https://issues.apache.org/jira/browse/SOLR-9469 .  I read the conversation 
there.  Changing it is worthy of its own JIRA & CHANGES.txt entry, following a 
deeper attempt to understand the causes behind why it hasn't been reusing a 
port thus far.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-21 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1924412143


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -51,6 +56,7 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
   private String urlScheme;
+  private List configuredNodes;

Review Comment:
   We don't need a `liveNodeURLs` field.  We know how to convert on the fly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-21 Thread via GitHub


HoustonPutman commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1924410045


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -51,6 +56,7 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
   private String urlScheme;
+  private List configuredNodes;

Review Comment:
   yeah, keeping `liveNodes` as string and `configuredNodes` as URLs, with the 
method handling each sounds good to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-21 Thread via GitHub


mlbiscoc commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1924301197


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -51,6 +56,7 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
   private String urlScheme;
+  private List configuredNodes;

Review Comment:
   I basically tried to keep `configuredNodes` the same format as `liveNodes` 
as they are used the same with `updateLiveNodes`. I don't like the idea of 
another `liveNodeURLs` list attribute and think just changing liveNodes to be 
`List` would make sense but the caller now get a different list URLs and 
not nodeNames.
   
   I guess I'll just go the route of updateLiveNodes functions handling both 
`NodeName` strings and `Base URL` inputs unless you feel strongly about having 
an additional `liveNodeURLs` list



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-21 Thread via GitHub


mlbiscoc commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1924267502


##
solr/test-framework/src/java/org/apache/solr/cloud/MiniSolrCloudCluster.java:
##
@@ -530,14 +530,14 @@ public JettySolrRunner stopJettySolrRunner(int index) 
throws Exception {
   }
 
   /**
-   * Add a previously stopped node back to the cluster
+   * Add a previously stopped node back to the cluster and reuse its port
*
* @param jetty a {@link JettySolrRunner} previously returned by {@link 
#stopJettySolrRunner(int)}
* @return the started node
* @throws Exception on error
*/
   public JettySolrRunner startJettySolrRunner(JettySolrRunner jetty) throws 
Exception {
-jetty.start(false);
+jetty.start();

Review Comment:
   The tests I incorporated where I take down and bring back nodes, the URLs 
changed because port number changed. So I looked at this method noticed this 
was the cause. It didn't make sense to me change the port when you are passing 
the same jetty back and tests didn't seem to be effected so I didn't think 
there was an impact to change. I can move it out if you'd like, I just need to 
then start jetty in the tests instead of from this function to keep the same 
ports.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-21 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1924189617


##
solr/test-framework/src/java/org/apache/solr/cloud/MiniSolrCloudCluster.java:
##
@@ -530,14 +530,14 @@ public JettySolrRunner stopJettySolrRunner(int index) 
throws Exception {
   }
 
   /**
-   * Add a previously stopped node back to the cluster
+   * Add a previously stopped node back to the cluster and reuse its port
*
* @param jetty a {@link JettySolrRunner} previously returned by {@link 
#stopJettySolrRunner(int)}
* @return the started node
* @throws Exception on error
*/
   public JettySolrRunner startJettySolrRunner(JettySolrRunner jetty) throws 
Exception {
-jetty.start(false);
+jetty.start();

Review Comment:
   I suspect the results of this could be subtle, and nobody would guess this 
JIRA/PR-title would have anything to do with this.  Thus should be merged 
separately.  What did you find that led you to incorporate this change into 
this PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-21 Thread via GitHub


HoustonPutman commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1924191070


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -51,6 +56,7 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
   private String urlScheme;
+  private List configuredNodes;

Review Comment:
   Yeah, if the idea is to have 1 method that handles both, then I would 
instead have a `liveNodeURLs` list in addition to `liveNodes`, so that the 
`updateLiveNodes ` method takes in the urls.
   
   But the method can probably split to share logic, but handle both strings 
and urls as inputs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-21 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1924183653


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -51,6 +56,7 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
   private String urlScheme;
+  private List configuredNodes;

Review Comment:
   Perhaps configuredNodes should be a List?  Why convert to a live node 
format?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-21 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1924183653


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -51,6 +56,7 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
   private String urlScheme;
+  private List configuredNodes;

Review Comment:
   Perhaps configuredNodes should be a `List`?  Why convert to a live node 
format?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-21 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1924087626


##
solr/solrj/src/java/org/apache/solr/common/util/Utils.java:
##
@@ -748,61 +745,6 @@ public static String applyUrlScheme(final String url, 
final String urlScheme) {
 return (at == -1) ? (urlScheme + "://" + url) : urlScheme + 
url.substring(at);
   }
 
-  /**
-   * Construct a V1 base url for the Solr node, given its name (e.g., 
'app-node-1:8983_solr') and a
-   * URL scheme.
-   *
-   * @param nodeName name of the Solr node
-   * @param urlScheme scheme for the base url ('http' or 'https')
-   * @return url that looks like {@code https://app-node-1:8983/solr}
-   * @throws IllegalArgumentException if the provided node name is malformed
-   */
-  public static String getBaseUrlForNodeName(final String nodeName, final 
String urlScheme) {

Review Comment:
   These methods are called in a bunch of places; we can't simply remove them 
in 9x.  They can stay and call the new location and be deprecated.



##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -248,15 +240,19 @@ > getCacheTimeout()) {
   + "succeeded in obtaining the cluster state from none of them."
   + "If you think your Solr cluster is up and is accessible,"
   + " you could try re-creating a new CloudSolrClient using 
working"
-  + " solrUrl(s) or zkHost(s).");
+  + " solrUrl(s).");
 } else {
   return this.liveNodes; // cached copy is fresh enough
 }
   }
 
+  private boolean updateLiveNodes(List liveNodes) {
+return updateLiveNodes(Set.copyOf(liveNodes));
+  }
+
   private boolean updateLiveNodes(Set liveNodes) {

Review Comment:
   why does updateLiveNodes take a Set as an argument; can't it simply be a 
Collection?  The Set is forcing you to do a conversion and that which changes 
the order, and the order may be deliberate.  It's not worth overloading the 
method for either.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-21 Thread via GitHub


mlbiscoc commented on PR #2935:
URL: https://github.com/apache/solr/pull/2935#issuecomment-2605025917

   > You know what... See URLUtil. Shouldn't _that_ be where these methods go? 
The existing method can be deprecated and call the one at the better location. 
Also has a test where your tests of this method can be added to. (big code 
base; so many util classes)
   > 
   > This PR is pretty close to merging now.
   
   I ended up moving all those functions into `URLUril` which resulted in a 
much larger code change. Not sure if you meant for me to use the `Deprecated` 
annotation but instead just moved it and refactored the callers.
   
   Also didn't add in 
[SOLR-17607](https://issues.apache.org/jira/browse/SOLR-17607) because it broke 
a few more tests that needed more changes so just held the changes locally and 
will probably just make it in another PR and link it to that JIRA unless you 
prefer me to do it here in one go. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-20 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1922783901


##
solr/solrj/src/test/org/apache/solr/client/solrj/impl/ClusterStateProviderTest.java:
##
@@ -51,6 +54,15 @@ public static void setupCluster() throws Exception {
 .resolve("streaming")
 .resolve("conf"))
 .configure();
+cluster.waitForAllNodes(30);
+System.setProperty("solr.solrj.cache.timeout.sec", "1");

Review Comment:
   Christine, Solr's test infrastructure auto-restores all system properties to 
as they were from test suite to test suite thanks to 
SystemPropertiesRestoreRule.  But not test-method to test-method, albeit that's 
not pertinent for the line you commented (as it's in a BeforeClass).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-20 Thread via GitHub


cpoerschke commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1922536226


##
solr/solrj/src/test/org/apache/solr/client/solrj/impl/ClusterStateProviderTest.java:
##
@@ -51,6 +54,15 @@ public static void setupCluster() throws Exception {
 .resolve("streaming")
 .resolve("conf"))
 .configure();
+cluster.waitForAllNodes(30);
+System.setProperty("solr.solrj.cache.timeout.sec", "1");

Review Comment:
   (am random pull request browser/scanner only here) -- maybe have matching 
`System.clearProperty` during test cleanup to avoid carry-ever effects between 
tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-17 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1920955860


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -413,6 +430,19 @@ public String getQuorumHosts() {
 return String.join(",", this.liveNodes);
   }
 
+  public Set getNodeNamesFromSolrUrls(List urls) {
+return urls.stream()
+.map(
+(url) -> {
+  try {
+return getNodeNameFromSolrUrl(url);
+  } catch (MalformedURLException | URISyntaxException e) {
+throw new RuntimeException("Failed to parse base Solr URL " + 
url, e);

Review Comment:
   IllegalArgumentException seems appropriate



##
solr/solrj/src/java/org/apache/solr/common/util/Utils.java:
##
@@ -786,6 +789,20 @@ public static String getBaseUrlForNodeName(
 return urlScheme + "://" + hostAndPort + "/" + (isV2 ? "api" : "solr");
   }
 
+  /**
+   * Construct base Solr URL to a Solr node name
+   *
+   * @param solrUrl Given a base Solr URL string (e.g., 
'https://app-node-1:8983/solr')
+   * @return URL that looks like {@code app-node-1:8983_solr}
+   * @throws MalformedURLException if the provided URL string is malformed
+   * @throws URISyntaxException if the provided URL string could not be parsed 
as a URI reference.
+   */
+  public static String getNodeNameFromSolrUrl(String solrUrl)
+  throws MalformedURLException, URISyntaxException {
+URL url = new URI(solrUrl).toURL();
+return url.getAuthority() + url.getPath().replace('/', '_');
+  }

Review Comment:
   Perfect.   (never mind my redundant code review comment)



##
solr/test-framework/src/java/org/apache/solr/cloud/MiniSolrCloudCluster.java:
##
@@ -530,14 +530,14 @@ public JettySolrRunner stopJettySolrRunner(int index) 
throws Exception {
   }
 
   /**
-   * Add a previously stopped node back to the cluster
+   * Add a previously stopped node back to the cluster and reuse its port
*
* @param jetty a {@link JettySolrRunner} previously returned by {@link 
#stopJettySolrRunner(int)}
* @return the started node
* @throws Exception on error
*/
   public JettySolrRunner startJettySolrRunner(JettySolrRunner jetty) throws 
Exception {
-jetty.start(false);
+jetty.start();

Review Comment:
   That makes sense... I'll call it out in the commit log.  I wonder if there's 
any down-side to this.



##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -51,6 +55,7 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
   private String urlScheme;
+  private Set backupNodes;

Review Comment:
   Also maybe leave as a list as it was provided.  Why not use them in the 
order provided.



##
solr/solrj/src/java/org/apache/solr/common/util/Utils.java:
##
@@ -786,6 +789,20 @@ public static String getBaseUrlForNodeName(
 return urlScheme + "://" + hostAndPort + "/" + (isV2 ? "api" : "solr");
   }
 
+  /**
+   * Construct base Solr URL to a Solr node name
+   *
+   * @param solrUrl Given a base Solr URL string (e.g., 
'https://app-node-1:8983/solr')
+   * @return URL that looks like {@code app-node-1:8983_solr}
+   * @throws MalformedURLException if the provided URL string is malformed
+   * @throws URISyntaxException if the provided URL string could not be parsed 
as a URI reference.
+   */

Review Comment:
   fantastic javadocs!  I wish someone documented the existing reverse 
direction method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-17 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1920949678


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -226,17 +232,15 @@ public Set getLiveNodes() {
 }
 if (TimeUnit.SECONDS.convert((System.nanoTime() - liveNodesTimestamp), 
TimeUnit.NANOSECONDS)
 > getCacheTimeout()) {
-  for (String nodeName : liveNodes) {
-String baseUrl = Utils.getBaseUrlForNodeName(nodeName, urlScheme);
-try (SolrClient client = getSolrClient(baseUrl)) {
-  Set liveNodes = fetchLiveNodes(client);
-  this.liveNodes = (liveNodes);
-  liveNodesTimestamp = System.nanoTime();
-  return liveNodes;
-} catch (Exception e) {
-  log.warn("Attempt to fetch cluster state from {} failed.", baseUrl, 
e);
-}
-  }
+
+  if (updateLiveNodes(liveNodes)) return this.liveNodes;
+
+  log.warn(
+  "Attempt to fetch cluster state from all known live nodes {} failed. 
Trying backup nodes",
+  liveNodes);
+
+  if (updateLiveNodes(backupNodes)) return this.liveNodes;
+
   throw new RuntimeException(
   "Tried fetching live_nodes using all the node names we knew of, i.e. 
"
   + liveNodes

Review Comment:
   and the configured nodes.  Also don't recommend zkHosts :-)



##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -226,17 +232,15 @@ public Set getLiveNodes() {
 }

Review Comment:
   Just above this, there's a liveNodes null check.  That seems bogus now



##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -51,6 +55,7 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
   private String urlScheme;
+  private Set backupNodes;

Review Comment:
   Can we name this configuredNodes since that's what it is?  Yes we use it as 
a backup but we could adapt how we use them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-17 Thread via GitHub


mlbiscoc commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1920793611


##
solr/solrj/src/java/org/apache/solr/common/util/Utils.java:
##
@@ -786,6 +789,20 @@ public static String getBaseUrlForNodeName(
 return urlScheme + "://" + hostAndPort + "/" + (isV2 ? "api" : "solr");
   }
 
+  /**
+   * Construct base Solr URL to a Solr node name
+   *
+   * @param solrUrl Given a base Solr URL string (e.g., 
'https://app-node-1:8983/solr')
+   * @return URL that looks like {@code app-node-1:8983_solr}
+   * @throws MalformedURLException if the provided URL string is malformed
+   * @throws URISyntaxException if the provided URL string could not be parsed 
as a URI reference.
+   */
+  public static String getNodeNameFromSolrUrl(String solrUrl)
+  throws MalformedURLException, URISyntaxException {
+URL url = new URI(solrUrl).toURL();
+return url.getAuthority() + url.getPath().replace('/', '_');
+  }

Review Comment:
   I couldn't find anywhere where this is a function like this (Maybe you 
someone knows?). Regardless, I think if it exists somewhere, it should sit here 
because the `getBaseUrlForNodeName` function right above it does the conversion 
`nodeName->Url` and putting `url->nodeName` right below it makes sense to me. 
Added unit tests and just added unit tests for `getBaseUrlForNodeName` because 
it didn't exist before.



##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -226,17 +232,15 @@ public Set getLiveNodes() {
 }
 if (TimeUnit.SECONDS.convert((System.nanoTime() - liveNodesTimestamp), 
TimeUnit.NANOSECONDS)
 > getCacheTimeout()) {
-  for (String nodeName : liveNodes) {
-String baseUrl = Utils.getBaseUrlForNodeName(nodeName, urlScheme);
-try (SolrClient client = getSolrClient(baseUrl)) {
-  Set liveNodes = fetchLiveNodes(client);
-  this.liveNodes = (liveNodes);
-  liveNodesTimestamp = System.nanoTime();
-  return liveNodes;
-} catch (Exception e) {
-  log.warn("Attempt to fetch cluster state from {} failed.", baseUrl, 
e);
-}
-  }
+
+  if (updateLiveNodes(liveNodes)) return this.liveNodes;
+
+  log.warn(
+  "Attempt to fetch cluster state from all known live nodes {} failed. 
Trying backup nodes",
+  liveNodes);
+
+  if (updateLiveNodes(backupNodes)) return this.liveNodes;

Review Comment:
   Backup is checked if liveNodes is exhausted. There are a few places that 
liveNodes is used for fetching but backup isn't now. Was thinking of adding it, 
but that requires a larger refactor from just this bug and writing additional 
tests. Happy to do it though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-10 Thread via GitHub


mlbiscoc commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1911140452


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -413,6 +418,30 @@ public String getQuorumHosts() {
 return String.join(",", this.liveNodes);
   }
 
+  /** Live nodes should always have the latest set of live nodes but never 
remove initial set */
+  private void setLiveNodes(Set nodes) {
+Set liveNodes = new HashSet<>(nodes);
+liveNodes.addAll(this.initialNodes);
+this.liveNodes = Set.copyOf(liveNodes);

Review Comment:
   My idea was to not do another loop through if liveNodes was exhausted. 
`initalNodes` would always exist in liveNodes with this set method. Let me 
refactor again from your suggestion comment



##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -413,6 +418,30 @@ public String getQuorumHosts() {
 return String.join(",", this.liveNodes);
   }
 
+  /** Live nodes should always have the latest set of live nodes but never 
remove initial set */
+  private void setLiveNodes(Set nodes) {
+Set liveNodes = new HashSet<>(nodes);
+liveNodes.addAll(this.initialNodes);
+this.liveNodes = Set.copyOf(liveNodes);
+  }
+
+  public Set getNodeNamesFromSolrUrls(List urls)
+  throws URISyntaxException, MalformedURLException {
+Set set = new HashSet<>();
+for (String url : urls) {
+  String nodeNameFromSolrUrl = getNodeNameFromSolrUrl(url);
+  set.add(nodeNameFromSolrUrl);
+}
+return Collections.unmodifiableSet(set);
+  }
+
+  /** URL to cluster state node name (http://127.0.0.1:12345/solr to 
127.0.0.1:12345_solr) */
+  public String getNodeNameFromSolrUrl(String solrUrl)
+  throws MalformedURLException, URISyntaxException {
+URL url = new URI(solrUrl).toURL();
+return url.getAuthority() + url.getPath().replace('/', '_');
+  }

Review Comment:
   Hmm maybe... I found `getBaseUrlForNodeName` but it's going the other way 
nodeName->url. Let me look around more



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-10 Thread via GitHub


dsmiley commented on PR #2935:
URL: https://github.com/apache/solr/pull/2935#issuecomment-2583773292

   I simplified characterization of how I think this should be done:
   * on initialization, copy the configured URLs to a field for safe keeping.  
Convert to URL if you like (validates if malformed)
   * on initialization, liveNodes shall be empty; we haven't fetched it yet.  
Ideally we don't even try to on initialization.
   * when getLiveNodes is called and it's not out of date, return it.  If it's 
out of date, loop live nodes _and then the initial configured URLs_ for the 
first server that will respond to fetch the current live nodes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-10 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1911044868


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -229,10 +236,9 @@ > getCacheTimeout()) {
   for (String nodeName : liveNodes) {

Review Comment:
   If we exhaust liveNodes, shouldn't we then try the initial configured nodes, 
and then only failing that, throw an exception?



##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -413,6 +418,30 @@ public String getQuorumHosts() {
 return String.join(",", this.liveNodes);
   }
 
+  /** Live nodes should always have the latest set of live nodes but never 
remove initial set */
+  private void setLiveNodes(Set nodes) {
+Set liveNodes = new HashSet<>(nodes);
+liveNodes.addAll(this.initialNodes);
+this.liveNodes = Set.copyOf(liveNodes);
+  }
+
+  public Set getNodeNamesFromSolrUrls(List urls)
+  throws URISyntaxException, MalformedURLException {
+Set set = new HashSet<>();
+for (String url : urls) {
+  String nodeNameFromSolrUrl = getNodeNameFromSolrUrl(url);
+  set.add(nodeNameFromSolrUrl);
+}
+return Collections.unmodifiableSet(set);

Review Comment:
   Could easily be converted to a single expression with streams



##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -413,6 +418,30 @@ public String getQuorumHosts() {
 return String.join(",", this.liveNodes);
   }
 
+  /** Live nodes should always have the latest set of live nodes but never 
remove initial set */
+  private void setLiveNodes(Set nodes) {
+Set liveNodes = new HashSet<>(nodes);
+liveNodes.addAll(this.initialNodes);
+this.liveNodes = Set.copyOf(liveNodes);
+  }
+
+  public Set getNodeNamesFromSolrUrls(List urls)
+  throws URISyntaxException, MalformedURLException {
+Set set = new HashSet<>();
+for (String url : urls) {
+  String nodeNameFromSolrUrl = getNodeNameFromSolrUrl(url);
+  set.add(nodeNameFromSolrUrl);
+}
+return Collections.unmodifiableSet(set);
+  }
+
+  /** URL to cluster state node name (http://127.0.0.1:12345/solr to 
127.0.0.1:12345_solr) */
+  public String getNodeNameFromSolrUrl(String solrUrl)
+  throws MalformedURLException, URISyntaxException {
+URL url = new URI(solrUrl).toURL();
+return url.getAuthority() + url.getPath().replace('/', '_');
+  }

Review Comment:
   can you find other code in Solr doing this?  Surely it's somewhere.



##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -413,6 +418,30 @@ public String getQuorumHosts() {
 return String.join(",", this.liveNodes);
   }
 
+  /** Live nodes should always have the latest set of live nodes but never 
remove initial set */
+  private void setLiveNodes(Set nodes) {
+Set liveNodes = new HashSet<>(nodes);
+liveNodes.addAll(this.initialNodes);
+this.liveNodes = Set.copyOf(liveNodes);
+  }
+
+  public Set getNodeNamesFromSolrUrls(List urls)
+  throws URISyntaxException, MalformedURLException {
+Set set = new HashSet<>();
+for (String url : urls) {
+  String nodeNameFromSolrUrl = getNodeNameFromSolrUrl(url);
+  set.add(nodeNameFromSolrUrl);
+}
+return Collections.unmodifiableSet(set);
+  }
+
+  /** URL to cluster state node name (http://127.0.0.1:12345/solr to 
127.0.0.1:12345_solr) */
+  public String getNodeNameFromSolrUrl(String solrUrl)
+  throws MalformedURLException, URISyntaxException {
+URL url = new URI(solrUrl).toURL();
+return url.getAuthority() + url.getPath().replace('/', '_');
+  }

Review Comment:
   should be a static method that with a unit test



##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -413,6 +418,30 @@ public String getQuorumHosts() {
 return String.join(",", this.liveNodes);
   }
 
+  /** Live nodes should always have the latest set of live nodes but never 
remove initial set */
+  private void setLiveNodes(Set nodes) {
+Set liveNodes = new HashSet<>(nodes);
+liveNodes.addAll(this.initialNodes);
+this.liveNodes = Set.copyOf(liveNodes);

Review Comment:
   Why is this 3 lines of extra copy-ing instead of nothing more than: 
`this.liveNodes = Set.copyOf(nodes);` ?  That is, why are we touching / using 
initialNodes at all here?
   Maybe an IllegalArgumentException if nodes.isEmpty.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: issues-unsubscr.

Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-10 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1910989910


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -61,6 +67,7 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private int cacheTimeout = 
EnvUtils.getPropertyAsInteger("solr.solrj.cache.timeout.sec", 5);
 
   public void init(List solrUrls) throws Exception {
+this.initialNodes = getNodeNamesFromSolrUrls(solrUrls);

Review Comment:
   great idea to do basic URL malformed checks like this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-10 Thread via GitHub


mlbiscoc commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1910697779


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -61,6 +67,7 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private int cacheTimeout = 
EnvUtils.getPropertyAsInteger("solr.solrj.cache.timeout.sec", 5);
 
   public void init(List solrUrls) throws Exception {
+this.initialNodes = getNodeNamesFromSolrUrls(solrUrls);

Review Comment:
   > I think the idea of this JIRA issue is that we'll take the Solr URLs as 
configured and use this is the initial / backup liveNodes. I think this is a 
very simple idea to to document/understand/implement.
   
   That makes sense. This leaves me with a few questions then. Shouldn't this 
take a list of `URL` or `URI` java object to verify actual non malformed URLs 
instead of a list of strings? I created the functions below to convert these 
Strings into cluster state nodeNames for `liveNodes`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-10 Thread via GitHub


mlbiscoc commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1910698804


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -413,6 +418,30 @@ public String getQuorumHosts() {
 return String.join(",", this.liveNodes);
   }
 
+  /** Live nodes should always have the latest set of live nodes but never 
remove initial set */
+  private void setLiveNodes(Set nodes) {
+Set liveNodes = new HashSet<>(nodes);
+liveNodes.addAll(this.initialNodes);
+this.liveNodes = Set.copyOf(liveNodes);
+  }
+
+  public Set getNodeNamesFromSolrUrls(List urls)
+  throws URISyntaxException, MalformedURLException {
+Set set = new HashSet<>();
+for (String url : urls) {
+  String nodeNameFromSolrUrl = getNodeNameFromSolrUrl(url);
+  set.add(nodeNameFromSolrUrl);
+}
+return Collections.unmodifiableSet(set);
+  }
+
+  /** URL to cluster state node name (http://127.0.0.1:12345/solr to 
127.0.0.1:12345_solr) */
+  public String getNodeNameFromSolrUrl(String solrUrl)
+  throws MalformedURLException, URISyntaxException {
+URL url = new URI(solrUrl).toURL();
+return url.getAuthority() + url.getPath().replace('/', '_');
+  }
+

Review Comment:
   Not sure if these methods belong here but it is required to convert the 
initial set of String urls into cluster state node names



##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -61,6 +67,7 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private int cacheTimeout = 
EnvUtils.getPropertyAsInteger("solr.solrj.cache.timeout.sec", 5);
 
   public void init(List solrUrls) throws Exception {
+this.initialNodes = getNodeNamesFromSolrUrls(solrUrls);

Review Comment:
   > I think the idea of this JIRA issue is that we'll take the Solr URLs as 
configured and use this is the initial / backup liveNodes. I think this is a 
very simple idea to to document/understand/implement.
   
   That makes sense. This leaves me with a few questions then. Shouldn't this 
take a list of `URL` or `URI` java object to verify actual non malformed URLs 
instead of a list of URLs? I created the functions below to convert these 
Strings into cluster state nodeNames for `liveNodes`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-10 Thread via GitHub


mlbiscoc commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1910689266


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -51,7 +52,9 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
   private String urlScheme;
+  private Set initialNodes;
   volatile Set liveNodes;
+  volatile Set knownNodes;

Review Comment:
   Thats fair. Removed known nodes and put everything into live nodes while 
using the initial set inside all the time.



##
solr/solrj/src/test/org/apache/solr/client/solrj/impl/ClusterStateProviderTest.java:
##
@@ -183,4 +204,99 @@ public void testClusterStateProvider() throws 
SolrServerException, IOException {
   clusterStateZk.getCollection("col2"), 
equalTo(clusterStateHttp.getCollection("col2")));
 }
   }
+
+  @Test
+  public void testClusterStateProviderDownedLiveNodes() throws Exception {
+try (var cspZk = zkClientClusterStateProvider();
+var cspHttp = http2ClusterStateProvider()) {
+  Set expectedLiveNodes = cspZk.getClusterState().getLiveNodes();
+  Set actualLiveNodes = cspHttp.getLiveNodes();
+  assertEquals(2, actualLiveNodes.size());
+  assertEquals(expectedLiveNodes, actualLiveNodes);
+
+  cluster.stopJettySolrRunner(jettyNode1);
+  waitForCSPCacheTimeout();
+
+  expectedLiveNodes = cspZk.getClusterState().getLiveNodes();
+  actualLiveNodes = cspHttp.getLiveNodes();
+  assertEquals(1, actualLiveNodes.size());
+  assertEquals(expectedLiveNodes, actualLiveNodes);
+
+  cluster.startJettySolrRunner(jettyNode1);
+  cluster.stopJettySolrRunner(jettyNode2);
+  waitForCSPCacheTimeout();
+
+  // Should still be reachable because known hosts doesn't remove initial 
nodes
+  expectedLiveNodes = cspZk.getClusterState().getLiveNodes();
+  actualLiveNodes = cspHttp.getLiveNodes();
+  assertEquals(1, actualLiveNodes.size());
+  assertEquals(expectedLiveNodes, actualLiveNodes);
+}
+  }
+
+  @Test
+  public void testClusterStateProviderDownedKnownHosts() throws Exception {
+
+try (var cspHttp = http2ClusterStateProvider()) {
+
+  String jettyNode1Url = 
normalizeJettyUrl(jettyNode1.getBaseUrl().toString());
+  String jettyNode2Url = 
normalizeJettyUrl(jettyNode2.getBaseUrl().toString());
+  Set expectedKnownNodes = Set.of(jettyNode1Url, jettyNode2Url);
+  Set actualKnownNodes = cspHttp.getKnownNodes();
+
+  assertEquals(2, actualKnownNodes.size());
+  assertEquals(expectedKnownNodes, actualKnownNodes);
+
+  cluster.stopJettySolrRunner(jettyNode1);
+  waitForCSPCacheTimeout();
+
+  // Known hosts should never remove the initial set of live nodes
+  actualKnownNodes = cspHttp.getKnownNodes();
+  assertEquals(2, actualKnownNodes.size());
+  assertEquals(expectedKnownNodes, actualKnownNodes);
+}
+  }
+
+  @Test
+  public void testClusterStateProviderKnownHostsWithNewHost() throws Exception 
{
+
+try (var cspHttp = http2ClusterStateProvider()) {
+
+  var jettyNode3 = cluster.startJettySolrRunner();
+  String jettyNode1Url = 
normalizeJettyUrl(jettyNode1.getBaseUrl().toString());
+  String jettyNode2Url = 
normalizeJettyUrl(jettyNode2.getBaseUrl().toString());
+  String jettyNode3Url = 
normalizeJettyUrl(jettyNode3.getBaseUrl().toString());
+  Set expectedKnownNodes = Set.of(jettyNode1Url, jettyNode2Url, 
jettyNode3Url);
+  waitForCSPCacheTimeout();
+
+  Set actualKnownNodes = cspHttp.getKnownNodes();
+  assertEquals(3, actualKnownNodes.size());
+  assertEquals(expectedKnownNodes, actualKnownNodes);
+
+  cluster.stopJettySolrRunner(jettyNode1);
+  waitForCSPCacheTimeout();
+
+  actualKnownNodes = cspHttp.getKnownNodes();
+  assertEquals(3, actualKnownNodes.size());
+  assertEquals(expectedKnownNodes, actualKnownNodes);
+
+  cluster.stopJettySolrRunner(jettyNode3);
+  expectedKnownNodes = Set.of(jettyNode1Url, jettyNode2Url);
+  waitForCSPCacheTimeout();
+
+  // New nodes are removable from known hosts
+  actualKnownNodes = cspHttp.getKnownNodes();
+  assertEquals(2, actualKnownNodes.size());
+  assertEquals(expectedKnownNodes, actualKnownNodes);
+}
+  }
+
+  private void waitForCSPCacheTimeout() throws InterruptedException {
+Thread.sleep(6000);

Review Comment:
   Cache timeout setting is in seconds and pulls an int. Can't make it 1ms. 
This probably should have had better granularity instead of increments of 
seconds.
   
   I could change cacheTimeout to take milliseconds as the int instead but that 
would change peoples system property not using the default. So if they set the 
cache timeout to 5 seconds, it now turns into 5ms if they are not aware of this 
change.



-- 
This is an automated messag

Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-06 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1904581414


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -51,7 +52,9 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
   private String urlScheme;
+  private Set initialNodes;
   volatile Set liveNodes;
+  volatile Set knownNodes;

Review Comment:
   There's no guarantee that getLiveNodes is going to return a list of 
reachable nodes.  A moment after returning it, a node could become unreachable. 
 So it's "best effort" and the caller has to deal with failures by trying the 
other nodes in the list and/or getting a possibly refreshed list.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-06 Thread via GitHub


mlbiscoc commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1904552713


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -51,7 +52,9 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
   private String urlScheme;
+  private Set initialNodes;
   volatile Set liveNodes;
+  volatile Set knownNodes;

Review Comment:
   I originally did it with just `liveNodes` but liveNodes is returned to the 
client which from it's name I was thinking they would assume they are "live". 
If we just place all the nodes into liveNodes thats not necessarily true, 
right? If the current set hold all the initially passed nodes, it isn't 
guaranteed it's live which is why I switched to known nodes while live is what 
was actually fetched from ZooKeeper.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-05 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1903717289


##
solr/solrj/src/test/org/apache/solr/client/solrj/impl/ClusterStateProviderTest.java:
##
@@ -34,15 +35,20 @@
 import org.apache.solr.common.cloud.ClusterState;
 import org.apache.solr.common.cloud.DocCollection;
 import org.apache.solr.common.util.NamedList;
+import org.apache.solr.embedded.JettySolrRunner;
 import org.hamcrest.Matchers;
+import org.junit.After;
 import org.junit.BeforeClass;
 import org.junit.Test;
 
 public class ClusterStateProviderTest extends SolrCloudTestCase {
 
+  private static JettySolrRunner jettyNode1;
+  private static JettySolrRunner jettyNode2;

Review Comment:
   could add a convenience method if you like.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail [solr]

2025-01-05 Thread via GitHub


dsmiley commented on code in PR #2935:
URL: https://github.com/apache/solr/pull/2935#discussion_r1903643954


##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -65,6 +68,8 @@ public void init(List solrUrls) throws Exception {
   urlScheme = solrUrl.startsWith("https") ? "https" : "http";
   try (SolrClient initialClient = getSolrClient(solrUrl)) {
 this.liveNodes = fetchLiveNodes(initialClient);
+this.initialNodes = Set.copyOf(liveNodes);

Review Comment:
   I think the idea of this JIRA issue is that we'll take the Solr URLs as 
configured and use *this* is the initial / backup liveNodes.  I think this is a 
very simple idea to to document/understand/implement.



##
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java:
##
@@ -51,7 +52,9 @@ public abstract class BaseHttpClusterStateProvider implements 
ClusterStateProvid
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
   private String urlScheme;
+  private Set initialNodes;
   volatile Set liveNodes;
+  volatile Set knownNodes;

Review Comment:
   right away, I'm surprised.  I can understand needing two sets, but 3?  IMO 
knownNodes doesn't need to exists; that's the same as liveNodes.  I see below 
you've changed many references from liveNodes to knownNodes but I recommend 
against doing that because the exact wording of "liveNodes" is extremely 
pervasive in SolrCloud (well known concept) so let's not have a vary it in just 
this class or any class.



##
solr/solrj/src/test/org/apache/solr/client/solrj/impl/ClusterStateProviderTest.java:
##
@@ -34,15 +35,20 @@
 import org.apache.solr.common.cloud.ClusterState;
 import org.apache.solr.common.cloud.DocCollection;
 import org.apache.solr.common.util.NamedList;
+import org.apache.solr.embedded.JettySolrRunner;
 import org.hamcrest.Matchers;
+import org.junit.After;
 import org.junit.BeforeClass;
 import org.junit.Test;
 
 public class ClusterStateProviderTest extends SolrCloudTestCase {
 
+  private static JettySolrRunner jettyNode1;
+  private static JettySolrRunner jettyNode2;

Review Comment:
   static fields in tests that refer to Solr should be null'ed out, so there's 
some burden to them.  Here... I think you could avoid these altogether and 
simply call `cluster.getJettySolrRunner(0)` which isn't bad!



##
solr/solrj/src/test/org/apache/solr/client/solrj/impl/ClusterStateProviderTest.java:
##
@@ -183,4 +204,99 @@ public void testClusterStateProvider() throws 
SolrServerException, IOException {
   clusterStateZk.getCollection("col2"), 
equalTo(clusterStateHttp.getCollection("col2")));
 }
   }
+
+  @Test
+  public void testClusterStateProviderDownedLiveNodes() throws Exception {
+try (var cspZk = zkClientClusterStateProvider();
+var cspHttp = http2ClusterStateProvider()) {
+  Set expectedLiveNodes = cspZk.getClusterState().getLiveNodes();
+  Set actualLiveNodes = cspHttp.getLiveNodes();
+  assertEquals(2, actualLiveNodes.size());
+  assertEquals(expectedLiveNodes, actualLiveNodes);
+
+  cluster.stopJettySolrRunner(jettyNode1);
+  waitForCSPCacheTimeout();
+
+  expectedLiveNodes = cspZk.getClusterState().getLiveNodes();
+  actualLiveNodes = cspHttp.getLiveNodes();
+  assertEquals(1, actualLiveNodes.size());
+  assertEquals(expectedLiveNodes, actualLiveNodes);
+
+  cluster.startJettySolrRunner(jettyNode1);
+  cluster.stopJettySolrRunner(jettyNode2);
+  waitForCSPCacheTimeout();
+
+  // Should still be reachable because known hosts doesn't remove initial 
nodes
+  expectedLiveNodes = cspZk.getClusterState().getLiveNodes();
+  actualLiveNodes = cspHttp.getLiveNodes();
+  assertEquals(1, actualLiveNodes.size());
+  assertEquals(expectedLiveNodes, actualLiveNodes);
+}
+  }
+
+  @Test
+  public void testClusterStateProviderDownedKnownHosts() throws Exception {
+
+try (var cspHttp = http2ClusterStateProvider()) {
+
+  String jettyNode1Url = 
normalizeJettyUrl(jettyNode1.getBaseUrl().toString());
+  String jettyNode2Url = 
normalizeJettyUrl(jettyNode2.getBaseUrl().toString());
+  Set expectedKnownNodes = Set.of(jettyNode1Url, jettyNode2Url);
+  Set actualKnownNodes = cspHttp.getKnownNodes();
+
+  assertEquals(2, actualKnownNodes.size());
+  assertEquals(expectedKnownNodes, actualKnownNodes);
+
+  cluster.stopJettySolrRunner(jettyNode1);
+  waitForCSPCacheTimeout();
+
+  // Known hosts should never remove the initial set of live nodes
+  actualKnownNodes = cspHttp.getKnownNodes();
+  assertEquals(2, actualKnownNodes.size());
+  assertEquals(expectedKnownNodes, actualKnownNodes);
+}
+  }
+
+  @Test
+  public void testClusterStateProviderKnownHostsWithNewHost() throws Exception 
{
+
+try (var cspHttp = http2ClusterStateProvider())