[jira] [Commented] (SOLR-13100) harden/manage connectionpool used for intra-cluster communication when we know nodes go down

2019-01-02 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732635#comment-16732635
 ] 

Mark Miller commented on SOLR-13100:


You might be able to - I mention that more of the impl is internal now because 
I think I worked around this when we did some of the impl and how I did it then 
doesn't work now.

You may be able to tie invalidating the right connections when someone leaves 
live nodes for 7x without much of a race, but I'm not sure if it's as 
straightforward as it seems.

 

 

> harden/manage connectionpool used for intra-cluster communication when we 
> know nodes go down
> 
>
> Key: SOLR-13100
> URL: https://issues.apache.org/jira/browse/SOLR-13100
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
>
> I'm spinng this idea off of some comments i made in SOLR-13028...
> In that issue, in discussion of some test failures that can happen after a 
> node is shutdown/restarted (new emphasis added)...
> {quote}
> The bit where the test fails is that it:
> # shuts down a jetty instance
> # starts the jetty instance again
> # does some waiting for all the collections to be "active" and all the 
> replicas to be "live"
> # tries to send an auto-scalling 'set-cluster-preferences' config change to 
> the cluster
> The bit of test code where it does this creates an entirely new 
> CLoudSolrClient, ignoring the existing one except for the ZKServer address, 
> w/an explicit comment that the reason it's doing this is because the 
> connection pool on the existing CloudSolrClient might have a stale connection 
> to the old (Ie: dead) instance of the restarted jetty...
>   ...
> ...doing this ensures that the cloudClient doesn't try to query the "dead" 
> server directly (on a stale connection) but IIUC this issue of stale 
> connections to the dead server instance is still problematic - and the root 
> cause of this failure - because after the CloudSolrClient picks a random node 
> to send the request to, _on the remote solr side, that node then has to 
> dispatch a request to each and every node, and at that point the node doing 
> the distributed dispatch may also have a stale connection pool pointing at a 
> server instance that's no longer listening._
> {quote}
> *The point of this issue is to explore, if/how we can -- in general -- better 
> deal with pooled connections in situations where the cluster state knows that 
> an existing node has gone down, or been restarted.*
> SOLR-13028 is a particular example of when/how stale pooled conection info 
> can cause test problems -- and the bulk of the discussion in that issue is 
> about how that specific code path (in dealing with an intra-cl autoscaling 
> handler command dispatch) can be improved to do a retry in the event of 
> NoHttpResponseException -- but not ever place where solr nodes need to talk 
> to each other can blindly retry on every possible connection exception; and 
> even when we can, it would be better if we could minimize the risk of the 
> request failing in a way that would require a retry.
> *So why not improve our HTTP connection pool to be aware of our clusterstate 
> and purge connections when we know odes have been shutdown/lost?*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13100) harden/manage connectionpool used for intra-cluster communication when we know nodes go down

2019-01-02 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732442#comment-16732442
 ] 

Hoss Man commented on SOLR-13100:
-

bq. ... otherwise it's done on a timer with a thread and somehow shutting down 
Jetty and starting it again does not play quite correctly with our stale 
connection avoidance strategy (try to make client manage the connection 
lifecycle instead of the server).

But in the case of nodes talking to other nodes, we control the client(s) *AND* 
the server(s) and (unless i'm missunderstanding something) every (solr) node 
has a zk watch on the {{live_node}} zkNode for every other (solr) node ... so 
why can't each solr X node invalidate it's connection pool if it see's node Y 
has gone away (or ideally just the pooled connections to node Y) ?

> harden/manage connectionpool used for intra-cluster communication when we 
> know nodes go down
> 
>
> Key: SOLR-13100
> URL: https://issues.apache.org/jira/browse/SOLR-13100
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
>
> I'm spinng this idea off of some comments i made in SOLR-13028...
> In that issue, in discussion of some test failures that can happen after a 
> node is shutdown/restarted (new emphasis added)...
> {quote}
> The bit where the test fails is that it:
> # shuts down a jetty instance
> # starts the jetty instance again
> # does some waiting for all the collections to be "active" and all the 
> replicas to be "live"
> # tries to send an auto-scalling 'set-cluster-preferences' config change to 
> the cluster
> The bit of test code where it does this creates an entirely new 
> CLoudSolrClient, ignoring the existing one except for the ZKServer address, 
> w/an explicit comment that the reason it's doing this is because the 
> connection pool on the existing CloudSolrClient might have a stale connection 
> to the old (Ie: dead) instance of the restarted jetty...
>   ...
> ...doing this ensures that the cloudClient doesn't try to query the "dead" 
> server directly (on a stale connection) but IIUC this issue of stale 
> connections to the dead server instance is still problematic - and the root 
> cause of this failure - because after the CloudSolrClient picks a random node 
> to send the request to, _on the remote solr side, that node then has to 
> dispatch a request to each and every node, and at that point the node doing 
> the distributed dispatch may also have a stale connection pool pointing at a 
> server instance that's no longer listening._
> {quote}
> *The point of this issue is to explore, if/how we can -- in general -- better 
> deal with pooled connections in situations where the cluster state knows that 
> an existing node has gone down, or been restarted.*
> SOLR-13028 is a particular example of when/how stale pooled conection info 
> can cause test problems -- and the bulk of the discussion in that issue is 
> about how that specific code path (in dealing with an intra-cl autoscaling 
> handler command dispatch) can be improved to do a retry in the event of 
> NoHttpResponseException -- but not ever place where solr nodes need to talk 
> to each other can blindly retry on every possible connection exception; and 
> even when we can, it would be better if we could minimize the risk of the 
> request failing in a way that would require a retry.
> *So why not improve our HTTP connection pool to be aware of our clusterstate 
> and purge connections when we know odes have been shutdown/lost?*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13100) harden/manage connectionpool used for intra-cluster communication when we know nodes go down

2019-01-02 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732378#comment-16732378
 ] 

Mark Miller commented on SOLR-13100:


I can spend a little more time here, but a quick reply.

 

HTTP 1/1.1 has a race condition around stale connections that cannot be 
overcome - though it can be defended against, and we have put a lot of effort 
there. Restarting a test Jetty instance is still a problem area - I think you 
can work around it by making explicit calls to the pool to check for expired 
connections at the right time, but otherwise it's done on a timer with a thread 
and somehow shutting down Jetty and starting it again does not play quite 
correctly with our stale connection avoidance strategy (try to make client 
manage the connection lifecycle instead of the server). We used to have more 
control over some of theinner workings around sweeping for idle connections and 
such, but now I think we use more internal impls with a newer http client 
version. I think the speed of an integration test makes this an issue, I think 
it's less likely to see real world.

HTTP 2 has no such problems with this race, you can just see if a connection is 
valid, so I think this just goes away with 8.

> harden/manage connectionpool used for intra-cluster communication when we 
> know nodes go down
> 
>
> Key: SOLR-13100
> URL: https://issues.apache.org/jira/browse/SOLR-13100
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
>
> I'm spinng this idea off of some comments i made in SOLR-13028...
> In that issue, in discussion of some test failures that can happen after a 
> node is shutdown/restarted (new emphasis added)...
> {quote}
> The bit where the test fails is that it:
> # shuts down a jetty instance
> # starts the jetty instance again
> # does some waiting for all the collections to be "active" and all the 
> replicas to be "live"
> # tries to send an auto-scalling 'set-cluster-preferences' config change to 
> the cluster
> The bit of test code where it does this creates an entirely new 
> CLoudSolrClient, ignoring the existing one except for the ZKServer address, 
> w/an explicit comment that the reason it's doing this is because the 
> connection pool on the existing CloudSolrClient might have a stale connection 
> to the old (Ie: dead) instance of the restarted jetty...
>   ...
> ...doing this ensures that the cloudClient doesn't try to query the "dead" 
> server directly (on a stale connection) but IIUC this issue of stale 
> connections to the dead server instance is still problematic - and the root 
> cause of this failure - because after the CloudSolrClient picks a random node 
> to send the request to, _on the remote solr side, that node then has to 
> dispatch a request to each and every node, and at that point the node doing 
> the distributed dispatch may also have a stale connection pool pointing at a 
> server instance that's no longer listening._
> {quote}
> *The point of this issue is to explore, if/how we can -- in general -- better 
> deal with pooled connections in situations where the cluster state knows that 
> an existing node has gone down, or been restarted.*
> SOLR-13028 is a particular example of when/how stale pooled conection info 
> can cause test problems -- and the bulk of the discussion in that issue is 
> about how that specific code path (in dealing with an intra-cl autoscaling 
> handler command dispatch) can be improved to do a retry in the event of 
> NoHttpResponseException -- but not ever place where solr nodes need to talk 
> to each other can blindly retry on every possible connection exception; and 
> even when we can, it would be better if we could minimize the risk of the 
> request failing in a way that would require a retry.
> *So why not improve our HTTP connection pool to be aware of our clusterstate 
> and purge connections when we know odes have been shutdown/lost?*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13100) harden/manage connectionpool used for intra-cluster communication when we know nodes go down

2019-01-02 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732342#comment-16732342
 ] 

Hoss Man commented on SOLR-13100:
-

In addition to SOLR-13028, SOLR-13038 is another example where a test failure 
demonstrates a "real world" problem with these stale connections being used.

There is also some older context/discussion in SOLR-6944 - although it's mostly 
about the retry logic, and less specific discussion about the connection pool 
management.

[~markrmil...@gmail.com] - as the person whose spent the most time looking at 
this in the past, it would be really helpful if you could cohesively & 
comprehensively summarize your past experiments/experiences in this area -- 
lots of issues seem to have danced around this idea in the past w/o any 
definitive discussion ... it would be great if we had some clear cut guidence 
like "X won't work because of Y" or "Z might work but I haven't tried it 
because of Q" or "V seemed promising but at theime we had to worry about W and 
i don't know if that'sstill an isue because of U"

> harden/manage connectionpool used for intra-cluster communication when we 
> know nodes go down
> 
>
> Key: SOLR-13100
> URL: https://issues.apache.org/jira/browse/SOLR-13100
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
>
> I'm spinng this idea off of some comments i made in SOLR-13028...
> In that issue, in discussion of some test failures that can happen after a 
> node is shutdown/restarted (new emphasis added)...
> {quote}
> The bit where the test fails is that it:
> # shuts down a jetty instance
> # starts the jetty instance again
> # does some waiting for all the collections to be "active" and all the 
> replicas to be "live"
> # tries to send an auto-scalling 'set-cluster-preferences' config change to 
> the cluster
> The bit of test code where it does this creates an entirely new 
> CLoudSolrClient, ignoring the existing one except for the ZKServer address, 
> w/an explicit comment that the reason it's doing this is because the 
> connection pool on the existing CloudSolrClient might have a stale connection 
> to the old (Ie: dead) instance of the restarted jetty...
>   ...
> ...doing this ensures that the cloudClient doesn't try to query the "dead" 
> server directly (on a stale connection) but IIUC this issue of stale 
> connections to the dead server instance is still problematic - and the root 
> cause of this failure - because after the CloudSolrClient picks a random node 
> to send the request to, _on the remote solr side, that node then has to 
> dispatch a request to each and every node, and at that point the node doing 
> the distributed dispatch may also have a stale connection pool pointing at a 
> server instance that's no longer listening._
> {quote}
> *The point of this issue is to explore, if/how we can -- in general -- better 
> deal with pooled connections in situations where the cluster state knows that 
> an existing node has gone down, or been restarted.*
> SOLR-13028 is a particular example of when/how stale pooled conection info 
> can cause test problems -- and the bulk of the discussion in that issue is 
> about how that specific code path (in dealing with an intra-cl autoscaling 
> handler command dispatch) can be improved to do a retry in the event of 
> NoHttpResponseException -- but not ever place where solr nodes need to talk 
> to each other can blindly retry on every possible connection exception; and 
> even when we can, it would be better if we could minimize the risk of the 
> request failing in a way that would require a retry.
> *So why not improve our HTTP connection pool to be aware of our clusterstate 
> and purge connections when we know odes have been shutdown/lost?*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org