Re: Overseer, expiring queued messages

2024-02-03 Thread David Smiley
Indeed; "async" doesn't support cancellation. Deleting the request ID doesn't cancel. BTW I haven't been meaning to talk about cancellation; I've been talking about not starting something if it's already too late. And I haven't been talking about anything that would impact "async" calls; I'm

Re: Overseer, expiring queued messages

2024-02-02 Thread Ilan Ginzburg
A cluster wide timeout makes sense and is simpler if it is only used by the Overseer (or whatever entity processes a request) to decide not to start processing (that delay would not be request specific but depends on the load put by other concurrent activity in the cluster). If we consider a

Re: Overseer, expiring queued messages

2024-02-01 Thread David Smiley
On Thu, Feb 1, 2024 at 1:53 PM Ilan Ginzburg wrote: > > I'd be in favor of the Overseer dropping synchronous requests for which the > requestor is no longer waiting (ephemeral ZK node is gone). I agree! As you know, we've customized Solr to do exactly that for collection creation. We suspect a

Re: Overseer, expiring queued messages

2024-02-01 Thread Ilan Ginzburg
I'd be in favor of the Overseer dropping synchronous requests for which the requestor is no longer waiting (ephemeral ZK node is gone). For sync or async requests, we could let the caller set a timeout after which the processing should not start if it hasn't already, or for async messages allow a

Re: Overseer, expiring queued messages

2024-01-31 Thread 6harat
Thanks David for starting this thread. We have also seen this behavior from overseer resulting in "orphan collections" or "more than 1 replica created" due to timeouts especially when our cluster is scaled up during peak traffic days. While I am still at a nascent stage of my understanding of solr

Overseer, expiring queued messages

2024-01-31 Thread David Smiley
I have a proposal and am curious what folks think. When the Overseer dequeues an admin command message to process, imagine it being enhanced to examine the "ctime" (creation time) of the ZK message node to determine how long it has been enqueued, and thus roughly how long the client has been