[ 
https://issues.apache.org/jira/browse/SOLR-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilan Ginzburg reopened SOLR-14462:
----------------------------------

Reopening to backport for 8.6

> Autoscaling placement wrong with concurrent collection creations
> ----------------------------------------------------------------
>
>                 Key: SOLR-14462
>                 URL: https://issues.apache.org/jira/browse/SOLR-14462
>             Project: Solr
>          Issue Type: Bug
>          Components: AutoScaling
>    Affects Versions: master (9.0)
>            Reporter: Ilan Ginzburg
>            Assignee: Ilan Ginzburg
>            Priority: Major
>         Attachments: PolicyHelperNewLogs.txt, policylogs.txt
>
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Under concurrent collection creation, wrong Autoscaling placement decisions 
> can lead to severely unbalanced clusters.
>  Sequential creation of the same collections is handled correctly and the 
> cluster is balanced.
> *TL;DR;* under high load, the way sessions that cache future changes to 
> Zookeeper are managed cause placement decisions of multiple concurrent 
> Collection API calls to ignore each other, be based on identical “initial” 
> cluster state, possibly leading to identical placement decisions and as a 
> consequence cluster imbalance.
> *Some context first* for those less familiar with how Autoscaling deals with 
> cluster state change: a PolicyHelper.Session is created with a snapshot of 
> the Zookeeper cluster state and is used to track already decided but not yet 
> persisted to Zookeeper cluster state changes so that Collection API commands 
> can make the right placement decisions.
>  A Collection API command either uses an existing cached Session (that 
> includes changes computed by previous command(s)) or creates a new Session 
> initialized from the Zookeeper cluster state (i.e. with only state changes 
> already persisted).
>  When a Collection API command requires a Session - and one is needed for any 
> cluster state update computation - if one exists but is currently in use, the 
> command can wait up to 10 seconds. If the session becomes available, it is 
> reused. Otherwise, a new one is created.
> The Session lifecycle is as follows: it is created in COMPUTING state by a 
> Collection API command and is initialized with a snapshot of cluster state 
> from Zookeeper (does not require a Zookeeper read, this is running on 
> Overseer that maintains a cache of cluster state). The command has exclusive 
> access to the Session and can change the state of the Session. When the 
> command is done changing the Session, the Session is “returned” and its state 
> changes to EXECUTING while the command continues to run to persist the state 
> to Zookeeper and interact with the nodes, but no longer interacts with the 
> Session. Another command can then grab a Session in EXECUTING state, change 
> its state to COMPUTING to compute new changes taking into account previous 
> changes. When all commands having used the session have completed their work, 
> the session is “released” and destroyed (at this stage, Zookeeper contains 
> all the state changes that were computed using that Session).
> The issue arises when multiple Collection API commands are executed at once. 
> A first Session is created and commands start using it one by one. In a 
> simple 1 shard 1 replica collection creation test run with 100 parallel 
> Collection API requests (see debug logs from PolicyHelper in file 
> policy.logs), this Session update phase (Session in COMPUTING status in 
> SessionWrapper) takes about 250-300ms (MacBook Pro).
> This means that about 40 commands can run by using in turn the same Session 
> (45 in the sample run). The commands that have been waiting for too long time 
> out after 10 seconds, more or less all at the same time (at the rate at which 
> they have been received by the OverseerCollectionMessageHandler, approx one 
> per 100ms in the sample run) and most/all independently decide to create a 
> new Session. These new Sessions are based on Zookeeper state, they might or 
> might not include some of the changes from the first 40 commands (depending 
> on if these commands got their changes written to Zookeeper by the time of 
> the 10 seconds timeout, a few might have made it, see below).
> These new Sessions (54 sessions in addition to the initial one) are based on 
> more or less the same state, so all remaining commands are making placement 
> decisions that do not take into account each other (and likely not much of 
> the first 44 placement decisions either).
> The sample run whose relevant logs are attached led for the 100 single shard 
> single replica collection creations to 82 collections on the Overseer node, 
> and 5 and 13 collections on the two other nodes of a 3 nodes cluster. Given 
> that the initial session was used 45 times (once initially then reused 44 
> times), one would have expected at least the first 45 collections to be 
> evenly distributed, i.e. 15 replicas on each node. This was not the case, 
> possibly a sign of other issues (other runs even ended up placing 0 replicas 
> out of the 100 on one of the nodes).
> From the client perspective, http admin collection CREATE requests averaged 
> 19.5 seconds each and lasted between 7 and 28 seconds (100 parallel threads). 
> This is likely an indication that the last 55 collection creations didn’t see 
> much of the state updates done by the first 45 creations (client delay is 
> longer though than actual Overseer command execution time by http time + 
> Collections API Zookeeper queue time) .
> *A possible fix* is to not observe any delay before creating a new Session 
> when the currently cached session is busy (i.e. COMPUTING). It will be 
> somewhat less optimal in low load cases (this is likely not an issue, future 
> creations will compensate for slight unbalance and under optimal placement) 
> but will speed up Collection API calls (no waiting) and will prevent multiple 
> waiting commands from all creating new Sessions based on an identical 
> Zookeeper state in cases such as the one described here. For long (minutes 
> and more) autoscaling computations it will likely not make a big difference.
> If we had more than a single Session being cached (and reused), then less 
> ongoing updates would be lost.
>  Maybe, rather than caching the new updated cluster state after each change, 
> the changes themselves (the deltas) should be tracked. This might allow to 
> propagate changes between sessions or to reconcile cluster state read from 
> Zookeeper with the stream of changes stored in a Session by identifying which 
> deltas made it to Zookeeper, which ones are new from Zookeeper (originating 
> from an update in another session) and which are still pending.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to