[jira] [Updated] (SOLR-14462) Autoscaling placement wrong with concurrent collection creations
[ https://issues.apache.org/jira/browse/SOLR-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilan Ginzburg updated SOLR-14462: - Fix Version/s: 8.6 > Autoscaling placement wrong with concurrent collection creations > > > Key: SOLR-14462 > URL: https://issues.apache.org/jira/browse/SOLR-14462 > Project: Solr > Issue Type: Bug > Components: AutoScaling >Affects Versions: master (9.0), 8.6 >Reporter: Ilan Ginzburg >Assignee: Ilan Ginzburg >Priority: Major > Fix For: 8.6 > > Attachments: PolicyHelperNewLogs.txt, policylogs.txt > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Under concurrent collection creation, wrong Autoscaling placement decisions > can lead to severely unbalanced clusters. > Sequential creation of the same collections is handled correctly and the > cluster is balanced. > *TL;DR;* under high load, the way sessions that cache future changes to > Zookeeper are managed cause placement decisions of multiple concurrent > Collection API calls to ignore each other, be based on identical “initial” > cluster state, possibly leading to identical placement decisions and as a > consequence cluster imbalance. > *Some context first* for those less familiar with how Autoscaling deals with > cluster state change: a PolicyHelper.Session is created with a snapshot of > the Zookeeper cluster state and is used to track already decided but not yet > persisted to Zookeeper cluster state changes so that Collection API commands > can make the right placement decisions. > A Collection API command either uses an existing cached Session (that > includes changes computed by previous command(s)) or creates a new Session > initialized from the Zookeeper cluster state (i.e. with only state changes > already persisted). > When a Collection API command requires a Session - and one is needed for any > cluster state update computation - if one exists but is currently in use, the > command can wait up to 10 seconds. If the session becomes available, it is > reused. Otherwise, a new one is created. > The Session lifecycle is as follows: it is created in COMPUTING state by a > Collection API command and is initialized with a snapshot of cluster state > from Zookeeper (does not require a Zookeeper read, this is running on > Overseer that maintains a cache of cluster state). The command has exclusive > access to the Session and can change the state of the Session. When the > command is done changing the Session, the Session is “returned” and its state > changes to EXECUTING while the command continues to run to persist the state > to Zookeeper and interact with the nodes, but no longer interacts with the > Session. Another command can then grab a Session in EXECUTING state, change > its state to COMPUTING to compute new changes taking into account previous > changes. When all commands having used the session have completed their work, > the session is “released” and destroyed (at this stage, Zookeeper contains > all the state changes that were computed using that Session). > The issue arises when multiple Collection API commands are executed at once. > A first Session is created and commands start using it one by one. In a > simple 1 shard 1 replica collection creation test run with 100 parallel > Collection API requests (see debug logs from PolicyHelper in file > policy.logs), this Session update phase (Session in COMPUTING status in > SessionWrapper) takes about 250-300ms (MacBook Pro). > This means that about 40 commands can run by using in turn the same Session > (45 in the sample run). The commands that have been waiting for too long time > out after 10 seconds, more or less all at the same time (at the rate at which > they have been received by the OverseerCollectionMessageHandler, approx one > per 100ms in the sample run) and most/all independently decide to create a > new Session. These new Sessions are based on Zookeeper state, they might or > might not include some of the changes from the first 40 commands (depending > on if these commands got their changes written to Zookeeper by the time of > the 10 seconds timeout, a few might have made it, see below). > These new Sessions (54 sessions in addition to the initial one) are based on > more or less the same state, so all remaining commands are making placement > decisions that do not take into account each other (and likely not much of > the first 44 placement decisions either). > The sample run whose relevant logs are attached led for the 100 single shard > single replica collection creations to 82 collections on the Overseer node, > and 5 and 13 collections on the two other nodes of a 3 nodes cluster. Given > that the initial session was used 45 times (
[jira] [Updated] (SOLR-14462) Autoscaling placement wrong with concurrent collection creations
[ https://issues.apache.org/jira/browse/SOLR-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilan Ginzburg updated SOLR-14462: - Affects Version/s: 8.6 > Autoscaling placement wrong with concurrent collection creations > > > Key: SOLR-14462 > URL: https://issues.apache.org/jira/browse/SOLR-14462 > Project: Solr > Issue Type: Bug > Components: AutoScaling >Affects Versions: master (9.0), 8.6 >Reporter: Ilan Ginzburg >Assignee: Ilan Ginzburg >Priority: Major > Attachments: PolicyHelperNewLogs.txt, policylogs.txt > > Time Spent: 2h 50m > Remaining Estimate: 0h > > Under concurrent collection creation, wrong Autoscaling placement decisions > can lead to severely unbalanced clusters. > Sequential creation of the same collections is handled correctly and the > cluster is balanced. > *TL;DR;* under high load, the way sessions that cache future changes to > Zookeeper are managed cause placement decisions of multiple concurrent > Collection API calls to ignore each other, be based on identical “initial” > cluster state, possibly leading to identical placement decisions and as a > consequence cluster imbalance. > *Some context first* for those less familiar with how Autoscaling deals with > cluster state change: a PolicyHelper.Session is created with a snapshot of > the Zookeeper cluster state and is used to track already decided but not yet > persisted to Zookeeper cluster state changes so that Collection API commands > can make the right placement decisions. > A Collection API command either uses an existing cached Session (that > includes changes computed by previous command(s)) or creates a new Session > initialized from the Zookeeper cluster state (i.e. with only state changes > already persisted). > When a Collection API command requires a Session - and one is needed for any > cluster state update computation - if one exists but is currently in use, the > command can wait up to 10 seconds. If the session becomes available, it is > reused. Otherwise, a new one is created. > The Session lifecycle is as follows: it is created in COMPUTING state by a > Collection API command and is initialized with a snapshot of cluster state > from Zookeeper (does not require a Zookeeper read, this is running on > Overseer that maintains a cache of cluster state). The command has exclusive > access to the Session and can change the state of the Session. When the > command is done changing the Session, the Session is “returned” and its state > changes to EXECUTING while the command continues to run to persist the state > to Zookeeper and interact with the nodes, but no longer interacts with the > Session. Another command can then grab a Session in EXECUTING state, change > its state to COMPUTING to compute new changes taking into account previous > changes. When all commands having used the session have completed their work, > the session is “released” and destroyed (at this stage, Zookeeper contains > all the state changes that were computed using that Session). > The issue arises when multiple Collection API commands are executed at once. > A first Session is created and commands start using it one by one. In a > simple 1 shard 1 replica collection creation test run with 100 parallel > Collection API requests (see debug logs from PolicyHelper in file > policy.logs), this Session update phase (Session in COMPUTING status in > SessionWrapper) takes about 250-300ms (MacBook Pro). > This means that about 40 commands can run by using in turn the same Session > (45 in the sample run). The commands that have been waiting for too long time > out after 10 seconds, more or less all at the same time (at the rate at which > they have been received by the OverseerCollectionMessageHandler, approx one > per 100ms in the sample run) and most/all independently decide to create a > new Session. These new Sessions are based on Zookeeper state, they might or > might not include some of the changes from the first 40 commands (depending > on if these commands got their changes written to Zookeeper by the time of > the 10 seconds timeout, a few might have made it, see below). > These new Sessions (54 sessions in addition to the initial one) are based on > more or less the same state, so all remaining commands are making placement > decisions that do not take into account each other (and likely not much of > the first 44 placement decisions either). > The sample run whose relevant logs are attached led for the 100 single shard > single replica collection creations to 82 collections on the Overseer node, > and 5 and 13 collections on the two other nodes of a 3 nodes cluster. Given > that the initial session was used 45 times (once initially then reuse
[jira] [Updated] (SOLR-14462) Autoscaling placement wrong with concurrent collection creations
[ https://issues.apache.org/jira/browse/SOLR-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilan Ginzburg updated SOLR-14462: - Attachment: PolicyHelperNewLogs.txt > Autoscaling placement wrong with concurrent collection creations > > > Key: SOLR-14462 > URL: https://issues.apache.org/jira/browse/SOLR-14462 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: AutoScaling >Affects Versions: master (9.0) >Reporter: Ilan Ginzburg >Assignee: Noble Paul >Priority: Major > Attachments: PolicyHelperNewLogs.txt, policylogs.txt > > > Under concurrent collection creation, wrong Autoscaling placement decisions > can lead to severely unbalanced clusters. > Sequential creation of the same collections is handled correctly and the > cluster is balanced. > *TL;DR;* under high load, the way sessions that cache future changes to > Zookeeper are managed cause placement decisions of multiple concurrent > Collection API calls to ignore each other, be based on identical “initial” > cluster state, possibly leading to identical placement decisions and as a > consequence cluster imbalance. > *Some context first* for those less familiar with how Autoscaling deals with > cluster state change: a PolicyHelper.Session is created with a snapshot of > the Zookeeper cluster state and is used to track already decided but not yet > persisted to Zookeeper cluster state changes so that Collection API commands > can make the right placement decisions. > A Collection API command either uses an existing cached Session (that > includes changes computed by previous command(s)) or creates a new Session > initialized from the Zookeeper cluster state (i.e. with only state changes > already persisted). > When a Collection API command requires a Session - and one is needed for any > cluster state update computation - if one exists but is currently in use, the > command can wait up to 10 seconds. If the session becomes available, it is > reused. Otherwise, a new one is created. > The Session lifecycle is as follows: it is created in COMPUTING state by a > Collection API command and is initialized with a snapshot of cluster state > from Zookeeper (does not require a Zookeeper read, this is running on > Overseer that maintains a cache of cluster state). The command has exclusive > access to the Session and can change the state of the Session. When the > command is done changing the Session, the Session is “returned” and its state > changes to EXECUTING while the command continues to run to persist the state > to Zookeeper and interact with the nodes, but no longer interacts with the > Session. Another command can then grab a Session in EXECUTING state, change > its state to COMPUTING to compute new changes taking into account previous > changes. When all commands having used the session have completed their work, > the session is “released” and destroyed (at this stage, Zookeeper contains > all the state changes that were computed using that Session). > The issue arises when multiple Collection API commands are executed at once. > A first Session is created and commands start using it one by one. In a > simple 1 shard 1 replica collection creation test run with 100 parallel > Collection API requests (see debug logs from PolicyHelper in file > policy.logs), this Session update phase (Session in COMPUTING status in > SessionWrapper) takes about 250-300ms (MacBook Pro). > This means that about 40 commands can run by using in turn the same Session > (45 in the sample run). The commands that have been waiting for too long time > out after 10 seconds, more or less all at the same time (at the rate at which > they have been received by the OverseerCollectionMessageHandler, approx one > per 100ms in the sample run) and most/all independently decide to create a > new Session. These new Sessions are based on Zookeeper state, they might or > might not include some of the changes from the first 40 commands (depending > on if these commands got their changes written to Zookeeper by the time of > the 10 seconds timeout, a few might have made it, see below). > These new Sessions (54 sessions in addition to the initial one) are based on > more or less the same state, so all remaining commands are making placement > decisions that do not take into account each other (and likely not much of > the first 44 placement decisions either). > The sample run whose relevant logs are attached led for the 100 single shard > single replica collection creations to 82 collections on the Overseer node, > and 5 and 13 collections on the two other nodes of a 3 nodes cluster. Given > that the initial session was used 45 times (on