[jira] [Updated] (SOLR-14462) Autoscaling placement wrong with concurrent collection creations

2020-06-29 Thread Ilan Ginzburg (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilan Ginzburg updated SOLR-14462:
-
Fix Version/s: 8.6

> Autoscaling placement wrong with concurrent collection creations
> 
>
> Key: SOLR-14462
> URL: https://issues.apache.org/jira/browse/SOLR-14462
> Project: Solr
>  Issue Type: Bug
>  Components: AutoScaling
>Affects Versions: master (9.0), 8.6
>Reporter: Ilan Ginzburg
>Assignee: Ilan Ginzburg
>Priority: Major
> Fix For: 8.6
>
> Attachments: PolicyHelperNewLogs.txt, policylogs.txt
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Under concurrent collection creation, wrong Autoscaling placement decisions 
> can lead to severely unbalanced clusters.
>  Sequential creation of the same collections is handled correctly and the 
> cluster is balanced.
> *TL;DR;* under high load, the way sessions that cache future changes to 
> Zookeeper are managed cause placement decisions of multiple concurrent 
> Collection API calls to ignore each other, be based on identical “initial” 
> cluster state, possibly leading to identical placement decisions and as a 
> consequence cluster imbalance.
> *Some context first* for those less familiar with how Autoscaling deals with 
> cluster state change: a PolicyHelper.Session is created with a snapshot of 
> the Zookeeper cluster state and is used to track already decided but not yet 
> persisted to Zookeeper cluster state changes so that Collection API commands 
> can make the right placement decisions.
>  A Collection API command either uses an existing cached Session (that 
> includes changes computed by previous command(s)) or creates a new Session 
> initialized from the Zookeeper cluster state (i.e. with only state changes 
> already persisted).
>  When a Collection API command requires a Session - and one is needed for any 
> cluster state update computation - if one exists but is currently in use, the 
> command can wait up to 10 seconds. If the session becomes available, it is 
> reused. Otherwise, a new one is created.
> The Session lifecycle is as follows: it is created in COMPUTING state by a 
> Collection API command and is initialized with a snapshot of cluster state 
> from Zookeeper (does not require a Zookeeper read, this is running on 
> Overseer that maintains a cache of cluster state). The command has exclusive 
> access to the Session and can change the state of the Session. When the 
> command is done changing the Session, the Session is “returned” and its state 
> changes to EXECUTING while the command continues to run to persist the state 
> to Zookeeper and interact with the nodes, but no longer interacts with the 
> Session. Another command can then grab a Session in EXECUTING state, change 
> its state to COMPUTING to compute new changes taking into account previous 
> changes. When all commands having used the session have completed their work, 
> the session is “released” and destroyed (at this stage, Zookeeper contains 
> all the state changes that were computed using that Session).
> The issue arises when multiple Collection API commands are executed at once. 
> A first Session is created and commands start using it one by one. In a 
> simple 1 shard 1 replica collection creation test run with 100 parallel 
> Collection API requests (see debug logs from PolicyHelper in file 
> policy.logs), this Session update phase (Session in COMPUTING status in 
> SessionWrapper) takes about 250-300ms (MacBook Pro).
> This means that about 40 commands can run by using in turn the same Session 
> (45 in the sample run). The commands that have been waiting for too long time 
> out after 10 seconds, more or less all at the same time (at the rate at which 
> they have been received by the OverseerCollectionMessageHandler, approx one 
> per 100ms in the sample run) and most/all independently decide to create a 
> new Session. These new Sessions are based on Zookeeper state, they might or 
> might not include some of the changes from the first 40 commands (depending 
> on if these commands got their changes written to Zookeeper by the time of 
> the 10 seconds timeout, a few might have made it, see below).
> These new Sessions (54 sessions in addition to the initial one) are based on 
> more or less the same state, so all remaining commands are making placement 
> decisions that do not take into account each other (and likely not much of 
> the first 44 placement decisions either).
> The sample run whose relevant logs are attached led for the 100 single shard 
> single replica collection creations to 82 collections on the Overseer node, 
> and 5 and 13 collections on the two other nodes of a 3 nodes cluster. Given 
> that the initial session was used 45 times (

[jira] [Updated] (SOLR-14462) Autoscaling placement wrong with concurrent collection creations

2020-06-29 Thread Ilan Ginzburg (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilan Ginzburg updated SOLR-14462:
-
Affects Version/s: 8.6

> Autoscaling placement wrong with concurrent collection creations
> 
>
> Key: SOLR-14462
> URL: https://issues.apache.org/jira/browse/SOLR-14462
> Project: Solr
>  Issue Type: Bug
>  Components: AutoScaling
>Affects Versions: master (9.0), 8.6
>Reporter: Ilan Ginzburg
>Assignee: Ilan Ginzburg
>Priority: Major
> Attachments: PolicyHelperNewLogs.txt, policylogs.txt
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Under concurrent collection creation, wrong Autoscaling placement decisions 
> can lead to severely unbalanced clusters.
>  Sequential creation of the same collections is handled correctly and the 
> cluster is balanced.
> *TL;DR;* under high load, the way sessions that cache future changes to 
> Zookeeper are managed cause placement decisions of multiple concurrent 
> Collection API calls to ignore each other, be based on identical “initial” 
> cluster state, possibly leading to identical placement decisions and as a 
> consequence cluster imbalance.
> *Some context first* for those less familiar with how Autoscaling deals with 
> cluster state change: a PolicyHelper.Session is created with a snapshot of 
> the Zookeeper cluster state and is used to track already decided but not yet 
> persisted to Zookeeper cluster state changes so that Collection API commands 
> can make the right placement decisions.
>  A Collection API command either uses an existing cached Session (that 
> includes changes computed by previous command(s)) or creates a new Session 
> initialized from the Zookeeper cluster state (i.e. with only state changes 
> already persisted).
>  When a Collection API command requires a Session - and one is needed for any 
> cluster state update computation - if one exists but is currently in use, the 
> command can wait up to 10 seconds. If the session becomes available, it is 
> reused. Otherwise, a new one is created.
> The Session lifecycle is as follows: it is created in COMPUTING state by a 
> Collection API command and is initialized with a snapshot of cluster state 
> from Zookeeper (does not require a Zookeeper read, this is running on 
> Overseer that maintains a cache of cluster state). The command has exclusive 
> access to the Session and can change the state of the Session. When the 
> command is done changing the Session, the Session is “returned” and its state 
> changes to EXECUTING while the command continues to run to persist the state 
> to Zookeeper and interact with the nodes, but no longer interacts with the 
> Session. Another command can then grab a Session in EXECUTING state, change 
> its state to COMPUTING to compute new changes taking into account previous 
> changes. When all commands having used the session have completed their work, 
> the session is “released” and destroyed (at this stage, Zookeeper contains 
> all the state changes that were computed using that Session).
> The issue arises when multiple Collection API commands are executed at once. 
> A first Session is created and commands start using it one by one. In a 
> simple 1 shard 1 replica collection creation test run with 100 parallel 
> Collection API requests (see debug logs from PolicyHelper in file 
> policy.logs), this Session update phase (Session in COMPUTING status in 
> SessionWrapper) takes about 250-300ms (MacBook Pro).
> This means that about 40 commands can run by using in turn the same Session 
> (45 in the sample run). The commands that have been waiting for too long time 
> out after 10 seconds, more or less all at the same time (at the rate at which 
> they have been received by the OverseerCollectionMessageHandler, approx one 
> per 100ms in the sample run) and most/all independently decide to create a 
> new Session. These new Sessions are based on Zookeeper state, they might or 
> might not include some of the changes from the first 40 commands (depending 
> on if these commands got their changes written to Zookeeper by the time of 
> the 10 seconds timeout, a few might have made it, see below).
> These new Sessions (54 sessions in addition to the initial one) are based on 
> more or less the same state, so all remaining commands are making placement 
> decisions that do not take into account each other (and likely not much of 
> the first 44 placement decisions either).
> The sample run whose relevant logs are attached led for the 100 single shard 
> single replica collection creations to 82 collections on the Overseer node, 
> and 5 and 13 collections on the two other nodes of a 3 nodes cluster. Given 
> that the initial session was used 45 times (once initially then reuse

[jira] [Updated] (SOLR-14462) Autoscaling placement wrong with concurrent collection creations

2020-05-11 Thread Ilan Ginzburg (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilan Ginzburg updated SOLR-14462:
-
Attachment: PolicyHelperNewLogs.txt

> Autoscaling placement wrong with concurrent collection creations
> 
>
> Key: SOLR-14462
> URL: https://issues.apache.org/jira/browse/SOLR-14462
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Affects Versions: master (9.0)
>Reporter: Ilan Ginzburg
>Assignee: Noble Paul
>Priority: Major
> Attachments: PolicyHelperNewLogs.txt, policylogs.txt
>
>
> Under concurrent collection creation, wrong Autoscaling placement decisions 
> can lead to severely unbalanced clusters.
>  Sequential creation of the same collections is handled correctly and the 
> cluster is balanced.
> *TL;DR;* under high load, the way sessions that cache future changes to 
> Zookeeper are managed cause placement decisions of multiple concurrent 
> Collection API calls to ignore each other, be based on identical “initial” 
> cluster state, possibly leading to identical placement decisions and as a 
> consequence cluster imbalance.
> *Some context first* for those less familiar with how Autoscaling deals with 
> cluster state change: a PolicyHelper.Session is created with a snapshot of 
> the Zookeeper cluster state and is used to track already decided but not yet 
> persisted to Zookeeper cluster state changes so that Collection API commands 
> can make the right placement decisions.
>  A Collection API command either uses an existing cached Session (that 
> includes changes computed by previous command(s)) or creates a new Session 
> initialized from the Zookeeper cluster state (i.e. with only state changes 
> already persisted).
>  When a Collection API command requires a Session - and one is needed for any 
> cluster state update computation - if one exists but is currently in use, the 
> command can wait up to 10 seconds. If the session becomes available, it is 
> reused. Otherwise, a new one is created.
> The Session lifecycle is as follows: it is created in COMPUTING state by a 
> Collection API command and is initialized with a snapshot of cluster state 
> from Zookeeper (does not require a Zookeeper read, this is running on 
> Overseer that maintains a cache of cluster state). The command has exclusive 
> access to the Session and can change the state of the Session. When the 
> command is done changing the Session, the Session is “returned” and its state 
> changes to EXECUTING while the command continues to run to persist the state 
> to Zookeeper and interact with the nodes, but no longer interacts with the 
> Session. Another command can then grab a Session in EXECUTING state, change 
> its state to COMPUTING to compute new changes taking into account previous 
> changes. When all commands having used the session have completed their work, 
> the session is “released” and destroyed (at this stage, Zookeeper contains 
> all the state changes that were computed using that Session).
> The issue arises when multiple Collection API commands are executed at once. 
> A first Session is created and commands start using it one by one. In a 
> simple 1 shard 1 replica collection creation test run with 100 parallel 
> Collection API requests (see debug logs from PolicyHelper in file 
> policy.logs), this Session update phase (Session in COMPUTING status in 
> SessionWrapper) takes about 250-300ms (MacBook Pro).
> This means that about 40 commands can run by using in turn the same Session 
> (45 in the sample run). The commands that have been waiting for too long time 
> out after 10 seconds, more or less all at the same time (at the rate at which 
> they have been received by the OverseerCollectionMessageHandler, approx one 
> per 100ms in the sample run) and most/all independently decide to create a 
> new Session. These new Sessions are based on Zookeeper state, they might or 
> might not include some of the changes from the first 40 commands (depending 
> on if these commands got their changes written to Zookeeper by the time of 
> the 10 seconds timeout, a few might have made it, see below).
> These new Sessions (54 sessions in addition to the initial one) are based on 
> more or less the same state, so all remaining commands are making placement 
> decisions that do not take into account each other (and likely not much of 
> the first 44 placement decisions either).
> The sample run whose relevant logs are attached led for the 100 single shard 
> single replica collection creations to 82 collections on the Overseer node, 
> and 5 and 13 collections on the two other nodes of a 3 nodes cluster. Given 
> that the initial session was used 45 times (on