Bertrand Delacretaz wrote > That would be a pity, as I suppose you're starting to like Sling now ;-)
Mannnn you have no idea haha! I've got almost every dev in the office all excited about this now haha. However, it seems our hands are tied. I wrote local consistency test scripts which POST and immediately GET a property, checking for consistency. Results on a 2-member Sling cluster and localhost mongodb: -0% consistency with 50ms delay between POST and GET -35% to 50% consistency with 1 second delay between POST and GET -90% consistency with 2 second delay -98% to 100% consistency after 3 seconds delay. So yes, you are all correct. True, we could use sticky sessions to avoid inconsistency... but only until we scale our server-farm up or down, which we do daily.... So sticky sessions doesn't really solve anything for us. If you already understand how scaling nullifies the benefit of sticky sessions, you can skip past this paragraph and move onto the next: Each time we scale, users will lose their "stickiness." We have thousands of write users ("authors"). Hundreds concurrently. Compare that to typical AEM projects have less than 10 authors, and rarely more than 1 concurrently (I've got several global-scale AEM implementations under my belt). For us, it's a requirement that we add or remove app servers multiple times per day, optimizing between AWS costs and performance. Each time we remove an instance, those users will go to a new Sling instance, and experience the inconsistency. Each time we add an instance, we will invalidate all stickiness and users will get re-assigned to a new Sling instance, and experience the inconsistency. If we don't do this invalidation and re-assignment on scaling-up, it can takes hours potentially for a scale-up to positively impact an overloaded cluster where all users are permanently stuck to their current app server instance. As you can see, we need to deal with the inconsistency problem, regardless of whether we use sticky sessions. I have some ideas, but none are appealing, and would benefit greatly from your guys' knowledge: 1) Race condition If this delay to "catch up" to latest revision is mostly predictable, it doesn't grow as the repo grows in size, or if it doesn't change due to other variables, we can measure it and then account for it reliably with user-feedback (loading screen or whatever). This *might* be a race condition we can live with. My results above show as much as 3 or 4 seconds to "catch up." I must know what determines the duration of this revision catch-up time. Is it a function of repo size? Does the delay grow as the repo size grows? Does the delay grow as usage increases? Does the delay grow as the number of Sling instances in the cluster grow? Does the delay grow as network latency grows (I'm testing all on the same machine with practically no latency compared to a distributed production deployment). Is there any Sling dev, who is familiar with the algorithm that Sling uses to select a "newer" revision, who could answer this for me? ... perhaps it's just polling on a predictable time period! :) 2) Browser knows what revision it's on. The browser could know what JCR Revision it's on, learning that revision after every POST or PUT, perhaps in some response header. When its future requests are sent to a Sling instance on an older revision, it could wait until that instance "catches up." This sounds like a horrible example of client code operating on knowledge of underlying implementation details, and we're not at all excited about the chaos to implement it. That being said, can we programmatically check the revision that the current Sling instance is reading from? 3) "Pause" during scale-up or scale-down. Each time we add or remove a sling instance, all users experience a "pause" screen while their new Sling Instance "catches up." This is essentially the same as the race condition in #1, except we'd constrain users to only experience this when we scale up or down. However, we are *extremely* unhappy to impact our users just because we're scaling up or down, especially when we must do so frequently. Anybody have any other ideas? Other questions: 1) When a brand new Sling instance discovers an existing JCR (Mongo), does it automatically and immediately go to the latest head revision? Or is there some progression through the revisions, and it takes time for the Sling instance to catch up to the latest? 2) Is there any reason, BESIDES JCR CONSISTENCY, why a Sling cluster must be deployed with sticky-sessions? What other problems would we introduce by not having sticky sessions? I seem to have used this email to track my own thoughts more than anything, my sincere thanks if you've taken the time to read the whole thing. -- View this message in context: http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530p4069709.html Sent from the Sling - Users mailing list archive at Nabble.com.