Chetan is making things crystal clear for us. 

Our next steps are:

1) Learn what the MAXIMUM "inconsistency window" could be. 
Is it possible to delay past 5 seconds? 10 Seconds? 60? What determines
this? Only server load? I'll ask on the JCR forum and also experiment. 

2) Design and test a solution almost exactly as Bertrand described. 
Sling responds to POST/PUT/DELETE with a JCR revision. Sling will behave
differently when the Request contains a JCR revision more recent than it's
current. I have no idea what I'm getting into or how hard this will be. 

Bertrand, I'd feel selfish taking you up on your offer to build this for me.
Yet I'd be a fool to not at least partner with you to get it done. Should we
correspond outside this mail list? 
Perhaps you could point me to the files you would edit to get this done and
I could try to do it myself? I imagine a solution where you can configure,
through OSGI, whether Sling will do one of the following:

A) Ignore JCR revision in Request, and function as it does today (Default
setting)
B) Block until it has caught up to JCR revision in Request
C) Call some other custom handler? This way we can do custom things like
send a redirect to enhance the user experience during a block. In a product
like ours, 5 or 10 second blocks aren't acceptable without user feedback. 

I also don't know how to determine the current Sling instance's Revision, or
how to compute whether one revision is "more recent" than another.

---------

Responding to a couple other minor points:


Felix Meschberger-3 wrote
> I suggest you go with something else, which does *not* need the repository
> for persistence. This means you might want to investigate your own
> authentication handler ...

Thank you Felix :) I've actually done this work recently and it's working
great! We have "stateless" authentication now, but are now dealing with the
unacceptable inconsistency that Chetan warned about. 
That's the question on the table: In a write-operation-heavy application,
how do we provide a "read-your-writes" consistent experience on an
eventually-consistent solution (Sling cluster), when traditional 
sticky-sessions are an invalid solution because your userbase is large
enough to demand server-scaling several times throughout the day.


chetan mehrotra wrote
> I can understand issue around when existing Sling server is removed
> from the pool. However adding a new instance should not cause existing
> users to be reassigned

When adding an instance, we purposely invalidate all sticky sessions and
users will get re-assigned to a new Sling instance, so that the new server
actually improves performance.
Imagine a farm of 4 app servers that has been SLAMMED and isn't performing
well. Adding 1 or 100 new servers to that farm won't improve performance if
every user is "stuck" to the previous 4 servers.
If we don't do this invalidation and re-assignment on scaling-up, it can
takes hours potentially for a scale-up to positively impact an overloaded
cluster. 


Bertrand Delacretaz wrote
> But Lance could patch [1] to experiment with different values, right?
> ....
> [1]
> http://svn.apache.org/repos/asf/jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/DocumentNodeStore.java

Thank you for pointing me to the code Bertrand :) On new information from
Chetan, I'm losing interest in changing that value. Perhaps setting
aSyncDelay to 0 or some small number will cause it to perform slower but be
more consistent... 
However, my tentative assessment is that the interval would just be
"checked" more often, but it will also get skipped more often, due to "local
cache invalidation, computing the external changes for observation" as
Chetan put it. 
I would love to be wrong about this and I'll ask on the JCR forum.



--
View this message in context: 
http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530p4069730.html
Sent from the Sling - Users mailing list archive at Nabble.com.

Reply via email to