Hi Bernd, It's great having you and Mathias working on this - I'll try to find a way to combine the 2 patches.
I think it's best to have the minimum amount of modifications in jk1 to get things working. For jk2 I think we should do what's most flexible and powerfull - even if it requires bigger changes. Costin On Mon, 6 May 2002, Bernd Koecke wrote: > Hi Costin, > > here is my patch for jk_lb_worker.c, jk_util.c and jk_util.h. It was diffed > against the cvs. > > How does it work? > > 1) A new config param for lb_worker exists (main_worker_mode). Two values are > possible: reject and balance. > > 2) This results in two new flags in lb_worker struct. If there was in minimum > one worker with lb_value of 0 the flag in_main_worker_mode is set to JK_TRUE. > JK_False if there was none. > > The second flag is reject. It is JK_TRUE if the config param above was set, > JK_FALSE otherwise or if in_main_worker_mode is JK_FALSE; > > Behavior: > If there was no worker with lb_value == 0 it should work as it used to. > All workers with lb_value == 0 will be moved at the beginning of the worker > list. It is possible to have more than one such worker. So you could have a > standby server if the first main tomcat goes down. > > If a request with a session id comes in it will be routed to his tomcat. If that > fails it will be routed to the main worker. > > If a request without a session id comes in it will be routed to the first of the > main workers if that fails it goes to the next main worker. If there are no main > workers left the reject flag is important. If it is set the request gets an > error. If it is not set the lb_worker trys to route it to one of the other workers. > > All requests without a session id or with an id on a shutdowned node will only > be routed to the main worker (reject == JK_TRUE). Its like a graceful shutdown, > e.g. for node1. If the lb in front of the cluster didn't send a request without > a session id to node 1 it will only get requests with a jvmRoute of TC1. After > these sessions timed out, it could be updated, get a shutdown and startup. Even > after startup it wouldn't get a request without a session. The lb in front of > the cluster must do the first move and send the apache on node one a fresh request. > > Example workers.properties of node one: > > workers.tomcat_home=<route to Catalina> > workers.java_home=$(JAVA_HOME) > ps=/ > worker.list=router > > worker.TC1.port=8009 > worker.TC1.host=node1.domain.tld > worker.TC1.type=ajp13 > worker.TC1.lbfactor=0 > > worker.TC2.port=8009 > worker.TC2.host=node2.domain.tld > worker.TC2.type=ajp13 > worker.TC2.lbfactor=1 > > worker.router.type=lb > worker.router.balanced_workers=TC1,TC2 > worker.router.main_worker_mode=reject > > For node two the lbfactor of TC1 is 1 and TC2 is 0. > > I didn't had a closer look at Mathias patch. It seems to be much shorter. We > should take the best of both. This patch is only a suggestion. I'm not an > experienced C-Programmer (I know my copy statements in validate are ugly, but > I'm not very familiar with moving memory around :) ). > > Bernd > > [EMAIL PROTECTED] wrote: > > Bernd, > > > > At this moment I believe we should add flags and stop using the '0' value > > in the config file. > > > > Internally ( in the code ) - it doesn't matter, we can keep 0 or > > use the flag ( I prefer the second ). > > > > I'm waiting for your patch - it seems there is another bug that must > > be fixed before we can tag - but I hope we can finish all changes in > > the next few days. > > > > > > Costin > > > > On Mon, 6 May 2002, Bernd Koecke wrote: > > > > > >>thanks for commiting my patch :). After thinking about it, I found the same > >>problem like Mathias. It's a problem for my environment too. We have the same > >>problem with shutdown and recovering here. I'm on the way of looking in jk2. The > >>question for jk1 is, what want we do if the main worker fails because of an error? > >> > >>Because the normal intention of lb is to switch to another worker in such case. > >>But for the special use of a main worker we don't want that (at least it is an > >>error in my environment here :) ). My suggestion is to add an additional flag to > >>the lb_worker struct where we hold the information that we have a main worker, > >>e.g main_worker_mode. Because of this flag we send only requests with a session > >>id to one of the other worker. And we could change the behavior after an error > >>of an other worker and check his state only if we get a request with his session > >>route. This would be easy if we set the main worker at the begining of the > >>worker list and/or use the flag. But we need the flag if we want to use more the > >>one main worker. > >> > >>But what should happen if the main worker is in error state? In my patch some > >>weeks ago I added an additional flag which causes the module to reject a request > >>if it comes in without a session id and the main worker is down. If this flag > >>wasn't set or was not set to reject the module chooses one of the other worker. > >>For our environment here rejecting the request is ok, because if a request > >>without a session comes to a switched off node, we have a problem with our > >>separated load balancer. This should never happen. We could make this rejecting > >>be the standard if we have a main worker, but with a separate flag it would be > >>more flexible. > >> > >>I will build a patch against cvs to make my intention clearer. > >> > >>Bernd > >> > >>[EMAIL PROTECTED] wrote: > >> > >>>Hi Mathias, > >>> > >>>I think we understand your use case, it is not very uncommon. > >>>In fact, as I mentioned few times, it is the 'main' use > >>>case for Apache ( multi-process ) when using the JNI worker. > >>>In this case Apache acts as a 'natural' load-balancer, with > >>>requests going to various processes ( more or less randomly ). > >>>As in your case, requests without a session should allways go > >>>to the worker that is in the same process. > >>> > >>>The main reason for using '0' for the "local" worker is that > >>>in jk2 I want to switch from float to int - there is no reason > >>>( AFAIK ) to do all the float computation, even a short int > >>>will be enough for the purpose of implementing a round-roubin > >>>with weitghs. > >>> > >>>BTW, one extension I'm trying to make is support for multiple > >>>local workers - I'm still thining on how to do that. This will > >>>cover the case of few big boxes, each with several tomcat > >>>instances ( if you have many G of RAM and many processors, sometimes > >>>is better to run more VMs instead of a single large process ) > >>>In this case you still want some remote tomcats, for failover, > >>>but most load should go to the local workers. > >>> > >>>For jk2 I already fixed the selection of the 'recovering' worker, > >>>after timeout the worker will go through normal selection instead > >>>of beeing automatically chosen. > >>> > >>>For jk1 - I'm waiting for patches :-) I wouldn't do a big change - > >>>the current fix seemed like a good one. > >>> > >>>I agree that changing the meaning of 0 may be confusing ( is it > >>>documented ? my workers.properties says it should never be used ). > >>>We can fix that by using an additional flag - and not using > >>>special values. > >>> > >>>Another special note - Jk2 will also support 'gracefull shutdown', > >>>that means your case ( replacing a webapp ) will be handled > >>>in a different way. You should be able to add/remove workers > >>>without restarting apache ( and I hope mostly automated ). > >>> > >>>Let me know what you think - with patches if possible :-) > >>> > >>>Costin > >>> > >>> > >>> > >>>>The setup I use is the following, a load balancer (Alteon) is in front > >>>>of several Apache servers, each hosted on a machine which also hosts a > >>>>Tomcat. > >>>>Let's call those Apache servers A1, A2 and A3 and the associated Tomcat > >>>>servers T1, T2 and T3. > >>>> > >>>>I have been using Paul's patch which I modified so the lb_value field of > >>>>fault tolerant workers would not be changed to a value other than INF. > >>>> > >>>>The basic setup is that Ai can talk to all Tj, but for requests not > >>>>associated with a session, Ti will be used unless it is unavailable. > >>>>Sessions belonging to Tk will be correctly routed. The load balancing > >>>>worker definition is different for all three Ai, the lbfactor is set to > >>>>0 for workers connecting to Tk for all k != i and set to 1.0 for the > >>>>worker connecting to Ti. > >>>> > >>>>This setup allows to have sticky sessions independently of the Apache > >>>>handling the request, which is a good thing since the Alteon cannot > >>>>extract the ';jsessionid=.....' part from the URL in a way which allows > >>>>the dispatching of the requests to the proper Ai (the cookie is dealed > >>>>with correctly though). > >>>> > >>>>This works perfectly except when we roll out a new release of our > >>>>webapps. In this case it would be ideal to be able to make the load > >>>>balancer ignore one Apache server, deploy the new version of the webapp > >>>>on this server, and switch this server back on and the other two off so > >>>>the service interruption would be as short as possible for the > >>>>customers. The immediate idea, if Ai/Ti is to be the first server to > >>>>have the new webapp, is to stop Ti so Ai will not be selected by the > >>>>load balancer. This does not work, indeed with Paul's patch Ti is the > >>>>preferred server BUT if Ti fails then another Tk will be selected by Ai, > >>>>therefore the load balancer will never declare Ai failed (even though we > >>>>managed to make it behave like this by specifying a test URL which > >>>>includes a jvmroute to Ti, but this uses lots of slb groups on the > >>>>alteon) and it will continue to send requests to it. > >>>> > >>>>Bernd's patch allows Ai to reject requests if Ti is stopped, the load > >>>>balancer will therefore quickly declare Ai inactive and will stop send > >>>>it requests, thus allowing to roll out the new webapp very easily, just > >>>>set up the new webapp, restart Ti, restart Ai, and as soon as the load > >>>>balancer sees Ai, shut down the other two Ak, the current sessions will > >>>>still be routed to the old webapp, and the new sessions will see the new > >>>>version. When there are no more sessions on the old version, shut down > >>>>Tk (k != i) and deploy the new webapp. > >>>> > >>>>My remark concerning the possible selection of recovering workers prior > >>>>to the local worker (one with lb_value set to 0) deals with the load > >>>>balancer not being able in this case to declare Ai inactive. > >>>> > >>>>I hope I have been clear enough, and that everybody got the point, if > >>>>not I'd be glad to explain more thoroughly. > >>>> > >>>>Mathias. > >>>> > >>>>Paul Frieden wrote: > >>>> > >>>> > >>>>>Hello, > >>>>> > >>>>>I'm afraid that I am no longer subscribed to the devel list. I would be > >>>>>happy to add my advice for this issue, but I don't have time to keep up > >>>>>with the entire devel list. If there is anything I can do, please just > >>>>>mail me directly. > >>>>> > >>>>>I chose to use the value 0 for a worker because it used the inverse of > >>>>>the value specified. The value 0 then resulted in essentially infinite > >>>>>preference. I used that approach purely because it was the smallest > >>>>>change possible, and the least likely to change the expected behavior > >>>>>for anybody else. The path of least astonishment and whatnot. I would > >>>>>be concerned about changing the current behavior now, because people > >>>>>probably want a drop in replacement. If there is going to be a change > >>>>>in the algorithm and behavior, a different approach may be better. > >>>>> > >>>>>I would also like to make a note of how we were using this code. In our > >>>>>environment, we have an external dedicated load balancer, and three web > >>>>>servers. The main problem that we ran into was with AOL users. AOL > >>>>>uses a proxy that randomizes the source IP of requests. That means that > >>>>>you can no longer count on the source IP to tell the load balancer which > >>>>>server to send future requests to. We used this code to allow sessions > >>>>>that arive on the wrong web server to be redirected to the tomcat on the > >>>>>correct server. This neatly side-steps the whole issue of changing IPs, > >>>>>because apache is able to make the decision based on the session ID. > >>>>> > >>>>>The reliability issue was a nice side effect for us in that it caught a > >>>>>failed server more quickly than the load balancer did, and prevented the > >>>>>user from having a connection time out or seeing an error message. > >>>>> > >>>>>I hope this provides some insight into why I changed the code that I > >>>>>did, and why that behavior worked well for us. > >>>>> > >>>>>Paul > >>>>> > >>>>>[EMAIL PROTECTED] wrote: > >>>>> > >>>>> > >>>>> > >>>>>>Hi Mathias, > >>>>>> > >>>>>>I think it would be better to discuss this on tomcat-dev. > >>>>>> > >>>>>>The 'error' worker will not be choosen unless the > >>>>>>timeout expires. When the timeout expires, we'll indeed > >>>>>>select it ( in preference to the default ) - this is easy to fix > >>>>>>if it creates problems, but I don't see why it would be a > >>>>>>problem. > >>>>>> > >>>>>>If it is working, next request will be served normally by > >>>>>>the default. If not, it'll go back to error state. > >>>>>> > >>>>>>In jk2 I removed that - error workers are no longer > >>>>>>selected. But for jk1 I would rather leave the old > >>>>>>behavior intact. > >>>>>> > >>>>>>Note that the reason for choosing 0 ( in jk2 ) as > >>>>>>default is that I want to switch from float to ints, > >>>>>>I'm not convinced floats are good for performance > >>>>>>( or needed ). > >>>>>> > >>>>>>Again - I'm just learning and trying, if you have > >>>>>>any idea I would be happy to hear them, patches > >>>>>>are more than wellcome. > >>>>>> > >>>>>>Costin > >>>>>> > >>>>>>On Sat, 4 May 2002, Mathias Herberts wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>>Hi, I just joined the Tomcat-dev list and saw your patch to > >>>>>>>jk_lb_worker.c (making it version 1.9). > >>>>>>> > >>>>>>>If I understand well your patch it offers the same behaviors as Paul's > >>>>>>>patch but with an opposite semantic for a lbfactor of 0.0 in the > >>>>>>>worker's definition, i.e. a value of 0.0 now means ALWAYS USE THIS > >>>>>>>WORKER FOR REQUESTS WITH NO SESSIONS instead of NEVER USE THIS WORKER > >>>>>>>FOR REQUESTS WITH NO SESSIONS. This seems fine to me. > >>>>>>> > >>>>>>>What disturbs me is what is happening when one worker is in error > >>>>>>>state and not yet recovering. In get_most_suitable worker, such a > >>>>>>>worker will be selected whatever its lb_value, meaning a recovering > >>>>>>>worker will have priority over one with a lb_value of 0.0 and this > >>>>>>>seems to break the behavior we had achieved with your patch. > >>>>>>> > >>>>>>>Did I miss something or is this really a problem? > >>>>>>> > >>>>>>>Mathias. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>> > >>>-- > >>>To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > >>>For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > >>> > >> > >> > >> > > > > > > -- > > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > > > > > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>