Dear all, The patch worked perfectly. See below the results:
2014-04-10 11:27:46,592 (Thread-2) TaskAssignmentStage INFO: Sending Message fcaaf416-bd62-43e8-98a3-e9e20201b58e to pf1.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:OFFLINE to:BOOTSTRAP 2014-04-10 11:27:48,045 (Thread-2) TaskAssignmentStage INFO: Sending Message cfa8985b-1521-42e8-8f38-f58f6852b2ff to pf1.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:BOOTSTRAP to:SLAVE 2014-04-10 11:27:49,717 (Thread-2) TaskAssignmentStage INFO: Sending Message 76f358eb-97bc-4457-b260-add038643d65 to pf1.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:SLAVE to:MASTER 2014-04-10 11:29:35,272 (Thread-2) TaskAssignmentStage INFO: Sending Message 5c38654b-10fe-4a08-9fe4-41d8b712dd99 to pf2.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:OFFLINE to:BOOTSTRAP 2014-04-10 11:29:44,702 (Thread-2) TaskAssignmentStage INFO: Sending Message 28e41ec2-daeb-4675-b1ac-5e8b094c0012 to pf2.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:BOOTSTRAP to:SLAVE 2014-04-10 11:29:46,442 (Thread-2) TaskAssignmentStage INFO: Sending Message 3107ceba-380d-4d0f-bad1-964fe166868b to pf1.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:MASTER to:SLAVE 2014-04-10 11:29:47,021 (Thread-2) TaskAssignmentStage INFO: Sending Message 0fa6e490-b65e-4818-a027-5490c6ebbfd2 to pf2.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:SLAVE to:MASTER Thank you very much again for the quick solution! Regards, Vlad On Wed, Apr 9, 2014 at 9:42 PM, Vlad Balan <[email protected]> wrote: > Thank you for the very quick fix! I will test it and let you know of the > result! > > Regards, > Vlad > > On Apr 9, 2014, at 9:16 PM, kishore g <[email protected]> wrote: > > Hi Vlad, > > Here is the diff https://reviews.apache.org/r/20196/diff for the fix and > the test case. If you want to give it a try. Apply this on the master. > > thanks, > Kishore G > > > On Wed, Apr 9, 2014 at 1:40 PM, Kanak Biscuitwala <[email protected]>wrote: > >> >> Based on the result of the conversation, we found the following: >> >> 1. 0.6.x doesn't support partition constraints. Created >> https://issues.apache.org/jira/browse/HELIX-426 >> 2. 0.7.x doesn't honor partition constraints correctly. Created >> https://issues.apache.org/jira/browse/HELIX-425 >> >> We will try to fix these tomorrow. >> >> Kanak >> ________________________________ >> > Date: Wed, 9 Apr 2014 12:51:10 -0700 >> > Subject: Re: keeping the master node up during bootstrap >> > From: [email protected] >> > To: [email protected] >> > >> > Sure! I'll join the channel! >> > >> > >> > On Wed, Apr 9, 2014 at 12:41 PM, kishore g >> > <[email protected]<mailto:[email protected]>> wrote: >> > Hi Vlad, >> > >> > I have some questions. Can you join the IRC channel #apachehelix. >> > >> > thanks, >> > Kishore G >> > >> > >> > On Wed, Apr 9, 2014 at 11:35 AM, >> > [email protected]<mailto:[email protected]> >> > <[email protected]<mailto:[email protected]>> wrote: >> > Upon some further testing, it seems that the controller does not >> > execute the events in the right sequence. >> > >> > Here are the results of some of my testing. Assume that we have a >> > partition NEWPROFILE_5 with the ideal state: >> > >> > "NEWPROFILE_5" : { >> > >> > "pf1.apps-pf.dev.docker_12000" : "SLAVE", >> > >> > "pf2.apps-pf.dev.docker_12000" : "MASTER" >> > >> > } >> > >> > I boot the host pf1 and a few minutes later the host pf2. In the >> > controller logs I see, when doing a grep for NEWPROFILE_5: >> > >> > 2014-04-08 17:04:35,309 (Thread-2) TaskAssignmentStage INFO: Sending >> > Message 69b4eddf-ac5f-4726-9d6b-bac742ad082e to >> > pf1.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:SLAVE >> > to:MASTER >> > >> > 2014-04-08 17:27:08,187 (Thread-2) TaskAssignmentStage INFO: Sending >> > Message a221b1ac-0807-425e-9062-6507e45b0bfb to >> > pf1.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:OFFLINE >> > to:BOOTSTRAP >> > >> > 2014-04-08 17:27:10,164 (Thread-2) TaskAssignmentStage INFO: Sending >> > Message 73ed85fd-49c9-46a5-b262-687d612c7d06 to >> > pf1.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:BOOTSTRAP >> > to:SLAVE >> > >> > 2014-04-08 17:27:11,868 (Thread-2) TaskAssignmentStage INFO: Sending >> > Message fb21aecc-68cf-4b9f-9718-aa6ed535c29d to >> > pf1.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:SLAVE >> > to:MASTER >> > >> > 2014-04-08 17:28:22,978 (Thread-2) TaskAssignmentStage INFO: Sending >> > Message ea441d18-b1f3-4ceb-96a2-3262cab1dfbe to >> > pf2.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:OFFLINE >> > to:BOOTSTRAP >> > >> > 2014-04-08 17:28:22,978 (Thread-2) TaskAssignmentStage INFO: Sending >> > Message f36b4d64-c790-413b-b9fa-915b9539d28c to >> > pf1.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:MASTER >> > to:SLAVE >> > >> > 2014-04-08 17:28:26,065 (Thread-2) TaskAssignmentStage INFO: Sending >> > Message 201429e1-e810-4017-b3ef-fb5930ac2192 to >> > pf2.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:BOOTSTRAP >> > to:SLAVE >> > >> > 2014-04-08 17:28:28,238 (Thread-2) TaskAssignmentStage INFO: Sending >> > Message 4a1fb64c-1063-4e49-a995-946d2dd25733 to >> > pf2.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:SLAVE >> > to:MASTER >> > >> > That is, the controller issues an offline->bootstrap command to pf-2, >> > but then issues a master->slave command to of-1 before bringing pf-2 up >> > as a slave as well (the last step before promotion to master). Since >> > the bootstrap->slave that follows takes time, the system spends time >> > without a master for the partition. >> > >> > The state model definition was: >> > public static StateModelDefinition defineStateModel() { >> > StateModelDefinition.Builder builder = >> > new StateModelDefinition.Builder(KVHelixDefinitions.STATE_MODEL_NAME); >> > // Add states and their rank to indicate priority. Lower the rank >> higher the >> > // priority >> > builder.addState(MASTER, 1); >> > builder.addState(SLAVE, 2); >> > builder.addState(BOOTSTRAP, 3); >> > builder.addState(OFFLINE); >> > builder.addState(DROPPED); >> > // Set the initial state when the node starts >> > builder.initialState(OFFLINE); >> > >> > // Add transitions between the states. >> > builder.addTransition(OFFLINE, BOOTSTRAP, 3); >> > builder.addTransition(BOOTSTRAP, SLAVE, 2); >> > builder.addTransition(SLAVE, MASTER, 1); >> > builder.addTransition(MASTER, SLAVE, 4); >> > builder.addTransition(SLAVE, OFFLINE, 5); >> > builder.addTransition(OFFLINE, DROPPED, 6); >> > >> > // set constraints on states. >> > // static constraint >> > builder.upperBound(MASTER, 1); >> > // dynamic constraint, R means it should be derived based on the >> replication >> > // factor. >> > builder.dynamicUpperBound(SLAVE, "R"); >> > >> > StateModelDefinition statemodelDefinition = builder.build(); >> > >> > assert(statemodelDefinition.isValid()); >> > >> > return statemodelDefinition; >> > } >> > >> > I have tried reversing the values of the transition priorities. In this >> > case, the controller log file looked as follows: >> > >> > 2014-04-09 11:17:52,831 (Thread-2) TaskAssignmentStage INFO: Sending >> > Message 2b29a319-c1c6-4042-b1ad-3e3c1b5092a7 to >> > pf1.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:OFFLINE >> > to:BOOTSTRAP >> > >> > 2014-04-09 11:17:55,672 (Thread-2) MessageGenerationStage INFO: Message >> > hasn't been removed for pf1.apps-pf.dev.docker_12000 to >> > transitNEWPROFILE_5 to BOOTSTRAP, desiredState: MASTER >> > >> > 2014-04-09 11:17:57,047 (Thread-2) TaskAssignmentStage INFO: Sending >> > Message b1ca701d-65f1-46b9-9ae4-286400d6d266 to >> > pf1.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:BOOTSTRAP >> > to:SLAVE >> > >> > 2014-04-09 11:17:58,888 (Thread-2) TaskAssignmentStage INFO: Sending >> > Message fe10228f-8f5b-4133-964a-5f6c7e60b0e6 to >> > pf1.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:SLAVE >> > to:MASTER >> > >> > 2014-04-09 11:23:26,117 (Thread-2) TaskAssignmentStage INFO: Sending >> > Message 6252a4e6-0ab8-490a-a51d-c47195c434b5 to >> > pf1.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:MASTER >> > to:SLAVE >> > >> > 2014-04-09 11:23:26,117 (Thread-2) TaskAssignmentStage INFO: Sending >> > Message 18bbf028-cb51-4162-8226-a6564a121986 to >> > pf2.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:OFFLINE >> > to:BOOTSTRAP >> > >> > 2014-04-09 11:23:33,462 (Thread-2) MessageGenerationStage INFO: Message >> > hasn't been removed for pf2.apps-pf.dev.docker_12000 to >> > transitNEWPROFILE_5 to BOOTSTRAP, desiredState: MASTER >> > >> > 2014-04-09 11:23:33,892 (Thread-2) TaskAssignmentStage INFO: Sending >> > Message c7fc4983-9d71-4dc4-bfee-2ad69e4de411 to >> > pf2.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:BOOTSTRAP >> > to:SLAVE >> > >> > 2014-04-09 11:23:35,933 (Thread-2) TaskAssignmentStage INFO: Sending >> > Message 75e715ed-3d53-4e39-b1e7-44695e4bfa03 to >> > pf2.apps-pf.dev.docker_12000 transit NEWPROFILE_5|[] from:SLAVE >> > to:MASTER >> > >> > That is, the transition for master->slave for pf1 was executed before >> > taking any action on pf2, clearly the opposite of the right order. >> > >> > >> > On Tue, Apr 8, 2014 at 2:19 PM, Kanak Biscuitwala >> > <[email protected]<mailto:[email protected]>> wrote: >> > >> > Looks good, thanks for sharing! >> > >> > Kanak >> > ________________________________ >> >> Date: Tue, 8 Apr 2014 14:08:28 -0700 >> >> Subject: Re: keeping the master node up during bootstrap >> >> From: [email protected]<mailto:[email protected]> >> >> To: [email protected]<mailto:[email protected]> >> >> >> >> My modified code looks like: >> >> >> >> /* Setup a Helix cluster for the KVStore */ >> >> public static void setupCluster() { >> >> assert(cluster != null); >> >> clusterSetup.addCluster(cluster, true); >> >> >> >> а а а а ConstraintItemBuilder constraintItemBuilder = new >> >> ConstraintItemBuilder(); >> >> >> >> а а а а constraintItemBuilder >> >> а а а а а а а а >> >> .addConstraintAttribute(ConstraintAttribute.MESSAGE_TYPE.toString(), >> >> "STATE_TRANSITION") >> >> а а а а а а а а >> >> .addConstraintAttribute(ConstraintAttribute.PARTITION.toString(), ".*") >> >> а а а а а а а а >> >> >> .addConstraintAttribute(ConstraintAttribute.CONSTRAINT_VALUE.toString(), >> >> "1"); >> >> >> >> а а а а clusterSetup.getClusterManagementTool().setConstraint(cluster, >> >> а а а а а а а а ClusterConstraints.ConstraintType.MESSAGE_CONSTRAINT, >> >> а а а а а а а а "constraint1", constraintItemBuilder.build()); >> >> а а } >> >> >> >> I will try to see whether it works in every situation. >> >> >> >> Regards, >> >> Vlad >> >> >> >> >> >> On Tue, Apr 8, 2014 at 8:59 AM, Vlad Balan >> >> >> > <[email protected]<mailto:[email protected]><mailto:[email protected] >> <mailto:[email protected]>>> >> > wrote: >> >> Hi Kishore, >> >> >> >> I managed to implement the bootstrapping using the constraint and it >> >> appears to be running as expected. I will post my code shortly. >> >> >> >> Regards, >> >> Vlad >> >> >> >> On Apr 8, 2014, at 8:27 AM, kishore g >> >> >> > <[email protected]<mailto:[email protected]><mailto: >> [email protected]<mailto:[email protected]>>> >> > wrote: >> >> >> >> Hi Vlad, >> >> >> >> Did you get a chance to play with the constraint.а I can write a sample >> >> code today to try this. >> >> >> >> Thanks, >> >> Kishore G >> >> >> >> >> >> On Thu, Apr 3, 2014 at 5:45 PM, >> >> >> > [email protected]<mailto:[email protected]><mailto:[email protected] >> <mailto:[email protected]>> >> >> >> > <[email protected]<mailto:[email protected]><mailto:[email protected] >> <mailto:[email protected]>>> >> > wrote: >> >> >> >> Thank you Kanak and Kishore! I will try enforcing the per-partition >> >> constraint and let you know if somehow it does not work. I was looking >> >> at the throttling documentation, but somehow missed that a >> >> per-partition constraint was an option! >> >> >> >> Regards, >> >> Vlad >> >> >> >> >> >> On Thu, Apr 3, 2014 at 5:42 PM, kishore g >> >> >> > <[email protected]<mailto:[email protected]><mailto: >> [email protected]<mailto:[email protected]>>> >> > wrote: >> >> Hi Vlad, >> >> >> >> You can try setting the transition priority order and a constraint that >> >> there should be only one transition per partition across the cluster. >> >> >> >> So the transition priority could be something like >> >> >> >> Slave-Master >> >> Offfline -> Bootstrap >> >> Bootstrap->Slave >> >> Slave->Master >> >> >> >> For the rest not sure if order matters. >> >> >> >> Also set the max transitions constraint to 1 per partition. >> >> >> >> The reason I put Slave-Master before Offline->Bootstrap is to ensure >> >> that availability is given more importance. For example if you have 3 >> >> nodes, N1, N2, N3. N1 is Master, N2 is Slave, and N3 is down. If N1 >> >> goes down and N3 comes up at the same time. We probably dont want to >> >> wait for N3 to bootstrap before promoting N2 to Master. >> >> >> >> I haven't tested this but assuming the constraints enforcement works, >> >> this should do the trick. >> >> >> >> Does this make sense? Let me know if this does not work, we can add a >> >> test case. >> >> >> >> thanks, >> >> Kishore G >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Apr 3, 2014 at 4:57 PM, >> >> >> > [email protected]<mailto:[email protected]><mailto:[email protected] >> <mailto:[email protected]>> >> >> >> > <[email protected]<mailto:[email protected]><mailto:[email protected] >> <mailto:[email protected]>>> >> > wrote: >> >> >> >> Dear all, >> >> >> >> I am trying to construct a state model with the following transition >> > diagram: >> >> >> >> OFFLINE -> BOOTSTRAPPING <---> SLAVE <-----> MASTER >> >> а а а а а<----------------------------------- >> >> >> >> That is, an offline mode can go into a bootstraping state, from the >> >> bootstrap state it can go into a slave state, >> >> from slave it can go from master, from master to slave and from slave >> >> it can go offline. >> >> >> >> Assume that if I have a partition with two nodes pf1 and pf2 and a >> >> partition partition_0 with the following ideal state: >> >> >> >> partition_0: pf2: MASTER pf1: SLAVE, >> >> >> >> and that currently pf1 is serving as a master. When pf2 boots, Helix >> >> will issue, almost simultaneously, two commands: >> >> for pf1: transition from MASTER to SLAVE >> >> for pf2: transition from BOOTSTRAPPING to SLAVE >> >> >> >> My understanding is that this happens since Helix is trying to execute >> >> as many commands in parallel and since the last state >> >> has pf2 as master. However, the transition from BOOTSTRAPPING to SLAVE >> >> for pf2 involves a long data copy step, so >> >> I would like to keep pf1 as a master in the meanwhile. I tried >> >> prioritizing the transition from BOOTSTRAPPING to SLAVE >> >> over the transition from MASTER to SLAVE, however Helix still issues >> >> them in parallel (as it should). >> >> >> >> I was wondering what my options would be in order to keep the master up >> >> while the future master is bootstrapping. Could >> >> a throttling in the number of transitions be enforced at partition >> >> level? Could I somehow specify that a state with a slave >> >> and a bootstrapping node is undesirable? >> >> >> >> As a note, I have also looked at the RSync-replicateed filesystem >> >> example. The reason for not using the OfflineOnline or the >> >> MasterSlave model in my application is that I would like the >> >> bootstrapping node to receive updates from clients, i.e. be visible >> >> during the bootstrap. For this reason, I am introducing the new >> >> BOOTSTRAPPING phase in-between OFFLINE and SLAVE. >> >> >> >> Regards, >> >> Vlad >> >> >> >> >> >> PS: The state model definition is as follows: >> >> >> >> builder.addState(MASTER, 1); а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а >> >> >> >> а а а а а а builder.addState(SLAVE, 2);а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а builder.addState(BOOTSTRAP, 3);а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а builder.addState(OFFLINE); а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а builder.addState(DROPPED); а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а // Set the initial state when the node startsа а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а builder.initialState(OFFLINE); а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> аа а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а // Add transitions between the states. а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а builder.addTransition(OFFLINE, BOOTSTRAP, 4);а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а builder.addTransition(BOOTSTRAP, SLAVE, 5);а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а builder.addTransition(SLAVE, MASTER, 6); а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а builder.addTransition(MASTER, SLAVE, 3); а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а builder.addTransition(SLAVE, OFFLINE, 2);а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а builder.addTransition(OFFLINE, DROPPED, 1);а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> аа а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а // set constraints on states.а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а // static constraint а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а builder.upperBound(MASTER, 1); а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а // dynamic constraint, R means it should be derived based >> >> on the replication а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а // factor. а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а а >> >> а а а а а а а а а а а а а а а а а а >> >> >> >> а а а а а а builder.dynamicUpperBound(SLAVE, "R");а а а а а а а а а а а >> >> >> >> >> >> >> >> >> > >> > >> > >> > >> >> > >
