[kudu-CR] docs: workflow for master migration
David Ribeiro Alves has posted comments on this change. Change subject: docs: workflow for master migration .. Patch Set 4: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/4300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b9c66505e0efd1f4aef80884346507d4fe08d9c Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: John Russell Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] docs: workflow for master migration
Kudu Jenkins has posted comments on this change. Change subject: docs: workflow for master migration .. Patch Set 3: Build Started http://104.196.14.100/job/kudu-gerrit/3426/ -- To view, visit http://gerrit.cloudera.org:8080/4300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b9c66505e0efd1f4aef80884346507d4fe08d9c Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: John Russell Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] docs: workflow for master migration
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4300 to look at the new patch set (#3). Change subject: docs: workflow for master migration .. docs: workflow for master migration Change-Id: I9b9c66505e0efd1f4aef80884346507d4fe08d9c --- M docs/administration.adoc 1 file changed, 160 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/00/4300/3 -- To view, visit http://gerrit.cloudera.org:8080/4300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9b9c66505e0efd1f4aef80884346507d4fe08d9c Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: John Russell Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] docs: workflow for master migration
Adar Dembo has posted comments on this change. Change subject: docs: workflow for master migration .. Patch Set 1: (10 comments) http://gerrit.cloudera.org:8080/#/c/4300/1//COMMIT_MSG Commit Message: Line 7: docs: workflow for master migration > A question: will this (or something like this) work to migrate, say, from 3 It won't work without more steps for migration from three to five. Specifically, once the three masters have started (after their raft configs have been rewritten from the command line), you'd need to wait until all three have caught up to one another, otherwise copying the tablet to the two new ones can incur data loss if one of the original three dies thereafter. On top of that, once you have three masters, you probably don't want the outage that using this workflow entails. Better to "do it right" with Raft config changes once that's implemented. Anyway, I'll doc that it doesn't work. http://gerrit.cloudera.org:8080/#/c/4300/1/docs/administration.adoc File docs/administration.adoc: Line 236: recovering from permanent master failures greatly, and is highly recommended. The alias should be > My "how" referred to "How is the user supposed to do this. What is the goal I don't know, I guess we just disagree on this. In my experience step-by-step product documentation is intentionally dry. When reading it, I don't expect to learn why something is the way it is; I just expect to solve a problem by following instructions. For this particular step, I think it's important to provide some kind of "carrot" to incentivize users to go through DNS changes. Without that, all a user knows is that it's optional; they don't know whether it's important or not. But at the same time, we don't want to swamp them with technical details. I view it as a balancing act that (I agree with you) leaves the more technical users in the dark, but focuses the doc for everyone else. If it helps, the "recover from permanent master failure" doc (still in progress) will talk about this in a little more detail. Line 241: colocated with other services, though not with another master from the same configuration. > what other services? Are we advising that people co-locate the master with This my CM experience talking; "other services" refers to any other data system or load-intensive process that may be deployed in the cluster. I'll clarify a bit. Line 244: * Identify and record the directory where the master's data will live. > IMO identify leans more towards "finding the identity" of something vs "cho Alright, I'll change it. Line 246: * Optional: configure a DNS cname or /etc/hsots alias to the master's hostname (e.g. `master-2`, > same as above See above. Line 251: . Shut down the entire cluster. > Does it mean shutting down the machines or just Kudu processes? If the lat Yeah, I'll clarify that we're talking about the processes here, not the machines. There's no actual "graceful" shutdown for Kudu though, so I'll elide that word to avoid confusion. I've omitted the part about disabling Kudu services. I think it's implied in "maintenance window", plus the "undisabling" sentence proved to be unnecessarily verbose and confusing. PS1, Line 264: > Nit: an extra space. Done PS1, Line 264: DNS cnames > Nit: DNS names. Those could be A records, right? Hmm, OK. I guess it could be either cnames or A records. I'll clarify. Line 284: . Start the existing master. > If recommending disabling Kudu services, then 'Enable and start ...' Yeah, this is the part that I think gets too verbose, hence why I omitted the "disable" from earlier. Line 314: are working properly, consider performing the following sanity checks: > Yeah, that would me my suggestion too. First the user should make sure that OK, I'll checking that the /masters page on each web UI looks the same (and that one master was elected leader), and use ksck. -- To view, visit http://gerrit.cloudera.org:8080/4300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b9c66505e0efd1f4aef80884346507d4fe08d9c Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: John Russell Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] docs: workflow for master migration
David Ribeiro Alves has posted comments on this change. Change subject: docs: workflow for master migration .. Patch Set 1: (4 comments) http://gerrit.cloudera.org:8080/#/c/4300/1/docs/administration.adoc File docs/administration.adoc: Line 236: recovering from permanent master failures greatly, and is highly recommended. The alias should be > To clarify, is your question "how does it simplify recovering from permanen My "how" referred to "How is the user supposed to do this. What is the goal and steps". Someone with no context won't know how or why this "simplifies recovering from permanent master failures greatly". It just seems like, with the removal of the "complex and distracting" explanation you settled on something that is worse than not having anything at all. I don't feel strongly about a particular route (between: full explanation, pointing to the design doc or removing it altogether) just find that leaving just this is confusing. Line 241: colocated with other services, though not with another master from the same configuration. what other services? Are we advising that people co-locate the master with a tablet server? make that clear. Line 244: * Identify and record the directory where the master's data will live. > I wanted "record" in there to make it clear that the choice being made need IMO identify leans more towards "finding the identity" of something vs "choosing the identity" of something. Breaking that symmetry is exactly my point since above it's the former case and here it's the latter Line 314: are working properly, consider performing the following sanity checks: > Is there a way to list existing masters in the system and status of each? Yeah, that would me my suggestion too. First the user should make sure that the system state is the expected one and then yes, try it. Maybe point to a tool (like ksck) that also does scans as scanning usually requires writing code and we wouldn't want an admin to have to write custom code to make sure the migration worked. -- To view, visit http://gerrit.cloudera.org:8080/4300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b9c66505e0efd1f4aef80884346507d4fe08d9c Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Anonymous Coward #149 Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] docs: workflow for master migration
Alexey Serbin has posted comments on this change. Change subject: docs: workflow for master migration .. Patch Set 1: (8 comments) http://gerrit.cloudera.org:8080/#/c/4300/1//COMMIT_MSG Commit Message: Line 7: docs: workflow for master migration A question: will this (or something like this) work to migrate, say, from 3 master configuration to 5 master configuration? If yes and it's verified that it works, consider mentioning this in the document. Or may be it should be a separate document? http://gerrit.cloudera.org:8080/#/c/4300/1/docs/administration.adoc File docs/administration.adoc: Line 251: . Shut down the entire cluster. Does it mean shutting down the machines or just Kudu processes? If the latter, consider changing to 'Gracefully shut down Kudu processes on the entire cluster.' Besides, I would consider disabling Kudu services (kudu-master and kudu-tserver) for the span of the migration procedure. The reason is to make sure the processes are not to start automatically in case of spurious rebooting of machines before the procedure is completed. PS1, Line 264: Nit: an extra space. PS1, Line 264: DNS cnames Nit: DNS names. Those could be A records, right? Line 284: . Start the existing master. If recommending disabling Kudu services, then 'Enable and start ...' Line 301: . Start all of the new masters. If recommending disabling Kudu services, then 'Enable and start ...' Line 311: . Start all of the tablet servers. If recommending disabling Kudu services, then 'Enable and start ...' Line 314: are working properly, consider performing the following sanity checks: > Obviously there are a lot of ways that this workflow could be botched. But, Is there a way to list existing masters in the system and status of each? If yes, it might make sense to check that the desired and the actual lists match. -- To view, visit http://gerrit.cloudera.org:8080/4300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b9c66505e0efd1f4aef80884346507d4fe08d9c Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Anonymous Coward #149 Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] docs: workflow for master migration
Adar Dembo has posted comments on this change. Change subject: docs: workflow for master migration .. Patch Set 1: (5 comments) http://gerrit.cloudera.org:8080/#/c/4300/1/docs/administration.adoc File docs/administration.adoc: Line 202: For true high availability and to avoid a single point of failure, Kudu clusters should be created > remove "true" Done Line 236: recovering from permanent master failures greatly, and is highly recommended. The alias should be > how? are you linking this to somewhere else, or are you adding this info on To clarify, is your question "how does it simplify recovering from permanent master failures"? Or "how do I configure a DNS cname or /etc/hosts alias"? To the first question: originally I had an explanation embedded in this section, but I found it to be too complex and distracting from the rest of the workflow, so I removed it. The actual explanation can be found in the "handling permanent failure in masters" design doc, but I don't think it makes sense to link to a design doc from product documentation. To the second question: configuring DNS is out of scope of this document because we expect administrators to know how to do that already, or to find someone in their organization who does. Line 244: * Identify and record the directory where the master's data will live. > this is confusing, the user can "choose" these now since it's running new d I wanted "record" in there to make it clear that the choice being made needs to be remembered for later on. After that, "choose" and "identify" seem synonymous enough to me, and I think it's good to evoke symmetry with respect to the previous step (where we identified and recorded the data directory of the existing master). Line 282: ** `port` is the master's previously recorded RPC port number > it would be good to have examples for an actual correct command, maybe some I've added an example for each command. Line 314: are working properly, consider performing the following sanity checks: > how does this validate that the new masters are working properly? if the us Obviously there are a lot of ways that this workflow could be botched. But, if the user at least rewrote the Raft configuration on the existing master such that it includes multiple entries, a scan will fail if there aren't healthy masters behind those entries (because the client will look for a leader master and won't find one, since the one working master can't get a majority of votes). What would you propose as a better means of post-migration validation? -- To view, visit http://gerrit.cloudera.org:8080/4300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b9c66505e0efd1f4aef80884346507d4fe08d9c Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Anonymous Coward #149 Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] docs: workflow for master migration
Kudu Jenkins has posted comments on this change. Change subject: docs: workflow for master migration .. Patch Set 2: Build Started http://104.196.14.100/job/kudu-gerrit/3221/ -- To view, visit http://gerrit.cloudera.org:8080/4300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b9c66505e0efd1f4aef80884346507d4fe08d9c Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Anonymous Coward #149 Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] docs: workflow for master migration
David Ribeiro Alves has posted comments on this change. Change subject: docs: workflow for master migration .. Patch Set 1: (6 comments) http://gerrit.cloudera.org:8080/#/c/4300/1/docs/administration.adoc File docs/administration.adoc: Line 202: For true high availability and to avoid a single point of failure, Kudu clusters should be created remove "true" Line 236: recovering from permanent master failures greatly, and is highly recommended. The alias should be how? are you linking this to somewhere else, or are you adding this info on a later patch. If its the latter I would recommend adding this info there. Line 244: * Identify and record the directory where the master's data will live. this is confusing, the user can "choose" these now since it's running new daemons, what should be indentified and recorded? Line 246: * Optional: configure a DNS cname or /etc/hsots alias to the master's hostname (e.g. `master-2`, same as above Line 282: ** `port` is the master's previously recorded RPC port number it would be good to have examples for an actual correct command, maybe somewhere else too Line 314: are working properly, consider performing the following sanity checks: how does this validate that the new masters are working properly? if the user did stuff wrong nothing changed, wouldnt this work anyway? -- To view, visit http://gerrit.cloudera.org:8080/4300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b9c66505e0efd1f4aef80884346507d4fe08d9c Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Anonymous Coward #149 Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] docs: workflow for master migration
Adar Dembo has posted comments on this change. Change subject: docs: workflow for master migration .. Patch Set 1: I tested this manually on a 4 node cluster, using CM and /etc/hosts aliases. The cluster is alive and (seemingly) well, but I'll do more testing on it tomorrow to make sure. -- To view, visit http://gerrit.cloudera.org:8080/4300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b9c66505e0efd1f4aef80884346507d4fe08d9c Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] docs: workflow for master migration
Kudu Jenkins has posted comments on this change. Change subject: docs: workflow for master migration .. Patch Set 1: Build Started http://104.196.14.100/job/kudu-gerrit/3205/ -- To view, visit http://gerrit.cloudera.org:8080/4300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b9c66505e0efd1f4aef80884346507d4fe08d9c Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] docs: workflow for master migration
Hello Todd Lipcon, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/4300 to review the following change. Change subject: docs: workflow for master migration .. docs: workflow for master migration Change-Id: I9b9c66505e0efd1f4aef80884346507d4fe08d9c --- M docs/administration.adoc 1 file changed, 122 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/00/4300/1 -- To view, visit http://gerrit.cloudera.org:8080/4300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I9b9c66505e0efd1f4aef80884346507d4fe08d9c Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Todd Lipcon