[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354963#comment-17354963 ] Duo Zhang commented on HBASE-7386: -- Any updates here? Thanks. > Investigate providing some supervisor support for znode deletion > > > Key: HBASE-7386 > URL: https://issues.apache.org/jira/browse/HBASE-7386 > Project: HBase > Issue Type: Task > Components: master, regionserver, scripts >Reporter: Gregory Chanan >Assignee: Michael Stack >Priority: Blocker > Fix For: 3.0.0-alpha-1 > > Attachments: HBASE-7386-bin-v2.patch, HBASE-7386-bin-v3.patch, > HBASE-7386-bin.patch, HBASE-7386-conf-v2.patch, HBASE-7386-conf-v3.patch, > HBASE-7386-conf.patch, HBASE-7386-master-00.patch, > HBASE-7386-master-01.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, > supervisordconfigs-v0.patch > > > There a couple of JIRAs for deleting the znode on a process failure: > HBASE-5844 (RS) > HBASE-5926 (Master) > which are pretty neat; on process failure, they delete the znode of the > underlying process so HBase can recover faster. > These JIRAs were implemented via the startup scripts; i.e. the script hangs > around and waits for the process to exit, then deletes the znode. > There are a few problems associated with this approach, as listed in the > below JIRAs: > 1) Hides startup output in script > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 2) two hbase processes listed per launched daemon > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 3) Not run by a real supervisor > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 4) Weird output after kill -9 actual process in standalone mode > https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 > 5) Can kill existing RS if called again > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 6) Hides stdout/stderr[6] > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 > I suspect running in via something like supervisor.d can solve these issues > if we provide the right support. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446039#comment-16446039 ] Hadoop QA commented on HBASE-7386: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} shelldocs {color} | {color:blue} 0m 3s{color} | {color:blue} Shelldocs was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} pylint {color} | {color:green} 0m 1s{color} | {color:green} There were no new pylint issues. {color} | | {color:red}-1{color} | {color:red} shellcheck {color} | {color:red} 0m 1s{color} | {color:red} The patch generated 44 new + 59 unchanged - 17 fixed = 103 total (was 76) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 8s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 0m 36s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:d8b550f | | JIRA Issue | HBASE-7386 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12878542/HBASE-7386-master-01.patch | | Optional Tests | asflicense shellcheck shelldocs pylint | | uname | Linux 054cbf364276 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 8219ec7493 | | maven | version: Apache Maven 3.5.3 (3383c37e1f9e9b3bc3df5050c29c8aff9f295297; 2018-02-24T19:49:05Z) | | shellcheck | v0.4.4 | | pylint | v1.6.5 | | shellcheck | https://builds.apache.org/job/PreCommit-HBASE-Build/12575/artifact/patchprocess/diff-patch-shellcheck.txt | | Max. process+thread count | 47 (vs. ulimit of 1) | | modules | C: . U: . | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/12575/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > Investigate providing some supervisor support for znode deletion > > > Key: HBASE-7386 > URL: https://issues.apache.org/jira/browse/HBASE-7386 > Project: HBase > Issue Type: Task > Components: master, regionserver, scripts >Reporter: Gregory Chanan >Assignee: stack >Priority: Blocker > Fix For: 3.0.0 > > Attachments: HBASE-7386-bin-v2.patch, HBASE-7386-bin-v3.patch, > HBASE-7386-bin.patch, HBASE-7386-conf-v2.patch, HBASE-7386-conf-v3.patch, > HBASE-7386-conf.patch, HBASE-7386-master-00.patch, > HBASE-7386-master-01.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, > supervisordconfigs-v0.patch > > > There a couple of JIRAs for deleting the znode on a process failure: > HBASE-5844 (RS) > HBASE-5926 (Master) > which are pretty neat; on process failure, they delete the znode of the > underlying process so HBase can recover faster. > These JIRAs were implemented via the startup scripts; i.e. the script hangs > around and waits for the process to exit, then deletes the znode. > There are a few problems associated with this approach, as listed in the > below JIRAs: > 1) Hides startup output in script > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 2) two hbase processes listed per launched daemon > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 3) Not run by a real supervisor > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 4) Weird output after kill -9 actual process in standalone mode > https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 > 5) Can kill
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446030#comment-16446030 ] Hadoop QA commented on HBASE-7386: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} shelldocs {color} | {color:blue} 0m 4s{color} | {color:blue} Shelldocs was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} pylint {color} | {color:green} 0m 1s{color} | {color:green} There were no new pylint issues. {color} | | {color:red}-1{color} | {color:red} shellcheck {color} | {color:red} 0m 2s{color} | {color:red} The patch generated 44 new + 59 unchanged - 17 fixed = 103 total (was 76) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 8s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 0m 36s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:d8b550f | | JIRA Issue | HBASE-7386 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12878542/HBASE-7386-master-01.patch | | Optional Tests | asflicense shellcheck shelldocs pylint | | uname | Linux fb37c345af50 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 8219ec7493 | | maven | version: Apache Maven 3.5.3 (3383c37e1f9e9b3bc3df5050c29c8aff9f295297; 2018-02-24T19:49:05Z) | | shellcheck | v0.4.4 | | pylint | v1.6.5 | | shellcheck | https://builds.apache.org/job/PreCommit-HBASE-Build/12574/artifact/patchprocess/diff-patch-shellcheck.txt | | Max. process+thread count | 48 (vs. ulimit of 1) | | modules | C: . U: . | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/12574/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > Investigate providing some supervisor support for znode deletion > > > Key: HBASE-7386 > URL: https://issues.apache.org/jira/browse/HBASE-7386 > Project: HBase > Issue Type: Task > Components: master, regionserver, scripts >Reporter: Gregory Chanan >Assignee: stack >Priority: Blocker > Fix For: 3.0.0 > > Attachments: HBASE-7386-bin-v2.patch, HBASE-7386-bin-v3.patch, > HBASE-7386-bin.patch, HBASE-7386-conf-v2.patch, HBASE-7386-conf-v3.patch, > HBASE-7386-conf.patch, HBASE-7386-master-00.patch, > HBASE-7386-master-01.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, > supervisordconfigs-v0.patch > > > There a couple of JIRAs for deleting the znode on a process failure: > HBASE-5844 (RS) > HBASE-5926 (Master) > which are pretty neat; on process failure, they delete the znode of the > underlying process so HBase can recover faster. > These JIRAs were implemented via the startup scripts; i.e. the script hangs > around and waits for the process to exit, then deletes the znode. > There are a few problems associated with this approach, as listed in the > below JIRAs: > 1) Hides startup output in script > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 2) two hbase processes listed per launched daemon > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 3) Not run by a real supervisor > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 4) Weird output after kill -9 actual process in standalone mode > https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 > 5) Can kill
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106090#comment-16106090 ] Samir Ahmic commented on HBASE-7386: [~stack] did you maybe have chance to take scripts on test drive ? Any suggestions how we can improve this chief ? > Investigate providing some supervisor support for znode deletion > > > Key: HBASE-7386 > URL: https://issues.apache.org/jira/browse/HBASE-7386 > Project: HBase > Issue Type: Task > Components: master, regionserver, scripts >Reporter: Gregory Chanan >Assignee: stack >Priority: Blocker > Fix For: 3.0.0 > > Attachments: HBASE-7386-bin.patch, HBASE-7386-bin-v2.patch, > HBASE-7386-bin-v3.patch, HBASE-7386-conf.patch, HBASE-7386-conf-v2.patch, > HBASE-7386-conf-v3.patch, HBASE-7386-master-00.patch, > HBASE-7386-master-01.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, > supervisordconfigs-v0.patch > > > There a couple of JIRAs for deleting the znode on a process failure: > HBASE-5844 (RS) > HBASE-5926 (Master) > which are pretty neat; on process failure, they delete the znode of the > underlying process so HBase can recover faster. > These JIRAs were implemented via the startup scripts; i.e. the script hangs > around and waits for the process to exit, then deletes the znode. > There are a few problems associated with this approach, as listed in the > below JIRAs: > 1) Hides startup output in script > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 2) two hbase processes listed per launched daemon > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 3) Not run by a real supervisor > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 4) Weird output after kill -9 actual process in standalone mode > https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 > 5) Can kill existing RS if called again > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 6) Hides stdout/stderr[6] > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 > I suspect running in via something like supervisor.d can solve these issues > if we provide the right support. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097618#comment-16097618 ] Hadoop QA commented on HBASE-7386: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} shelldocs {color} | {color:blue} 0m 4s{color} | {color:blue} Shelldocs was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} pylint {color} | {color:green} 0m 3s{color} | {color:green} There were no new pylint issues. {color} | | {color:red}-1{color} | {color:red} shellcheck {color} | {color:red} 0m 4s{color} | {color:red} The patch generated 46 new + 480 unchanged - 18 fixed = 526 total (was 498) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 29m 25s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha4. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 14s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 30m 20s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:bdc94b1 | | JIRA Issue | HBASE-7386 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12878542/HBASE-7386-master-01.patch | | Optional Tests | asflicense shellcheck shelldocs pylint | | uname | Linux 24d9842af88e 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / e9d8a7b | | shellcheck | v0.4.6 | | pylint | v1.7.2 | | shellcheck | https://builds.apache.org/job/PreCommit-HBASE-Build/7763/artifact/patchprocess/diff-patch-shellcheck.txt | | modules | C: . U: . | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/7763/console | | Powered by | Apache Yetus 0.4.0 http://yetus.apache.org | This message was automatically generated. > Investigate providing some supervisor support for znode deletion > > > Key: HBASE-7386 > URL: https://issues.apache.org/jira/browse/HBASE-7386 > Project: HBase > Issue Type: Task > Components: master, regionserver, scripts >Reporter: Gregory Chanan >Assignee: stack >Priority: Blocker > Fix For: 3.0.0 > > Attachments: HBASE-7386-bin.patch, HBASE-7386-bin-v2.patch, > HBASE-7386-bin-v3.patch, HBASE-7386-conf.patch, HBASE-7386-conf-v2.patch, > HBASE-7386-conf-v3.patch, HBASE-7386-master-00.patch, > HBASE-7386-master-01.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, > supervisordconfigs-v0.patch > > > There a couple of JIRAs for deleting the znode on a process failure: > HBASE-5844 (RS) > HBASE-5926 (Master) > which are pretty neat; on process failure, they delete the znode of the > underlying process so HBase can recover faster. > These JIRAs were implemented via the startup scripts; i.e. the script hangs > around and waits for the process to exit, then deletes the znode. > There are a few problems associated with this approach, as listed in the > below JIRAs: > 1) Hides startup output in script > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 2) two hbase processes listed per launched daemon > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 3) Not run by a real supervisor > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 4) Weird output after kill -9 actual process in standalone mode > https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 > 5) Can kill existing RS if called again >
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16096776#comment-16096776 ] Samir Ahmic commented on HBASE-7386: Thanks [~stack]. Why python supervisor? Well we originally started this story around it, and after some time testing it, at least for me, choosing mature and well proven process control system instead of writing custom bash scripts has multiple advantages. To be honest work here extends original issue of just removing stale znodes to creating watchdog over hbase processes and making alternative option for managing cluster but when we started tackling supervisor approach why not offer folks chance to less worry when rs process dies because it will be automatically restarted :) Also python supervisor has set of very cool futures like, auto-restart, event listeners (that may execute arbitrary code based on process state) an so on, and folks may start creating they own listeners for different proposes. Btw i will address shellcheck and pylint issues in next patch. > Investigate providing some supervisor support for znode deletion > > > Key: HBASE-7386 > URL: https://issues.apache.org/jira/browse/HBASE-7386 > Project: HBase > Issue Type: Task > Components: master, regionserver, scripts >Reporter: Gregory Chanan >Assignee: stack >Priority: Blocker > Fix For: 3.0.0 > > Attachments: HBASE-7386-bin.patch, HBASE-7386-bin-v2.patch, > HBASE-7386-bin-v3.patch, HBASE-7386-conf.patch, HBASE-7386-conf-v2.patch, > HBASE-7386-conf-v3.patch, HBASE-7386-master-00.patch, HBASE-7386-src.patch, > HBASE-7386-v0.patch, supervisordconfigs-v0.patch > > > There a couple of JIRAs for deleting the znode on a process failure: > HBASE-5844 (RS) > HBASE-5926 (Master) > which are pretty neat; on process failure, they delete the znode of the > underlying process so HBase can recover faster. > These JIRAs were implemented via the startup scripts; i.e. the script hangs > around and waits for the process to exit, then deletes the znode. > There are a few problems associated with this approach, as listed in the > below JIRAs: > 1) Hides startup output in script > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 2) two hbase processes listed per launched daemon > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 3) Not run by a real supervisor > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 4) Weird output after kill -9 actual process in standalone mode > https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 > 5) Can kill existing RS if called again > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 6) Hides stdout/stderr[6] > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 > I suspect running in via something like supervisor.d can solve these issues > if we provide the right support. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16096438#comment-16096438 ] stack commented on HBASE-7386: -- This looks really nice [~asamir] Let me give it a closer review Thank you. Why python supervisor scripts and not say bash? Thanks boss. > Investigate providing some supervisor support for znode deletion > > > Key: HBASE-7386 > URL: https://issues.apache.org/jira/browse/HBASE-7386 > Project: HBase > Issue Type: Task > Components: master, regionserver, scripts >Reporter: Gregory Chanan >Assignee: stack >Priority: Blocker > Fix For: 3.0.0 > > Attachments: HBASE-7386-bin.patch, HBASE-7386-bin-v2.patch, > HBASE-7386-bin-v3.patch, HBASE-7386-conf.patch, HBASE-7386-conf-v2.patch, > HBASE-7386-conf-v3.patch, HBASE-7386-master-00.patch, HBASE-7386-src.patch, > HBASE-7386-v0.patch, supervisordconfigs-v0.patch > > > There a couple of JIRAs for deleting the znode on a process failure: > HBASE-5844 (RS) > HBASE-5926 (Master) > which are pretty neat; on process failure, they delete the znode of the > underlying process so HBase can recover faster. > These JIRAs were implemented via the startup scripts; i.e. the script hangs > around and waits for the process to exit, then deletes the znode. > There are a few problems associated with this approach, as listed in the > below JIRAs: > 1) Hides startup output in script > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 2) two hbase processes listed per launched daemon > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 3) Not run by a real supervisor > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 4) Weird output after kill -9 actual process in standalone mode > https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 > 5) Can kill existing RS if called again > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 6) Hides stdout/stderr[6] > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 > I suspect running in via something like supervisor.d can solve these issues > if we provide the right support. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095330#comment-16095330 ] Hadoop QA commented on HBASE-7386: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} shelldocs {color} | {color:blue} 0m 4s{color} | {color:blue} Shelldocs was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} pylint {color} | {color:red} 0m 1s{color} | {color:red} The patch generated 6 new + 0 unchanged - 0 fixed = 6 total (was 0) {color} | | {color:red}-1{color} | {color:red} shellcheck {color} | {color:red} 0m 7s{color} | {color:red} The patch generated 153 new + 489 unchanged - 9 fixed = 642 total (was 498) {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch 3 line(s) with tabs. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 31m 40s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha4. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 12s{color} | {color:red} The patch generated 2 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 32m 32s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:757bf37 | | JIRA Issue | HBASE-7386 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12878238/HBASE-7386-master-00.patch | | Optional Tests | asflicense shellcheck shelldocs pylint | | uname | Linux 1c2618903bc3 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh | | git revision | master / bc93b66 | | shellcheck | v0.4.6 | | pylint | v1.7.2 | | pylint | https://builds.apache.org/job/PreCommit-HBASE-Build/7737/artifact/patchprocess/diff-patch-pylint.txt | | shellcheck | https://builds.apache.org/job/PreCommit-HBASE-Build/7737/artifact/patchprocess/diff-patch-shellcheck.txt | | whitespace | https://builds.apache.org/job/PreCommit-HBASE-Build/7737/artifact/patchprocess/whitespace-tabs.txt | | asflicense | https://builds.apache.org/job/PreCommit-HBASE-Build/7737/artifact/patchprocess/patch-asflicense-problems.txt | | modules | C: . U: . | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/7737/console | | Powered by | Apache Yetus 0.4.0 http://yetus.apache.org | This message was automatically generated. > Investigate providing some supervisor support for znode deletion > > > Key: HBASE-7386 > URL: https://issues.apache.org/jira/browse/HBASE-7386 > Project: HBase > Issue Type: Task > Components: master, regionserver, scripts >Reporter: Gregory Chanan >Assignee: stack >Priority: Blocker > Fix For: 3.0.0 > > Attachments: HBASE-7386-bin.patch, HBASE-7386-bin-v2.patch, > HBASE-7386-bin-v3.patch, HBASE-7386-conf.patch, HBASE-7386-conf-v2.patch, > HBASE-7386-conf-v3.patch, HBASE-7386-master-00.patch, HBASE-7386-src.patch, > HBASE-7386-v0.patch, supervisordconfigs-v0.patch > > > There a couple of JIRAs for deleting the znode on a process failure: > HBASE-5844 (RS) > HBASE-5926 (Master) > which are pretty neat; on process failure, they delete the znode of the > underlying process so HBase can recover faster. > These JIRAs were implemented via the startup scripts; i.e. the script hangs > around and waits for the process to exit, then deletes the znode. > There are a few problems associated with this approach, as listed in the > below JIRAs: > 1) Hides startup output in script > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 2) two hbase processes listed per launched daemon > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 3) Not run by a real supervisor > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 4) Weird output after kill -9 actual process in
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084730#comment-16084730 ] Samir Ahmic commented on HBASE-7386: [~stack] i have done some testing with last patches against master branch and good news is that most of code(with small changes) and functionality works fine. So original idea to improve MTTR by removing stale master and rs znodes plus watchdog which will restart process in case of unexpected failure is still valid. My original scripts here are written with idea to be optional route in managing hbase processes using supervisor, and that approach opens couple of questions which i would like to discuss: # Amount of code added and options to reduce it (i will anyway try to reduce it to minimum) probably some code can be integrated in exiting scripts to avoid copying # Where are we going to add new scripts supervisord folder inside bin dir was may original idea and same thing goes for config files supervisord folder in conf dir # Testing: i will cover supervisor 3.3.2 version(last stable) and some older version that are installed trough system packet manages # And finally would it be better to implement our own Java supervisor which would do similar thing as python implementation Based on what we decide i will continue work here, if we go with python supervisor i can have patch ready for testing in couple of days. > Investigate providing some supervisor support for znode deletion > > > Key: HBASE-7386 > URL: https://issues.apache.org/jira/browse/HBASE-7386 > Project: HBase > Issue Type: Task > Components: master, regionserver, scripts >Reporter: Gregory Chanan >Assignee: stack >Priority: Blocker > Attachments: HBASE-7386-bin.patch, HBASE-7386-bin-v2.patch, > HBASE-7386-bin-v3.patch, HBASE-7386-conf.patch, HBASE-7386-conf-v2.patch, > HBASE-7386-conf-v3.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, > supervisordconfigs-v0.patch > > > There a couple of JIRAs for deleting the znode on a process failure: > HBASE-5844 (RS) > HBASE-5926 (Master) > which are pretty neat; on process failure, they delete the znode of the > underlying process so HBase can recover faster. > These JIRAs were implemented via the startup scripts; i.e. the script hangs > around and waits for the process to exit, then deletes the znode. > There are a few problems associated with this approach, as listed in the > below JIRAs: > 1) Hides startup output in script > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 2) two hbase processes listed per launched daemon > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 3) Not run by a real supervisor > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 4) Weird output after kill -9 actual process in standalone mode > https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 > 5) Can kill existing RS if called again > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 6) Hides stdout/stderr[6] > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 > I suspect running in via something like supervisor.d can solve these issues > if we provide the right support. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16073079#comment-16073079 ] stack commented on HBASE-7386: -- Thank you [~asamir] > Investigate providing some supervisor support for znode deletion > > > Key: HBASE-7386 > URL: https://issues.apache.org/jira/browse/HBASE-7386 > Project: HBase > Issue Type: Task > Components: master, regionserver, scripts >Reporter: Gregory Chanan >Assignee: stack >Priority: Blocker > Attachments: HBASE-7386-bin.patch, HBASE-7386-bin-v2.patch, > HBASE-7386-bin-v3.patch, HBASE-7386-conf.patch, HBASE-7386-conf-v2.patch, > HBASE-7386-conf-v3.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, > supervisordconfigs-v0.patch > > > There a couple of JIRAs for deleting the znode on a process failure: > HBASE-5844 (RS) > HBASE-5926 (Master) > which are pretty neat; on process failure, they delete the znode of the > underlying process so HBase can recover faster. > These JIRAs were implemented via the startup scripts; i.e. the script hangs > around and waits for the process to exit, then deletes the znode. > There are a few problems associated with this approach, as listed in the > below JIRAs: > 1) Hides startup output in script > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 2) two hbase processes listed per launched daemon > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 3) Not run by a real supervisor > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 4) Weird output after kill -9 actual process in standalone mode > https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 > 5) Can kill existing RS if called again > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 6) Hides stdout/stderr[6] > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 > I suspect running in via something like supervisor.d can solve these issues > if we provide the right support. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16072789#comment-16072789 ] Samir Ahmic commented on HBASE-7386: [~stack] anything improving MTTR and cluster operability deserves decent retry :), i will take look at work done here and check if same or similar concept can be applied at current hbase clusters. > Investigate providing some supervisor support for znode deletion > > > Key: HBASE-7386 > URL: https://issues.apache.org/jira/browse/HBASE-7386 > Project: HBase > Issue Type: Task > Components: master, regionserver, scripts >Reporter: Gregory Chanan >Assignee: stack >Priority: Blocker > Attachments: HBASE-7386-bin.patch, HBASE-7386-bin-v2.patch, > HBASE-7386-bin-v3.patch, HBASE-7386-conf.patch, HBASE-7386-conf-v2.patch, > HBASE-7386-conf-v3.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, > supervisordconfigs-v0.patch > > > There a couple of JIRAs for deleting the znode on a process failure: > HBASE-5844 (RS) > HBASE-5926 (Master) > which are pretty neat; on process failure, they delete the znode of the > underlying process so HBase can recover faster. > These JIRAs were implemented via the startup scripts; i.e. the script hangs > around and waits for the process to exit, then deletes the znode. > There are a few problems associated with this approach, as listed in the > below JIRAs: > 1) Hides startup output in script > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 2) two hbase processes listed per launched daemon > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 3) Not run by a real supervisor > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 4) Weird output after kill -9 actual process in standalone mode > https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 > 5) Can kill existing RS if called again > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 6) Hides stdout/stderr[6] > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 > I suspect running in via something like supervisor.d can solve these issues > if we provide the right support. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071439#comment-16071439 ] stack commented on HBASE-7386: -- What you think of this [~asamir] Is it salvageable? > Investigate providing some supervisor support for znode deletion > > > Key: HBASE-7386 > URL: https://issues.apache.org/jira/browse/HBASE-7386 > Project: HBase > Issue Type: Task > Components: master, regionserver, scripts >Reporter: Gregory Chanan >Assignee: stack >Priority: Blocker > Attachments: HBASE-7386-bin.patch, HBASE-7386-bin-v2.patch, > HBASE-7386-bin-v3.patch, HBASE-7386-conf.patch, HBASE-7386-conf-v2.patch, > HBASE-7386-conf-v3.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, > supervisordconfigs-v0.patch > > > There a couple of JIRAs for deleting the znode on a process failure: > HBASE-5844 (RS) > HBASE-5926 (Master) > which are pretty neat; on process failure, they delete the znode of the > underlying process so HBase can recover faster. > These JIRAs were implemented via the startup scripts; i.e. the script hangs > around and waits for the process to exit, then deletes the znode. > There are a few problems associated with this approach, as listed in the > below JIRAs: > 1) Hides startup output in script > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 2) two hbase processes listed per launched daemon > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 3) Not run by a real supervisor > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 > 4) Weird output after kill -9 actual process in standalone mode > https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 > 5) Can kill existing RS if called again > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 > 6) Hides stdout/stderr[6] > https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 > I suspect running in via something like supervisor.d can solve these issues > if we provide the right support. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005054#comment-14005054 ] Shengzhe Yao commented on HBASE-7386: - Any updates ? Are we going to merge the latest patch ? This is a pretty cool feature and please let me know if there are things I can help :) Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: stack Priority: Blocker Attachments: HBASE-7386-bin-v2.patch, HBASE-7386-bin-v3.patch, HBASE-7386-bin.patch, HBASE-7386-conf-v2.patch, HBASE-7386-conf-v3.patch, HBASE-7386-conf.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869578#comment-13869578 ] Samir Ahmic commented on HBASE-7386: Update, with [HBASE-10310] fixed master---backupmaster failover time is 4s when cluster is controlled with supervisor and 3s when is controlled with standard scripts. Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: stack Priority: Blocker Attachments: HBASE-7386-bin-v2.patch, HBASE-7386-bin.patch, HBASE-7386-conf-v2.patch, HBASE-7386-conf.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869629#comment-13869629 ] Nicolas Liochon commented on HBASE-7386: Thanks a lot for the fix of HBASE-10310, Samir. I went through your patch. It's a difficult read when you don't know supervisor ;-). The definition of 'PROCESS_STATE_UNKNOWN' is a little scary (as we kill the region server when we reach this state). There are some typos ('Test is supevisored installed' instead of supevisord). I'm not sure about stuff like 'subprocess.call('/bin/mail -s HBASE_PROCESS_EVENT %s %s'%(email, tmp_file), shell=True)': seems machine dependent, there is no /bin/mail on my ubuntu desktop. Do we have to use python? It would be good to have a review from someone who knows supervisor... As well, this should be documented in the hbase reference guide imho. Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: stack Priority: Blocker Attachments: HBASE-7386-bin-v2.patch, HBASE-7386-bin.patch, HBASE-7386-conf-v2.patch, HBASE-7386-conf.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869915#comment-13869915 ] Samir Ahmic commented on HBASE-7386: Thanks for review [~nkeywal]. I agree about 'PROCESS_STATE_UNKNOWN', i checked it in supervisor source code and it is look like that is used for actions if supervisor is unable to determine state of process. I will remove it from event listener since it can cause issues. I was planing to make mail notification optional even to create separate event listener that will handle email notifications. '/bin/mail' is most simple solution and following that example folks could develop there own solution. What do you think how this should be handled ? bq. Do we have to use python? According to documentation: Event listener can be written in any language supported by the platform you’re using to run supervisor. There is special library support for Python in the form of a supervisor.childutils module, which makes creating event listeners in Python slightly easier than in other languages. Any suggestions what should we use instead of python ? Java ? When we complete this work it should be documented probably under 15. Apache HBase Operational Management ? Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: stack Priority: Blocker Attachments: HBASE-7386-bin-v2.patch, HBASE-7386-bin.patch, HBASE-7386-conf-v2.patch, HBASE-7386-conf.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868220#comment-13868220 ] stack commented on HBASE-7386: -- Related, from our boys at Xiaomi https://github.com/XiaoMi/minos Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: stack Priority: Blocker Attachments: HBASE-7386-bin-v2.patch, HBASE-7386-bin.patch, HBASE-7386-conf-v2.patch, HBASE-7386-conf.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865272#comment-13865272 ] Nicolas Liochon commented on HBASE-7386: bq. i have verify that supervisor approach improves master failover in my testing this time is ~7s when using supervisor and when using standard scripts it is ~40s This is strange. Do you know why? What is the test scenario? Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: stack Priority: Blocker Attachments: HBASE-7386-bin-v2.patch, HBASE-7386-bin.patch, HBASE-7386-conf-v2.patch, HBASE-7386-conf.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865479#comment-13865479 ] Samir Ahmic commented on HBASE-7386: From what i could see it is all about removing master znode in zookeeper. In supervisor scenario master znode is deleted by autorestart and in standard scripts we don't delete master znode. Is master znode ephemeral ? It should be gone when master dies. Test scenario is very simple: * distrubuted cluster 0.96 * start master and backup master on different machines * date; kill -9 master and watch logs on backup master to see when it become active * i have also used python based script that watches '/hbase/master' znode and detect changes Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: stack Priority: Blocker Attachments: HBASE-7386-bin-v2.patch, HBASE-7386-bin.patch, HBASE-7386-conf-v2.patch, HBASE-7386-conf.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865513#comment-13865513 ] Nicolas Liochon commented on HBASE-7386: bq. standard scripts we don't delete master znode We should, that's what HBASE-5926 is about.It used to work for sure. It's better to delete it just after the server death, as the restart may never happen... Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: stack Priority: Blocker Attachments: HBASE-7386-bin-v2.patch, HBASE-7386-bin.patch, HBASE-7386-conf-v2.patch, HBASE-7386-conf.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865571#comment-13865571 ] Samir Ahmic commented on HBASE-7386: bq. It's better to delete it just after the server death, as the restart may never happen... Are you suggesting that i modify 'zk_cleaner.py' listener script to delete master znode when detects that master is in one of this states ('PROCESS_STATE_STOPPING', 'PROCESS_STATE_EXITED', 'PROCESS_STATE_UNKNOWN) ? I'm already doing this for regionservers so it should few lines of code. Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: stack Priority: Blocker Attachments: HBASE-7386-bin-v2.patch, HBASE-7386-bin.patch, HBASE-7386-conf-v2.patch, HBASE-7386-conf.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865594#comment-13865594 ] Nicolas Liochon commented on HBASE-7386: may be ;-). Just that if the process exit, we can clean the ZK node immediately, ideally w/o relying on a separate watchdog. What's the PROCESS_STATE_UNKNOWN? Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: stack Priority: Blocker Attachments: HBASE-7386-bin-v2.patch, HBASE-7386-bin.patch, HBASE-7386-conf-v2.patch, HBASE-7386-conf.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865644#comment-13865644 ] Samir Ahmic commented on HBASE-7386: bq. What's the PROCESS_STATE_UNKNOWN? According to supervisor documentation [http://supervisord.org/subprocess.html] The process is in an unknown state (supervisord programming error). Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: stack Priority: Blocker Attachments: HBASE-7386-bin-v2.patch, HBASE-7386-bin.patch, HBASE-7386-conf-v2.patch, HBASE-7386-conf.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13857994#comment-13857994 ] Samir Ahmic commented on HBASE-7386: Thanks for review add comments [~stack]] In respect of usage and documentation this scripts are following same logic as scripts in bin directory. For example here is output of start-supervisord-hbase.sh command: {code} $ $HBASE_HOME/bin/supervisord/start-supervisord-hbase.sh localhost: hbase-ZK: started hbase-MASTER: started localhost: hbase-RS: started {code} so considering usage this relations are true: start-hbase.sh ~= start-supervisord-hbase.sh stop-hbase.sh ~= stop-supervisord-hbase.sh hbase-daemon.sh ~= hbase-supervisord.sh I agree there is danger that scripts 'rot' but also i believe that this approach can solve number of issues for ops people and generally improve hbase MTTR . What is your suggestion how to address 'rot' scripts issue ? graceful-stop.sh from bin dir can be modified to avoid copy/paste. I will also check rest of scripts to try to reduce amount of copy/paste. migrate_to_supervisord.sh will switch running cluster that was started with scripts from bin directory to use supervisor. It will stop hbase daemons on nodes using hbase-daemon.sh and then it will start then using hbase-supervisord.sh script (revert_to_scripts.sh will do opposite). For master znode is removed by autostart method (patch in HMasterCommandLine.java ) in moment of starting. We have supervisor config autorestart=true so if master process dies unexpectedly supervisor will kick off autorestart and in that moment znode will be removed giving enough time for backup master to become active. Alternative is to craft listener script similar to mail_notification.py that will remove master znode when detects that process is exiting. Regarding RS znodes scripts does not remove them yet. I was thinking about listener script (similar to mail_notification.py) calling 'hbase zkcli rmr RSznode' or we can modify HRegionServerCommadLine.java and add 'autorestart' like in HMasterCommandLine.java. What is your suggestion how to address this ? Basically all this scripts are wrappers around 'supervisord' and 'supervisorctl' commands which are python based, I hope i have clarify some details. Cheers Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: stack Priority: Blocker Attachments: HBASE-7386-bin.patch, HBASE-7386-conf.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13857589#comment-13857589 ] stack commented on HBASE-7386: -- [~asamir] Excellent! Thank you for working on this. If you add a little detail on how it looks when in place working -- I can add a section to the reference guide that points at these scripts (I can commit the autorestart changes no problem). We 'could' commit these scripts. It might encourage folks to go the supervised route. Only issue is they may 'rot'. Do you have to 'copy' the graceful_stop.sh script, can you not point at the original (any needed changes we can make no problem) -- ditto for any other changes you'd need to you could reduce the amount of copy/paste. FYI, these lines are unnecessary '+# * Copyright 2007 The Apache Software Foundation' You probably don't want this in the scripts: +email=sa...@personal.com (call out the necessary config in README?) How do these migrate scripts work? They look nice if they are doing what I think they are doing. I see mail notification but do you actually do the removal of a znode? When you say python supervise, you just mean the script that supervise has as callback is in python? Nice job [~asamir] Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: stack Priority: Blocker Attachments: HBASE-7386-bin.patch, HBASE-7386-conf.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856860#comment-13856860 ] Samir Ahmic commented on HBASE-7386: Based on what was discussed here i have created set of scripts and configs that support running hbase processes under python supervisor control. I did not touch any scrips from bin directory as i see this as optional solution for managing hbase processes. Only change in hbase source is adding 'autorestart' option in HMasterCommandLine.java. Scripts does not require any config changes or additional env variables. All testing was done on 0.96 branch and partially on 0.94 using supervisor 3.0 version, OS was Centos 5.8 and 5.10. All suggestions and critics are welcome:) Cheers Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: stack Priority: Blocker Attachments: HBASE-7386-bin.patch, HBASE-7386-conf.patch, HBASE-7386-src.patch, HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836520#comment-13836520 ] Samir Ahmic commented on HBASE-7386: Hi [~stack], Is someone is working on this currently? This looks like great idea. I would like to continue this work if we still think that this is good direction ? Cheers Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: stack Priority: Blocker Attachments: HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13536656#comment-13536656 ] Gregory Chanan commented on HBASE-7386: --- Thanks for the feedback, I agree with what both of you are saying. I'm working on an a start-hbase like script. bq. What to do for the case where a shop has chosen other than supervisord to monitor their processes? I suppose we could let them do the convertion from 'supervise' to 'god', etc.? I don't think we should worry about it for the first pass. I figure if a shop is familiar with a different tool they should be able to write their own scripts/configs with the supervisord ones we provide as a guide. Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: Gregory Chanan Fix For: 0.96.0 Attachments: HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13535517#comment-13535517 ] Gregory Chanan commented on HBASE-7386: --- I'm going to first tackle the Master case, HBASE-5926. The proposal is to rip out the script modification of HBASE-5926 and replace with supervisor.d support. The java code from HBASE-5926 would of course stay, only the restart/cleanup code in the scripts would go. The goal is to solve the above listed issues: #2 is solved by running via supervisor (jps reports only one HMaster process when HMaster started via supervisor.d) #3 is self-evidently solved by running on a real supervisor #5 is clearly not relevant, as it is only related to the RS. So, we are left with #1, #4, and #6. #1 and #6 are solved because the previous script behavior returns when not run in supervisor and supervisor can redirect stdout/stderr when run. #4 is solved because the old script behavior returns when not run in supervisor and I don't think it makes any sense to run a standalone HBase via supervisor. Example patch coming soon. Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: Gregory Chanan Fix For: 0.96.0 There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13535562#comment-13535562 ] Gregory Chanan commented on HBASE-7386: --- Thanks for pointing that out, Enis. What I'm doing here is compatible with that, I think. Rather than deleting the znode we just expire the session. Should be a small change. Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: Gregory Chanan Fix For: 0.96.0 There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13535587#comment-13535587 ] Gregory Chanan commented on HBASE-7386: --- Here is how I verified this all works (apply both patches and set HBASE_HOME, HBASE_LOG_DIR in your environment but nothing else). Also make sure supervisord is installed. {noformat} 1. $HBASE_HOME/bin/hbase-daemons.sh --config start zookeeper 2. cd $HBASE_HOME/conf/supervisord 3. supervisord -c supervisord.conf (call jps and record the pid of the HMaster) 4. $HBASE_HOME/bin/hbase-daemons.sh --config --hosts localhost.txt start regionserver (localhost.txt is a text file with the line localhost) 5. $HBASE_HOME/bin/./local-master-backup.sh start 1 6. kill -9 [PID_OF_HMASTER_RECORDED_AT_STEP_3] {noformat} and notice the backup master takes over right away. Also use start-hbase.sh and notice it has the same behavior as before (does not swallow output). Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: Gregory Chanan Fix For: 0.96.0 Attachments: HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13535590#comment-13535590 ] Gregory Chanan commented on HBASE-7386: --- Open Questions: 1) Do we think this is a useful direction? 2) If so, what level of supervisor scripts do we provide? None? Samples? Actual scripts that can be run [i.e. mimicking HBASE-daemon.sh -- probably want to move all the config setup to a common file]? Actual scripts + ssh scripting to actually run with a start-hbase.sh, i.e. start-supervised-hbase.sh Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: Gregory Chanan Fix For: 0.96.0 Attachments: HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13535721#comment-13535721 ] nkeywal commented on HBASE-7386: bq. 1) Do we think this is a useful direction? Personally, I do. We just need to have a default way to start/stop and ideally monitor HBase. Scripts are the simplest way. Supervisor is better. bq. 2) If so, what level of supervisor scripts do we provide? Samples are doomed to be broken. So I would go for Actual scripts + ssh scripting to actually run with a start-hbase.sh, i.e. start-supervised-hbase.sh. Is there any reason not to use supervisor by default (licenses, supported platforms?) Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: Gregory Chanan Fix For: 0.96.0 Attachments: HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7386) Investigate providing some supervisor support for znode deletion
[ https://issues.apache.org/jira/browse/HBASE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13535744#comment-13535744 ] stack commented on HBASE-7386: -- I'd say just repurpose 'autorestart', especially if now broke. What was there previous was mickey mouse. This is real deal. bq. ... and does not respect any other environment settings (e.g. HBASE_CONF_DIR). Would this be fixed if we ...through all the config files in hbase-daemon and do something appropriate.? On questions: 1. Yes this is valid direction. Perhaps we could extract the stuff you hacked out into a 'wrapper' script, a poor-mans' supervise such that it was there as an option... you could run it if you wanted poor-mans' supervise but otherwise, scripts ran as they used to. But this would likely be wasted effort... effort better spent getting it so optionally, if supervisord was installed, you could just run with it. 2. I agree with nkeywal that templates/samples inevitably rot. Unused software also rots so providing supervisord scripts unless they are used, they will go bad. How much work involved making it so could do ./bin/start-supervisord-hbase.sh? Would be coolio if you could do ./bin/start-hbase.sh and ./bin/start-supervisord-hbase.sh if supervisor available (likely on most systems I'd say) and then in doc. we encourage folks to do the latter. What to do for the case where a shop has chosen other than supervisord to monitor their processes? I suppose we could let them do the convertion from 'supervise' to 'god', etc.? This is great stuff G. Investigate providing some supervisor support for znode deletion Key: HBASE-7386 URL: https://issues.apache.org/jira/browse/HBASE-7386 Project: HBase Issue Type: Task Components: master, regionserver, scripts Reporter: Gregory Chanan Assignee: Gregory Chanan Fix For: 0.96.0 Attachments: HBASE-7386-v0.patch, supervisordconfigs-v0.patch There a couple of JIRAs for deleting the znode on a process failure: HBASE-5844 (RS) HBASE-5926 (Master) which are pretty neat; on process failure, they delete the znode of the underlying process so HBase can recover faster. These JIRAs were implemented via the startup scripts; i.e. the script hangs around and waits for the process to exit, then deletes the znode. There are a few problems associated with this approach, as listed in the below JIRAs: 1) Hides startup output in script https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 2) two hbase processes listed per launched daemon https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 3) Not run by a real supervisor https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463409page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463409 4) Weird output after kill -9 actual process in standalone mode https://issues.apache.org/jira/browse/HBASE-5926?focusedCommentId=13506801page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506801 5) Can kill existing RS if called again https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13463401page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13463401 6) Hides stdout/stderr[6] https://issues.apache.org/jira/browse/HBASE-5844?focusedCommentId=13506832page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506832 I suspect running in via something like supervisor.d can solve these issues if we provide the right support. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira