Re: [gridengine users] can't delete an exec host
Try qconf -de host_list Cheers, On Thu, Sep 7, 2017 at 3:22 AM, Michael Staufferwrote: > On Wed, Sep 6, 2017 at 12:42 PM, Reuti wrote: > >> >> > Am 06.09.2017 um 17:33 schrieb Michael Stauffer : >> > >> > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang >> wrote: >> > It seems SGE master did not get refreshed with new hostgroup. Maybe you >> can try: >> > >> > 1. restart SGE master >> > >> > Is it safe to do this with jobs queued and running? I think it's not >> reliable, i.e. jobs can get killed and de-queued? >> >> Just to mention, that it's safe to restart the qmaster or reboot even the >> machine the qmaster is running on. Nothing will happen to the running jobs >> on the exechosts. >> > > OK good to know. I've done that before and seen them finish, although some > googling suggested people have seen jobs get killed. Does a qmaster > restart, however, empty the queue? I imagine a reboot would too, unless the > queue is stored in a file? > > -M > > >> >> -- Reuti >> >> >> > or >> > >> > 2. change basic.q, "hostlist" to any node, like "compute-1-0.local", >> > wait till it gets refreshed; then change it back to "@basichosts". >> > >> > I've done this, but it's not refreshing (been about 10 minutes now). >> I'm still getting the error when I try to delete exec host compute-2-4, and >> qhost is still showing basic.q on the nodes in @basichosts. >> > >> > Interestingly, host compute-2-4 was removed from another queue >> (qlogin.basic.q) that also uses @basichosts, so it's something about >> basic.q that's stuck. >> > >> > Is there some way to refresh things other than restarting qmaster? >> > >> > -M >> > >> > >> > >> > >> > >> > On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer >> wrote: >> > > SoGE 8.1.8 >> > > >> > > Hi, >> > > >> > > I'm having trouble deleting an execution host. I've removed it from >> the >> > > host group, but when I try to delete with qconf, it says it's still >> part of >> > > 'basic.q'. Here's the relevant output? Anyone have any suggestions? >> > > >> > > [root@chead ~]# qconf -de compute-2-4.local >> > > Host object "compute-2-4.local" is still referenced in cluster queue >> > > "basic.q". >> > > >> > > [root@chead ~]# qconf -sq basic.q >> > > qname basic.q >> > > hostlist @basichosts >> > > seq_no0 >> > > load_thresholds np_load_avg=1.74 >> > > suspend_thresholdsNONE >> > > nsuspend 1 >> > > suspend_interval 00:05:00 >> > > priority 0 >> > > min_cpu_interval 00:05:00 >> > > processorsUNDEFINED >> > > qtype BATCH >> > > ckpt_list NONE >> > > pe_list make mpich mpi orte unihost serial >> > > rerun FALSE >> > > slots 8,[compute-1-2.local=3],[compute-1-0.local=7], >> \ >> > > [compute-1-1.local=7],[compute-1-3.local=7], \ >> > > [compute-1-5.local=8],[compute-1-6.local=8], \ >> > > [compute-1-7.local=8],[compute-1-8.local=8], \ >> > > [compute-1-9.local=8],[compute-1-10.local=8], \ >> > > [compute-1-11.local=8],[compute-1-12.local=8], >> \ >> > > [compute-1-13.local=8],[compute-1-14.local=8], >> \ >> > > [compute-1-15.local=8] >> > > tmpdir/tmp >> > > shell /bin/bash >> > > prologNONE >> > > epilogNONE >> > > shell_start_mode posix_compliant >> > > starter_methodNONE >> > > suspend_methodNONE >> > > resume_method NONE >> > > terminate_method NONE >> > > notify00:00:60 >> > > owner_listNONE >> > > user_listsNONE >> > > xuser_lists NONE >> > > subordinate_list NONE >> > > complex_valuesNONE >> > > projects NONE >> > > xprojects NONE >> > > calendar NONE >> > > initial_state default >> > > s_rt INFINITY >> > > h_rt INFINITY >> > > s_cpu INFINITY >> > > h_cpu INFINITY >> > > s_fsize INFINITY >> > > h_fsize INFINITY >> > > s_dataINFINITY >> > > h_dataINFINITY >> > > s_stack INFINITY >> > > h_stack INFINITY >> > > s_coreINFINITY >> > > h_coreINFINITY >> > > s_rss INFINITY >> > > h_rss INFINITY >> > > s_vmem19G >> > > h_vmem19G >> > > >> > > [root@chead ~]# qconf -shgrp @basichosts >> > > group_name @basichosts >> > > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \ >> > > compute-1-5.local compute-1-6.local compute-1-7.local \ >> > > compute-1-8.local compute-1-9.local
Re: [gridengine users] can't delete an exec host
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Am 06.09.2017 um 19:22 schrieb Michael Stauffer: > On Wed, Sep 6, 2017 at 12:42 PM, Reutiwrote: > > > Am 06.09.2017 um 17:33 schrieb Michael Stauffer : > > > > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang wrote: > > It seems SGE master did not get refreshed with new hostgroup. Maybe you can > > try: > > > > 1. restart SGE master > > > > Is it safe to do this with jobs queued and running? I think it's not > > reliable, i.e. jobs can get killed and de-queued? > > Just to mention, that it's safe to restart the qmaster or reboot even the > machine the qmaster is running on. Nothing will happen to the running jobs on > the exechosts. > > OK good to know. I've done that before and seen them finish, although some > googling suggested people have seen jobs get killed. No. NB: They will get killed, in case you shut down the "sgeexecd" on an exechost with the conventional "stop" as argument though. Supplying the argument "softstop" instead will allow them to continue, although no longer being supervised by the "sgeexed" any longer. Sometimes this can be handy, in case a user gave an expected h_rt which is too short for the job and it's necessary to grant the job to continue to run. > Does a qmaster restart, however, empty the queue? No. > I imagine a reboot would too, unless the queue is stored in a file? All vital information is stored in flat files or BDB. The only thing which is lost, are the completed jobs aka zombies (which you can see with `qstat -s z`, the number of them can be set with `qconf -mconf` entry "finished_jobs"). - -- Reuti > > -M > > > -- Reuti > > > > or > > > > 2. change basic.q, "hostlist" to any node, like "compute-1-0.local", > > wait till it gets refreshed; then change it back to "@basichosts". > > > > I've done this, but it's not refreshing (been about 10 minutes now). I'm > > still getting the error when I try to delete exec host compute-2-4, and > > qhost is still showing basic.q on the nodes in @basichosts. > > > > Interestingly, host compute-2-4 was removed from another queue > > (qlogin.basic.q) that also uses @basichosts, so it's something about > > basic.q that's stuck. > > > > Is there some way to refresh things other than restarting qmaster? > > > > -M > > > > > > > > > > > > On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer > > wrote: > > > SoGE 8.1.8 > > > > > > Hi, > > > > > > I'm having trouble deleting an execution host. I've removed it from the > > > host group, but when I try to delete with qconf, it says it's still part > > > of > > > 'basic.q'. Here's the relevant output? Anyone have any suggestions? > > > > > > [root@chead ~]# qconf -de compute-2-4.local > > > Host object "compute-2-4.local" is still referenced in cluster queue > > > "basic.q". > > > > > > [root@chead ~]# qconf -sq basic.q > > > qname basic.q > > > hostlist @basichosts > > > seq_no0 > > > load_thresholds np_load_avg=1.74 > > > suspend_thresholdsNONE > > > nsuspend 1 > > > suspend_interval 00:05:00 > > > priority 0 > > > min_cpu_interval 00:05:00 > > > processorsUNDEFINED > > > qtype BATCH > > > ckpt_list NONE > > > pe_list make mpich mpi orte unihost serial > > > rerun FALSE > > > slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \ > > > [compute-1-1.local=7],[compute-1-3.local=7], \ > > > [compute-1-5.local=8],[compute-1-6.local=8], \ > > > [compute-1-7.local=8],[compute-1-8.local=8], \ > > > [compute-1-9.local=8],[compute-1-10.local=8], \ > > > [compute-1-11.local=8],[compute-1-12.local=8], \ > > > [compute-1-13.local=8],[compute-1-14.local=8], \ > > > [compute-1-15.local=8] > > > tmpdir/tmp > > > shell /bin/bash > > > prologNONE > > > epilogNONE > > > shell_start_mode posix_compliant > > > starter_methodNONE > > > suspend_methodNONE > > > resume_method NONE > > > terminate_method NONE > > > notify00:00:60 > > > owner_listNONE > > > user_listsNONE > > > xuser_lists NONE > > > subordinate_list NONE > > > complex_valuesNONE > > > projects NONE > > > xprojects NONE > > > calendar NONE > > > initial_state default > > > s_rt INFINITY > > > h_rt INFINITY > > > s_cpu INFINITY > > > h_cpu INFINITY > > > s_fsize INFINITY > > > h_fsize INFINITY > > > s_dataINFINITY > > > h_data
Re: [gridengine users] can't delete an exec host
On Wed, Sep 6, 2017 at 12:42 PM, Reutiwrote: > > > Am 06.09.2017 um 17:33 schrieb Michael Stauffer : > > > > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang wrote: > > It seems SGE master did not get refreshed with new hostgroup. Maybe you > can try: > > > > 1. restart SGE master > > > > Is it safe to do this with jobs queued and running? I think it's not > reliable, i.e. jobs can get killed and de-queued? > > Just to mention, that it's safe to restart the qmaster or reboot even the > machine the qmaster is running on. Nothing will happen to the running jobs > on the exechosts. > OK good to know. I've done that before and seen them finish, although some googling suggested people have seen jobs get killed. Does a qmaster restart, however, empty the queue? I imagine a reboot would too, unless the queue is stored in a file? -M > > -- Reuti > > > > or > > > > 2. change basic.q, "hostlist" to any node, like "compute-1-0.local", > > wait till it gets refreshed; then change it back to "@basichosts". > > > > I've done this, but it's not refreshing (been about 10 minutes now). I'm > still getting the error when I try to delete exec host compute-2-4, and > qhost is still showing basic.q on the nodes in @basichosts. > > > > Interestingly, host compute-2-4 was removed from another queue > (qlogin.basic.q) that also uses @basichosts, so it's something about > basic.q that's stuck. > > > > Is there some way to refresh things other than restarting qmaster? > > > > -M > > > > > > > > > > > > On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer > wrote: > > > SoGE 8.1.8 > > > > > > Hi, > > > > > > I'm having trouble deleting an execution host. I've removed it from the > > > host group, but when I try to delete with qconf, it says it's still > part of > > > 'basic.q'. Here's the relevant output? Anyone have any suggestions? > > > > > > [root@chead ~]# qconf -de compute-2-4.local > > > Host object "compute-2-4.local" is still referenced in cluster queue > > > "basic.q". > > > > > > [root@chead ~]# qconf -sq basic.q > > > qname basic.q > > > hostlist @basichosts > > > seq_no0 > > > load_thresholds np_load_avg=1.74 > > > suspend_thresholdsNONE > > > nsuspend 1 > > > suspend_interval 00:05:00 > > > priority 0 > > > min_cpu_interval 00:05:00 > > > processorsUNDEFINED > > > qtype BATCH > > > ckpt_list NONE > > > pe_list make mpich mpi orte unihost serial > > > rerun FALSE > > > slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \ > > > [compute-1-1.local=7],[compute-1-3.local=7], \ > > > [compute-1-5.local=8],[compute-1-6.local=8], \ > > > [compute-1-7.local=8],[compute-1-8.local=8], \ > > > [compute-1-9.local=8],[compute-1-10.local=8], \ > > > [compute-1-11.local=8],[compute-1-12.local=8], \ > > > [compute-1-13.local=8],[compute-1-14.local=8], \ > > > [compute-1-15.local=8] > > > tmpdir/tmp > > > shell /bin/bash > > > prologNONE > > > epilogNONE > > > shell_start_mode posix_compliant > > > starter_methodNONE > > > suspend_methodNONE > > > resume_method NONE > > > terminate_method NONE > > > notify00:00:60 > > > owner_listNONE > > > user_listsNONE > > > xuser_lists NONE > > > subordinate_list NONE > > > complex_valuesNONE > > > projects NONE > > > xprojects NONE > > > calendar NONE > > > initial_state default > > > s_rt INFINITY > > > h_rt INFINITY > > > s_cpu INFINITY > > > h_cpu INFINITY > > > s_fsize INFINITY > > > h_fsize INFINITY > > > s_dataINFINITY > > > h_dataINFINITY > > > s_stack INFINITY > > > h_stack INFINITY > > > s_coreINFINITY > > > h_coreINFINITY > > > s_rss INFINITY > > > h_rss INFINITY > > > s_vmem19G > > > h_vmem19G > > > > > > [root@chead ~]# qconf -shgrp @basichosts > > > group_name @basichosts > > > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \ > > > compute-1-5.local compute-1-6.local compute-1-7.local \ > > > compute-1-8.local compute-1-9.local compute-1-10.local \ > > > compute-1-11.local compute-1-12.local compute-1-13.local \ > > > compute-1-14.local compute-1-15.local compute-2-0.local \ > > > compute-2-2.local compute-2-5.local compute-2-7.local \ > > >
Re: [gridengine users] can't delete an exec host
> Am 06.09.2017 um 17:33 schrieb Michael Stauffer: > > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang wrote: > It seems SGE master did not get refreshed with new hostgroup. Maybe you can > try: > > 1. restart SGE master > > Is it safe to do this with jobs queued and running? I think it's not > reliable, i.e. jobs can get killed and de-queued? Just to mention, that it's safe to restart the qmaster or reboot even the machine the qmaster is running on. Nothing will happen to the running jobs on the exechosts. -- Reuti > or > > 2. change basic.q, "hostlist" to any node, like "compute-1-0.local", > wait till it gets refreshed; then change it back to "@basichosts". > > I've done this, but it's not refreshing (been about 10 minutes now). I'm > still getting the error when I try to delete exec host compute-2-4, and qhost > is still showing basic.q on the nodes in @basichosts. > > Interestingly, host compute-2-4 was removed from another queue > (qlogin.basic.q) that also uses @basichosts, so it's something about basic.q > that's stuck. > > Is there some way to refresh things other than restarting qmaster? > > -M > > > > > > On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer wrote: > > SoGE 8.1.8 > > > > Hi, > > > > I'm having trouble deleting an execution host. I've removed it from the > > host group, but when I try to delete with qconf, it says it's still part of > > 'basic.q'. Here's the relevant output? Anyone have any suggestions? > > > > [root@chead ~]# qconf -de compute-2-4.local > > Host object "compute-2-4.local" is still referenced in cluster queue > > "basic.q". > > > > [root@chead ~]# qconf -sq basic.q > > qname basic.q > > hostlist @basichosts > > seq_no0 > > load_thresholds np_load_avg=1.74 > > suspend_thresholdsNONE > > nsuspend 1 > > suspend_interval 00:05:00 > > priority 0 > > min_cpu_interval 00:05:00 > > processorsUNDEFINED > > qtype BATCH > > ckpt_list NONE > > pe_list make mpich mpi orte unihost serial > > rerun FALSE > > slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \ > > [compute-1-1.local=7],[compute-1-3.local=7], \ > > [compute-1-5.local=8],[compute-1-6.local=8], \ > > [compute-1-7.local=8],[compute-1-8.local=8], \ > > [compute-1-9.local=8],[compute-1-10.local=8], \ > > [compute-1-11.local=8],[compute-1-12.local=8], \ > > [compute-1-13.local=8],[compute-1-14.local=8], \ > > [compute-1-15.local=8] > > tmpdir/tmp > > shell /bin/bash > > prologNONE > > epilogNONE > > shell_start_mode posix_compliant > > starter_methodNONE > > suspend_methodNONE > > resume_method NONE > > terminate_method NONE > > notify00:00:60 > > owner_listNONE > > user_listsNONE > > xuser_lists NONE > > subordinate_list NONE > > complex_valuesNONE > > projects NONE > > xprojects NONE > > calendar NONE > > initial_state default > > s_rt INFINITY > > h_rt INFINITY > > s_cpu INFINITY > > h_cpu INFINITY > > s_fsize INFINITY > > h_fsize INFINITY > > s_dataINFINITY > > h_dataINFINITY > > s_stack INFINITY > > h_stack INFINITY > > s_coreINFINITY > > h_coreINFINITY > > s_rss INFINITY > > h_rss INFINITY > > s_vmem19G > > h_vmem19G > > > > [root@chead ~]# qconf -shgrp @basichosts > > group_name @basichosts > > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \ > > compute-1-5.local compute-1-6.local compute-1-7.local \ > > compute-1-8.local compute-1-9.local compute-1-10.local \ > > compute-1-11.local compute-1-12.local compute-1-13.local \ > > compute-1-14.local compute-1-15.local compute-2-0.local \ > > compute-2-2.local compute-2-5.local compute-2-7.local \ > > compute-2-8.local compute-2-9.local compute-2-11.local \ > > compute-2-12.local compute-2-13.local compute-2-15.local \ > > compute-2-6.local > > > > Thanks > > > > -M > > > > ___ > > users mailing list > > users@gridengine.org > > https://gridengine.org/mailman/listinfo/users > > > > > > -- > Best, > > Feng > > ___ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] can't delete an exec host
That was it, thanks! The node had failed so I didn't think there'd be anything running on there, but two jobs were stuck in the basic.q on that node. I've killed them and now can remove host compute-2-4. -M On Wed, Sep 6, 2017 at 11:41 AM, Feng Zhangwrote: > Is there any running jobs on queue instance of compute-2-4@basic.q? > > On Wed, Sep 6, 2017 at 11:33 AM, Michael Stauffer > wrote: > > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang wrote: > >> > >> It seems SGE master did not get refreshed with new hostgroup. Maybe you > >> can try: > >> > >> 1. restart SGE master > > > > > > Is it safe to do this with jobs queued and running? I think it's not > > reliable, i.e. jobs can get killed and de-queued? > > > >> > >> or > >> > >> 2. change basic.q, "hostlist" to any node, like "compute-1-0.local", > >> > >> wait till it gets refreshed; then change it back to "@basichosts". > > > > > > I've done this, but it's not refreshing (been about 10 minutes now). I'm > > still getting the error when I try to delete exec host compute-2-4, and > > qhost is still showing basic.q on the nodes in @basichosts. > > > > Interestingly, host compute-2-4 was removed from another queue > > (qlogin.basic.q) that also uses @basichosts, so it's something about > basic.q > > that's stuck. > > > > Is there some way to refresh things other than restarting qmaster? > > > > -M > > > > > >> > >> > >> > >> > >> On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer > >> wrote: > >> > SoGE 8.1.8 > >> > > >> > Hi, > >> > > >> > I'm having trouble deleting an execution host. I've removed it from > the > >> > host group, but when I try to delete with qconf, it says it's still > part > >> > of > >> > 'basic.q'. Here's the relevant output? Anyone have any suggestions? > >> > > >> > [root@chead ~]# qconf -de compute-2-4.local > >> > Host object "compute-2-4.local" is still referenced in cluster queue > >> > "basic.q". > >> > > >> > [root@chead ~]# qconf -sq basic.q > >> > qname basic.q > >> > hostlist @basichosts > >> > seq_no0 > >> > load_thresholds np_load_avg=1.74 > >> > suspend_thresholdsNONE > >> > nsuspend 1 > >> > suspend_interval 00:05:00 > >> > priority 0 > >> > min_cpu_interval 00:05:00 > >> > processorsUNDEFINED > >> > qtype BATCH > >> > ckpt_list NONE > >> > pe_list make mpich mpi orte unihost serial > >> > rerun FALSE > >> > slots 8,[compute-1-2.local=3],[compute-1-0.local=7], > \ > >> > [compute-1-1.local=7],[compute-1-3.local=7], \ > >> > [compute-1-5.local=8],[compute-1-6.local=8], \ > >> > [compute-1-7.local=8],[compute-1-8.local=8], \ > >> > [compute-1-9.local=8],[compute-1-10.local=8], \ > >> > [compute-1-11.local=8],[compute-1-12.local=8], > \ > >> > [compute-1-13.local=8],[compute-1-14.local=8], > \ > >> > [compute-1-15.local=8] > >> > tmpdir/tmp > >> > shell /bin/bash > >> > prologNONE > >> > epilogNONE > >> > shell_start_mode posix_compliant > >> > starter_methodNONE > >> > suspend_methodNONE > >> > resume_method NONE > >> > terminate_method NONE > >> > notify00:00:60 > >> > owner_listNONE > >> > user_listsNONE > >> > xuser_lists NONE > >> > subordinate_list NONE > >> > complex_valuesNONE > >> > projects NONE > >> > xprojects NONE > >> > calendar NONE > >> > initial_state default > >> > s_rt INFINITY > >> > h_rt INFINITY > >> > s_cpu INFINITY > >> > h_cpu INFINITY > >> > s_fsize INFINITY > >> > h_fsize INFINITY > >> > s_dataINFINITY > >> > h_dataINFINITY > >> > s_stack INFINITY > >> > h_stack INFINITY > >> > s_coreINFINITY > >> > h_coreINFINITY > >> > s_rss INFINITY > >> > h_rss INFINITY > >> > s_vmem19G > >> > h_vmem19G > >> > > >> > [root@chead ~]# qconf -shgrp @basichosts > >> > group_name @basichosts > >> > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \ > >> > compute-1-5.local compute-1-6.local compute-1-7.local \ > >> > compute-1-8.local compute-1-9.local compute-1-10.local \ > >> > compute-1-11.local compute-1-12.local compute-1-13.local \ > >> > compute-1-14.local compute-1-15.local compute-2-0.local \ > >> > compute-2-2.local compute-2-5.local compute-2-7.local \ > >> > compute-2-8.local
Re: [gridengine users] can't delete an exec host
Is there any running jobs on queue instance of compute-2-4@basic.q? On Wed, Sep 6, 2017 at 11:33 AM, Michael Staufferwrote: > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang wrote: >> >> It seems SGE master did not get refreshed with new hostgroup. Maybe you >> can try: >> >> 1. restart SGE master > > > Is it safe to do this with jobs queued and running? I think it's not > reliable, i.e. jobs can get killed and de-queued? > >> >> or >> >> 2. change basic.q, "hostlist" to any node, like "compute-1-0.local", >> >> wait till it gets refreshed; then change it back to "@basichosts". > > > I've done this, but it's not refreshing (been about 10 minutes now). I'm > still getting the error when I try to delete exec host compute-2-4, and > qhost is still showing basic.q on the nodes in @basichosts. > > Interestingly, host compute-2-4 was removed from another queue > (qlogin.basic.q) that also uses @basichosts, so it's something about basic.q > that's stuck. > > Is there some way to refresh things other than restarting qmaster? > > -M > > >> >> >> >> >> On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer >> wrote: >> > SoGE 8.1.8 >> > >> > Hi, >> > >> > I'm having trouble deleting an execution host. I've removed it from the >> > host group, but when I try to delete with qconf, it says it's still part >> > of >> > 'basic.q'. Here's the relevant output? Anyone have any suggestions? >> > >> > [root@chead ~]# qconf -de compute-2-4.local >> > Host object "compute-2-4.local" is still referenced in cluster queue >> > "basic.q". >> > >> > [root@chead ~]# qconf -sq basic.q >> > qname basic.q >> > hostlist @basichosts >> > seq_no0 >> > load_thresholds np_load_avg=1.74 >> > suspend_thresholdsNONE >> > nsuspend 1 >> > suspend_interval 00:05:00 >> > priority 0 >> > min_cpu_interval 00:05:00 >> > processorsUNDEFINED >> > qtype BATCH >> > ckpt_list NONE >> > pe_list make mpich mpi orte unihost serial >> > rerun FALSE >> > slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \ >> > [compute-1-1.local=7],[compute-1-3.local=7], \ >> > [compute-1-5.local=8],[compute-1-6.local=8], \ >> > [compute-1-7.local=8],[compute-1-8.local=8], \ >> > [compute-1-9.local=8],[compute-1-10.local=8], \ >> > [compute-1-11.local=8],[compute-1-12.local=8], \ >> > [compute-1-13.local=8],[compute-1-14.local=8], \ >> > [compute-1-15.local=8] >> > tmpdir/tmp >> > shell /bin/bash >> > prologNONE >> > epilogNONE >> > shell_start_mode posix_compliant >> > starter_methodNONE >> > suspend_methodNONE >> > resume_method NONE >> > terminate_method NONE >> > notify00:00:60 >> > owner_listNONE >> > user_listsNONE >> > xuser_lists NONE >> > subordinate_list NONE >> > complex_valuesNONE >> > projects NONE >> > xprojects NONE >> > calendar NONE >> > initial_state default >> > s_rt INFINITY >> > h_rt INFINITY >> > s_cpu INFINITY >> > h_cpu INFINITY >> > s_fsize INFINITY >> > h_fsize INFINITY >> > s_dataINFINITY >> > h_dataINFINITY >> > s_stack INFINITY >> > h_stack INFINITY >> > s_coreINFINITY >> > h_coreINFINITY >> > s_rss INFINITY >> > h_rss INFINITY >> > s_vmem19G >> > h_vmem19G >> > >> > [root@chead ~]# qconf -shgrp @basichosts >> > group_name @basichosts >> > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \ >> > compute-1-5.local compute-1-6.local compute-1-7.local \ >> > compute-1-8.local compute-1-9.local compute-1-10.local \ >> > compute-1-11.local compute-1-12.local compute-1-13.local \ >> > compute-1-14.local compute-1-15.local compute-2-0.local \ >> > compute-2-2.local compute-2-5.local compute-2-7.local \ >> > compute-2-8.local compute-2-9.local compute-2-11.local \ >> > compute-2-12.local compute-2-13.local compute-2-15.local \ >> > compute-2-6.local >> > >> > Thanks >> > >> > -M >> > >> > ___ >> > users mailing list >> > users@gridengine.org >> > https://gridengine.org/mailman/listinfo/users >> > >> >> >> >> -- >> Best, >> >> Feng > > -- Best, Feng ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] can't delete an exec host
On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhangwrote: > It seems SGE master did not get refreshed with new hostgroup. Maybe you > can try: > > 1. restart SGE master > Is it safe to do this with jobs queued and running? I think it's not reliable, i.e. jobs can get killed and de-queued? > or > > 2. change basic.q, "hostlist" to any node, like "compute-1-0.local", wait till it gets refreshed; then change it back to "@basichosts". > I've done this, but it's not refreshing (been about 10 minutes now). I'm still getting the error when I try to delete exec host compute-2-4, and qhost is still showing basic.q on the nodes in @basichosts. Interestingly, host compute-2-4 was removed from another queue (qlogin.basic.q) that also uses @basichosts, so it's something about basic.q that's stuck. Is there some way to refresh things other than restarting qmaster? -M > > > > On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer > wrote: > > SoGE 8.1.8 > > > > Hi, > > > > I'm having trouble deleting an execution host. I've removed it from the > > host group, but when I try to delete with qconf, it says it's still part > of > > 'basic.q'. Here's the relevant output? Anyone have any suggestions? > > > > [root@chead ~]# qconf -de compute-2-4.local > > Host object "compute-2-4.local" is still referenced in cluster queue > > "basic.q". > > > > [root@chead ~]# qconf -sq basic.q > > qname basic.q > > hostlist @basichosts > > seq_no0 > > load_thresholds np_load_avg=1.74 > > suspend_thresholdsNONE > > nsuspend 1 > > suspend_interval 00:05:00 > > priority 0 > > min_cpu_interval 00:05:00 > > processorsUNDEFINED > > qtype BATCH > > ckpt_list NONE > > pe_list make mpich mpi orte unihost serial > > rerun FALSE > > slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \ > > [compute-1-1.local=7],[compute-1-3.local=7], \ > > [compute-1-5.local=8],[compute-1-6.local=8], \ > > [compute-1-7.local=8],[compute-1-8.local=8], \ > > [compute-1-9.local=8],[compute-1-10.local=8], \ > > [compute-1-11.local=8],[compute-1-12.local=8], \ > > [compute-1-13.local=8],[compute-1-14.local=8], \ > > [compute-1-15.local=8] > > tmpdir/tmp > > shell /bin/bash > > prologNONE > > epilogNONE > > shell_start_mode posix_compliant > > starter_methodNONE > > suspend_methodNONE > > resume_method NONE > > terminate_method NONE > > notify00:00:60 > > owner_listNONE > > user_listsNONE > > xuser_lists NONE > > subordinate_list NONE > > complex_valuesNONE > > projects NONE > > xprojects NONE > > calendar NONE > > initial_state default > > s_rt INFINITY > > h_rt INFINITY > > s_cpu INFINITY > > h_cpu INFINITY > > s_fsize INFINITY > > h_fsize INFINITY > > s_dataINFINITY > > h_dataINFINITY > > s_stack INFINITY > > h_stack INFINITY > > s_coreINFINITY > > h_coreINFINITY > > s_rss INFINITY > > h_rss INFINITY > > s_vmem19G > > h_vmem19G > > > > [root@chead ~]# qconf -shgrp @basichosts > > group_name @basichosts > > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \ > > compute-1-5.local compute-1-6.local compute-1-7.local \ > > compute-1-8.local compute-1-9.local compute-1-10.local \ > > compute-1-11.local compute-1-12.local compute-1-13.local \ > > compute-1-14.local compute-1-15.local compute-2-0.local \ > > compute-2-2.local compute-2-5.local compute-2-7.local \ > > compute-2-8.local compute-2-9.local compute-2-11.local \ > > compute-2-12.local compute-2-13.local compute-2-15.local \ > > compute-2-6.local > > > > Thanks > > > > -M > > > > ___ > > users mailing list > > users@gridengine.org > > https://gridengine.org/mailman/listinfo/users > > > > > > -- > Best, > > Feng > ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] can't delete an exec host
It seems SGE master did not get refreshed with new hostgroup. Maybe you can try: 1. restart SGE master or 2. change basic.q, "hostlist" to any node, like "compute-1-0.local", wait till it gets refreshed; then change it back to "@basichosts". On Wed, Sep 6, 2017 at 10:29 AM, Michael Staufferwrote: > SoGE 8.1.8 > > Hi, > > I'm having trouble deleting an execution host. I've removed it from the > host group, but when I try to delete with qconf, it says it's still part of > 'basic.q'. Here's the relevant output? Anyone have any suggestions? > > [root@chead ~]# qconf -de compute-2-4.local > Host object "compute-2-4.local" is still referenced in cluster queue > "basic.q". > > [root@chead ~]# qconf -sq basic.q > qname basic.q > hostlist @basichosts > seq_no0 > load_thresholds np_load_avg=1.74 > suspend_thresholdsNONE > nsuspend 1 > suspend_interval 00:05:00 > priority 0 > min_cpu_interval 00:05:00 > processorsUNDEFINED > qtype BATCH > ckpt_list NONE > pe_list make mpich mpi orte unihost serial > rerun FALSE > slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \ > [compute-1-1.local=7],[compute-1-3.local=7], \ > [compute-1-5.local=8],[compute-1-6.local=8], \ > [compute-1-7.local=8],[compute-1-8.local=8], \ > [compute-1-9.local=8],[compute-1-10.local=8], \ > [compute-1-11.local=8],[compute-1-12.local=8], \ > [compute-1-13.local=8],[compute-1-14.local=8], \ > [compute-1-15.local=8] > tmpdir/tmp > shell /bin/bash > prologNONE > epilogNONE > shell_start_mode posix_compliant > starter_methodNONE > suspend_methodNONE > resume_method NONE > terminate_method NONE > notify00:00:60 > owner_listNONE > user_listsNONE > xuser_lists NONE > subordinate_list NONE > complex_valuesNONE > projects NONE > xprojects NONE > calendar NONE > initial_state default > s_rt INFINITY > h_rt INFINITY > s_cpu INFINITY > h_cpu INFINITY > s_fsize INFINITY > h_fsize INFINITY > s_dataINFINITY > h_dataINFINITY > s_stack INFINITY > h_stack INFINITY > s_coreINFINITY > h_coreINFINITY > s_rss INFINITY > h_rss INFINITY > s_vmem19G > h_vmem19G > > [root@chead ~]# qconf -shgrp @basichosts > group_name @basichosts > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \ > compute-1-5.local compute-1-6.local compute-1-7.local \ > compute-1-8.local compute-1-9.local compute-1-10.local \ > compute-1-11.local compute-1-12.local compute-1-13.local \ > compute-1-14.local compute-1-15.local compute-2-0.local \ > compute-2-2.local compute-2-5.local compute-2-7.local \ > compute-2-8.local compute-2-9.local compute-2-11.local \ > compute-2-12.local compute-2-13.local compute-2-15.local \ > compute-2-6.local > > Thanks > > -M > > ___ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users > -- Best, Feng ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
[gridengine users] can't delete an exec host
SoGE 8.1.8 Hi, I'm having trouble deleting an execution host. I've removed it from the host group, but when I try to delete with qconf, it says it's still part of 'basic.q'. Here's the relevant output? Anyone have any suggestions? [root@chead ~]# qconf -de compute-2-4.local Host object "compute-2-4.local" is still referenced in cluster queue "basic.q". [root@chead ~]# qconf -sq basic.q qname basic.q hostlist @basichosts seq_no0 load_thresholds np_load_avg=1.74 suspend_thresholdsNONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processorsUNDEFINED qtype BATCH ckpt_list NONE pe_list make mpich mpi orte unihost serial rerun FALSE slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \ [compute-1-1.local=7],[compute-1-3.local=7], \ [compute-1-5.local=8],[compute-1-6.local=8], \ [compute-1-7.local=8],[compute-1-8.local=8], \ [compute-1-9.local=8],[compute-1-10.local=8], \ [compute-1-11.local=8],[compute-1-12.local=8], \ [compute-1-13.local=8],[compute-1-14.local=8], \ [compute-1-15.local=8] tmpdir/tmp shell /bin/bash prologNONE epilogNONE shell_start_mode posix_compliant starter_methodNONE suspend_methodNONE resume_method NONE terminate_method NONE notify00:00:60 owner_listNONE user_listsNONE xuser_lists NONE subordinate_list NONE complex_valuesNONE projects NONE xprojects NONE calendar NONE initial_state default s_rt INFINITY h_rt INFINITY s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_dataINFINITY h_dataINFINITY s_stack INFINITY h_stack INFINITY s_coreINFINITY h_coreINFINITY s_rss INFINITY h_rss INFINITY s_vmem19G h_vmem19G [root@chead ~]# qconf -shgrp @basichosts group_name @basichosts hostlist compute-1-0.local compute-1-2.local compute-1-3.local \ compute-1-5.local compute-1-6.local compute-1-7.local \ compute-1-8.local compute-1-9.local compute-1-10.local \ compute-1-11.local compute-1-12.local compute-1-13.local \ compute-1-14.local compute-1-15.local compute-2-0.local \ compute-2-2.local compute-2-5.local compute-2-7.local \ compute-2-8.local compute-2-9.local compute-2-11.local \ compute-2-12.local compute-2-13.local compute-2-15.local \ compute-2-6.local Thanks -M ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users