Re: [gridengine users] can't delete an exec host

2017-10-06 Thread Derrick Lin
Try

qconf -de host_list

Cheers,

On Thu, Sep 7, 2017 at 3:22 AM, Michael Stauffer  wrote:

> On Wed, Sep 6, 2017 at 12:42 PM, Reuti  wrote:
>
>>
>> > Am 06.09.2017 um 17:33 schrieb Michael Stauffer :
>> >
>> > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang 
>> wrote:
>> > It seems SGE master did not get refreshed with new hostgroup. Maybe you
>> can try:
>> >
>> > 1. restart SGE master
>> >
>> > Is it safe to do this with jobs queued and running? I think it's not
>> reliable, i.e. jobs can get killed and de-queued?
>>
>> Just to mention, that it's safe to restart the qmaster or reboot even the
>> machine the qmaster is running on. Nothing will happen to the running jobs
>> on the exechosts.
>>
>
> OK good to know. I've done that before and seen them finish, although some
> googling suggested people have seen jobs get killed. Does a qmaster
> restart, however, empty the queue? I imagine a reboot would too, unless the
> queue is stored in a file?
>
> -M
>
>
>>
>> -- Reuti
>>
>>
>> > or
>> >
>> > 2. change basic.q, "hostlist" to any node, like "compute-1-0.local",
>> > wait till it gets refreshed; then change it back to "@basichosts".
>> >
>> > I've done this, but it's not refreshing (been about 10 minutes now).
>> I'm still getting the error when I try to delete exec host compute-2-4, and
>> qhost is still showing basic.q on the nodes in @basichosts.
>> >
>> > Interestingly, host compute-2-4 was removed from another queue
>> (qlogin.basic.q) that also uses @basichosts, so it's something about
>> basic.q that's stuck.
>> >
>> > Is there some way to refresh things other than restarting qmaster?
>> >
>> > -M
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer 
>> wrote:
>> > > SoGE 8.1.8
>> > >
>> > > Hi,
>> > >
>> > > I'm having trouble deleting an execution host. I've removed it from
>> the
>> > > host group, but when I try to delete with qconf, it says it's still
>> part of
>> > > 'basic.q'. Here's the relevant output? Anyone have any suggestions?
>> > >
>> > > [root@chead ~]# qconf -de compute-2-4.local
>> > > Host object "compute-2-4.local" is still referenced in cluster queue
>> > > "basic.q".
>> > >
>> > > [root@chead ~]# qconf -sq basic.q
>> > > qname basic.q
>> > > hostlist  @basichosts
>> > > seq_no0
>> > > load_thresholds   np_load_avg=1.74
>> > > suspend_thresholdsNONE
>> > > nsuspend  1
>> > > suspend_interval  00:05:00
>> > > priority  0
>> > > min_cpu_interval  00:05:00
>> > > processorsUNDEFINED
>> > > qtype BATCH
>> > > ckpt_list NONE
>> > > pe_list   make mpich mpi orte unihost serial
>> > > rerun FALSE
>> > > slots 8,[compute-1-2.local=3],[compute-1-0.local=7],
>> \
>> > >   [compute-1-1.local=7],[compute-1-3.local=7], \
>> > >   [compute-1-5.local=8],[compute-1-6.local=8], \
>> > >   [compute-1-7.local=8],[compute-1-8.local=8], \
>> > >   [compute-1-9.local=8],[compute-1-10.local=8], \
>> > >   [compute-1-11.local=8],[compute-1-12.local=8],
>> \
>> > >   [compute-1-13.local=8],[compute-1-14.local=8],
>> \
>> > >   [compute-1-15.local=8]
>> > > tmpdir/tmp
>> > > shell /bin/bash
>> > > prologNONE
>> > > epilogNONE
>> > > shell_start_mode  posix_compliant
>> > > starter_methodNONE
>> > > suspend_methodNONE
>> > > resume_method NONE
>> > > terminate_method  NONE
>> > > notify00:00:60
>> > > owner_listNONE
>> > > user_listsNONE
>> > > xuser_lists   NONE
>> > > subordinate_list  NONE
>> > > complex_valuesNONE
>> > > projects  NONE
>> > > xprojects NONE
>> > > calendar  NONE
>> > > initial_state default
>> > > s_rt  INFINITY
>> > > h_rt  INFINITY
>> > > s_cpu INFINITY
>> > > h_cpu INFINITY
>> > > s_fsize   INFINITY
>> > > h_fsize   INFINITY
>> > > s_dataINFINITY
>> > > h_dataINFINITY
>> > > s_stack   INFINITY
>> > > h_stack   INFINITY
>> > > s_coreINFINITY
>> > > h_coreINFINITY
>> > > s_rss INFINITY
>> > > h_rss INFINITY
>> > > s_vmem19G
>> > > h_vmem19G
>> > >
>> > > [root@chead ~]# qconf -shgrp @basichosts
>> > > group_name @basichosts
>> > > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \
>> > >  compute-1-5.local compute-1-6.local compute-1-7.local \
>> > >  compute-1-8.local compute-1-9.local 

Re: [gridengine users] can't delete an exec host

2017-09-06 Thread Reuti
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Am 06.09.2017 um 19:22 schrieb Michael Stauffer:

> On Wed, Sep 6, 2017 at 12:42 PM, Reuti  wrote:
> 
> > Am 06.09.2017 um 17:33 schrieb Michael Stauffer :
> >
> > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang  wrote:
> > It seems SGE master did not get refreshed with new hostgroup. Maybe you can 
> > try:
> >
> > 1. restart SGE master
> >
> > Is it safe to do this with jobs queued and running? I think it's not 
> > reliable, i.e. jobs can get killed and de-queued?
> 
> Just to mention, that it's safe to restart the qmaster or reboot even the 
> machine the qmaster is running on. Nothing will happen to the running jobs on 
> the exechosts.
> 
> OK good to know. I've done that before and seen them finish, although some 
> googling suggested people have seen jobs get killed.

No. 

NB: They will get killed, in case you shut down the "sgeexecd" on an exechost 
with the conventional "stop" as argument though. Supplying the argument 
"softstop" instead will allow them to continue, although no longer being 
supervised by the "sgeexed" any longer. Sometimes this can be handy, in case a 
user gave an expected h_rt which is too short for the job and it's necessary to 
grant the job to continue to run.


> Does a qmaster restart, however, empty the queue?

No.


>  I imagine a reboot would too, unless the queue is stored in a file?

All vital information is stored in flat files or BDB. The only thing which is 
lost, are the completed jobs aka zombies (which you can see with `qstat -s z`, 
the number of them can be set with `qconf -mconf` entry "finished_jobs").

- -- Reuti


> 
> -M
>  
> 
> -- Reuti
> 
> 
> > or
> >
> > 2. change basic.q, "hostlist" to any node, like "compute-1-0.local",
> > wait till it gets refreshed; then change it back to "@basichosts".
> >
> > I've done this, but it's not refreshing (been about 10 minutes now). I'm 
> > still getting the error when I try to delete exec host compute-2-4, and 
> > qhost is still showing basic.q on the nodes in @basichosts.
> >
> > Interestingly, host compute-2-4 was removed from another queue 
> > (qlogin.basic.q) that also uses @basichosts, so it's something about 
> > basic.q that's stuck.
> >
> > Is there some way to refresh things other than restarting qmaster?
> >
> > -M
> >
> >
> >
> >
> >
> > On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer  
> > wrote:
> > > SoGE 8.1.8
> > >
> > > Hi,
> > >
> > > I'm having trouble deleting an execution host. I've removed it from the
> > > host group, but when I try to delete with qconf, it says it's still part 
> > > of
> > > 'basic.q'. Here's the relevant output? Anyone have any suggestions?
> > >
> > > [root@chead ~]# qconf -de compute-2-4.local
> > > Host object "compute-2-4.local" is still referenced in cluster queue
> > > "basic.q".
> > >
> > > [root@chead ~]# qconf -sq basic.q
> > > qname basic.q
> > > hostlist  @basichosts
> > > seq_no0
> > > load_thresholds   np_load_avg=1.74
> > > suspend_thresholdsNONE
> > > nsuspend  1
> > > suspend_interval  00:05:00
> > > priority  0
> > > min_cpu_interval  00:05:00
> > > processorsUNDEFINED
> > > qtype BATCH
> > > ckpt_list NONE
> > > pe_list   make mpich mpi orte unihost serial
> > > rerun FALSE
> > > slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \
> > >   [compute-1-1.local=7],[compute-1-3.local=7], \
> > >   [compute-1-5.local=8],[compute-1-6.local=8], \
> > >   [compute-1-7.local=8],[compute-1-8.local=8], \
> > >   [compute-1-9.local=8],[compute-1-10.local=8], \
> > >   [compute-1-11.local=8],[compute-1-12.local=8], \
> > >   [compute-1-13.local=8],[compute-1-14.local=8], \
> > >   [compute-1-15.local=8]
> > > tmpdir/tmp
> > > shell /bin/bash
> > > prologNONE
> > > epilogNONE
> > > shell_start_mode  posix_compliant
> > > starter_methodNONE
> > > suspend_methodNONE
> > > resume_method NONE
> > > terminate_method  NONE
> > > notify00:00:60
> > > owner_listNONE
> > > user_listsNONE
> > > xuser_lists   NONE
> > > subordinate_list  NONE
> > > complex_valuesNONE
> > > projects  NONE
> > > xprojects NONE
> > > calendar  NONE
> > > initial_state default
> > > s_rt  INFINITY
> > > h_rt  INFINITY
> > > s_cpu INFINITY
> > > h_cpu INFINITY
> > > s_fsize   INFINITY
> > > h_fsize   INFINITY
> > > s_dataINFINITY
> > > h_data

Re: [gridengine users] can't delete an exec host

2017-09-06 Thread Michael Stauffer
On Wed, Sep 6, 2017 at 12:42 PM, Reuti  wrote:

>
> > Am 06.09.2017 um 17:33 schrieb Michael Stauffer :
> >
> > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang  wrote:
> > It seems SGE master did not get refreshed with new hostgroup. Maybe you
> can try:
> >
> > 1. restart SGE master
> >
> > Is it safe to do this with jobs queued and running? I think it's not
> reliable, i.e. jobs can get killed and de-queued?
>
> Just to mention, that it's safe to restart the qmaster or reboot even the
> machine the qmaster is running on. Nothing will happen to the running jobs
> on the exechosts.
>

OK good to know. I've done that before and seen them finish, although some
googling suggested people have seen jobs get killed. Does a qmaster
restart, however, empty the queue? I imagine a reboot would too, unless the
queue is stored in a file?

-M


>
> -- Reuti
>
>
> > or
> >
> > 2. change basic.q, "hostlist" to any node, like "compute-1-0.local",
> > wait till it gets refreshed; then change it back to "@basichosts".
> >
> > I've done this, but it's not refreshing (been about 10 minutes now). I'm
> still getting the error when I try to delete exec host compute-2-4, and
> qhost is still showing basic.q on the nodes in @basichosts.
> >
> > Interestingly, host compute-2-4 was removed from another queue
> (qlogin.basic.q) that also uses @basichosts, so it's something about
> basic.q that's stuck.
> >
> > Is there some way to refresh things other than restarting qmaster?
> >
> > -M
> >
> >
> >
> >
> >
> > On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer 
> wrote:
> > > SoGE 8.1.8
> > >
> > > Hi,
> > >
> > > I'm having trouble deleting an execution host. I've removed it from the
> > > host group, but when I try to delete with qconf, it says it's still
> part of
> > > 'basic.q'. Here's the relevant output? Anyone have any suggestions?
> > >
> > > [root@chead ~]# qconf -de compute-2-4.local
> > > Host object "compute-2-4.local" is still referenced in cluster queue
> > > "basic.q".
> > >
> > > [root@chead ~]# qconf -sq basic.q
> > > qname basic.q
> > > hostlist  @basichosts
> > > seq_no0
> > > load_thresholds   np_load_avg=1.74
> > > suspend_thresholdsNONE
> > > nsuspend  1
> > > suspend_interval  00:05:00
> > > priority  0
> > > min_cpu_interval  00:05:00
> > > processorsUNDEFINED
> > > qtype BATCH
> > > ckpt_list NONE
> > > pe_list   make mpich mpi orte unihost serial
> > > rerun FALSE
> > > slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \
> > >   [compute-1-1.local=7],[compute-1-3.local=7], \
> > >   [compute-1-5.local=8],[compute-1-6.local=8], \
> > >   [compute-1-7.local=8],[compute-1-8.local=8], \
> > >   [compute-1-9.local=8],[compute-1-10.local=8], \
> > >   [compute-1-11.local=8],[compute-1-12.local=8], \
> > >   [compute-1-13.local=8],[compute-1-14.local=8], \
> > >   [compute-1-15.local=8]
> > > tmpdir/tmp
> > > shell /bin/bash
> > > prologNONE
> > > epilogNONE
> > > shell_start_mode  posix_compliant
> > > starter_methodNONE
> > > suspend_methodNONE
> > > resume_method NONE
> > > terminate_method  NONE
> > > notify00:00:60
> > > owner_listNONE
> > > user_listsNONE
> > > xuser_lists   NONE
> > > subordinate_list  NONE
> > > complex_valuesNONE
> > > projects  NONE
> > > xprojects NONE
> > > calendar  NONE
> > > initial_state default
> > > s_rt  INFINITY
> > > h_rt  INFINITY
> > > s_cpu INFINITY
> > > h_cpu INFINITY
> > > s_fsize   INFINITY
> > > h_fsize   INFINITY
> > > s_dataINFINITY
> > > h_dataINFINITY
> > > s_stack   INFINITY
> > > h_stack   INFINITY
> > > s_coreINFINITY
> > > h_coreINFINITY
> > > s_rss INFINITY
> > > h_rss INFINITY
> > > s_vmem19G
> > > h_vmem19G
> > >
> > > [root@chead ~]# qconf -shgrp @basichosts
> > > group_name @basichosts
> > > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \
> > >  compute-1-5.local compute-1-6.local compute-1-7.local \
> > >  compute-1-8.local compute-1-9.local compute-1-10.local \
> > >  compute-1-11.local compute-1-12.local compute-1-13.local \
> > >  compute-1-14.local compute-1-15.local compute-2-0.local \
> > >  compute-2-2.local compute-2-5.local compute-2-7.local \
> > >  

Re: [gridengine users] can't delete an exec host

2017-09-06 Thread Reuti

> Am 06.09.2017 um 17:33 schrieb Michael Stauffer :
> 
> On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang  wrote:
> It seems SGE master did not get refreshed with new hostgroup. Maybe you can 
> try:
> 
> 1. restart SGE master
> 
> Is it safe to do this with jobs queued and running? I think it's not 
> reliable, i.e. jobs can get killed and de-queued?

Just to mention, that it's safe to restart the qmaster or reboot even the 
machine the qmaster is running on. Nothing will happen to the running jobs on 
the exechosts.

-- Reuti


> or
> 
> 2. change basic.q, "hostlist" to any node, like "compute-1-0.local", 
> wait till it gets refreshed; then change it back to "@basichosts".
> 
> I've done this, but it's not refreshing (been about 10 minutes now). I'm 
> still getting the error when I try to delete exec host compute-2-4, and qhost 
> is still showing basic.q on the nodes in @basichosts.
> 
> Interestingly, host compute-2-4 was removed from another queue 
> (qlogin.basic.q) that also uses @basichosts, so it's something about basic.q 
> that's stuck.
> 
> Is there some way to refresh things other than restarting qmaster?
> 
> -M
> 
>  
> 
> 
> 
> On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer  wrote:
> > SoGE 8.1.8
> >
> > Hi,
> >
> > I'm having trouble deleting an execution host. I've removed it from the
> > host group, but when I try to delete with qconf, it says it's still part of
> > 'basic.q'. Here's the relevant output? Anyone have any suggestions?
> >
> > [root@chead ~]# qconf -de compute-2-4.local
> > Host object "compute-2-4.local" is still referenced in cluster queue
> > "basic.q".
> >
> > [root@chead ~]# qconf -sq basic.q
> > qname basic.q
> > hostlist  @basichosts
> > seq_no0
> > load_thresholds   np_load_avg=1.74
> > suspend_thresholdsNONE
> > nsuspend  1
> > suspend_interval  00:05:00
> > priority  0
> > min_cpu_interval  00:05:00
> > processorsUNDEFINED
> > qtype BATCH
> > ckpt_list NONE
> > pe_list   make mpich mpi orte unihost serial
> > rerun FALSE
> > slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \
> >   [compute-1-1.local=7],[compute-1-3.local=7], \
> >   [compute-1-5.local=8],[compute-1-6.local=8], \
> >   [compute-1-7.local=8],[compute-1-8.local=8], \
> >   [compute-1-9.local=8],[compute-1-10.local=8], \
> >   [compute-1-11.local=8],[compute-1-12.local=8], \
> >   [compute-1-13.local=8],[compute-1-14.local=8], \
> >   [compute-1-15.local=8]
> > tmpdir/tmp
> > shell /bin/bash
> > prologNONE
> > epilogNONE
> > shell_start_mode  posix_compliant
> > starter_methodNONE
> > suspend_methodNONE
> > resume_method NONE
> > terminate_method  NONE
> > notify00:00:60
> > owner_listNONE
> > user_listsNONE
> > xuser_lists   NONE
> > subordinate_list  NONE
> > complex_valuesNONE
> > projects  NONE
> > xprojects NONE
> > calendar  NONE
> > initial_state default
> > s_rt  INFINITY
> > h_rt  INFINITY
> > s_cpu INFINITY
> > h_cpu INFINITY
> > s_fsize   INFINITY
> > h_fsize   INFINITY
> > s_dataINFINITY
> > h_dataINFINITY
> > s_stack   INFINITY
> > h_stack   INFINITY
> > s_coreINFINITY
> > h_coreINFINITY
> > s_rss INFINITY
> > h_rss INFINITY
> > s_vmem19G
> > h_vmem19G
> >
> > [root@chead ~]# qconf -shgrp @basichosts
> > group_name @basichosts
> > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \
> >  compute-1-5.local compute-1-6.local compute-1-7.local \
> >  compute-1-8.local compute-1-9.local compute-1-10.local \
> >  compute-1-11.local compute-1-12.local compute-1-13.local \
> >  compute-1-14.local compute-1-15.local compute-2-0.local \
> >  compute-2-2.local compute-2-5.local compute-2-7.local \
> >  compute-2-8.local compute-2-9.local compute-2-11.local \
> >  compute-2-12.local compute-2-13.local compute-2-15.local \
> >  compute-2-6.local
> >
> > Thanks
> >
> > -M
> >
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> >
> 
> 
> 
> --
> Best,
> 
> Feng
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users



Re: [gridengine users] can't delete an exec host

2017-09-06 Thread Michael Stauffer
That was it, thanks! The node had failed so I didn't think there'd be
anything running on there, but two jobs were stuck in the basic.q on that
node. I've killed them and now can remove host compute-2-4.

-M

On Wed, Sep 6, 2017 at 11:41 AM, Feng Zhang  wrote:

> Is there any running jobs on queue instance of compute-2-4@basic.q?
>
> On Wed, Sep 6, 2017 at 11:33 AM, Michael Stauffer 
> wrote:
> > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang  wrote:
> >>
> >> It seems SGE master did not get refreshed with new hostgroup. Maybe you
> >> can try:
> >>
> >> 1. restart SGE master
> >
> >
> > Is it safe to do this with jobs queued and running? I think it's not
> > reliable, i.e. jobs can get killed and de-queued?
> >
> >>
> >> or
> >>
> >> 2. change basic.q, "hostlist" to any node, like "compute-1-0.local",
> >>
> >> wait till it gets refreshed; then change it back to "@basichosts".
> >
> >
> > I've done this, but it's not refreshing (been about 10 minutes now). I'm
> > still getting the error when I try to delete exec host compute-2-4, and
> > qhost is still showing basic.q on the nodes in @basichosts.
> >
> > Interestingly, host compute-2-4 was removed from another queue
> > (qlogin.basic.q) that also uses @basichosts, so it's something about
> basic.q
> > that's stuck.
> >
> > Is there some way to refresh things other than restarting qmaster?
> >
> > -M
> >
> >
> >>
> >>
> >>
> >>
> >> On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer 
> >> wrote:
> >> > SoGE 8.1.8
> >> >
> >> > Hi,
> >> >
> >> > I'm having trouble deleting an execution host. I've removed it from
> the
> >> > host group, but when I try to delete with qconf, it says it's still
> part
> >> > of
> >> > 'basic.q'. Here's the relevant output? Anyone have any suggestions?
> >> >
> >> > [root@chead ~]# qconf -de compute-2-4.local
> >> > Host object "compute-2-4.local" is still referenced in cluster queue
> >> > "basic.q".
> >> >
> >> > [root@chead ~]# qconf -sq basic.q
> >> > qname basic.q
> >> > hostlist  @basichosts
> >> > seq_no0
> >> > load_thresholds   np_load_avg=1.74
> >> > suspend_thresholdsNONE
> >> > nsuspend  1
> >> > suspend_interval  00:05:00
> >> > priority  0
> >> > min_cpu_interval  00:05:00
> >> > processorsUNDEFINED
> >> > qtype BATCH
> >> > ckpt_list NONE
> >> > pe_list   make mpich mpi orte unihost serial
> >> > rerun FALSE
> >> > slots 8,[compute-1-2.local=3],[compute-1-0.local=7],
> \
> >> >   [compute-1-1.local=7],[compute-1-3.local=7], \
> >> >   [compute-1-5.local=8],[compute-1-6.local=8], \
> >> >   [compute-1-7.local=8],[compute-1-8.local=8], \
> >> >   [compute-1-9.local=8],[compute-1-10.local=8], \
> >> >   [compute-1-11.local=8],[compute-1-12.local=8],
> \
> >> >   [compute-1-13.local=8],[compute-1-14.local=8],
> \
> >> >   [compute-1-15.local=8]
> >> > tmpdir/tmp
> >> > shell /bin/bash
> >> > prologNONE
> >> > epilogNONE
> >> > shell_start_mode  posix_compliant
> >> > starter_methodNONE
> >> > suspend_methodNONE
> >> > resume_method NONE
> >> > terminate_method  NONE
> >> > notify00:00:60
> >> > owner_listNONE
> >> > user_listsNONE
> >> > xuser_lists   NONE
> >> > subordinate_list  NONE
> >> > complex_valuesNONE
> >> > projects  NONE
> >> > xprojects NONE
> >> > calendar  NONE
> >> > initial_state default
> >> > s_rt  INFINITY
> >> > h_rt  INFINITY
> >> > s_cpu INFINITY
> >> > h_cpu INFINITY
> >> > s_fsize   INFINITY
> >> > h_fsize   INFINITY
> >> > s_dataINFINITY
> >> > h_dataINFINITY
> >> > s_stack   INFINITY
> >> > h_stack   INFINITY
> >> > s_coreINFINITY
> >> > h_coreINFINITY
> >> > s_rss INFINITY
> >> > h_rss INFINITY
> >> > s_vmem19G
> >> > h_vmem19G
> >> >
> >> > [root@chead ~]# qconf -shgrp @basichosts
> >> > group_name @basichosts
> >> > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \
> >> >  compute-1-5.local compute-1-6.local compute-1-7.local \
> >> >  compute-1-8.local compute-1-9.local compute-1-10.local \
> >> >  compute-1-11.local compute-1-12.local compute-1-13.local \
> >> >  compute-1-14.local compute-1-15.local compute-2-0.local \
> >> >  compute-2-2.local compute-2-5.local compute-2-7.local \
> >> >  compute-2-8.local 

Re: [gridengine users] can't delete an exec host

2017-09-06 Thread Feng Zhang
Is there any running jobs on queue instance of compute-2-4@basic.q?

On Wed, Sep 6, 2017 at 11:33 AM, Michael Stauffer  wrote:
> On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang  wrote:
>>
>> It seems SGE master did not get refreshed with new hostgroup. Maybe you
>> can try:
>>
>> 1. restart SGE master
>
>
> Is it safe to do this with jobs queued and running? I think it's not
> reliable, i.e. jobs can get killed and de-queued?
>
>>
>> or
>>
>> 2. change basic.q, "hostlist" to any node, like "compute-1-0.local",
>>
>> wait till it gets refreshed; then change it back to "@basichosts".
>
>
> I've done this, but it's not refreshing (been about 10 minutes now). I'm
> still getting the error when I try to delete exec host compute-2-4, and
> qhost is still showing basic.q on the nodes in @basichosts.
>
> Interestingly, host compute-2-4 was removed from another queue
> (qlogin.basic.q) that also uses @basichosts, so it's something about basic.q
> that's stuck.
>
> Is there some way to refresh things other than restarting qmaster?
>
> -M
>
>
>>
>>
>>
>>
>> On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer 
>> wrote:
>> > SoGE 8.1.8
>> >
>> > Hi,
>> >
>> > I'm having trouble deleting an execution host. I've removed it from the
>> > host group, but when I try to delete with qconf, it says it's still part
>> > of
>> > 'basic.q'. Here's the relevant output? Anyone have any suggestions?
>> >
>> > [root@chead ~]# qconf -de compute-2-4.local
>> > Host object "compute-2-4.local" is still referenced in cluster queue
>> > "basic.q".
>> >
>> > [root@chead ~]# qconf -sq basic.q
>> > qname basic.q
>> > hostlist  @basichosts
>> > seq_no0
>> > load_thresholds   np_load_avg=1.74
>> > suspend_thresholdsNONE
>> > nsuspend  1
>> > suspend_interval  00:05:00
>> > priority  0
>> > min_cpu_interval  00:05:00
>> > processorsUNDEFINED
>> > qtype BATCH
>> > ckpt_list NONE
>> > pe_list   make mpich mpi orte unihost serial
>> > rerun FALSE
>> > slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \
>> >   [compute-1-1.local=7],[compute-1-3.local=7], \
>> >   [compute-1-5.local=8],[compute-1-6.local=8], \
>> >   [compute-1-7.local=8],[compute-1-8.local=8], \
>> >   [compute-1-9.local=8],[compute-1-10.local=8], \
>> >   [compute-1-11.local=8],[compute-1-12.local=8], \
>> >   [compute-1-13.local=8],[compute-1-14.local=8], \
>> >   [compute-1-15.local=8]
>> > tmpdir/tmp
>> > shell /bin/bash
>> > prologNONE
>> > epilogNONE
>> > shell_start_mode  posix_compliant
>> > starter_methodNONE
>> > suspend_methodNONE
>> > resume_method NONE
>> > terminate_method  NONE
>> > notify00:00:60
>> > owner_listNONE
>> > user_listsNONE
>> > xuser_lists   NONE
>> > subordinate_list  NONE
>> > complex_valuesNONE
>> > projects  NONE
>> > xprojects NONE
>> > calendar  NONE
>> > initial_state default
>> > s_rt  INFINITY
>> > h_rt  INFINITY
>> > s_cpu INFINITY
>> > h_cpu INFINITY
>> > s_fsize   INFINITY
>> > h_fsize   INFINITY
>> > s_dataINFINITY
>> > h_dataINFINITY
>> > s_stack   INFINITY
>> > h_stack   INFINITY
>> > s_coreINFINITY
>> > h_coreINFINITY
>> > s_rss INFINITY
>> > h_rss INFINITY
>> > s_vmem19G
>> > h_vmem19G
>> >
>> > [root@chead ~]# qconf -shgrp @basichosts
>> > group_name @basichosts
>> > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \
>> >  compute-1-5.local compute-1-6.local compute-1-7.local \
>> >  compute-1-8.local compute-1-9.local compute-1-10.local \
>> >  compute-1-11.local compute-1-12.local compute-1-13.local \
>> >  compute-1-14.local compute-1-15.local compute-2-0.local \
>> >  compute-2-2.local compute-2-5.local compute-2-7.local \
>> >  compute-2-8.local compute-2-9.local compute-2-11.local \
>> >  compute-2-12.local compute-2-13.local compute-2-15.local \
>> >  compute-2-6.local
>> >
>> > Thanks
>> >
>> > -M
>> >
>> > ___
>> > users mailing list
>> > users@gridengine.org
>> > https://gridengine.org/mailman/listinfo/users
>> >
>>
>>
>>
>> --
>> Best,
>>
>> Feng
>
>



-- 
Best,

Feng
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] can't delete an exec host

2017-09-06 Thread Michael Stauffer
On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang  wrote:

> It seems SGE master did not get refreshed with new hostgroup. Maybe you
> can try:
>
> 1. restart SGE master
>

Is it safe to do this with jobs queued and running? I think it's not
reliable, i.e. jobs can get killed and de-queued?


> or
>
> 2. change basic.q, "hostlist" to any node, like "compute-1-0.local",

wait till it gets refreshed; then change it back to "@basichosts".
>

I've done this, but it's not refreshing (been about 10 minutes now). I'm
still getting the error when I try to delete exec host compute-2-4, and
qhost is still showing basic.q on the nodes in @basichosts.

Interestingly, host compute-2-4 was removed from another queue
(qlogin.basic.q) that also uses @basichosts, so it's something about
basic.q that's stuck.

Is there some way to refresh things other than restarting qmaster?

-M



>
>
>
> On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer 
> wrote:
> > SoGE 8.1.8
> >
> > Hi,
> >
> > I'm having trouble deleting an execution host. I've removed it from the
> > host group, but when I try to delete with qconf, it says it's still part
> of
> > 'basic.q'. Here's the relevant output? Anyone have any suggestions?
> >
> > [root@chead ~]# qconf -de compute-2-4.local
> > Host object "compute-2-4.local" is still referenced in cluster queue
> > "basic.q".
> >
> > [root@chead ~]# qconf -sq basic.q
> > qname basic.q
> > hostlist  @basichosts
> > seq_no0
> > load_thresholds   np_load_avg=1.74
> > suspend_thresholdsNONE
> > nsuspend  1
> > suspend_interval  00:05:00
> > priority  0
> > min_cpu_interval  00:05:00
> > processorsUNDEFINED
> > qtype BATCH
> > ckpt_list NONE
> > pe_list   make mpich mpi orte unihost serial
> > rerun FALSE
> > slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \
> >   [compute-1-1.local=7],[compute-1-3.local=7], \
> >   [compute-1-5.local=8],[compute-1-6.local=8], \
> >   [compute-1-7.local=8],[compute-1-8.local=8], \
> >   [compute-1-9.local=8],[compute-1-10.local=8], \
> >   [compute-1-11.local=8],[compute-1-12.local=8], \
> >   [compute-1-13.local=8],[compute-1-14.local=8], \
> >   [compute-1-15.local=8]
> > tmpdir/tmp
> > shell /bin/bash
> > prologNONE
> > epilogNONE
> > shell_start_mode  posix_compliant
> > starter_methodNONE
> > suspend_methodNONE
> > resume_method NONE
> > terminate_method  NONE
> > notify00:00:60
> > owner_listNONE
> > user_listsNONE
> > xuser_lists   NONE
> > subordinate_list  NONE
> > complex_valuesNONE
> > projects  NONE
> > xprojects NONE
> > calendar  NONE
> > initial_state default
> > s_rt  INFINITY
> > h_rt  INFINITY
> > s_cpu INFINITY
> > h_cpu INFINITY
> > s_fsize   INFINITY
> > h_fsize   INFINITY
> > s_dataINFINITY
> > h_dataINFINITY
> > s_stack   INFINITY
> > h_stack   INFINITY
> > s_coreINFINITY
> > h_coreINFINITY
> > s_rss INFINITY
> > h_rss INFINITY
> > s_vmem19G
> > h_vmem19G
> >
> > [root@chead ~]# qconf -shgrp @basichosts
> > group_name @basichosts
> > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \
> >  compute-1-5.local compute-1-6.local compute-1-7.local \
> >  compute-1-8.local compute-1-9.local compute-1-10.local \
> >  compute-1-11.local compute-1-12.local compute-1-13.local \
> >  compute-1-14.local compute-1-15.local compute-2-0.local \
> >  compute-2-2.local compute-2-5.local compute-2-7.local \
> >  compute-2-8.local compute-2-9.local compute-2-11.local \
> >  compute-2-12.local compute-2-13.local compute-2-15.local \
> >  compute-2-6.local
> >
> > Thanks
> >
> > -M
> >
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> >
>
>
>
> --
> Best,
>
> Feng
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] can't delete an exec host

2017-09-06 Thread Feng Zhang
It seems SGE master did not get refreshed with new hostgroup. Maybe you can try:

1. restart SGE master

or

2. change basic.q, "hostlist" to any node, like "compute-1-0.local",
wait till it gets refreshed; then change it back to "@basichosts".



On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer  wrote:
> SoGE 8.1.8
>
> Hi,
>
> I'm having trouble deleting an execution host. I've removed it from the
> host group, but when I try to delete with qconf, it says it's still part of
> 'basic.q'. Here's the relevant output? Anyone have any suggestions?
>
> [root@chead ~]# qconf -de compute-2-4.local
> Host object "compute-2-4.local" is still referenced in cluster queue
> "basic.q".
>
> [root@chead ~]# qconf -sq basic.q
> qname basic.q
> hostlist  @basichosts
> seq_no0
> load_thresholds   np_load_avg=1.74
> suspend_thresholdsNONE
> nsuspend  1
> suspend_interval  00:05:00
> priority  0
> min_cpu_interval  00:05:00
> processorsUNDEFINED
> qtype BATCH
> ckpt_list NONE
> pe_list   make mpich mpi orte unihost serial
> rerun FALSE
> slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \
>   [compute-1-1.local=7],[compute-1-3.local=7], \
>   [compute-1-5.local=8],[compute-1-6.local=8], \
>   [compute-1-7.local=8],[compute-1-8.local=8], \
>   [compute-1-9.local=8],[compute-1-10.local=8], \
>   [compute-1-11.local=8],[compute-1-12.local=8], \
>   [compute-1-13.local=8],[compute-1-14.local=8], \
>   [compute-1-15.local=8]
> tmpdir/tmp
> shell /bin/bash
> prologNONE
> epilogNONE
> shell_start_mode  posix_compliant
> starter_methodNONE
> suspend_methodNONE
> resume_method NONE
> terminate_method  NONE
> notify00:00:60
> owner_listNONE
> user_listsNONE
> xuser_lists   NONE
> subordinate_list  NONE
> complex_valuesNONE
> projects  NONE
> xprojects NONE
> calendar  NONE
> initial_state default
> s_rt  INFINITY
> h_rt  INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize   INFINITY
> h_fsize   INFINITY
> s_dataINFINITY
> h_dataINFINITY
> s_stack   INFINITY
> h_stack   INFINITY
> s_coreINFINITY
> h_coreINFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmem19G
> h_vmem19G
>
> [root@chead ~]# qconf -shgrp @basichosts
> group_name @basichosts
> hostlist compute-1-0.local compute-1-2.local compute-1-3.local \
>  compute-1-5.local compute-1-6.local compute-1-7.local \
>  compute-1-8.local compute-1-9.local compute-1-10.local \
>  compute-1-11.local compute-1-12.local compute-1-13.local \
>  compute-1-14.local compute-1-15.local compute-2-0.local \
>  compute-2-2.local compute-2-5.local compute-2-7.local \
>  compute-2-8.local compute-2-9.local compute-2-11.local \
>  compute-2-12.local compute-2-13.local compute-2-15.local \
>  compute-2-6.local
>
> Thanks
>
> -M
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>



-- 
Best,

Feng
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] can't delete an exec host

2017-09-06 Thread Michael Stauffer
SoGE 8.1.8

Hi,

I'm having trouble deleting an execution host. I've removed it from the
 host group, but when I try to delete with qconf, it says it's still part
of 'basic.q'. Here's the relevant output? Anyone have any suggestions?

[root@chead ~]# qconf -de compute-2-4.local
Host object "compute-2-4.local" is still referenced in cluster queue
"basic.q".

[root@chead ~]# qconf -sq basic.q
qname basic.q
hostlist  @basichosts
seq_no0
load_thresholds   np_load_avg=1.74
suspend_thresholdsNONE
nsuspend  1
suspend_interval  00:05:00
priority  0
min_cpu_interval  00:05:00
processorsUNDEFINED
qtype BATCH
ckpt_list NONE
pe_list   make mpich mpi orte unihost serial
rerun FALSE
slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \
  [compute-1-1.local=7],[compute-1-3.local=7], \
  [compute-1-5.local=8],[compute-1-6.local=8], \
  [compute-1-7.local=8],[compute-1-8.local=8], \
  [compute-1-9.local=8],[compute-1-10.local=8], \
  [compute-1-11.local=8],[compute-1-12.local=8], \
  [compute-1-13.local=8],[compute-1-14.local=8], \
  [compute-1-15.local=8]
tmpdir/tmp
shell /bin/bash
prologNONE
epilogNONE
shell_start_mode  posix_compliant
starter_methodNONE
suspend_methodNONE
resume_method NONE
terminate_method  NONE
notify00:00:60
owner_listNONE
user_listsNONE
xuser_lists   NONE
subordinate_list  NONE
complex_valuesNONE
projects  NONE
xprojects NONE
calendar  NONE
initial_state default
s_rt  INFINITY
h_rt  INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize   INFINITY
h_fsize   INFINITY
s_dataINFINITY
h_dataINFINITY
s_stack   INFINITY
h_stack   INFINITY
s_coreINFINITY
h_coreINFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem19G
h_vmem19G

[root@chead ~]# qconf -shgrp @basichosts
group_name @basichosts
hostlist compute-1-0.local compute-1-2.local compute-1-3.local \
 compute-1-5.local compute-1-6.local compute-1-7.local \
 compute-1-8.local compute-1-9.local compute-1-10.local \
 compute-1-11.local compute-1-12.local compute-1-13.local \
 compute-1-14.local compute-1-15.local compute-2-0.local \
 compute-2-2.local compute-2-5.local compute-2-7.local \
 compute-2-8.local compute-2-9.local compute-2-11.local \
 compute-2-12.local compute-2-13.local compute-2-15.local \
 compute-2-6.local

Thanks

-M
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users