Re: [gridengine users] Node refuse to run job

Prentice Bisbal Thu, 16 Feb 2012 06:10:34 -0800

I recently created a website for myself, and threw this documentation up
there, where it might be a little easier to read:


http://prentice.bisbal.co/hpc/sge/cannot_run_on_host

--
Prentice


On 02/10/2012 09:52 AM, Prentice Bisbal wrote:
> Jerome,
>
> I had a similar problem a couple of  years ago. I would get this error:
>
> cannot run on host node64.aurora until clean up of an previous run has
> finished
>
> (Aurora's my cluster's name, so I use that as my top-level domain on my
> cluster nodes)
>
> Fixing this problem is a bit tedious. Fortunately, I did a little write
> up of the problem all the steps I took to fix that on my internal wiki.
> Since access to that is restricted, here's a cut-and-paste on my write
> up on how to fix it. This should fix your problem.
>
> ********************************************
>
> I experienced this error in Feb 2010 when a users jobs were consuming
> all the memory on the execution nodes, causing the jobs to crash, and
> apparently leaving the executions hosts in an uncertain state after the
> jobs crashed.
>
> A large parallel job was in the 'qw' state for almost a week, and when I
> checked why with 'qstat -j <jobid>, I saw errors like this in the
> scheduling_info section of the output:
>
> cannot run on host node64.aurora until clean up of an previous run has 
> finished
>
> I inspected the hosts listed, and none of them had any jobs running on
> them. The SGE Users mailing list suggested the following remedies, but
> none of them worked:
>
>    1.
>       restart sgeexecd on the hosts
>    2.
>       restart sge_qmaster on the master
>    3.
>       restart the entire cluster
>
> Since this was a production cluster with many jobs running, I didn't
> have the luxury of trying that last one. I did approximate it by
> shutting down sge_qmaster and then rebooting the afflicted nodes, with
> no effect.
>
> Finally, I fixed this by deleting the hosts from SGE and then re-adding
> them. For the sake of future victims of a problem like this, here's what
> I did, since there's a few minor gotchas:
>
> 0. Disable all queues to the affected hosts, and make sure no jobs are
> running on them before starting.
>
> for host in <list of nodes>; do  
>  qmod -d \*@$host
> done
>
> 1. Wrote the configs of all the hosts to be deleted to text files:
>
> for host in <list of nodes>; do  
>  qconf -se $node > $node.txt
> done
>
> 2. Edit each text file and remove the entries for “load_values” and
> “processors”. These are values calculated by SGE, and will generate
> errors when you try to add the executions hosts back to the config later
> on. Since the load_values entry spans multiple lines and may be a
> different number of lines on different hosts, you can't do a simple sed
> operation to remove the lines. I used vi *.txt top open them all at once.
>
> 3. Edit any host groups or queues that reference the nodes you are about
> to delete. You will have to edit the hostgroup @allhosts at a minimum:
>
> qconf -mhgrp @allhosts
>
> 4. Delete the missing hosts from SGE:
>
> for host in <list of nodes>; do  
>  qconf -de $node
> done
>
> 5. Add them back:
>
> for host in <list of nodes>; do  
>  qconf -Ae $node.txt
> done
>
> 6. Edit the hostgroups or queues you modified in step 3 to add the hosts
> back
>
> qconf -mhgrp @allhosts
>
> That should be it. Be sure to check that the hosts are part of all the
> queues they should be, and that none of the queues are in error. Enable
> any queues that need it.
>
> – Prentice
>
>
>
>
>
>
> On 02/09/2012 12:51 PM, Jerome wrote:
>> Dera all
>>
>> I have the SGE version GE 6.2u2_1 on a Rocks cluster.
>> Since few days, a node refuse to run a job. using "qstat -j jid", i
>> notice this line a the end of the output:
>>
>> cannot run on host "compute-2-15.local" until clean up of an previous
>> run has finished
>>
>> I revise on the node 2-15, but the jobs directory is totaly empty. To
>> be sure about what i do, i reinstall from scratch the node, and the
>> problem persists.
>> It seems to be the master how is causing this issue. Someone can help
>> me on find where is the bad information file that i have to modify to
>> let my node running the job?
>>
>> Best regards.
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Node refuse to run job

Reply via email to