I recently created a website for myself, and threw this documentation up there, where it might be a little easier to read:
http://prentice.bisbal.co/hpc/sge/cannot_run_on_host -- Prentice On 02/10/2012 09:52 AM, Prentice Bisbal wrote: > Jerome, > > I had a similar problem a couple of years ago. I would get this error: > > cannot run on host node64.aurora until clean up of an previous run has > finished > > (Aurora's my cluster's name, so I use that as my top-level domain on my > cluster nodes) > > Fixing this problem is a bit tedious. Fortunately, I did a little write > up of the problem all the steps I took to fix that on my internal wiki. > Since access to that is restricted, here's a cut-and-paste on my write > up on how to fix it. This should fix your problem. > > ******************************************** > > I experienced this error in Feb 2010 when a users jobs were consuming > all the memory on the execution nodes, causing the jobs to crash, and > apparently leaving the executions hosts in an uncertain state after the > jobs crashed. > > A large parallel job was in the 'qw' state for almost a week, and when I > checked why with 'qstat -j <jobid>, I saw errors like this in the > scheduling_info section of the output: > > cannot run on host node64.aurora until clean up of an previous run has > finished > > I inspected the hosts listed, and none of them had any jobs running on > them. The SGE Users mailing list suggested the following remedies, but > none of them worked: > > 1. > restart sgeexecd on the hosts > 2. > restart sge_qmaster on the master > 3. > restart the entire cluster > > Since this was a production cluster with many jobs running, I didn't > have the luxury of trying that last one. I did approximate it by > shutting down sge_qmaster and then rebooting the afflicted nodes, with > no effect. > > Finally, I fixed this by deleting the hosts from SGE and then re-adding > them. For the sake of future victims of a problem like this, here's what > I did, since there's a few minor gotchas: > > 0. Disable all queues to the affected hosts, and make sure no jobs are > running on them before starting. > > for host in <list of nodes>; do > qmod -d \*@$host > done > > 1. Wrote the configs of all the hosts to be deleted to text files: > > for host in <list of nodes>; do > qconf -se $node > $node.txt > done > > 2. Edit each text file and remove the entries for “load_values” and > “processors”. These are values calculated by SGE, and will generate > errors when you try to add the executions hosts back to the config later > on. Since the load_values entry spans multiple lines and may be a > different number of lines on different hosts, you can't do a simple sed > operation to remove the lines. I used vi *.txt top open them all at once. > > 3. Edit any host groups or queues that reference the nodes you are about > to delete. You will have to edit the hostgroup @allhosts at a minimum: > > qconf -mhgrp @allhosts > > 4. Delete the missing hosts from SGE: > > for host in <list of nodes>; do > qconf -de $node > done > > 5. Add them back: > > for host in <list of nodes>; do > qconf -Ae $node.txt > done > > 6. Edit the hostgroups or queues you modified in step 3 to add the hosts > back > > qconf -mhgrp @allhosts > > That should be it. Be sure to check that the hosts are part of all the > queues they should be, and that none of the queues are in error. Enable > any queues that need it. > > – Prentice > > > > > > > On 02/09/2012 12:51 PM, Jerome wrote: >> Dera all >> >> I have the SGE version GE 6.2u2_1 on a Rocks cluster. >> Since few days, a node refuse to run a job. using "qstat -j jid", i >> notice this line a the end of the output: >> >> cannot run on host "compute-2-15.local" until clean up of an previous >> run has finished >> >> I revise on the node 2-15, but the jobs directory is totaly empty. To >> be sure about what i do, i reinstall from scratch the node, and the >> problem persists. >> It seems to be the master how is causing this issue. Someone can help >> me on find where is the bad information file that i have to modify to >> let my node running the job? >> >> Best regards. > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
