Hi all, yes, the nodes were in error state, yesterday the fasted solution was to try to reconfigure SGE in the nodes and master (inst_sge command), after all, the system became online without error:
[root@hactar ~]# qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- all.q@compute-1-1 BIP 0/0/24 0.03 linux-x64 --------------------------------------------------------------------------------- all.q@compute-1-10 BIP 0/0/24 0.00 linux-x64 --------------------------------------------------------------------------------- all.q@compute-1-11 BIP 0/0/24 0.00 linux-x64 --------------------------------------------------------------------------------- all.q@compute-1-12 BIP 0/0/24 0.00 linux-x64 --------------------------------------------------------------------------------- all.q@compute-1-13 BIP 0/0/24 0.05 linux-x64 --------------------------------------------------------------------------------- all.q@compute-1-14 BIP 0/0/24 0.08 linux-x64 --------------------------------------------------------------------------------- all.q@compute-1-2 BIP 0/0/24 0.00 linux-x64 --------------------------------------------------------------------------------- all.q@compute-1-3 BIP 0/0/24 0.00 linux-x64 --------------------------------------------------------------------------------- all.q@compute-1-4 BIP 0/0/24 0.00 linux-x64 --------------------------------------------------------------------------------- all.q@compute-1-5 BIP 0/0/24 0.05 linux-x64 --------------------------------------------------------------------------------- all.q@compute-1-6 BIP 0/0/24 0.02 linux-x64 --------------------------------------------------------------------------------- all.q@compute-1-7 BIP 0/0/24 0.00 linux-x64 --------------------------------------------------------------------------------- all.q@compute-1-8 BIP 0/0/24 0.00 linux-x64 --------------------------------------------------------------------------------- all.q@compute-1-9 BIP 0/0/24 0.00 linux-x64 [root@hactar ~]# Now the mpi job submission runs well. Maybe the wrong behavior was due to a configuration error in the install phase. Thanks D. > Il giorno 16/giu/2015, alle ore 09:33, William Hay <[email protected]> ha > scritto: > > On Mon, 15 Jun 2015 17:27:47 +0000 > Daniele Gregori <[email protected]> wrote: > >> [root@hactar ~]# qstat -f >> >> queuename qtype resv/used/tot. load_avg arch >> states >> >> --------------------------------------------------------------------------------- >> >> all.q@compute-1-1 BIP 0/0/24 0.18 linux-x64 E >> >> --------------------------------------------------------------------------------- >> >> all.q@compute-1-10 BIP 0/0/24 0.13 linux-x64 E >> >> --------------------------------------------------------------------------------- >> >> all.q@compute-1-11 BIP 0/0/24 0.03 linux-x64 E >> >> --------------------------------------------------------------------------------- >> >> all.q@compute-1-12 BIP 0/0/24 0.12 linux-x64 E >> >> --------------------------------------------------------------------------------- >> >> all.q@compute-1-13 BIP 0/0/24 0.03 linux-x64 E >> >> --------------------------------------------------------------------------------- >> >> all.q@compute-1-14 BIP 0/0/24 0.10 linux-x64 E >> >> --------------------------------------------------------------------------------- >> >> all.q@compute-1-2 BIP 0/0/24 0.12 linux-x64 E >> >> --------------------------------------------------------------------------------- >> >> all.q@compute-1-3 BIP 0/0/24 0.10 linux-x64 E >> >> --------------------------------------------------------------------------------- >> >> all.q@compute-1-4 BIP 0/0/24 0.16 linux-x64 E >> >> --------------------------------------------------------------------------------- >> >> all.q@compute-1-5 BIP 0/0/24 0.12 linux-x64 E >> >> --------------------------------------------------------------------------------- >> >> all.q@compute-1-6 BIP 0/0/24 0.07 linux-x64 E >> >> --------------------------------------------------------------------------------- >> >> all.q@compute-1-7 BIP 0/0/24 0.05 linux-x64 E >> >> --------------------------------------------------------------------------------- >> >> all.q@compute-1-8 BIP 0/0/24 0.04 linux-x64 E >> >> --------------------------------------------------------------------------------- >> >> all.q@compute-1-9 BIP 0/0/24 0.09 linux-x64 E >> > Well the above reveals the proximate cause of your problem. Your queues are > all in an error > state. This usually happens when something goes wrong when a job starts and > grid engine decides > that the cause is related to the node rather than the job. > > If you run qstat -qs E -explain E it will probably point at the job that > triggered the problem. > It is possible that a clue to what happened may appear in the output of the > job which triggered > the problem or in the execd messages file of the node with the problem. > > If that doesn't tell you what the problem is you can enable KEEP_ACTIVE in > the execd_params of the sge config it will > retain the job's active directory after the job terminates/exits. Next time > a job triggers a queue into an error state you > can examine the additional logfiles left in the active directory. As the man > page says this is a debug option so turn it off > again when you've finished diagnosing/fixing. > > You can clear the error state with qmod -cq <queue name> but if you haven't > identified and fixed the root > of the problem it will likely reoccur. > > > -- > William Hay <[email protected]> _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
