Hi all,
   yes, the nodes were in error state, yesterday the fasted solution was to try 
to reconfigure SGE in the nodes and master (inst_sge command), after all, the 
system became online without error:

[root@hactar ~]# qstat -f
queuename                      qtype resv/used/tot. load_avg arch          
states
---------------------------------------------------------------------------------
all.q@compute-1-1              BIP   0/0/24         0.03     linux-x64
---------------------------------------------------------------------------------
all.q@compute-1-10             BIP   0/0/24         0.00     linux-x64
---------------------------------------------------------------------------------
all.q@compute-1-11             BIP   0/0/24         0.00     linux-x64
---------------------------------------------------------------------------------
all.q@compute-1-12             BIP   0/0/24         0.00     linux-x64
---------------------------------------------------------------------------------
all.q@compute-1-13             BIP   0/0/24         0.05     linux-x64
---------------------------------------------------------------------------------
all.q@compute-1-14             BIP   0/0/24         0.08     linux-x64
---------------------------------------------------------------------------------
all.q@compute-1-2              BIP   0/0/24         0.00     linux-x64
---------------------------------------------------------------------------------
all.q@compute-1-3              BIP   0/0/24         0.00     linux-x64
---------------------------------------------------------------------------------
all.q@compute-1-4              BIP   0/0/24         0.00     linux-x64
---------------------------------------------------------------------------------
all.q@compute-1-5              BIP   0/0/24         0.05     linux-x64
---------------------------------------------------------------------------------
all.q@compute-1-6              BIP   0/0/24         0.02     linux-x64
---------------------------------------------------------------------------------
all.q@compute-1-7              BIP   0/0/24         0.00     linux-x64
---------------------------------------------------------------------------------
all.q@compute-1-8              BIP   0/0/24         0.00     linux-x64
---------------------------------------------------------------------------------
all.q@compute-1-9              BIP   0/0/24         0.00     linux-x64
[root@hactar ~]#

Now the mpi job submission runs well. Maybe the wrong behavior was due to a 
configuration error in the install phase.

Thanks

D.


> Il giorno 16/giu/2015, alle ore 09:33, William Hay <[email protected]> ha 
> scritto:
> 
> On Mon, 15 Jun 2015 17:27:47 +0000
> Daniele Gregori <[email protected]> wrote:
> 
>> [root@hactar ~]# qstat -f
>> 
>> queuename                      qtype resv/used/tot. load_avg arch          
>> states
>> 
>> ---------------------------------------------------------------------------------
>> 
>> all.q@compute-1-1              BIP   0/0/24         0.18     linux-x64     E
>> 
>> ---------------------------------------------------------------------------------
>> 
>> all.q@compute-1-10             BIP   0/0/24         0.13     linux-x64     E
>> 
>> ---------------------------------------------------------------------------------
>> 
>> all.q@compute-1-11             BIP   0/0/24         0.03     linux-x64     E
>> 
>> ---------------------------------------------------------------------------------
>> 
>> all.q@compute-1-12             BIP   0/0/24         0.12     linux-x64     E
>> 
>> ---------------------------------------------------------------------------------
>> 
>> all.q@compute-1-13             BIP   0/0/24         0.03     linux-x64     E
>> 
>> ---------------------------------------------------------------------------------
>> 
>> all.q@compute-1-14             BIP   0/0/24         0.10     linux-x64     E
>> 
>> ---------------------------------------------------------------------------------
>> 
>> all.q@compute-1-2              BIP   0/0/24         0.12     linux-x64     E
>> 
>> ---------------------------------------------------------------------------------
>> 
>> all.q@compute-1-3              BIP   0/0/24         0.10     linux-x64     E
>> 
>> ---------------------------------------------------------------------------------
>> 
>> all.q@compute-1-4              BIP   0/0/24         0.16     linux-x64     E
>> 
>> ---------------------------------------------------------------------------------
>> 
>> all.q@compute-1-5              BIP   0/0/24         0.12     linux-x64     E
>> 
>> ---------------------------------------------------------------------------------
>> 
>> all.q@compute-1-6              BIP   0/0/24         0.07     linux-x64     E
>> 
>> ---------------------------------------------------------------------------------
>> 
>> all.q@compute-1-7              BIP   0/0/24         0.05     linux-x64     E
>> 
>> ---------------------------------------------------------------------------------
>> 
>> all.q@compute-1-8              BIP   0/0/24         0.04     linux-x64     E
>> 
>> ---------------------------------------------------------------------------------
>> 
>> all.q@compute-1-9              BIP   0/0/24         0.09     linux-x64     E
>> 
> Well the above reveals the proximate cause of your problem.  Your queues are 
> all in an error
> state.  This usually happens  when something goes wrong when a job starts and 
> grid engine decides
> that the cause is related to the node rather than the job.
> 
> If you run qstat -qs E -explain E it will probably point at the job that 
> triggered the problem.
> It is possible that a clue to what happened may appear in the output of the 
> job which triggered
> the problem or in the execd messages file of the node with the problem.
> 
> If that doesn't tell you what the problem is you can enable KEEP_ACTIVE in 
> the execd_params of the sge config it will
> retain the job's active directory after the job terminates/exits.  Next time 
> a job triggers a queue into an error state you 
> can examine the additional logfiles left in the active directory.  As the man 
> page says this is a debug option so turn it off
> again when you've finished diagnosing/fixing.   
> 
> You can clear the error state with qmod -cq <queue name> but if you haven't 
> identified and fixed the root 
> of the problem it will likely reoccur.
> 
> 
> -- 
> William Hay <[email protected]>


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to