Re: [gridengine users] Jobs on qw state and exec node on au state

Bill Bryce Mon, 30 May 2016 11:17:25 -0700

Okay,

can you run any qconf commands such as ‘qconf -sconf’.  Try having a look at 
the messages files for the execution daemons.  They should be in


$SGE_ROOT/default/spool/ and in there are directories for the master and exec 
hosts (if you have this installed in a shared filesystem envirionment).  You 
can check both the qmaster messages file and the execd messages files in those 
directories.

A question.  Do you have the qmaster running on one host or on many?  I noticed 
that you have the ps output for compute010 and it is running a qmaster.

Other things you can check is to see if all nodes can contact the qmaster 
machine i.e. the networking is configured properly.  You can also make sure 
that the host naming is correct, either configure DNS properly or configure a 
/etc/hosts file for all nodes so the IP to host name mapping is consistent 
across the cluster.  Grid Engine is very picky about host names.



> On May 30, 2016, at 1:36 PM, Radhouane Aniba <[email protected]> wrote:
> 
> Hi Bill
> 
> Yes I am sure
> 
> This is what I have when I login to one of the nodes and do
> 
> ubuntu@compute010:~$ ps -ef | grep sge_
> sgeadmin  1254     1  0 May28 ?        00:00:39 
> /usr/lib/gridengine/sge_qmaster
> sgeadmin  1446     1  0 May28 ?        00:00:22 /usr/lib/gridengine/sge_execd
> ubuntu    2552  2527  0 17:36 pts/0    00:00:00 grep --color=auto sge_
> 
> 
> On Mon, May 30, 2016 at 10:33 AM, Bill Bryce <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Rad,
> 
> Are you sure that the execution daemons are running on your compute nodes?  
> Can you login to one of the nodes say ‘compute001’ and do a ps looking for 
> the execd?  When an execd is functioning normally it provides the load and 
> memory, etc… none of your nodes are showing that.
> 
> Regards,
> 
> Bill.
> 
>> On May 30, 2016, at 1:20 PM, Radhouane Aniba <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hello all,
>> 
>> I am trying to submit a simple "hello world" to test a gridengine (I used it 
>> before with no problems)
>> 
>> The problem is that my job is waiting in the queue forever
>> 
>> The qhost command shows a wired state of the compute nodes
>> 
>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  
>> SWAPUS
>> -------------------------------------------------------------------------------
>> global                  -               -     -       -       -       -      
>>  -
>> compute001              lx26-amd64      4     -   31.4G       -     0.0      
>>  -
>> compute002              lx26-amd64      4     -   31.4G       -     0.0      
>>  -
>> compute003              lx26-amd64      4     -   31.4G       -     0.0      
>>  -
>> compute004              lx26-amd64      4     -   31.4G       -     0.0      
>>  -
>> compute005              lx26-amd64      4     -   31.4G       -     0.0      
>>  -
>> compute006              lx26-amd64      4     -   31.4G       -     0.0      
>>  -
>> compute007              lx26-amd64      4     -   31.4G       -     0.0      
>>  -
>> compute008              lx26-amd64      4     -   31.4G       -     0.0      
>>  -
>> compute009              lx26-amd64      4     -   31.4G       -     0.0      
>>  -
>> compute010              lx26-amd64      4     -   31.4G       -     0.0      
>>  -
>> compute011              lx26-amd64      4     -   31.4G       -     0.0
>> In normal times even when the compute nodes are not used I used to have some 
>> information on the load and memuse columns
>> 
>> I am not an SGE persons but I am familiar with all the commands, any help 
>> would be much appreciated
>> 
>> the qstat -f command shows all my nodes in au state. I've been reading a lot 
>> about it and I understood its an alarm state (overloaded ?)
>> 
>> the only heavy activity I had on the head node was a script downloading 19T 
>> of data, could the headnode be the problem and not the compute nodes ?
>> 
>> sge_execd is working on all the compute/exec nodes :/
>> 
>> --
>> Rad
>> _______________________________________________
>> users mailing list
>> [email protected] <mailto:[email protected]>
>> https://gridengine.org/mailman/listinfo/users 
>> <https://gridengine.org/mailman/listinfo/users>
> 
> William Bryce | VP Products
> Univa Corporation, Toronto
> E: [email protected] <mailto:[email protected]> | D: 647-9742841 
> <tel:647-9742841> | Toll-Free (800) 370-5320 <tel:%28800%29%20370-5320>
> W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation 
> <http://facebook.com/univa.corporation> | T: twitter.com/Grid_Engine 
> <http://twitter.com/Grid_Engine>
> 
> 
> 
> --
> Radhouane Aniba
> Bioinformatics Scientist
> BC Cancer Agency, Vancouver, Canada

William Bryce | VP Products
Univa Corporation, Toronto
E: [email protected] | D: 647-9742841 | Toll-Free (800) 370-5320
W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation 
<http://facebook.com/univa.corporation> | T: twitter.com/Grid_Engine 
<http://twitter.com/Grid_Engine>

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs on qw state and exec node on au state

Reply via email to