Re: [gridengine users] Jobs on qw state and exec node on au state

Bill Bryce Mon, 30 May 2016 12:03:25 -0700

So typically with Grid Engine you need to select one machine as the ‘master’ 
machine in the cluster (you can have backups but they are running a 
‘shadow_master’ so don’t worry about that for now).  The qmaster needs to be on 
one host that all the nodes can communicate with over the network.  Each nodes 
must have a fully qualified name in order to communicate in a Grid Engine 
cluster.  So although you can ping one host from another that does not mean 
that you have the host names setup properly.  for example if I have a host 
compute010 it has to have a fully qualified name such as compute010.example.com 
<http://compute010.example.com/> (example.com <http://example.com/> is used 
here because it always works in examples, but don’t use it in your real cluster 
- you should have something at your site).  And that fully qualified name needs 
to map to an IP address for the machine.


this means that when I am on machine compute010  I can run:

# hostname

# cat /etc/hosts

and the names should match.

If I do a

# ping compute010.example.com <http://compute010.example.com/> it should return 
the IP address of the machine

And you definitely cannot map the loopback address to the hostname i.e. 
127.0.0.1 to compute010  that won’t work.  The messages file below from host 
compute010 is indicating that it can’t communicate with the master on 
frontend001.  So if the physical networking is not messed up and the machines 
have IP addresses then the name resolution is messed up or something else is 
running on the port that Grid Engine wants to use or you have a firewall that 
is getting in the way of communications between the hosts and blocking the 
communication.

basically check everything you can between two hosts to make sure they can 
communicate properly.

Regards,

Bill.

> On May 30, 2016, at 2:24 PM, Radhouane Aniba <[email protected]> wrote:
> 
> Ok here is what I have
> 
> connected to one node compute010
> 
> qconf -sconf gives me this
> 
> 
> #global:
> execd_spool_dir              /var/spool/gridengine/execd
> mailer                       /usr/bin/mail
> xterm                        /usr/bin/xterm
> load_sensor                  none
> prolog                       none
> epilog                       none
> shell_start_mode             posix_compliant
> login_shells                 bash,sh,ksh,csh,tcsh
> min_uid                      0
> min_gid                      0
> user_lists                   none
> xuser_lists                  none
> projects                     none
> xprojects                    none
> enforce_project              false
> enforce_user                 auto
> load_report_time             00:00:40
> max_unheard                  00:05:00
> reschedule_unknown           00:00:00
> loglevel                     log_warning
> administrator_mail           root
> set_token_cmd                none
> pag_cmd                      none
> token_extend_time            none
> shepherd_cmd                 none
> qmaster_params               none
> execd_params                 none
> reporting_params             accounting=true reporting=false \
>                              flush_time=00:00:15 joblog=false 
> sharelog=00:00:00
> finished_jobs                100
> gid_range                    65400-65500
> max_aj_instances             2000
> max_aj_tasks                 75000
> max_u_jobs                   0
> max_jobs                     0
> auto_user_oticket            0
> auto_user_fshare             0
> auto_user_default_project    none
> auto_user_delete_time        86400
> delegated_file_staging       false
> reprioritize                 0
> rlogin_daemon                /usr/sbin/sshd -i
> rlogin_command               /usr/bin/ssh
> qlogin_daemon                /usr/sbin/sshd -i
> qlogin_command               /usr/share/gridengine/qlogin-wrapper
> rsh_daemon                   /usr/sbin/sshd -i
> rsh_command                  /usr/bin/ssh
> jsv_url                      none
> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
> 
> 
> the message in spool :
> 
> 
> ubuntu@compute010:~$ more /var/spool/gridengine/execd/compute010/messages
> 05/02/2016 18:10:11|  main|compute010|E|can't find connection
> 05/02/2016 18:10:11|  main|compute010|E|can't get configuration from qmaster 
> -- backgrounding
> 05/04/2016 16:58:28|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
> 05/18/2016 17:10:36|  main|compute010|W|can't register at qmaster 
> "frontend001": abort qmaster registration due to communication errors
> 05/18/2016 17:37:55|  main|compute010|I|controlled shutdown 6.2u5
> 05/18/2016 17:46:28|  main|compute010|E|can't find connection
> 05/18/2016 17:46:28|  main|compute010|E|can't get configuration from qmaster 
> -- backgrounding
> 05/18/2016 17:46:31|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
> 05/20/2016 14:27:40|  main|compute010|I|controlled shutdown 6.2u5
> 05/22/2016 17:00:28|  main|compute010|E|can't find connection
> 05/22/2016 17:00:28|  main|compute010|E|can't get configuration from qmaster 
> -- backgrounding
> 05/22/2016 17:01:38|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
> 05/28/2016 03:59:31|  main|compute010|I|controlled shutdown 6.2u5
> 05/28/2016 03:59:49|  main|compute010|W|local configuration compute010 not 
> defined - using global configuration
> 05/28/2016 03:59:49|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
> 05/30/2016 17:41:50|  main|compute010|W|can't register at qmaster 
> "compute010": abort qmaster registration due to communication errors
> 05/30/2016 17:41:50|  main|compute010|E|commlib error: got select error 
> (Connection refused)
> 05/30/2016 17:42:14|  main|compute010|I|controlled shutdown 6.2u5
> 05/30/2016 17:58:58|  main|compute010|W|local configuration compute010 not 
> defined - using global configuration
> 05/30/2016 17:58:58|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
> 
> I had the qmaster running on all nodes before, with no problem (master and 
> executors)
> when I kill sge_master on the node, the sge_execd is not working anymore 
> because its not able to connect to the master
> 
> a ping on the node to the frontend node shows that it is visible though
> 
> :/
> 
> On Mon, May 30, 2016 at 11:14 AM, Bill Bryce <[email protected] 
> <mailto:[email protected]>> wrote:
> Okay,
> 
> can you run any qconf commands such as ‘qconf -sconf’.  Try having a look at 
> the messages files for the execution daemons.  They should be in
> 
> $SGE_ROOT/default/spool/ and in there are directories for the master and exec 
> hosts (if you have this installed in a shared filesystem envirionment).  You 
> can check both the qmaster messages file and the execd messages files in 
> those directories.
> 
> A question.  Do you have the qmaster running on one host or on many?  I 
> noticed that you have the ps output for compute010 and it is running a 
> qmaster.
> 
> Other things you can check is to see if all nodes can contact the qmaster 
> machine i.e. the networking is configured properly.  You can also make sure 
> that the host naming is correct, either configure DNS properly or configure a 
> /etc/hosts file for all nodes so the IP to host name mapping is consistent 
> across the cluster.  Grid Engine is very picky about host names.
> 
> 
> 
>> On May 30, 2016, at 1:36 PM, Radhouane Aniba <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi Bill
>> 
>> Yes I am sure
>> 
>> This is what I have when I login to one of the nodes and do
>> 
>> ubuntu@compute010:~$ ps -ef | grep sge_
>> sgeadmin  1254     1  0 May28 ?        00:00:39 
>> /usr/lib/gridengine/sge_qmaster
>> sgeadmin  1446     1  0 May28 ?        00:00:22 /usr/lib/gridengine/sge_execd
>> ubuntu    2552  2527  0 17:36 pts/0    00:00:00 grep --color=auto sge_
>> 
>> 
>> On Mon, May 30, 2016 at 10:33 AM, Bill Bryce <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi Rad,
>> 
>> Are you sure that the execution daemons are running on your compute nodes?  
>> Can you login to one of the nodes say ‘compute001’ and do a ps looking for 
>> the execd?  When an execd is functioning normally it provides the load and 
>> memory, etc… none of your nodes are showing that.
>> 
>> Regards,
>> 
>> Bill.
>> 
>>> On May 30, 2016, at 1:20 PM, Radhouane Aniba <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hello all,
>>> 
>>> I am trying to submit a simple "hello world" to test a gridengine (I used 
>>> it before with no problems)
>>> 
>>> The problem is that my job is waiting in the queue forever
>>> 
>>> The qhost command shows a wired state of the compute nodes
>>> 
>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  
>>> SWAPUS
>>> -------------------------------------------------------------------------------
>>> global                  -               -     -       -       -       -     
>>>   -
>>> compute001              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute002              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute003              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute004              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute005              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute006              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute007              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute008              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute009              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute010              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute011              lx26-amd64      4     -   31.4G       -     0.0
>>> In normal times even when the compute nodes are not used I used to have 
>>> some information on the load and memuse columns
>>> 
>>> I am not an SGE persons but I am familiar with all the commands, any help 
>>> would be much appreciated
>>> 
>>> the qstat -f command shows all my nodes in au state. I've been reading a 
>>> lot about it and I understood its an alarm state (overloaded ?)
>>> 
>>> the only heavy activity I had on the head node was a script downloading 19T 
>>> of data, could the headnode be the problem and not the compute nodes ?
>>> 
>>> sge_execd is working on all the compute/exec nodes :/
>>> 
>>> --
>>> Rad
>>> _______________________________________________
>>> users mailing list
>>> [email protected] <mailto:[email protected]>
>>> https://gridengine.org/mailman/listinfo/users 
>>> <https://gridengine.org/mailman/listinfo/users>
>> 
>> William Bryce | VP Products
>> Univa Corporation, Toronto
>> E: [email protected] <mailto:[email protected]> | D: 647-9742841 
>> <tel:647-9742841> | Toll-Free (800) 370-5320 <tel:%28800%29%20370-5320>
>> W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation 
>> <http://facebook.com/univa.corporation> | T: twitter.com/Grid_Engine 
>> <http://twitter.com/Grid_Engine>
>> 
>> 
>> 
>> --
>> Radhouane Aniba
>> Bioinformatics Scientist
>> BC Cancer Agency, Vancouver, Canada
> 
> William Bryce | VP Products
> Univa Corporation, Toronto
> E: [email protected] <mailto:[email protected]> | D: 647-9742841 
> <tel:647-9742841> | Toll-Free (800) 370-5320 <tel:%28800%29%20370-5320>
> W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation 
> <http://facebook.com/univa.corporation> | T: twitter.com/Grid_Engine 
> <http://twitter.com/Grid_Engine>
> 
> 
> 
> --
> Radhouane Aniba
> Bioinformatics Scientist
> BC Cancer Agency, Vancouver, Canada

William Bryce | VP Products
Univa Corporation, Toronto
E: [email protected] | D: 647-9742841 | Toll-Free (800) 370-5320
W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation 
<http://facebook.com/univa.corporation> | T: twitter.com/Grid_Engine 
<http://twitter.com/Grid_Engine>

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs on qw state and exec node on au state

Reply via email to