Re: [gridengine users] Jobs on qw state and exec node on au state

Radhouane Aniba Mon, 30 May 2016 12:16:59 -0700

Hello Bill Thank you for your reply

Everything looks ok as far as I can tell


ubuntu@compute010:~$ hostname
compute010

ubuntu@compute010:~$ cat /etc/hosts
# THIS FILE IS CONTROLLED BY ANSIBLE
# any local modifications will be overwritten!
#

# This file is managed by Ansible.
127.0.0.1 localhost.localdomain localhost

10.0.0.102 compute001
10.0.0.100 compute002
10.0.0.101 compute003
10.0.0.103 compute004
10.0.0.105 compute005
10.0.0.104 compute006
10.0.0.108 compute007
10.0.0.121 compute008
10.0.0.106 compute009
10.0.0.110 compute010
10.0.0.107 compute011
10.0.0.112 compute012
10.0.0.116 compute013
10.0.0.118 compute014
10.0.0.113 compute015
10.0.0.115 compute016
10.0.0.111 compute017
10.0.0.120 compute018
10.0.0.109 compute019
10.0.0.119 compute020
10.0.0.122 compute021
10.0.0.117 frontend001

I am not sure what's wrong to be honest, I'll keep looking



On Mon, May 30, 2016 at 12:00 PM, Bill Bryce <[email protected]> wrote:

> So typically with Grid Engine you need to select one machine as the
> ‘master’ machine in the cluster (you can have backups but they are running
> a ‘shadow_master’ so don’t worry about that for now).  The qmaster needs to
> be on one host that all the nodes can communicate with over the network.
> Each nodes must have a fully qualified name in order to communicate in a
> Grid Engine cluster.  So although you can ping one host from another that
> does not mean that you have the host names setup properly.  for example if
> I have a host compute010 it has to have a fully qualified name such as
> compute010.example.com (example.com is used here because it always works
> in examples, but don’t use it in your real cluster - you should have
> something at your site).  And that fully qualified name needs to map to an
> IP address for the machine.
>
> this means that when I am on machine compute010  I can run:
>
> # hostname
>
> # cat /etc/hosts
>
> and the names should match.
>
> If I do a
>
> # ping compute010.example.com it should return the IP address of the
> machine
>
> And you definitely cannot map the loopback address to the hostname i.e.
> 127.0.0.1 to compute010  that won’t work.  The messages file below from
> host compute010 is indicating that it can’t communicate with the master on
> frontend001.  So if the physical networking is not messed up and the
> machines have IP addresses then the name resolution is messed up or
> something else is running on the port that Grid Engine wants to use or you
> have a firewall that is getting in the way of communications between the
> hosts and blocking the communication.
>
> basically check everything you can between two hosts to make sure they can
> communicate properly.
>
> Regards,
>
> Bill.
>
> On May 30, 2016, at 2:24 PM, Radhouane Aniba <[email protected]> wrote:
>
> Ok here is what I have
>
> connected to one node compute010
>
> qconf -sconf gives me this
>
>
> #global:
> execd_spool_dir              /var/spool/gridengine/execd
> mailer                       /usr/bin/mail
> xterm                        /usr/bin/xterm
> load_sensor                  none
> prolog                       none
> epilog                       none
> shell_start_mode             posix_compliant
> login_shells                 bash,sh,ksh,csh,tcsh
> min_uid                      0
> min_gid                      0
> user_lists                   none
> xuser_lists                  none
> projects                     none
> xprojects                    none
> enforce_project              false
> enforce_user                 auto
> load_report_time             00:00:40
> max_unheard                  00:05:00
> reschedule_unknown           00:00:00
> loglevel                     log_warning
> administrator_mail           root
> set_token_cmd                none
> pag_cmd                      none
> token_extend_time            none
> shepherd_cmd                 none
> qmaster_params               none
> execd_params                 none
> reporting_params             accounting=true reporting=false \
>                              flush_time=00:00:15 joblog=false
> sharelog=00:00:00
> finished_jobs                100
> gid_range                    65400-65500
> max_aj_instances             2000
> max_aj_tasks                 75000
> max_u_jobs                   0
> max_jobs                     0
> auto_user_oticket            0
> auto_user_fshare             0
> auto_user_default_project    none
> auto_user_delete_time        86400
> delegated_file_staging       false
> reprioritize                 0
> rlogin_daemon                /usr/sbin/sshd -i
> rlogin_command               /usr/bin/ssh
> qlogin_daemon                /usr/sbin/sshd -i
> qlogin_command               /usr/share/gridengine/qlogin-wrapper
> rsh_daemon                   /usr/sbin/sshd -i
> rsh_command                  /usr/bin/ssh
> jsv_url                      none
> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>
>
> the message in spool :
>
>
> ubuntu@compute010:~$ more /var/spool/gridengine/execd/compute010/messages
> 05/02/2016 18:10:11|  main|compute010|E|can't find connection
> 05/02/2016 18:10:11|  main|compute010|E|can't get configuration from
> qmaster -- backgrounding
> 05/04/2016 16:58:28|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
> 05/18/2016 17:10:36|  main|compute010|W|can't register at qmaster
> "frontend001": abort qmaster registration due to communication errors
> 05/18/2016 17:37:55|  main|compute010|I|controlled shutdown 6.2u5
> 05/18/2016 17:46:28|  main|compute010|E|can't find connection
> 05/18/2016 17:46:28|  main|compute010|E|can't get configuration from
> qmaster -- backgrounding
> 05/18/2016 17:46:31|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
> 05/20/2016 14:27:40|  main|compute010|I|controlled shutdown 6.2u5
> 05/22/2016 17:00:28|  main|compute010|E|can't find connection
> 05/22/2016 17:00:28|  main|compute010|E|can't get configuration from
> qmaster -- backgrounding
> 05/22/2016 17:01:38|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
> 05/28/2016 03:59:31|  main|compute010|I|controlled shutdown 6.2u5
> 05/28/2016 03:59:49|  main|compute010|W|local configuration compute010 not
> defined - using global configuration
> 05/28/2016 03:59:49|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
> 05/30/2016 17:41:50|  main|compute010|W|can't register at qmaster
> "compute010": abort qmaster registration due to communication errors
> 05/30/2016 17:41:50|  main|compute010|E|commlib error: got select error
> (Connection refused)
> 05/30/2016 17:42:14|  main|compute010|I|controlled shutdown 6.2u5
> 05/30/2016 17:58:58|  main|compute010|W|local configuration compute010 not
> defined - using global configuration
> 05/30/2016 17:58:58|  main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
>
> I had the qmaster running on all nodes before, with no problem (master and
> executors)
> when I kill sge_master on the node, the sge_execd is not working anymore
> because its not able to connect to the master
>
> a ping on the node to the frontend node shows that it is visible though
>
> :/
>
> On Mon, May 30, 2016 at 11:14 AM, Bill Bryce <[email protected]> wrote:
>
>> Okay,
>>
>> can you run any qconf commands such as ‘qconf -sconf’.  Try having a look
>> at the messages files for the execution daemons.  They should be in
>>
>> $SGE_ROOT/default/spool/ and in there are directories for the master and
>> exec hosts (if you have this installed in a shared filesystem
>> envirionment).  You can check both the qmaster messages file and the execd
>> messages files in those directories.
>>
>> A question.  Do you have the qmaster running on one host or on many?  I
>> noticed that you have the ps output for compute010 and it is running a
>> qmaster.
>>
>> Other things you can check is to see if all nodes can contact the qmaster
>> machine i.e. the networking is configured properly.  You can also make sure
>> that the host naming is correct, either configure DNS properly or configure
>> a /etc/hosts file for all nodes so the IP to host name mapping is
>> consistent across the cluster.  Grid Engine is very picky about host names.
>>
>>
>>
>> On May 30, 2016, at 1:36 PM, Radhouane Aniba <[email protected]> wrote:
>>
>> Hi Bill
>>
>> Yes I am sure
>>
>> This is what I have when I login to one of the nodes and do
>>
>> ubuntu@compute010:~$ ps -ef | grep sge_
>> sgeadmin  1254     1  0 May28 ?        00:00:39
>> /usr/lib/gridengine/sge_qmaster
>> sgeadmin  1446     1  0 May28 ?        00:00:22
>> /usr/lib/gridengine/sge_execd
>> ubuntu    2552  2527  0 17:36 pts/0    00:00:00 grep --color=auto sge_
>>
>>
>> On Mon, May 30, 2016 at 10:33 AM, Bill Bryce <[email protected]> wrote:
>>
>>> Hi Rad,
>>>
>>> Are you sure that the execution daemons are running on your compute
>>> nodes?  Can you login to one of the nodes say ‘compute001’ and do a ps
>>> looking for the execd?  When an execd is functioning normally it provides
>>> the load and memory, etc… none of your nodes are showing that.
>>>
>>> Regards,
>>>
>>> Bill.
>>>
>>> On May 30, 2016, at 1:20 PM, Radhouane Aniba <[email protected]> wrote:
>>>
>>> Hello all,
>>>
>>> I am trying to submit a simple "hello world" to test a gridengine (I
>>> used it before with no problems)
>>>
>>> The problem is that my job is waiting in the queue forever
>>>
>>> The qhost command shows a wired state of the compute nodes
>>>
>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  
>>> SWAPUS
>>> -------------------------------------------------------------------------------
>>> global                  -               -     -       -       -       -     
>>>   -
>>> compute001              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute002              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute003              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute004              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute005              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute006              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute007              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute008              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute009              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute010              lx26-amd64      4     -   31.4G       -     0.0     
>>>   -
>>> compute011              lx26-amd64      4     -   31.4G       -     0.0
>>>
>>> In normal times even when the compute nodes are not used I used to have
>>> some information on the load and memuse columns
>>>
>>> I am not an SGE persons but I am familiar with all the commands, any
>>> help would be much appreciated
>>>
>>> the qstat -f command shows all my nodes in au state. I've been reading
>>> a lot about it and I understood its an alarm state (overloaded ?)
>>>
>>> the only heavy activity I had on the head node was a script downloading
>>> 19T of data, could the headnode be the problem and not the compute nodes ?
>>> sge_execd is working on all the compute/exec nodes :/
>>>
>>> --
>>> *Rad*
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>>>
>>>
>>> William Bryce | VP Products
>>> Univa Corporation, Toronto
>>> E: [email protected] | D: 647-9742841 | Toll-Free (800) 370-5320
>>> W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation |
>>> T: twitter.com/Grid_Engine
>>>
>>>
>>
>>
>> --
>> *Radhouane Aniba*
>> *Bioinformatics Scientist*
>> *BC Cancer Agency, Vancouver, Canada*
>>
>>
>> William Bryce | VP Products
>> Univa Corporation, Toronto
>> E: [email protected] | D: 647-9742841 | Toll-Free (800) 370-5320
>> W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation |
>> T: twitter.com/Grid_Engine
>>
>>
>
>
> --
> *Radhouane Aniba*
> *Bioinformatics Scientist*
> *BC Cancer Agency, Vancouver, Canada*
>
>
> William Bryce | VP Products
> Univa Corporation, Toronto
> E: [email protected] | D: 647-9742841 | Toll-Free (800) 370-5320
> W: Univa.com | FB: facebook.com/univa.corporation | T:
> twitter.com/Grid_Engine
>
>


-- 
*Radhouane Aniba*
*Bioinformatics Scientist*
*BC Cancer Agency, Vancouver, Canada*

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs on qw state and exec node on au state

Reply via email to