So typically with Grid Engine you need to select one machine as the ‘master’ machine in the cluster (you can have backups but they are running a ‘shadow_master’ so don’t worry about that for now). The qmaster needs to be on one host that all the nodes can communicate with over the network. Each nodes must have a fully qualified name in order to communicate in a Grid Engine cluster. So although you can ping one host from another that does not mean that you have the host names setup properly. for example if I have a host compute010 it has to have a fully qualified name such as compute010.example.com <http://compute010.example.com/> (example.com <http://example.com/> is used here because it always works in examples, but don’t use it in your real cluster - you should have something at your site). And that fully qualified name needs to map to an IP address for the machine.
this means that when I am on machine compute010 I can run: # hostname # cat /etc/hosts and the names should match. If I do a # ping compute010.example.com <http://compute010.example.com/> it should return the IP address of the machine And you definitely cannot map the loopback address to the hostname i.e. 127.0.0.1 to compute010 that won’t work. The messages file below from host compute010 is indicating that it can’t communicate with the master on frontend001. So if the physical networking is not messed up and the machines have IP addresses then the name resolution is messed up or something else is running on the port that Grid Engine wants to use or you have a firewall that is getting in the way of communications between the hosts and blocking the communication. basically check everything you can between two hosts to make sure they can communicate properly. Regards, Bill. > On May 30, 2016, at 2:24 PM, Radhouane Aniba <arad...@gmail.com> wrote: > > Ok here is what I have > > connected to one node compute010 > > qconf -sconf gives me this > > > #global: > execd_spool_dir /var/spool/gridengine/execd > mailer /usr/bin/mail > xterm /usr/bin/xterm > load_sensor none > prolog none > epilog none > shell_start_mode posix_compliant > login_shells bash,sh,ksh,csh,tcsh > min_uid 0 > min_gid 0 > user_lists none > xuser_lists none > projects none > xprojects none > enforce_project false > enforce_user auto > load_report_time 00:00:40 > max_unheard 00:05:00 > reschedule_unknown 00:00:00 > loglevel log_warning > administrator_mail root > set_token_cmd none > pag_cmd none > token_extend_time none > shepherd_cmd none > qmaster_params none > execd_params none > reporting_params accounting=true reporting=false \ > flush_time=00:00:15 joblog=false > sharelog=00:00:00 > finished_jobs 100 > gid_range 65400-65500 > max_aj_instances 2000 > max_aj_tasks 75000 > max_u_jobs 0 > max_jobs 0 > auto_user_oticket 0 > auto_user_fshare 0 > auto_user_default_project none > auto_user_delete_time 86400 > delegated_file_staging false > reprioritize 0 > rlogin_daemon /usr/sbin/sshd -i > rlogin_command /usr/bin/ssh > qlogin_daemon /usr/sbin/sshd -i > qlogin_command /usr/share/gridengine/qlogin-wrapper > rsh_daemon /usr/sbin/sshd -i > rsh_command /usr/bin/ssh > jsv_url none > jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w > > > the message in spool : > > > ubuntu@compute010:~$ more /var/spool/gridengine/execd/compute010/messages > 05/02/2016 18:10:11| main|compute010|E|can't find connection > 05/02/2016 18:10:11| main|compute010|E|can't get configuration from qmaster > -- backgrounding > 05/04/2016 16:58:28| main|compute010|I|starting up GE 6.2u5 (lx26-amd64) > 05/18/2016 17:10:36| main|compute010|W|can't register at qmaster > "frontend001": abort qmaster registration due to communication errors > 05/18/2016 17:37:55| main|compute010|I|controlled shutdown 6.2u5 > 05/18/2016 17:46:28| main|compute010|E|can't find connection > 05/18/2016 17:46:28| main|compute010|E|can't get configuration from qmaster > -- backgrounding > 05/18/2016 17:46:31| main|compute010|I|starting up GE 6.2u5 (lx26-amd64) > 05/20/2016 14:27:40| main|compute010|I|controlled shutdown 6.2u5 > 05/22/2016 17:00:28| main|compute010|E|can't find connection > 05/22/2016 17:00:28| main|compute010|E|can't get configuration from qmaster > -- backgrounding > 05/22/2016 17:01:38| main|compute010|I|starting up GE 6.2u5 (lx26-amd64) > 05/28/2016 03:59:31| main|compute010|I|controlled shutdown 6.2u5 > 05/28/2016 03:59:49| main|compute010|W|local configuration compute010 not > defined - using global configuration > 05/28/2016 03:59:49| main|compute010|I|starting up GE 6.2u5 (lx26-amd64) > 05/30/2016 17:41:50| main|compute010|W|can't register at qmaster > "compute010": abort qmaster registration due to communication errors > 05/30/2016 17:41:50| main|compute010|E|commlib error: got select error > (Connection refused) > 05/30/2016 17:42:14| main|compute010|I|controlled shutdown 6.2u5 > 05/30/2016 17:58:58| main|compute010|W|local configuration compute010 not > defined - using global configuration > 05/30/2016 17:58:58| main|compute010|I|starting up GE 6.2u5 (lx26-amd64) > > I had the qmaster running on all nodes before, with no problem (master and > executors) > when I kill sge_master on the node, the sge_execd is not working anymore > because its not able to connect to the master > > a ping on the node to the frontend node shows that it is visible though > > :/ > > On Mon, May 30, 2016 at 11:14 AM, Bill Bryce <bbr...@univa.com > <mailto:bbr...@univa.com>> wrote: > Okay, > > can you run any qconf commands such as ‘qconf -sconf’. Try having a look at > the messages files for the execution daemons. They should be in > > $SGE_ROOT/default/spool/ and in there are directories for the master and exec > hosts (if you have this installed in a shared filesystem envirionment). You > can check both the qmaster messages file and the execd messages files in > those directories. > > A question. Do you have the qmaster running on one host or on many? I > noticed that you have the ps output for compute010 and it is running a > qmaster. > > Other things you can check is to see if all nodes can contact the qmaster > machine i.e. the networking is configured properly. You can also make sure > that the host naming is correct, either configure DNS properly or configure a > /etc/hosts file for all nodes so the IP to host name mapping is consistent > across the cluster. Grid Engine is very picky about host names. > > > >> On May 30, 2016, at 1:36 PM, Radhouane Aniba <arad...@gmail.com >> <mailto:arad...@gmail.com>> wrote: >> >> Hi Bill >> >> Yes I am sure >> >> This is what I have when I login to one of the nodes and do >> >> ubuntu@compute010:~$ ps -ef | grep sge_ >> sgeadmin 1254 1 0 May28 ? 00:00:39 >> /usr/lib/gridengine/sge_qmaster >> sgeadmin 1446 1 0 May28 ? 00:00:22 /usr/lib/gridengine/sge_execd >> ubuntu 2552 2527 0 17:36 pts/0 00:00:00 grep --color=auto sge_ >> >> >> On Mon, May 30, 2016 at 10:33 AM, Bill Bryce <bbr...@univa.com >> <mailto:bbr...@univa.com>> wrote: >> Hi Rad, >> >> Are you sure that the execution daemons are running on your compute nodes? >> Can you login to one of the nodes say ‘compute001’ and do a ps looking for >> the execd? When an execd is functioning normally it provides the load and >> memory, etc… none of your nodes are showing that. >> >> Regards, >> >> Bill. >> >>> On May 30, 2016, at 1:20 PM, Radhouane Aniba <arad...@gmail.com >>> <mailto:arad...@gmail.com>> wrote: >>> >>> Hello all, >>> >>> I am trying to submit a simple "hello world" to test a gridengine (I used >>> it before with no problems) >>> >>> The problem is that my job is waiting in the queue forever >>> >>> The qhost command shows a wired state of the compute nodes >>> >>> HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO >>> SWAPUS >>> ------------------------------------------------------------------------------- >>> global - - - - - - >>> - >>> compute001 lx26-amd64 4 - 31.4G - 0.0 >>> - >>> compute002 lx26-amd64 4 - 31.4G - 0.0 >>> - >>> compute003 lx26-amd64 4 - 31.4G - 0.0 >>> - >>> compute004 lx26-amd64 4 - 31.4G - 0.0 >>> - >>> compute005 lx26-amd64 4 - 31.4G - 0.0 >>> - >>> compute006 lx26-amd64 4 - 31.4G - 0.0 >>> - >>> compute007 lx26-amd64 4 - 31.4G - 0.0 >>> - >>> compute008 lx26-amd64 4 - 31.4G - 0.0 >>> - >>> compute009 lx26-amd64 4 - 31.4G - 0.0 >>> - >>> compute010 lx26-amd64 4 - 31.4G - 0.0 >>> - >>> compute011 lx26-amd64 4 - 31.4G - 0.0 >>> In normal times even when the compute nodes are not used I used to have >>> some information on the load and memuse columns >>> >>> I am not an SGE persons but I am familiar with all the commands, any help >>> would be much appreciated >>> >>> the qstat -f command shows all my nodes in au state. I've been reading a >>> lot about it and I understood its an alarm state (overloaded ?) >>> >>> the only heavy activity I had on the head node was a script downloading 19T >>> of data, could the headnode be the problem and not the compute nodes ? >>> >>> sge_execd is working on all the compute/exec nodes :/ >>> >>> -- >>> Rad >>> _______________________________________________ >>> users mailing list >>> users@gridengine.org <mailto:users@gridengine.org> >>> https://gridengine.org/mailman/listinfo/users >>> <https://gridengine.org/mailman/listinfo/users> >> >> William Bryce | VP Products >> Univa Corporation, Toronto >> E: bbr...@univa.com <mailto:bbr...@univa.com> | D: 647-9742841 >> <tel:647-9742841> | Toll-Free (800) 370-5320 <tel:%28800%29%20370-5320> >> W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation >> <http://facebook.com/univa.corporation> | T: twitter.com/Grid_Engine >> <http://twitter.com/Grid_Engine> >> >> >> >> -- >> Radhouane Aniba >> Bioinformatics Scientist >> BC Cancer Agency, Vancouver, Canada > > William Bryce | VP Products > Univa Corporation, Toronto > E: bbr...@univa.com <mailto:bbr...@univa.com> | D: 647-9742841 > <tel:647-9742841> | Toll-Free (800) 370-5320 <tel:%28800%29%20370-5320> > W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation > <http://facebook.com/univa.corporation> | T: twitter.com/Grid_Engine > <http://twitter.com/Grid_Engine> > > > > -- > Radhouane Aniba > Bioinformatics Scientist > BC Cancer Agency, Vancouver, Canada William Bryce | VP Products Univa Corporation, Toronto E: bbr...@univa.com | D: 647-9742841 | Toll-Free (800) 370-5320 W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation <http://facebook.com/univa.corporation> | T: twitter.com/Grid_Engine <http://twitter.com/Grid_Engine>
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users