Hi, Am 07.04.2012 um 17:14 schrieb Skip Coombe:
> Hi Reuti, > > On Sat, Apr 7, 2012 at 8:27 AM, Reuti <[email protected]> wrote: > Hi, > > Am 07.04.2012 um 03:54 schrieb Skip Coombe: > > > (Sorry for incomplete message) > > > > I set up 2 hosts in one cluster on CentOS 5.4 > > > > Linux version 2.6.18-308.1.1.el5 ([email protected]) (gcc > > version 4.1.2 20080704 (Red Hat 4.1.2-52)) #1 SMP Wed Mar 7 04:16:51 EST > > 2012 > > Linux elm.tdi.local 2.6.18-308.1.1.el5 #1 SMP Wed Mar 7 04:16:51 EST 2012 > > x86_64 x86_64 x86_64 GNU/Linux > > > > ge2011.11 installed with ge2011.11-x64.tar with mostly default values > > except db=classic (same domain) > > > > (all cmds on elm.tdi.local) > > > > $ qconf -sel > > elm.tdi.local > > oak.tdi.local > > > > but > > > > $ qconf -sconf oak > > configuration oak.tdi.local not defined > > If all the machine have the same OS, you don't need any local configuration > at all. In fact: it often leads to confusion which settings are used finally. > > $ qconf -dconf elm > > The remove the one from eln. Then for both machines the global configuration > is used (qconf -sconf). Should read: This removes the one for elm. > I will experiment with this, although I am unclear about the process. The > only configuration I did after > identical installation on elm and oak was to add oak as an execution host on > elm with > qconf -Ae ${oak-execution-host-spec}. > > How did you install SGE on each of them? The share a common directory on both > machines? You started the execd on oak by /etc/init.d/sgeexecd by hand? > Both installations were done identically using scripts install_qmaster > followed by install_execd from > a common pathname (/opt/SGE/ge2011.11) but on separate hosts. > > /opt/SGE/ge2011.11/bin/linux-x64/sge_qmaster > /opt/SGE/ge2011.11/bin/linux-x64/sge_execd > > were both started by the installation scripts and are running on both hosts. This sounds like you got two independent clusters. For SGE you need only one machine for the qmaster* - it can even be a separate machine without execution daemon. In your case it's also a machine for hosting an execution daemon which is fine. Then for the second execution machine it's usually setup to share: - /opt/SGE/ge2011.11 - /home On this second machine you only need to start the execution daemon, but it's fine to install it with the script install_execd to get it started when the machine boots too. In case you don't want to share "/opt/SGE/ge2011.11", it's at least necessary to copy "/opt/SGE/ge2011.11/default/common" to the other machine. The file act_qmaster therein will tell the execution daemon where the qmaster lives. There is no built in file staging in SGE (except the submitted jobscript) to the execution host. If you want to check the created output files, it's therefore necessary to share /home. For communication between the machines, also ports 6444 and 6445 should be open in case you run a firewall thereon. -- Reuti *) Unless you want to implement a failover setup. > Skip > > > -- Reuti > > > > I issued "qsub sleeper.sh 300" 6 times and expected to see 2 jobs being > > executed on > > each host, but > > > > $ qstat -f > > queuename qtype resv/used/tot. load_avg arch > > states > > --------------------------------------------------------------------------------- > > [email protected] BIP 0/2/2 0.19 linux-x64 > > 29 0.55500 Sleeper skip r 04/06/2012 15:31:32 1 > > 30 0.55500 Sleeper skip r 04/06/2012 15:31:32 1 > > --------------------------------------------------------------------------------- > > [email protected] BIP 0/0/1 -NA- -NA- > > au > > > > ############################################################################ > > - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS > > ############################################################################ > > 31 0.55500 Sleeper skip qw 04/06/2012 15:31:26 1 > > 32 0.55500 Sleeper skip qw 04/06/2012 15:31:27 1 > > 33 0.55500 Sleeper skip qw 04/06/2012 15:31:28 1 > > 34 0.55500 Sleeper skip qw 04/06/2012 15:31:29 1 > > > > > > > > > > [skip@elm jobs]$ qstat -F > > queuename qtype resv/used/tot. load_avg arch > > states > > --------------------------------------------------------------------------------- > > [email protected] BIP 0/2/2 0.19 linux-x64 > > hl:load_avg=0.190000 > > hl:load_short=0.290000 > > hl:load_medium=0.190000 > > hl:load_long=0.150000 > > hl:arch=linux-x64 > > hl:num_proc=2 > > hl:mem_free=2.969G > > hl:swap_free=5.750G > > hl:virtual_free=8.718G > > hl:mem_total=3.796G > > hl:swap_total=5.750G > > hl:virtual_total=9.546G > > hl:mem_used=846.965M > > hl:swap_used=160.000K > > hl:virtual_used=847.121M > > hl:cpu=1.000000 > > hl:m_topology=SCC > > hl:m_topology_inuse=SCC > > hl:m_socket=1 > > hl:m_core=2 > > hl:np_load_avg=0.095000 > > hl:np_load_short=0.145000 > > hl:np_load_medium=0.095000 > > hl:np_load_long=0.075000 > > qf:qname=all.q > > qf:hostname=elm.tdi.local > > qc:slots=0 > > qf:tmpdir=/tmp > > qf:seq_no=0 > > qf:rerun=0.000000 > > qf:calendar=NONE > > qf:s_rt=infinity > > qf:h_rt=infinity > > qf:s_cpu=infinity > > qf:h_cpu=infinity > > qf:s_fsize=infinity > > qf:h_fsize=infinity > > qf:s_data=infinity > > qf:h_data=infinity > > qf:s_stack=infinity > > qf:h_stack=infinity > > qf:s_core=infinity > > qf:h_core=infinity > > qf:s_rss=infinity > > qf:h_rss=infinity > > qf:s_vmem=infinity > > qf:h_vmem=infinity > > qf:min_cpu_interval=00:05:00 > > 29 0.55500 Sleeper skip r 04/06/2012 15:31:32 1 > > 30 0.55500 Sleeper skip r 04/06/2012 15:31:32 1 > > --------------------------------------------------------------------------------- > > [email protected] BIP 0/0/1 -NA- -NA- > > au > > qf:qname=all.q > > qf:hostname=oak.tdi.local > > qc:slots=1 > > qf:tmpdir=/tmp > > qf:seq_no=0 > > qf:rerun=0.000000 > > qf:calendar=NONE > > qf:s_rt=infinity > > qf:h_rt=infinity > > qf:s_cpu=infinity > > qf:h_cpu=infinity > > qf:s_fsize=infinity > > qf:h_fsize=infinity > > qf:s_data=infinity > > qf:h_data=infinity > > qf:s_stack=infinity > > qf:h_stack=infinity > > qf:s_core=infinity > > qf:h_core=infinity > > qf:s_rss=infinity > > qf:h_rss=infinity > > qf:s_vmem=infinity > > qf:h_vmem=infinity > > qf:min_cpu_interval=00:05:00 > > > > ############################################################################ > > - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS > > ############################################################################ > > 31 0.55500 Sleeper skip qw 04/06/2012 15:31:26 1 > > 32 0.55500 Sleeper skip qw 04/06/2012 15:31:27 1 > > 33 0.55500 Sleeper skip qw 04/06/2012 15:31:28 1 > > 34 0.55500 Sleeper skip qw 04/06/2012 15:31:29 1 > > > > also qmon cluster conf (on elm) only shows elm, but has both host in > > the execution hosts list and has a host group containing both named > > "@allhosts". > > > > I'm probably overlooking something obvious. Any help will be appreciated. > > > > Skip Coombe > > [email protected] > > > > > > > > -- > > Skip Coombe > > [email protected] > > 919.442.VLSI > > > > > > > > > > > -- > Skip Coombe > [email protected] > 919.442.VLSI > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
