Dear List,
I've installed OGE on a mini-cluster of iMacs running OS X 10.6.8, and it seems
to be
working but with one semi-major glitch. (Why iMacs, you ask...well, they are
what I
inherited from a guy that moved his lab...5 iMacs and various other boxes.)
I compiled the OGE source locally, and that went great after I tweaked it to
find
darwin-x64 and whatnot. Installation went great, following the wonderful
install vids
that have been posted for GE on Mac OS X. I have qmaster running on
dhcp80fff96b, with
three execution hosts (dhcp80fff96b, dhcp80fff9b6, and dhcp80fff90d), and an
NFS share
between them (where GE resides). Passwordless ssh is enabled for the GE owner,
so the
boxes should be able to communicate.
This shouldn't be necessary for the operation of OGE - just for the
installation it *might* be necessary (but you can also do it without by local
installations).
Thanks. I was thinking of MPI jobs and communicating between nodes.
So, this is where the problems arise: in all.q, the execution host on the
master node
running qmaster throws an E status.
<cut>
dhcp80fff96b:~ akitchen$ qstat -f
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
[email protected] 0/0/2 0.02 darwin-x64 E
---------------------------------------------------------------------------------
[email protected] 0/0/2 0.00 darwin-x64
---------------------------------------------------------------------------------
[email protected] 0/0/2 0.00 darwin-x64
<cut>
I can submit jobs and they will be successfully farmed out to the external
execution
hosts, so it would seem that everything is fine and dandy. Meanwhile, the
execution
daemon is working on the master node.
<cut>
dhcp80fff96b:~ akitchen$ qping dhcp80fff96b.state.edu 6445 execd 1
11/09/2012 17:08:25 endpoint dhcp80fff96b.state.edu/execd/1 at port 6445 is up
since 89828 seconds
<cut>
I've tried just about everything (even rebooting the master node), and nothing
seems to
solve this. I've looked in the spool messages to troubleshoot, and I get a
cryptic
"commlib error".
<cut>
11/07/2012 15:27:47| main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1
(darwin-x64)
11/08/2012 10:43:00| main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1
(darwin-x64)
11/08/2012 10:43:02| main|dhcp80fff96b|E|commlib error: got read error (closing
"dhcp80fff96b.state.edu/qmaster/1")
11/08/2012 10:43:03| main|dhcp80fff96b|W|can't register at qmaster
"dhcp80fff96b.state.edu": abort qmaster registration due to communication errors
11/08/2012 10:43:03| main|dhcp80fff96b|E|commlib error: can't connect to
service (Connection refused)
The ports 6444 and 6445 are excluded from the firewalls?
All machines get always the same address?
-- Reuti
Yes, all machines have stable IPs and they get the same address when
queried. All firewalls are disabled (they exist under the uni's
firewall), so that shouldn't be a problem. All machines also have
6444/6445 reserved for qmaster/execd, respectively.
Thanks for the help!
Cheers,
Drew
11/08/2012 10:43:35| main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1
(darwin-x64)
11/08/2012 10:52:45| main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1
(darwin-x64)
11/08/2012 12:31:14| main|dhcp80fff96b|I|controlled shutdown 2011.11p1
11/08/2012 12:31:14| main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1
(darwin-x64)
<cut>
Otherwise, everything seems to be running fine. I've scrounged around and found
a couple
Mac Minis that I'd like to add to the mini-cluster, but I'd rather figure this
out
before adding them (and maybe shifting qmaster to one of them).
Any help would be greatly appreciated!
Cheers and best,
Drew
P.S. Here is some more info for anyone curious....
dhcp80fff96b:~ akitchen$ hostname
dhcp80fff96b.state.edu
dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostname
Hostname: dhcp80fff96b.state.edu
Aliases: ANTH-M014 dhcp80fff96b
Host Address(es): XXX.XXX.XXX.107
dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyaddr
XXX.XXX.XXX.107
Hostname: dhcp80fff96b.state.edu
Aliases: ANTH-M014 dhcp80fff96b
Host Address(es): XXX.XXX.XXX.107
dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyname
dhcp80fff96b.state.edu
Hostname: dhcp80fff96b.state.edu
Aliases: ANTH-M014 dhcp80fff96b
Host Address(es): XXX.XXX.XXX.107
dhcp80fff96b:~ akitchen$ cat /etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost
fe80::1%lo0 localhost
XXX.XXX.XXX.107 dhcp80fff96b.state.edu ANTH-M014 dhcp80fff96b
XXX.XXX.XXX.182 dhcp80fff9b6.state.edu ANTH-M036 dhcp80fff9b6
XXX.XXX.XXX.208 dhcp80fff9d0.state.edu ANTH-M013 dhcp80fff9d0
dhcp80fff96b:~ akitchen$ qconf -shgrp @allhosts
group_name @allhosts
hostlist dhcp80fff96b.state.edu dhcp80fff9d0.state.edu \
dhcp80fff9b6.state.edu
dhcp80fff96b:~ akitchen$ qconf -sel
dhcp80fff96b.state.edu
dhcp80fff9b6.state.edu
dhcp80fff9d0.state.edu
dhcp80fff96b:~ akitchen$ qconf -ss
dhcp80fff96b.state.edu
dhcp80fff96b:~ akitchen$ qconf -sh
dhcp80fff96b.state.edu
dhcp80fff9b6.state.edu
dhcp80fff9d0.state.edu
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users