Am 12.11.2012 um 21:31 schrieb Drew Kitchen: >>> Dear List, >>> >>> I've installed OGE on a mini-cluster of iMacs running OS X 10.6.8, and it >>> seems to be >>> working but with one semi-major glitch. (Why iMacs, you ask...well, they >>> are what I >>> inherited from a guy that moved his lab...5 iMacs and various other boxes.) >>> >>> I compiled the OGE source locally, and that went great after I tweaked it >>> to find >>> darwin-x64 and whatnot. Installation went great, following the wonderful >>> install vids >>> that have been posted for GE on Mac OS X. I have qmaster running on >>> dhcp80fff96b, with >>> three execution hosts (dhcp80fff96b, dhcp80fff9b6, and dhcp80fff90d), and >>> an NFS share >>> between them (where GE resides). Passwordless ssh is enabled for the GE >>> owner, so the >>> boxes should be able to communicate. >> This shouldn't be necessary for the operation of OGE - just for the >> installation it *might* be necessary (but you can also do it without by >> local installations). > > Thanks. I was thinking of MPI jobs and communicating between nodes. > >>> So, this is where the problems arise: in all.q, the execution host on the >>> master node >>> running qmaster throws an E status. >>> >>> <cut> >>> dhcp80fff96b:~ akitchen$ qstat -f >>> queuename qtype resv/used/tot. load_avg arch >>> states >>> --------------------------------------------------------------------------------- >>> [email protected] 0/0/2 0.02 darwin-x64 E
NB: Does the error reappear when you reset it with `qmod -cq all.q@dhcp80fff96b`? -- Reuti >>> --------------------------------------------------------------------------------- >>> [email protected] 0/0/2 0.00 darwin-x64 >>> --------------------------------------------------------------------------------- >>> [email protected] 0/0/2 0.00 darwin-x64 >>> <cut> >>> >>> I can submit jobs and they will be successfully farmed out to the external >>> execution >>> hosts, so it would seem that everything is fine and dandy. Meanwhile, the >>> execution >>> daemon is working on the master node. >>> >>> <cut> >>> dhcp80fff96b:~ akitchen$ qping dhcp80fff96b.state.edu 6445 execd 1 >>> 11/09/2012 17:08:25 endpoint dhcp80fff96b.state.edu/execd/1 at port 6445 is >>> up since 89828 seconds >>> <cut> >>> >>> I've tried just about everything (even rebooting the master node), and >>> nothing seems to >>> solve this. I've looked in the spool messages to troubleshoot, and I get a >>> cryptic >>> "commlib error". >>> >>> <cut> >>> 11/07/2012 15:27:47| main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 >>> (darwin-x64) >>> 11/08/2012 10:43:00| main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 >>> (darwin-x64) >>> 11/08/2012 10:43:02| main|dhcp80fff96b|E|commlib error: got read error >>> (closing "dhcp80fff96b.state.edu/qmaster/1") >>> 11/08/2012 10:43:03| main|dhcp80fff96b|W|can't register at qmaster >>> "dhcp80fff96b.state.edu": abort qmaster registration due to communication >>> errors >>> 11/08/2012 10:43:03| main|dhcp80fff96b|E|commlib error: can't connect to >>> service (Connection refused) >> The ports 6444 and 6445 are excluded from the firewalls? >> >> All machines get always the same address? >> >> -- Reuti > > Yes, all machines have stable IPs and they get the same address when queried. > All firewalls are disabled (they exist under the uni's firewall), so that > shouldn't be a problem. All machines also have 6444/6445 reserved for > qmaster/execd, respectively. > > Thanks for the help! > > Cheers, > Drew > >>> 11/08/2012 10:43:35| main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 >>> (darwin-x64) >>> 11/08/2012 10:52:45| main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 >>> (darwin-x64) >>> 11/08/2012 12:31:14| main|dhcp80fff96b|I|controlled shutdown 2011.11p1 >>> 11/08/2012 12:31:14| main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 >>> (darwin-x64) >>> <cut> >>> >>> Otherwise, everything seems to be running fine. I've scrounged around and >>> found a couple >>> Mac Minis that I'd like to add to the mini-cluster, but I'd rather figure >>> this out >>> before adding them (and maybe shifting qmaster to one of them). >>> >>> Any help would be greatly appreciated! >>> >>> Cheers and best, >>> Drew >>> >>> P.S. Here is some more info for anyone curious.... >>> >>> >>> dhcp80fff96b:~ akitchen$ hostname >>> dhcp80fff96b.state.edu >>> >>> dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostname >>> Hostname: dhcp80fff96b.state.edu >>> Aliases: ANTH-M014 dhcp80fff96b >>> Host Address(es): XXX.XXX.XXX.107 >>> >>> dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyaddr >>> XXX.XXX.XXX.107 >>> Hostname: dhcp80fff96b.state.edu >>> Aliases: ANTH-M014 dhcp80fff96b >>> Host Address(es): XXX.XXX.XXX.107 >>> >>> dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyname >>> dhcp80fff96b.state.edu >>> Hostname: dhcp80fff96b.state.edu >>> Aliases: ANTH-M014 dhcp80fff96b >>> Host Address(es): XXX.XXX.XXX.107 >>> >>> dhcp80fff96b:~ akitchen$ cat /etc/hosts >>> ## >>> # Host Database >>> # >>> # localhost is used to configure the loopback interface >>> # when the system is booting. Do not change this entry. >>> ## >>> 127.0.0.1 localhost >>> 255.255.255.255 broadcasthost >>> ::1 localhost >>> fe80::1%lo0 localhost >>> XXX.XXX.XXX.107 dhcp80fff96b.state.edu ANTH-M014 dhcp80fff96b >>> XXX.XXX.XXX.182 dhcp80fff9b6.state.edu ANTH-M036 dhcp80fff9b6 >>> XXX.XXX.XXX.208 dhcp80fff9d0.state.edu ANTH-M013 dhcp80fff9d0 >>> >>> dhcp80fff96b:~ akitchen$ qconf -shgrp @allhosts >>> group_name @allhosts >>> hostlist dhcp80fff96b.state.edu dhcp80fff9d0.state.edu \ >>> dhcp80fff9b6.state.edu >>> >>> dhcp80fff96b:~ akitchen$ qconf -sel >>> dhcp80fff96b.state.edu >>> dhcp80fff9b6.state.edu >>> dhcp80fff9d0.state.edu >>> >>> dhcp80fff96b:~ akitchen$ qconf -ss >>> dhcp80fff96b.state.edu >>> >>> dhcp80fff96b:~ akitchen$ qconf -sh >>> dhcp80fff96b.state.edu >>> dhcp80fff9b6.state.edu >>> dhcp80fff9d0.state.edu >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
