Dear List,

I've installed OGE on a mini-cluster of iMacs running OS X 10.6.8, and it seems 
to be
working but with one semi-major glitch. (Why iMacs, you ask...well, they are 
what I
inherited from a guy that moved his lab...5 iMacs and various other boxes.)

I compiled the OGE source locally, and that went great after I tweaked it to 
find
darwin-x64 and whatnot. Installation went great, following the wonderful 
install vids
that have been posted for GE on Mac OS X. I have qmaster running on 
dhcp80fff96b, with
three execution hosts (dhcp80fff96b, dhcp80fff9b6, and dhcp80fff90d), and an 
NFS share
between them (where GE resides). Passwordless ssh is enabled for the GE owner, 
so the
boxes should be able to communicate.
This shouldn't be necessary for the operation of OGE - just for the 
installation it *might* be necessary (but you can also do it without by local 
installations).
Thanks. I was thinking of MPI jobs and communicating between nodes.

So, this is where the problems arise: in all.q, the execution host on the 
master node
running qmaster throws an E status.

<cut>
dhcp80fff96b:~ akitchen$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          
states
---------------------------------------------------------------------------------
[email protected]   0/0/2 0.02     darwin-x64    E
NB: Does the error reappear when you reset it with `qmod -cq 
all.q@dhcp80fff96b`?

-- Reuti
Doh! Thanks Reuti--I insanely forgot to just clear the error state and see if it kept throwing an error. Silly silly me...

Thanks again, Reuti!

Cheers,
Drew

---------------------------------------------------------------------------------
[email protected]   0/0/2 0.00     darwin-x64
---------------------------------------------------------------------------------
[email protected]   0/0/2 0.00     darwin-x64
<cut>

I can submit jobs and they will be successfully farmed out to the external 
execution
hosts, so it would seem that everything is fine and dandy. Meanwhile, the 
execution
daemon is working on the master node.

<cut>
dhcp80fff96b:~ akitchen$ qping dhcp80fff96b.state.edu 6445 execd 1
11/09/2012 17:08:25 endpoint dhcp80fff96b.state.edu/execd/1 at port 6445 is up 
since 89828 seconds
<cut>

I've tried just about everything (even rebooting the master node), and nothing 
seems to
solve this. I've looked in the spool messages to troubleshoot, and I get a 
cryptic
"commlib error".

<cut>
11/07/2012 15:27:47|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
(darwin-x64)
11/08/2012 10:43:00|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
(darwin-x64)
11/08/2012 10:43:02|  main|dhcp80fff96b|E|commlib error: got read error (closing 
"dhcp80fff96b.state.edu/qmaster/1")
11/08/2012 10:43:03|  main|dhcp80fff96b|W|can't register at qmaster 
"dhcp80fff96b.state.edu": abort qmaster registration due to communication errors
11/08/2012 10:43:03|  main|dhcp80fff96b|E|commlib error: can't connect to 
service (Connection refused)
The ports 6444 and 6445 are excluded from the firewalls?

All machines get always the same address?

-- Reuti
Yes, all machines have stable IPs and they get the same address when queried. 
All firewalls are disabled (they exist under the uni's firewall), so that 
shouldn't be a problem. All machines also have 6444/6445 reserved for 
qmaster/execd, respectively.

Thanks for the help!

Cheers,
Drew

11/08/2012 10:43:35|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
(darwin-x64)
11/08/2012 10:52:45|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
(darwin-x64)
11/08/2012 12:31:14|  main|dhcp80fff96b|I|controlled shutdown 2011.11p1
11/08/2012 12:31:14|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
(darwin-x64)
<cut>

Otherwise, everything seems to be running fine. I've scrounged around and found 
a couple
Mac Minis that I'd like to add to the mini-cluster, but I'd rather figure this 
out
before adding them (and maybe shifting qmaster to one of them).

Any help would be greatly appreciated!

Cheers and best,
Drew

P.S. Here is some more info for anyone curious....


dhcp80fff96b:~ akitchen$ hostname
dhcp80fff96b.state.edu

dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostname
Hostname: dhcp80fff96b.state.edu
Aliases:  ANTH-M014 dhcp80fff96b
Host Address(es): XXX.XXX.XXX.107

dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyaddr 
XXX.XXX.XXX.107
Hostname: dhcp80fff96b.state.edu
Aliases:  ANTH-M014 dhcp80fff96b
Host Address(es): XXX.XXX.XXX.107

dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyname 
dhcp80fff96b.state.edu
Hostname: dhcp80fff96b.state.edu
Aliases:  ANTH-M014 dhcp80fff96b
Host Address(es): XXX.XXX.XXX.107

dhcp80fff96b:~ akitchen$ cat /etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1    localhost
255.255.255.255    broadcasthost
::1             localhost
fe80::1%lo0    localhost
XXX.XXX.XXX.107 dhcp80fff96b.state.edu ANTH-M014 dhcp80fff96b
XXX.XXX.XXX.182 dhcp80fff9b6.state.edu ANTH-M036 dhcp80fff9b6
XXX.XXX.XXX.208 dhcp80fff9d0.state.edu ANTH-M013 dhcp80fff9d0

dhcp80fff96b:~ akitchen$ qconf -shgrp @allhosts
group_name @allhosts
hostlist dhcp80fff96b.state.edu dhcp80fff9d0.state.edu \
         dhcp80fff9b6.state.edu

dhcp80fff96b:~ akitchen$ qconf -sel
dhcp80fff96b.state.edu
dhcp80fff9b6.state.edu
dhcp80fff9d0.state.edu

dhcp80fff96b:~ akitchen$ qconf -ss
dhcp80fff96b.state.edu

dhcp80fff96b:~ akitchen$ qconf -sh
dhcp80fff96b.state.edu
dhcp80fff9b6.state.edu
dhcp80fff9d0.state.edu
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to