Am 12.11.2012 um 21:31 schrieb Drew Kitchen:

>>> Dear List,
>>> 
>>> I've installed OGE on a mini-cluster of iMacs running OS X 10.6.8, and it 
>>> seems to be
>>> working but with one semi-major glitch. (Why iMacs, you ask...well, they 
>>> are what I
>>> inherited from a guy that moved his lab...5 iMacs and various other boxes.)
>>> 
>>> I compiled the OGE source locally, and that went great after I tweaked it 
>>> to find
>>> darwin-x64 and whatnot. Installation went great, following the wonderful 
>>> install vids
>>> that have been posted for GE on Mac OS X. I have qmaster running on 
>>> dhcp80fff96b, with
>>> three execution hosts (dhcp80fff96b, dhcp80fff9b6, and dhcp80fff90d), and 
>>> an NFS share
>>> between them (where GE resides). Passwordless ssh is enabled for the GE 
>>> owner, so the
>>> boxes should be able to communicate.
>> This shouldn't be necessary for the operation of OGE - just for the 
>> installation it *might* be necessary (but you can also do it without by 
>> local installations).
> 
> Thanks. I was thinking of MPI jobs and communicating between nodes.
> 
>>> So, this is where the problems arise: in all.q, the execution host on the 
>>> master node
>>> running qmaster throws an E status.
>>> 
>>> <cut>
>>> dhcp80fff96b:~ akitchen$ qstat -f
>>> queuename                      qtype resv/used/tot. load_avg arch          
>>> states
>>> ---------------------------------------------------------------------------------
>>> [email protected]   0/0/2 0.02     darwin-x64    E

NB: Does the error reappear when you reset it with `qmod -cq 
all.q@dhcp80fff96b`?

-- Reuti


>>> ---------------------------------------------------------------------------------
>>> [email protected]   0/0/2 0.00     darwin-x64
>>> ---------------------------------------------------------------------------------
>>> [email protected]   0/0/2 0.00     darwin-x64
>>> <cut>
>>> 
>>> I can submit jobs and they will be successfully farmed out to the external 
>>> execution
>>> hosts, so it would seem that everything is fine and dandy. Meanwhile, the 
>>> execution
>>> daemon is working on the master node.
>>> 
>>> <cut>
>>> dhcp80fff96b:~ akitchen$ qping dhcp80fff96b.state.edu 6445 execd 1
>>> 11/09/2012 17:08:25 endpoint dhcp80fff96b.state.edu/execd/1 at port 6445 is 
>>> up since 89828 seconds
>>> <cut>
>>> 
>>> I've tried just about everything (even rebooting the master node), and 
>>> nothing seems to
>>> solve this. I've looked in the spool messages to troubleshoot, and I get a 
>>> cryptic
>>> "commlib error".
>>> 
>>> <cut>
>>> 11/07/2012 15:27:47|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
>>> (darwin-x64)
>>> 11/08/2012 10:43:00|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
>>> (darwin-x64)
>>> 11/08/2012 10:43:02|  main|dhcp80fff96b|E|commlib error: got read error 
>>> (closing "dhcp80fff96b.state.edu/qmaster/1")
>>> 11/08/2012 10:43:03|  main|dhcp80fff96b|W|can't register at qmaster 
>>> "dhcp80fff96b.state.edu": abort qmaster registration due to communication 
>>> errors
>>> 11/08/2012 10:43:03|  main|dhcp80fff96b|E|commlib error: can't connect to 
>>> service (Connection refused)
>> The ports 6444 and 6445 are excluded from the firewalls?
>> 
>> All machines get always the same address?
>> 
>> -- Reuti
> 
> Yes, all machines have stable IPs and they get the same address when queried. 
> All firewalls are disabled (they exist under the uni's firewall), so that 
> shouldn't be a problem. All machines also have 6444/6445 reserved for 
> qmaster/execd, respectively.
> 
> Thanks for the help!
> 
> Cheers,
> Drew
> 
>>> 11/08/2012 10:43:35|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
>>> (darwin-x64)
>>> 11/08/2012 10:52:45|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
>>> (darwin-x64)
>>> 11/08/2012 12:31:14|  main|dhcp80fff96b|I|controlled shutdown 2011.11p1
>>> 11/08/2012 12:31:14|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
>>> (darwin-x64)
>>> <cut>
>>> 
>>> Otherwise, everything seems to be running fine. I've scrounged around and 
>>> found a couple
>>> Mac Minis that I'd like to add to the mini-cluster, but I'd rather figure 
>>> this out
>>> before adding them (and maybe shifting qmaster to one of them).
>>> 
>>> Any help would be greatly appreciated!
>>> 
>>> Cheers and best,
>>> Drew
>>> 
>>> P.S. Here is some more info for anyone curious....
>>> 
>>> 
>>> dhcp80fff96b:~ akitchen$ hostname
>>> dhcp80fff96b.state.edu
>>> 
>>> dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostname
>>> Hostname: dhcp80fff96b.state.edu
>>> Aliases:  ANTH-M014 dhcp80fff96b
>>> Host Address(es): XXX.XXX.XXX.107
>>> 
>>> dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyaddr 
>>> XXX.XXX.XXX.107
>>> Hostname: dhcp80fff96b.state.edu
>>> Aliases:  ANTH-M014 dhcp80fff96b
>>> Host Address(es): XXX.XXX.XXX.107
>>> 
>>> dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyname 
>>> dhcp80fff96b.state.edu
>>> Hostname: dhcp80fff96b.state.edu
>>> Aliases:  ANTH-M014 dhcp80fff96b
>>> Host Address(es): XXX.XXX.XXX.107
>>> 
>>> dhcp80fff96b:~ akitchen$ cat /etc/hosts
>>> ##
>>> # Host Database
>>> #
>>> # localhost is used to configure the loopback interface
>>> # when the system is booting.  Do not change this entry.
>>> ##
>>> 127.0.0.1    localhost
>>> 255.255.255.255    broadcasthost
>>> ::1             localhost
>>> fe80::1%lo0    localhost
>>> XXX.XXX.XXX.107 dhcp80fff96b.state.edu ANTH-M014 dhcp80fff96b
>>> XXX.XXX.XXX.182 dhcp80fff9b6.state.edu ANTH-M036 dhcp80fff9b6
>>> XXX.XXX.XXX.208 dhcp80fff9d0.state.edu ANTH-M013 dhcp80fff9d0
>>> 
>>> dhcp80fff96b:~ akitchen$ qconf -shgrp @allhosts
>>> group_name @allhosts
>>> hostlist dhcp80fff96b.state.edu dhcp80fff9d0.state.edu \
>>>         dhcp80fff9b6.state.edu
>>> 
>>> dhcp80fff96b:~ akitchen$ qconf -sel
>>> dhcp80fff96b.state.edu
>>> dhcp80fff9b6.state.edu
>>> dhcp80fff9d0.state.edu
>>> 
>>> dhcp80fff96b:~ akitchen$ qconf -ss
>>> dhcp80fff96b.state.edu
>>> 
>>> dhcp80fff96b:~ akitchen$ qconf -sh
>>> dhcp80fff96b.state.edu
>>> dhcp80fff9b6.state.edu
>>> dhcp80fff9d0.state.edu
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to