Re: [gridengine users] Adding GE2011.11 submit host to existing rocks cluster

Rayson Ho Wed, 28 Mar 2012 15:01:12 -0700

On Wed, Mar 28, 2012 at 5:05 PM, Robert Chase
<[email protected]> wrote:
> I followed your advice and copied the directory from one of the compute
> nodes to the new submit node. I opened the firewall on ports 536 and 537 and
> added execd and qmaster to the /etc/services file. I'm getting the following
> error messages when I use qping..
>
> [root@galdev common]# qping q.bwh.harvard.edu 536 qmaster 1
> endpoint q/qmaster/1 at port 536: can't find connection
> got select error: Connection refused
> got select error: closing "q/qmaster/1"
> endpoint q/qmaster/1 at port 536: can't find connection
> endpoint q/qmaster/1 at port 536: can't find connection


Is the qmaster really listening to port 536?? Note that we have a
standard port number for qmaster & execd (6444 & 6445). From your
output it looks like nothing is listening to that port.

Rayson




>
> When I try to use qsub I get the following error...
>
> [root@galdev jobs]# qsub simple.sh
> error: commlib error: got select error (Connection refused)
> Unable to run job: unable to send message to qmaster using port 536 on host
> "q.bwh.harvard.edu": got send error.
> Exiting.
>
> Any help would be greatly appreciated.
>
> -Robert Chase
>
>
> On Wed, Mar 28, 2012 at 6:51 AM, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D."
> <[email protected]> wrote:
>>
>>
>> sorry again
>> one can always add login node in rockscluster that will act as submit node
>> to sge
>> regards
>>
>>
>> On 3/28/2012 6:21 AM, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D." wrote:
>>>
>>>
>>>
>>> On 3/28/2012 5:53 AM, Reuti wrote:
>>>>
>>>> Am 27.03.2012 um 23:27 schrieb Hung-sheng Tsao:
>>>>
>>>>> May be just copy the /opt/ grid engine from one of the compute node
>>>>> Add this as submit host from the frontend
>>>>
>>>> It may be good to add an explanation: to me it looks like the original
>>>> poster installed a separate SGE cluster on just one machine, including the
>>>> qmaster daemon and hence it's just running local which explains the job id
>>>> of being 1.
>>>
>>> sorry, if one just copy the /opt/gridengine from compute nodes
>>> then
>>> it  will have the full directory of /opt/gridengine/default/common and
>>> /opt/gridengine/bin
>>> yes there is also default/spool that one could delete
>>>
>>> the demon should not run!
>>>
>>> of course one will need the home directory, uid etc from the rocks
>>> frontend
>>>
>>> IMHO, it is much simpler then install a new version of SGE
>>> of course if the submit host is not running the same centos/redhat of
>>> compute node that is another story
>>> regards
>>>
>>>>
>>>> To add a submit host to an existing cluster it isn't necessary to have
>>>> any daemon running on it, and installing a different version of SGE will
>>>> most likely not work too, as the internal protocol changes between the
>>>> releases. I suggest to:
>>>>
>>>> - Stop the daemons you started on the new submit host
>>>> - Remove the compilation you did
>>>> - Share the users from the existing cluster by NIS/LDAP (unless you want
>>>> to define them all by hand on the new machine too)
>>>> - Mount /home from the existing cluster
>>>> - Mount /usr/sge or /opt/grid whereever you have SGE installed in the
>>>> exisitng cluster
>>>> - Add the machine in question as submit host in the original cluster
>>>> - Source during login $SGE_ROOT/default/common/settings.sh on the submit
>>>> machine
>>>>
>>>> Then you should be able to submit jobs from this machine.
>>>>
>>>> As there is no builtin file staging in SGE, it's most common to share
>>>> /home.
>>>>
>>>> ==
>>>>
>>>> Nevertheless it could be done to have a separate single machine cluster
>>>> (with a different version of SGE) and use file staging (which you have to
>>>> implement on your own) but it's to much overhead for adding just this
>>>> particular machine IMO. It's a suitable setup to combine clusters by the 
>>>> use
>>>> of a transfer queue this way. I did it once and used the job context to 
>>>> name
>>>> the files which have to be copied back and forth to copy them then on my 
>>>> own
>>>> in a starter method.
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> LT
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Mar 27, 2012, at 4:36 PM, Robert Chase<[email protected]>
>>>>>  wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> A number of years ago, our group created a rocks cluster consisting of
>>>>>> a head node, a data node and eight execution nodes. The eight execution
>>>>>> nodes can only be accessed by the head node.
>>>>>>
>>>>>> My goal is to add a submit node to the existing cluster. I have
>>>>>> downloaded GE2011.11 and compiled from source without errors. When I try 
>>>>>> the
>>>>>> command:
>>>>>>
>>>>>> qsub simple.sh
>>>>>>
>>>>>> I get the error:
>>>>>>
>>>>>> Unable to run job: warning: root your job is not allowed to run in any
>>>>>> queue
>>>>>>
>>>>>> When I look at qstat I get:
>>>>>>
>>>>>> job-ID  prior   name       user         state submit/start at
>>>>>> queue                          slots ja-task-ID
>>>>>>
>>>>>> -----------------------------------------------------------------------------------------------------------------
>>>>>>      1 0.55500 simple.sh  root         qw    03/27/2012 09:41:11
>>>>>>                              1
>>>>>>
>>>>>> I have added the new submit node to the list of submit nodes on the
>>>>>> head node using the command
>>>>>>
>>>>>> qconf -as
>>>>>>
>>>>>> When I run qconf -ss on the new submit node I see the head node, the
>>>>>> data node and the new submit node.
>>>>>>
>>>>>> When I run qconf -ss on the head node, I see the head node, the data
>>>>>> node, the new submit node and all eight execution nodes.
>>>>>>
>>>>>> When I run qhost on the new submit node, I get
>>>>>>
>>>>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
>>>>>>  SWAPTO  SWAPUS
>>>>>>
>>>>>> -------------------------------------------------------------------------------
>>>>>> global                  -               -     -       -       -
>>>>>> -       -
>>>>>>
>>>>>>
>>>>>> Other posts have asked about the output of qconf -sq all.q...
>>>>>>
>>>>>> [root@HEADNODE jobs]# qconf -sq all.q
>>>>>> qname                 all.q
>>>>>> hostlist              @allhosts
>>>>>> seq_no                0
>>>>>> load_thresholds       np_load_avg=1.75
>>>>>> suspend_thresholds    NONE
>>>>>> nsuspend              1
>>>>>> suspend_interval      00:05:00
>>>>>> priority              0
>>>>>> min_cpu_interval      00:05:00
>>>>>> processors            UNDEFINED
>>>>>> qtype                 BATCH INTERACTIVE
>>>>>> ckpt_list             NONE
>>>>>> pe_list               make mpi mpich multicore orte
>>>>>> rerun                 FALSE
>>>>>> slots                 1,[compute-0-0.local=16],[compute-0-1.local=16],
>>>>>> \
>>>>>>                      [compute-0-2.local=16],[compute-0-3.local=16], \
>>>>>>                      [compute-0-4.local=16],[compute-0-6.local=16], \
>>>>>>                      [compute-0-7.local=16]
>>>>>> tmpdir                /tmp
>>>>>> shell                 /bin/csh
>>>>>> prolog                NONE
>>>>>> epilog                NONE
>>>>>> shell_start_mode      posix_compliant
>>>>>> starter_method        NONE
>>>>>> suspend_method        NONE
>>>>>> resume_method         NONE
>>>>>> terminate_method      NONE
>>>>>> notify                00:00:60
>>>>>> owner_list            NONE
>>>>>> user_lists            NONE
>>>>>> xuser_lists           NONE
>>>>>> subordinate_list      NONE
>>>>>> complex_values        NONE
>>>>>> projects              NONE
>>>>>> xprojects             NONE
>>>>>> calendar              NONE
>>>>>> initial_state         default
>>>>>> s_rt                  INFINITY
>>>>>> h_rt                  INFINITY
>>>>>> s_cpu                 INFINITY
>>>>>> h_cpu                 INFINITY
>>>>>> s_fsize               INFINITY
>>>>>> h_fsize               INFINITY
>>>>>> s_data                INFINITY
>>>>>> h_data                INFINITY
>>>>>> s_stack               INFINITY
>>>>>> h_stack               INFINITY
>>>>>> s_core                INFINITY
>>>>>> h_core                INFINITY
>>>>>> s_rss                 INFINITY
>>>>>> h_rss                 INFINITY
>>>>>> s_vmem                INFINITY
>>>>>> h_vmem                INFINITY
>>>>>>
>>>>>>
>>>>>> [root@SUBMITNODE jobs]# qconf -sq all.q
>>>>>> qname                 all.q
>>>>>> hostlist              @allhosts
>>>>>> seq_no                0
>>>>>> load_thresholds       np_load_avg=1.75
>>>>>> suspend_thresholds    NONE
>>>>>> nsuspend              1
>>>>>> suspend_interval      00:05:00
>>>>>> priority              0
>>>>>> min_cpu_interval      00:05:00
>>>>>> processors            UNDEFINED
>>>>>> qtype                 BATCH INTERACTIVE
>>>>>> ckpt_list             NONE
>>>>>> pe_list               make
>>>>>> rerun                 FALSE
>>>>>> slots                 1
>>>>>> tmpdir                /tmp
>>>>>> shell                 /bin/csh
>>>>>> prolog                NONE
>>>>>> epilog                NONE
>>>>>> shell_start_mode      posix_compliant
>>>>>> starter_method        NONE
>>>>>> suspend_method        NONE
>>>>>> resume_method         NONE
>>>>>> terminate_method      NONE
>>>>>> notify                00:00:60
>>>>>> owner_list            NONE
>>>>>> user_lists            NONE
>>>>>> xuser_lists           NONE
>>>>>> subordinate_list      NONE
>>>>>> complex_values        NONE
>>>>>> projects              NONE
>>>>>> xprojects             NONE
>>>>>> calendar              NONE
>>>>>> initial_state         default
>>>>>> s_rt                  INFINITY
>>>>>> h_rt                  INFINITY
>>>>>> s_cpu                 INFINITY
>>>>>> h_cpu                 INFINITY
>>>>>> s_fsize               INFINITY
>>>>>> h_fsize               INFINITY
>>>>>> s_data                INFINITY
>>>>>> h_data                INFINITY
>>>>>> s_stack               INFINITY
>>>>>> h_stack               INFINITY
>>>>>> s_core                INFINITY
>>>>>> h_core                INFINITY
>>>>>> s_rss                 INFINITY
>>>>>> h_rss                 INFINITY
>>>>>> s_vmem                INFINITY
>>>>>> h_vmem                INFINITY
>>>>>>
>>>>>> I would like to know how to get qsub working.
>>>>>>
>>>>>> Thanks,
>>>>>> -Robert Paul Chase
>>>>>> Channing Labs
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> [email protected]
>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> https://gridengine.org/mailman/listinfo/users
>>>
>>>
>>
>> --
>> Hung-Sheng Tsao Ph D.
>> Founder&  Principal
>> HopBit GridComputing LLC
>> cell: 9734950840
>>
>> http://laotsao.blogspot.com/
>> http://laotsao.wordpress.com/
>> http://blogs.oracle.com/hstsao/
>>
>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Adding GE2011.11 submit host to existing rocks cluster

Reply via email to