Re: [gridengine users] Adding GE2011.11 submit host to existing rocks cluster

Robert Chase Wed, 28 Mar 2012 14:06:25 -0700

Hello,

I followed your advice and copied the directory from one of the compute
nodes to the new submit node. I opened the firewall on ports 536 and 537
and added execd and qmaster to the /etc/services file. I'm getting the
following error messages when I use qping..


[root@galdev common]# qping q.bwh.harvard.edu 536 qmaster 1
endpoint q/qmaster/1 at port 536: can't find connection
got select error: Connection refused
got select error: closing "q/qmaster/1"
endpoint q/qmaster/1 at port 536: can't find connection
endpoint q/qmaster/1 at port 536: can't find connection

When I try to use qsub I get the following error...

[root@galdev jobs]# qsub simple.sh
error: commlib error: got select error (Connection refused)
Unable to run job: unable to send message to qmaster using port 536 on host
"q.bwh.harvard.edu": got send error.
Exiting.

Any help would be greatly appreciated.

-Robert Chase

On Wed, Mar 28, 2012 at 6:51 AM, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D." <
[email protected]> wrote:

>
> sorry again
> one can always add login node in rockscluster that will act as submit node
> to sge
> regards
>
>
> On 3/28/2012 6:21 AM, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D." wrote:
>
>>
>>
>> On 3/28/2012 5:53 AM, Reuti wrote:
>>
>>> Am 27.03.2012 um 23:27 schrieb Hung-sheng Tsao:
>>>
>>>  May be just copy the /opt/ grid engine from one of the compute node
>>>> Add this as submit host from the frontend
>>>>
>>> It may be good to add an explanation: to me it looks like the original
>>> poster installed a separate SGE cluster on just one machine, including the
>>> qmaster daemon and hence it's just running local which explains the job id
>>> of being 1.
>>>
>> sorry, if one just copy the /opt/gridengine from compute nodes
>> then
>> it  will have the full directory of /opt/gridengine/default/common and
>> /opt/gridengine/bin
>> yes there is also default/spool that one could delete
>>
>> the demon should not run!
>>
>> of course one will need the home directory, uid etc from the rocks
>> frontend
>>
>> IMHO, it is much simpler then install a new version of SGE
>> of course if the submit host is not running the same centos/redhat of
>> compute node that is another story
>> regards
>>
>>
>>> To add a submit host to an existing cluster it isn't necessary to have
>>> any daemon running on it, and installing a different version of SGE will
>>> most likely not work too, as the internal protocol changes between the
>>> releases. I suggest to:
>>>
>>> - Stop the daemons you started on the new submit host
>>> - Remove the compilation you did
>>> - Share the users from the existing cluster by NIS/LDAP (unless you want
>>> to define them all by hand on the new machine too)
>>> - Mount /home from the existing cluster
>>> - Mount /usr/sge or /opt/grid whereever you have SGE installed in the
>>> exisitng cluster
>>> - Add the machine in question as submit host in the original cluster
>>> - Source during login $SGE_ROOT/default/common/**settings.sh on the
>>> submit machine
>>>
>>> Then you should be able to submit jobs from this machine.
>>>
>>> As there is no builtin file staging in SGE, it's most common to share
>>> /home.
>>>
>>> ==
>>>
>>> Nevertheless it could be done to have a separate single machine cluster
>>> (with a different version of SGE) and use file staging (which you have to
>>> implement on your own) but it's to much overhead for adding just this
>>> particular machine IMO. It's a suitable setup to combine clusters by the
>>> use of a transfer queue this way. I did it once and used the job context to
>>> name the files which have to be copied back and forth to copy them then on
>>> my own in a starter method.
>>>
>>> -- Reuti
>>>
>>>
>>>  LT
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Mar 27, 2012, at 4:36 PM, Robert 
>>>> Chase<[email protected].**edu<[email protected]>>
>>>>  wrote:
>>>>
>>>>  Hello,
>>>>>
>>>>> A number of years ago, our group created a rocks cluster consisting of
>>>>> a head node, a data node and eight execution nodes. The eight execution
>>>>> nodes can only be accessed by the head node.
>>>>>
>>>>> My goal is to add a submit node to the existing cluster. I have
>>>>> downloaded GE2011.11 and compiled from source without errors. When I try
>>>>> the command:
>>>>>
>>>>> qsub simple.sh
>>>>>
>>>>> I get the error:
>>>>>
>>>>> Unable to run job: warning: root your job is not allowed to run in any
>>>>> queue
>>>>>
>>>>> When I look at qstat I get:
>>>>>
>>>>> job-ID  prior   name       user         state submit/start at
>>>>> queue                          slots ja-task-ID
>>>>> ------------------------------**------------------------------**
>>>>> ------------------------------**-----------------------
>>>>>      1 0.55500 simple.sh  root         qw    03/27/2012 09:41:11
>>>>>                              1
>>>>>
>>>>> I have added the new submit node to the list of submit nodes on the
>>>>> head node using the command
>>>>>
>>>>> qconf -as
>>>>>
>>>>> When I run qconf -ss on the new submit node I see the head node, the
>>>>> data node and the new submit node.
>>>>>
>>>>> When I run qconf -ss on the head node, I see the head node, the data
>>>>> node, the new submit node and all eight execution nodes.
>>>>>
>>>>> When I run qhost on the new submit node, I get
>>>>>
>>>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
>>>>>  SWAPTO  SWAPUS
>>>>> ------------------------------**------------------------------**-------------------
>>>>>
>>>>> global                  -               -     -       -       -
>>>>> -       -
>>>>>
>>>>>
>>>>> Other posts have asked about the output of qconf -sq all.q...
>>>>>
>>>>> [root@HEADNODE jobs]# qconf -sq all.q
>>>>> qname                 all.q
>>>>> hostlist              @allhosts
>>>>> seq_no                0
>>>>> load_thresholds       np_load_avg=1.75
>>>>> suspend_thresholds    NONE
>>>>> nsuspend              1
>>>>> suspend_interval      00:05:00
>>>>> priority              0
>>>>> min_cpu_interval      00:05:00
>>>>> processors            UNDEFINED
>>>>> qtype                 BATCH INTERACTIVE
>>>>> ckpt_list             NONE
>>>>> pe_list               make mpi mpich multicore orte
>>>>> rerun                 FALSE
>>>>> slots                 1,[compute-0-0.local=16],[**compute-0-1.local=16],
>>>>> \
>>>>>                      [compute-0-2.local=16],[**compute-0-3.local=16],
>>>>> \
>>>>>                      [compute-0-4.local=16],[**compute-0-6.local=16],
>>>>> \
>>>>>                      [compute-0-7.local=16]
>>>>> tmpdir                /tmp
>>>>> shell                 /bin/csh
>>>>> prolog                NONE
>>>>> epilog                NONE
>>>>> shell_start_mode      posix_compliant
>>>>> starter_method        NONE
>>>>> suspend_method        NONE
>>>>> resume_method         NONE
>>>>> terminate_method      NONE
>>>>> notify                00:00:60
>>>>> owner_list            NONE
>>>>> user_lists            NONE
>>>>> xuser_lists           NONE
>>>>> subordinate_list      NONE
>>>>> complex_values        NONE
>>>>> projects              NONE
>>>>> xprojects             NONE
>>>>> calendar              NONE
>>>>> initial_state         default
>>>>> s_rt                  INFINITY
>>>>> h_rt                  INFINITY
>>>>> s_cpu                 INFINITY
>>>>> h_cpu                 INFINITY
>>>>> s_fsize               INFINITY
>>>>> h_fsize               INFINITY
>>>>> s_data                INFINITY
>>>>> h_data                INFINITY
>>>>> s_stack               INFINITY
>>>>> h_stack               INFINITY
>>>>> s_core                INFINITY
>>>>> h_core                INFINITY
>>>>> s_rss                 INFINITY
>>>>> h_rss                 INFINITY
>>>>> s_vmem                INFINITY
>>>>> h_vmem                INFINITY
>>>>>
>>>>>
>>>>> [root@SUBMITNODE jobs]# qconf -sq all.q
>>>>> qname                 all.q
>>>>> hostlist              @allhosts
>>>>> seq_no                0
>>>>> load_thresholds       np_load_avg=1.75
>>>>> suspend_thresholds    NONE
>>>>> nsuspend              1
>>>>> suspend_interval      00:05:00
>>>>> priority              0
>>>>> min_cpu_interval      00:05:00
>>>>> processors            UNDEFINED
>>>>> qtype                 BATCH INTERACTIVE
>>>>> ckpt_list             NONE
>>>>> pe_list               make
>>>>> rerun                 FALSE
>>>>> slots                 1
>>>>> tmpdir                /tmp
>>>>> shell                 /bin/csh
>>>>> prolog                NONE
>>>>> epilog                NONE
>>>>> shell_start_mode      posix_compliant
>>>>> starter_method        NONE
>>>>> suspend_method        NONE
>>>>> resume_method         NONE
>>>>> terminate_method      NONE
>>>>> notify                00:00:60
>>>>> owner_list            NONE
>>>>> user_lists            NONE
>>>>> xuser_lists           NONE
>>>>> subordinate_list      NONE
>>>>> complex_values        NONE
>>>>> projects              NONE
>>>>> xprojects             NONE
>>>>> calendar              NONE
>>>>> initial_state         default
>>>>> s_rt                  INFINITY
>>>>> h_rt                  INFINITY
>>>>> s_cpu                 INFINITY
>>>>> h_cpu                 INFINITY
>>>>> s_fsize               INFINITY
>>>>> h_fsize               INFINITY
>>>>> s_data                INFINITY
>>>>> h_data                INFINITY
>>>>> s_stack               INFINITY
>>>>> h_stack               INFINITY
>>>>> s_core                INFINITY
>>>>> h_core                INFINITY
>>>>> s_rss                 INFINITY
>>>>> h_rss                 INFINITY
>>>>> s_vmem                INFINITY
>>>>> h_vmem                INFINITY
>>>>>
>>>>> I would like to know how to get qsub working.
>>>>>
>>>>> Thanks,
>>>>> -Robert Paul Chase
>>>>> Channing Labs
>>>>> ______________________________**_________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> https://gridengine.org/**mailman/listinfo/users<https://gridengine.org/mailman/listinfo/users>
>>>>>
>>>> ______________________________**_________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/**mailman/listinfo/users<https://gridengine.org/mailman/listinfo/users>
>>>>
>>>
>>
> --
> Hung-Sheng Tsao Ph D.
> Founder&  Principal
> HopBit GridComputing LLC
> cell: 9734950840
>
> http://laotsao.blogspot.com/
> http://laotsao.wordpress.com/
> http://blogs.oracle.com/**hstsao/ <http://blogs.oracle.com/hstsao/>
>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Adding GE2011.11 submit host to existing rocks cluster

Reply via email to