Re: [gridengine users] Adding GE2011.11 submit host to existing rocks cluster

Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D. Wed, 28 Mar 2012 15:55:36 -0700


hi
did you run
. settings.sh point to /opt/gridengine/default/common/settings.sh
include qmaster FQDN  in /etc/hosts?
???


On 3/28/2012 5:39 PM, Hung-Sheng Tsao (LaoTsao) Ph.D wrote:

hi
is this submit host  on the public net or private net of rockscluster?
is this node run the same os as compute node?
-LT


Sent from my iPad

On Mar 28, 2012, at 17:05, Robert Chase <[email protected]<mailto:[email protected]>> wrote:

Hello,

I followed your advice and copied the directory from one of thecompute nodes to the new submit node. I opened the firewall on ports536 and 537 and added execd and qmaster to the /etc/services file.I'm getting the following error messages when I use qping..

[root@galdev common]# qping q.bwh.harvard.edu<http://q.bwh.harvard.edu> 536 qmaster 1

endpoint q/qmaster/1 at port 536: can't find connection
got select error: Connection refused
got select error: closing "q/qmaster/1"
endpoint q/qmaster/1 at port 536: can't find connection
endpoint q/qmaster/1 at port 536: can't find connection

When I try to use qsub I get the following error...

[root@galdev jobs]# qsub simple.sh
error: commlib error: got select error (Connection refused)

Unable to run job: unable to send message to qmaster using port 536on host "q.bwh.harvard.edu <http://q.bwh.harvard.edu>": got send error.

Exiting.

Any help would be greatly appreciated.

-Robert Chase

On Wed, Mar 28, 2012 at 6:51 AM, "Hung-Sheng Tsao (Lao Tsao 老曹)Ph.D." <[email protected] <mailto:[email protected]>> wrote:



    sorry again
    one can always add login node in rockscluster that will act as
    submit node to sge
    regards


    On 3/28/2012 6:21 AM, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D." wrote:



        On 3/28/2012 5:53 AM, Reuti wrote:

            Am 27.03.2012 um 23:27 schrieb Hung-sheng Tsao:

                May be just copy the /opt/ grid engine from one of
                the compute node
                Add this as submit host from the frontend

            It may be good to add an explanation: to me it looks like
            the original poster installed a separate SGE cluster on
            just one machine, including the qmaster daemon and hence
            it's just running local which explains the job id of being 1.

        sorry, if one just copy the /opt/gridengine from compute nodes
        then
        it  will have the full directory of
        /opt/gridengine/default/common and /opt/gridengine/bin
        yes there is also default/spool that one could delete

        the demon should not run!

        of course one will need the home directory, uid etc from the
        rocks frontend

        IMHO, it is much simpler then install a new version of SGE
        of course if the submit host is not running the same
        centos/redhat of compute node that is another story
        regards


            To add a submit host to an existing cluster it isn't
            necessary to have any daemon running on it, and
            installing a different version of SGE will most likely
            not work too, as the internal protocol changes between
            the releases. I suggest to:

            - Stop the daemons you started on the new submit host
            - Remove the compilation you did
            - Share the users from the existing cluster by NIS/LDAP
            (unless you want to define them all by hand on the new
            machine too)
            - Mount /home from the existing cluster
            - Mount /usr/sge or /opt/grid whereever you have SGE
            installed in the exisitng cluster
            - Add the machine in question as submit host in the
            original cluster
            - Source during login
            $SGE_ROOT/default/common/settings.sh on the submit machine

            Then you should be able to submit jobs from this machine.

            As there is no builtin file staging in SGE, it's most
            common to share /home.

            ==

            Nevertheless it could be done to have a separate single
            machine cluster (with a different version of SGE) and use
            file staging (which you have to implement on your own)
            but it's to much overhead for adding just this particular
            machine IMO. It's a suitable setup to combine clusters by
            the use of a transfer queue this way. I did it once and
            used the job context to name the files which have to be
            copied back and forth to copy them then on my own in a
            starter method.

            -- Reuti


                LT

                Sent from my iPhone

                On Mar 27, 2012, at 4:36 PM, Robert
                Chase<[email protected]
                <mailto:[email protected]>>  wrote:

                    Hello,

                    A number of years ago, our group created a rocks
                    cluster consisting of a head node, a data node
                    and eight execution nodes. The eight execution
                    nodes can only be accessed by the head node.

                    My goal is to add a submit node to the existing
                    cluster. I have downloaded GE2011.11 and compiled
                    from source without errors. When I try the command:

                    qsub simple.sh

                    I get the error:

                    Unable to run job: warning: root your job is not
                    allowed to run in any queue

                    When I look at qstat I get:

                    job-ID  prior   name       user         state

submit/start at queueslots ja-task-ID

                    
-----------------------------------------------------------------------------------------------------------------

1 0.55500 simple.sh root qw03/27/2012 09:41:111


                    I have added the new submit node to the list of
                    submit nodes on the head node using the command

                    qconf -as

                    When I run qconf -ss on the new submit node I see
                    the head node, the data node and the new submit node.

                    When I run qconf -ss on the head node, I see the
                    head node, the data node, the new submit node and
                    all eight execution nodes.

                    When I run qhost on the new submit node, I get

                    HOSTNAME                ARCH         NCPU  LOAD
                     MEMTOT  MEMUSE  SWAPTO  SWAPUS
                    
-------------------------------------------------------------------------------

global - - -- - - -



                    Other posts have asked about the output of qconf
                    -sq all.q...

                    [root@HEADNODE jobs]# qconf -sq all.q
                    qname                 all.q
                    hostlist              @allhosts
                    seq_no                0
                    load_thresholds       np_load_avg=1.75
                    suspend_thresholds    NONE
                    nsuspend              1
                    suspend_interval      00:05:00
                    priority              0
                    min_cpu_interval      00:05:00
                    processors            UNDEFINED
                    qtype                 BATCH INTERACTIVE
                    ckpt_list             NONE
                    pe_list               make mpi mpich multicore orte
                    rerun                 FALSE

slots1,[compute-0-0.local=16],[compute-0-1.local=16], \[compute-0-2.local=16],[compute-0-3.local=16], \[compute-0-4.local=16],[compute-0-6.local=16], \

                                         [compute-0-7.local=16]
                    tmpdir                /tmp
                    shell                 /bin/csh
                    prolog                NONE
                    epilog                NONE
                    shell_start_mode      posix_compliant
                    starter_method        NONE
                    suspend_method        NONE
                    resume_method         NONE
                    terminate_method      NONE
                    notify                00:00:60
                    owner_list            NONE
                    user_lists            NONE
                    xuser_lists           NONE
                    subordinate_list      NONE
                    complex_values        NONE
                    projects              NONE
                    xprojects             NONE
                    calendar              NONE
                    initial_state         default
                    s_rt                  INFINITY
                    h_rt                  INFINITY
                    s_cpu                 INFINITY
                    h_cpu                 INFINITY
                    s_fsize               INFINITY
                    h_fsize               INFINITY
                    s_data                INFINITY
                    h_data                INFINITY
                    s_stack               INFINITY
                    h_stack               INFINITY
                    s_core                INFINITY
                    h_core                INFINITY
                    s_rss                 INFINITY
                    h_rss                 INFINITY
                    s_vmem                INFINITY
                    h_vmem                INFINITY


                    [root@SUBMITNODE jobs]# qconf -sq all.q
                    qname                 all.q
                    hostlist              @allhosts
                    seq_no                0
                    load_thresholds       np_load_avg=1.75
                    suspend_thresholds    NONE
                    nsuspend              1
                    suspend_interval      00:05:00
                    priority              0
                    min_cpu_interval      00:05:00
                    processors            UNDEFINED
                    qtype                 BATCH INTERACTIVE
                    ckpt_list             NONE
                    pe_list               make
                    rerun                 FALSE
                    slots                 1
                    tmpdir                /tmp
                    shell                 /bin/csh
                    prolog                NONE
                    epilog                NONE
                    shell_start_mode      posix_compliant
                    starter_method        NONE
                    suspend_method        NONE
                    resume_method         NONE
                    terminate_method      NONE
                    notify                00:00:60
                    owner_list            NONE
                    user_lists            NONE
                    xuser_lists           NONE
                    subordinate_list      NONE
                    complex_values        NONE
                    projects              NONE
                    xprojects             NONE
                    calendar              NONE
                    initial_state         default
                    s_rt                  INFINITY
                    h_rt                  INFINITY
                    s_cpu                 INFINITY
                    h_cpu                 INFINITY
                    s_fsize               INFINITY
                    h_fsize               INFINITY
                    s_data                INFINITY
                    h_data                INFINITY
                    s_stack               INFINITY
                    h_stack               INFINITY
                    s_core                INFINITY
                    h_core                INFINITY
                    s_rss                 INFINITY
                    h_rss                 INFINITY
                    s_vmem                INFINITY
                    h_vmem                INFINITY

                    I would like to know how to get qsub working.

                    Thanks,
                    -Robert Paul Chase
                    Channing Labs
                    _______________________________________________
                    users mailing list
                    [email protected] <mailto:[email protected]>
                    https://gridengine.org/mailman/listinfo/users

                _______________________________________________
                users mailing list
                [email protected] <mailto:[email protected]>
                https://gridengine.org/mailman/listinfo/users

--Hung-Sheng Tsao Ph D.

    Founder&  Principal
    HopBit GridComputing LLC
    cell: 9734950840 <tel:9734950840>

    http://laotsao.blogspot.com/
    http://laotsao.wordpress.com/
    http://blogs.oracle.com/hstsao/


--
Hung-Sheng Tsao Ph D.
Founder&  Principal
HopBit GridComputing LLC
cell: 9734950840

http://laotsao.blogspot.com/
http://laotsao.wordpress.com/
http://blogs.oracle.com/hstsao/

<<attachment: laotsao.vcf>>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Adding GE2011.11 submit host to existing rocks cluster

Reply via email to