Re: [gridengine users] Adding GE2011.11 submit host to existing rocks cluster

Reuti Thu, 29 Mar 2012 01:41:05 -0700

Am 28.03.2012 um 23:05 schrieb Robert Chase:

> I followed your advice and copied the directory from one of the compute nodes 
> to the new submit node. I opened the firewall on ports 536 and 537 and added 
> execd and qmaster to the /etc/services file. I'm getting the following error 
> messages when I use qping..
> 
> [root@galdev common]# qping q.bwh.harvard.edu 536 qmaster 1
> endpoint q/qmaster/1 at port 536: can't find connection
> got select error: Connection refused
> got select error: closing "q/qmaster/1"
> endpoint q/qmaster/1 at port 536: can't find connection
> endpoint q/qmaster/1 at port 536: can't find connection


Firewall on both ends disabled?

-- Reuti


> When I try to use qsub I get the following error...
> 
> [root@galdev jobs]# qsub simple.sh
> error: commlib error: got select error (Connection refused)
> Unable to run job: unable to send message to qmaster using port 536 on host 
> "q.bwh.harvard.edu": got send error.
> Exiting.
> 
> Any help would be greatly appreciated.
> 
> -Robert Chase
> 
> On Wed, Mar 28, 2012 at 6:51 AM, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D." 
> <[email protected]> wrote:
> 
> sorry again
> one can always add login node in rockscluster that will act as submit node to 
> sge
> regards
> 
> 
> On 3/28/2012 6:21 AM, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D." wrote:
> 
> 
> On 3/28/2012 5:53 AM, Reuti wrote:
> Am 27.03.2012 um 23:27 schrieb Hung-sheng Tsao:
> 
> May be just copy the /opt/ grid engine from one of the compute node
> Add this as submit host from the frontend
> It may be good to add an explanation: to me it looks like the original poster 
> installed a separate SGE cluster on just one machine, including the qmaster 
> daemon and hence it's just running local which explains the job id of being 1.
> sorry, if one just copy the /opt/gridengine from compute nodes
> then
> it  will have the full directory of /opt/gridengine/default/common and 
> /opt/gridengine/bin
> yes there is also default/spool that one could delete
> 
> the demon should not run!
> 
> of course one will need the home directory, uid etc from the rocks frontend
> 
> IMHO, it is much simpler then install a new version of SGE
> of course if the submit host is not running the same centos/redhat of compute 
> node that is another story
> regards
> 
> 
> To add a submit host to an existing cluster it isn't necessary to have any 
> daemon running on it, and installing a different version of SGE will most 
> likely not work too, as the internal protocol changes between the releases. I 
> suggest to:
> 
> - Stop the daemons you started on the new submit host
> - Remove the compilation you did
> - Share the users from the existing cluster by NIS/LDAP (unless you want to 
> define them all by hand on the new machine too)
> - Mount /home from the existing cluster
> - Mount /usr/sge or /opt/grid whereever you have SGE installed in the 
> exisitng cluster
> - Add the machine in question as submit host in the original cluster
> - Source during login $SGE_ROOT/default/common/settings.sh on the submit 
> machine
> 
> Then you should be able to submit jobs from this machine.
> 
> As there is no builtin file staging in SGE, it's most common to share /home.
> 
> ==
> 
> Nevertheless it could be done to have a separate single machine cluster (with 
> a different version of SGE) and use file staging (which you have to implement 
> on your own) but it's to much overhead for adding just this particular 
> machine IMO. It's a suitable setup to combine clusters by the use of a 
> transfer queue this way. I did it once and used the job context to name the 
> files which have to be copied back and forth to copy them then on my own in a 
> starter method.
> 
> -- Reuti
> 
> 
> LT
> 
> Sent from my iPhone
> 
> On Mar 27, 2012, at 4:36 PM, Robert Chase<[email protected]>  wrote:
> 
> Hello,
> 
> A number of years ago, our group created a rocks cluster consisting of a head 
> node, a data node and eight execution nodes. The eight execution nodes can 
> only be accessed by the head node.
> 
> My goal is to add a submit node to the existing cluster. I have downloaded 
> GE2011.11 and compiled from source without errors. When I try the command:
> 
> qsub simple.sh
> 
> I get the error:
> 
> Unable to run job: warning: root your job is not allowed to run in any queue
> 
> When I look at qstat I get:
> 
> job-ID  prior   name       user         state submit/start at     queue       
>                    slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
>  
>      1 0.55500 simple.sh  root         qw    03/27/2012 09:41:11              
>                       1
> 
> I have added the new submit node to the list of submit nodes on the head node 
> using the command
> 
> qconf -as
> 
> When I run qconf -ss on the new submit node I see the head node, the data 
> node and the new submit node.
> 
> When I run qconf -ss on the head node, I see the head node, the data node, 
> the new submit node and all eight execution nodes.
> 
> When I run qhost on the new submit node, I get
> 
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  
> SWAPUS
> -------------------------------------------------------------------------------
>  
> global                  -               -     -       -       -       -       
> -
> 
> 
> Other posts have asked about the output of qconf -sq all.q...
> 
> [root@HEADNODE jobs]# qconf -sq all.q
> qname                 all.q
> hostlist              @allhosts
> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               make mpi mpich multicore orte
> rerun                 FALSE
> slots                 1,[compute-0-0.local=16],[compute-0-1.local=16], \
>                      [compute-0-2.local=16],[compute-0-3.local=16], \
>                      [compute-0-4.local=16],[compute-0-6.local=16], \
>                      [compute-0-7.local=16]
> tmpdir                /tmp
> shell                 /bin/csh
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            NONE
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
> 
> 
> [root@SUBMITNODE jobs]# qconf -sq all.q
> qname                 all.q
> hostlist              @allhosts
> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               make
> rerun                 FALSE
> slots                 1
> tmpdir                /tmp
> shell                 /bin/csh
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            NONE
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
> 
> I would like to know how to get qsub working.
> 
> Thanks,
> -Robert Paul Chase
> Channing Labs
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 
> 
> -- 
> Hung-Sheng Tsao Ph D.
> Founder&  Principal
> HopBit GridComputing LLC
> cell: 9734950840
> 
> http://laotsao.blogspot.com/
> http://laotsao.wordpress.com/
> http://blogs.oracle.com/hstsao/
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Adding GE2011.11 submit host to existing rocks cluster

Reply via email to