Re: [gridengine users] Adding GE2011.11 submit host to existing rocks cluster

Hung-Sheng Tsao (LaoTsao) Ph.D Wed, 28 Mar 2012 14:41:51 -0700

which version of rocks?
-LT

Sent from my iPad


On Mar 28, 2012, at 17:39, "Hung-Sheng Tsao (LaoTsao) Ph.D" <[email protected]> 
wrote:

> hi
> is this submit host  on the public net or private net of rockscluster?
> is this node run the same os as compute node?
> -LT
> 
> 
> Sent from my iPad
> 
> On Mar 28, 2012, at 17:05, Robert Chase <[email protected]> wrote:
> 
>> Hello,
>> 
>> I followed your advice and copied the directory from one of the compute 
>> nodes to the new submit node. I opened the firewall on ports 536 and 537 and 
>> added execd and qmaster to the /etc/services file. I'm getting the following 
>> error messages when I use qping..
>> 
>> [root@galdev common]# qping q.bwh.harvard.edu 536 qmaster 1
>> endpoint q/qmaster/1 at port 536: can't find connection
>> got select error: Connection refused
>> got select error: closing "q/qmaster/1"
>> endpoint q/qmaster/1 at port 536: can't find connection
>> endpoint q/qmaster/1 at port 536: can't find connection
>> 
>> When I try to use qsub I get the following error...
>> 
>> [root@galdev jobs]# qsub simple.sh
>> error: commlib error: got select error (Connection refused)
>> Unable to run job: unable to send message to qmaster using port 536 on host 
>> "q.bwh.harvard.edu": got send error.
>> Exiting.
>> 
>> Any help would be greatly appreciated.
>> 
>> -Robert Chase
>> 
>> On Wed, Mar 28, 2012 at 6:51 AM, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D." 
>> <[email protected]> wrote:
>> 
>> sorry again
>> one can always add login node in rockscluster that will act as submit node 
>> to sge
>> regards
>> 
>> 
>> On 3/28/2012 6:21 AM, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D." wrote:
>> 
>> 
>> On 3/28/2012 5:53 AM, Reuti wrote:
>> Am 27.03.2012 um 23:27 schrieb Hung-sheng Tsao:
>> 
>> May be just copy the /opt/ grid engine from one of the compute node
>> Add this as submit host from the frontend
>> It may be good to add an explanation: to me it looks like the original 
>> poster installed a separate SGE cluster on just one machine, including the 
>> qmaster daemon and hence it's just running local which explains the job id 
>> of being 1.
>> sorry, if one just copy the /opt/gridengine from compute nodes
>> then
>> it  will have the full directory of /opt/gridengine/default/common and 
>> /opt/gridengine/bin
>> yes there is also default/spool that one could delete
>> 
>> the demon should not run!
>> 
>> of course one will need the home directory, uid etc from the rocks frontend
>> 
>> IMHO, it is much simpler then install a new version of SGE
>> of course if the submit host is not running the same centos/redhat of 
>> compute node that is another story
>> regards
>> 
>> 
>> To add a submit host to an existing cluster it isn't necessary to have any 
>> daemon running on it, and installing a different version of SGE will most 
>> likely not work too, as the internal protocol changes between the releases. 
>> I suggest to:
>> 
>> - Stop the daemons you started on the new submit host
>> - Remove the compilation you did
>> - Share the users from the existing cluster by NIS/LDAP (unless you want to 
>> define them all by hand on the new machine too)
>> - Mount /home from the existing cluster
>> - Mount /usr/sge or /opt/grid whereever you have SGE installed in the 
>> exisitng cluster
>> - Add the machine in question as submit host in the original cluster
>> - Source during login $SGE_ROOT/default/common/settings.sh on the submit 
>> machine
>> 
>> Then you should be able to submit jobs from this machine.
>> 
>> As there is no builtin file staging in SGE, it's most common to share /home.
>> 
>> ==
>> 
>> Nevertheless it could be done to have a separate single machine cluster 
>> (with a different version of SGE) and use file staging (which you have to 
>> implement on your own) but it's to much overhead for adding just this 
>> particular machine IMO. It's a suitable setup to combine clusters by the use 
>> of a transfer queue this way. I did it once and used the job context to name 
>> the files which have to be copied back and forth to copy them then on my own 
>> in a starter method.
>> 
>> -- Reuti
>> 
>> 
>> LT
>> 
>> Sent from my iPhone
>> 
>> On Mar 27, 2012, at 4:36 PM, Robert Chase<[email protected]>  wrote:
>> 
>> Hello,
>> 
>> A number of years ago, our group created a rocks cluster consisting of a 
>> head node, a data node and eight execution nodes. The eight execution nodes 
>> can only be accessed by the head node.
>> 
>> My goal is to add a submit node to the existing cluster. I have downloaded 
>> GE2011.11 and compiled from source without errors. When I try the command:
>> 
>> qsub simple.sh
>> 
>> I get the error:
>> 
>> Unable to run job: warning: root your job is not allowed to run in any queue
>> 
>> When I look at qstat I get:
>> 
>> job-ID  prior   name       user         state submit/start at     queue      
>>                     slots ja-task-ID
>> -----------------------------------------------------------------------------------------------------------------
>>  
>>      1 0.55500 simple.sh  root         qw    03/27/2012 09:41:11             
>>                        1
>> 
>> I have added the new submit node to the list of submit nodes on the head 
>> node using the command
>> 
>> qconf -as
>> 
>> When I run qconf -ss on the new submit node I see the head node, the data 
>> node and the new submit node.
>> 
>> When I run qconf -ss on the head node, I see the head node, the data node, 
>> the new submit node and all eight execution nodes.
>> 
>> When I run qhost on the new submit node, I get
>> 
>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  
>> SWAPUS
>> -------------------------------------------------------------------------------
>>  
>> global                  -               -     -       -       -       -      
>>  -
>> 
>> 
>> Other posts have asked about the output of qconf -sq all.q...
>> 
>> [root@HEADNODE jobs]# qconf -sq all.q
>> qname                 all.q
>> hostlist              @allhosts
>> seq_no                0
>> load_thresholds       np_load_avg=1.75
>> suspend_thresholds    NONE
>> nsuspend              1
>> suspend_interval      00:05:00
>> priority              0
>> min_cpu_interval      00:05:00
>> processors            UNDEFINED
>> qtype                 BATCH INTERACTIVE
>> ckpt_list             NONE
>> pe_list               make mpi mpich multicore orte
>> rerun                 FALSE
>> slots                 1,[compute-0-0.local=16],[compute-0-1.local=16], \
>>                      [compute-0-2.local=16],[compute-0-3.local=16], \
>>                      [compute-0-4.local=16],[compute-0-6.local=16], \
>>                      [compute-0-7.local=16]
>> tmpdir                /tmp
>> shell                 /bin/csh
>> prolog                NONE
>> epilog                NONE
>> shell_start_mode      posix_compliant
>> starter_method        NONE
>> suspend_method        NONE
>> resume_method         NONE
>> terminate_method      NONE
>> notify                00:00:60
>> owner_list            NONE
>> user_lists            NONE
>> xuser_lists           NONE
>> subordinate_list      NONE
>> complex_values        NONE
>> projects              NONE
>> xprojects             NONE
>> calendar              NONE
>> initial_state         default
>> s_rt                  INFINITY
>> h_rt                  INFINITY
>> s_cpu                 INFINITY
>> h_cpu                 INFINITY
>> s_fsize               INFINITY
>> h_fsize               INFINITY
>> s_data                INFINITY
>> h_data                INFINITY
>> s_stack               INFINITY
>> h_stack               INFINITY
>> s_core                INFINITY
>> h_core                INFINITY
>> s_rss                 INFINITY
>> h_rss                 INFINITY
>> s_vmem                INFINITY
>> h_vmem                INFINITY
>> 
>> 
>> [root@SUBMITNODE jobs]# qconf -sq all.q
>> qname                 all.q
>> hostlist              @allhosts
>> seq_no                0
>> load_thresholds       np_load_avg=1.75
>> suspend_thresholds    NONE
>> nsuspend              1
>> suspend_interval      00:05:00
>> priority              0
>> min_cpu_interval      00:05:00
>> processors            UNDEFINED
>> qtype                 BATCH INTERACTIVE
>> ckpt_list             NONE
>> pe_list               make
>> rerun                 FALSE
>> slots                 1
>> tmpdir                /tmp
>> shell                 /bin/csh
>> prolog                NONE
>> epilog                NONE
>> shell_start_mode      posix_compliant
>> starter_method        NONE
>> suspend_method        NONE
>> resume_method         NONE
>> terminate_method      NONE
>> notify                00:00:60
>> owner_list            NONE
>> user_lists            NONE
>> xuser_lists           NONE
>> subordinate_list      NONE
>> complex_values        NONE
>> projects              NONE
>> xprojects             NONE
>> calendar              NONE
>> initial_state         default
>> s_rt                  INFINITY
>> h_rt                  INFINITY
>> s_cpu                 INFINITY
>> h_cpu                 INFINITY
>> s_fsize               INFINITY
>> h_fsize               INFINITY
>> s_data                INFINITY
>> h_data                INFINITY
>> s_stack               INFINITY
>> h_stack               INFINITY
>> s_core                INFINITY
>> h_core                INFINITY
>> s_rss                 INFINITY
>> h_rss                 INFINITY
>> s_vmem                INFINITY
>> h_vmem                INFINITY
>> 
>> I would like to know how to get qsub working.
>> 
>> Thanks,
>> -Robert Paul Chase
>> Channing Labs
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>> 
>> 
>> -- 
>> Hung-Sheng Tsao Ph D.
>> Founder&  Principal
>> HopBit GridComputing LLC
>> cell: 9734950840
>> 
>> http://laotsao.blogspot.com/
>> http://laotsao.wordpress.com/
>> http://blogs.oracle.com/hstsao/
>> 
>>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Adding GE2011.11 submit host to existing rocks cluster

Reply via email to