On Wed, Mar 28, 2012 at 5:05 PM, Robert Chase <[email protected]> wrote: > I followed your advice and copied the directory from one of the compute > nodes to the new submit node. I opened the firewall on ports 536 and 537 and > added execd and qmaster to the /etc/services file. I'm getting the following > error messages when I use qping.. > > [root@galdev common]# qping q.bwh.harvard.edu 536 qmaster 1 > endpoint q/qmaster/1 at port 536: can't find connection > got select error: Connection refused > got select error: closing "q/qmaster/1" > endpoint q/qmaster/1 at port 536: can't find connection > endpoint q/qmaster/1 at port 536: can't find connection
Is the qmaster really listening to port 536?? Note that we have a standard port number for qmaster & execd (6444 & 6445). From your output it looks like nothing is listening to that port. Rayson > > When I try to use qsub I get the following error... > > [root@galdev jobs]# qsub simple.sh > error: commlib error: got select error (Connection refused) > Unable to run job: unable to send message to qmaster using port 536 on host > "q.bwh.harvard.edu": got send error. > Exiting. > > Any help would be greatly appreciated. > > -Robert Chase > > > On Wed, Mar 28, 2012 at 6:51 AM, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D." > <[email protected]> wrote: >> >> >> sorry again >> one can always add login node in rockscluster that will act as submit node >> to sge >> regards >> >> >> On 3/28/2012 6:21 AM, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D." wrote: >>> >>> >>> >>> On 3/28/2012 5:53 AM, Reuti wrote: >>>> >>>> Am 27.03.2012 um 23:27 schrieb Hung-sheng Tsao: >>>> >>>>> May be just copy the /opt/ grid engine from one of the compute node >>>>> Add this as submit host from the frontend >>>> >>>> It may be good to add an explanation: to me it looks like the original >>>> poster installed a separate SGE cluster on just one machine, including the >>>> qmaster daemon and hence it's just running local which explains the job id >>>> of being 1. >>> >>> sorry, if one just copy the /opt/gridengine from compute nodes >>> then >>> it will have the full directory of /opt/gridengine/default/common and >>> /opt/gridengine/bin >>> yes there is also default/spool that one could delete >>> >>> the demon should not run! >>> >>> of course one will need the home directory, uid etc from the rocks >>> frontend >>> >>> IMHO, it is much simpler then install a new version of SGE >>> of course if the submit host is not running the same centos/redhat of >>> compute node that is another story >>> regards >>> >>>> >>>> To add a submit host to an existing cluster it isn't necessary to have >>>> any daemon running on it, and installing a different version of SGE will >>>> most likely not work too, as the internal protocol changes between the >>>> releases. I suggest to: >>>> >>>> - Stop the daemons you started on the new submit host >>>> - Remove the compilation you did >>>> - Share the users from the existing cluster by NIS/LDAP (unless you want >>>> to define them all by hand on the new machine too) >>>> - Mount /home from the existing cluster >>>> - Mount /usr/sge or /opt/grid whereever you have SGE installed in the >>>> exisitng cluster >>>> - Add the machine in question as submit host in the original cluster >>>> - Source during login $SGE_ROOT/default/common/settings.sh on the submit >>>> machine >>>> >>>> Then you should be able to submit jobs from this machine. >>>> >>>> As there is no builtin file staging in SGE, it's most common to share >>>> /home. >>>> >>>> == >>>> >>>> Nevertheless it could be done to have a separate single machine cluster >>>> (with a different version of SGE) and use file staging (which you have to >>>> implement on your own) but it's to much overhead for adding just this >>>> particular machine IMO. It's a suitable setup to combine clusters by the >>>> use >>>> of a transfer queue this way. I did it once and used the job context to >>>> name >>>> the files which have to be copied back and forth to copy them then on my >>>> own >>>> in a starter method. >>>> >>>> -- Reuti >>>> >>>> >>>>> LT >>>>> >>>>> Sent from my iPhone >>>>> >>>>> On Mar 27, 2012, at 4:36 PM, Robert Chase<[email protected]> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> A number of years ago, our group created a rocks cluster consisting of >>>>>> a head node, a data node and eight execution nodes. The eight execution >>>>>> nodes can only be accessed by the head node. >>>>>> >>>>>> My goal is to add a submit node to the existing cluster. I have >>>>>> downloaded GE2011.11 and compiled from source without errors. When I try >>>>>> the >>>>>> command: >>>>>> >>>>>> qsub simple.sh >>>>>> >>>>>> I get the error: >>>>>> >>>>>> Unable to run job: warning: root your job is not allowed to run in any >>>>>> queue >>>>>> >>>>>> When I look at qstat I get: >>>>>> >>>>>> job-ID prior name user state submit/start at >>>>>> queue slots ja-task-ID >>>>>> >>>>>> ----------------------------------------------------------------------------------------------------------------- >>>>>> 1 0.55500 simple.sh root qw 03/27/2012 09:41:11 >>>>>> 1 >>>>>> >>>>>> I have added the new submit node to the list of submit nodes on the >>>>>> head node using the command >>>>>> >>>>>> qconf -as >>>>>> >>>>>> When I run qconf -ss on the new submit node I see the head node, the >>>>>> data node and the new submit node. >>>>>> >>>>>> When I run qconf -ss on the head node, I see the head node, the data >>>>>> node, the new submit node and all eight execution nodes. >>>>>> >>>>>> When I run qhost on the new submit node, I get >>>>>> >>>>>> HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE >>>>>> SWAPTO SWAPUS >>>>>> >>>>>> ------------------------------------------------------------------------------- >>>>>> global - - - - - >>>>>> - - >>>>>> >>>>>> >>>>>> Other posts have asked about the output of qconf -sq all.q... >>>>>> >>>>>> [root@HEADNODE jobs]# qconf -sq all.q >>>>>> qname all.q >>>>>> hostlist @allhosts >>>>>> seq_no 0 >>>>>> load_thresholds np_load_avg=1.75 >>>>>> suspend_thresholds NONE >>>>>> nsuspend 1 >>>>>> suspend_interval 00:05:00 >>>>>> priority 0 >>>>>> min_cpu_interval 00:05:00 >>>>>> processors UNDEFINED >>>>>> qtype BATCH INTERACTIVE >>>>>> ckpt_list NONE >>>>>> pe_list make mpi mpich multicore orte >>>>>> rerun FALSE >>>>>> slots 1,[compute-0-0.local=16],[compute-0-1.local=16], >>>>>> \ >>>>>> [compute-0-2.local=16],[compute-0-3.local=16], \ >>>>>> [compute-0-4.local=16],[compute-0-6.local=16], \ >>>>>> [compute-0-7.local=16] >>>>>> tmpdir /tmp >>>>>> shell /bin/csh >>>>>> prolog NONE >>>>>> epilog NONE >>>>>> shell_start_mode posix_compliant >>>>>> starter_method NONE >>>>>> suspend_method NONE >>>>>> resume_method NONE >>>>>> terminate_method NONE >>>>>> notify 00:00:60 >>>>>> owner_list NONE >>>>>> user_lists NONE >>>>>> xuser_lists NONE >>>>>> subordinate_list NONE >>>>>> complex_values NONE >>>>>> projects NONE >>>>>> xprojects NONE >>>>>> calendar NONE >>>>>> initial_state default >>>>>> s_rt INFINITY >>>>>> h_rt INFINITY >>>>>> s_cpu INFINITY >>>>>> h_cpu INFINITY >>>>>> s_fsize INFINITY >>>>>> h_fsize INFINITY >>>>>> s_data INFINITY >>>>>> h_data INFINITY >>>>>> s_stack INFINITY >>>>>> h_stack INFINITY >>>>>> s_core INFINITY >>>>>> h_core INFINITY >>>>>> s_rss INFINITY >>>>>> h_rss INFINITY >>>>>> s_vmem INFINITY >>>>>> h_vmem INFINITY >>>>>> >>>>>> >>>>>> [root@SUBMITNODE jobs]# qconf -sq all.q >>>>>> qname all.q >>>>>> hostlist @allhosts >>>>>> seq_no 0 >>>>>> load_thresholds np_load_avg=1.75 >>>>>> suspend_thresholds NONE >>>>>> nsuspend 1 >>>>>> suspend_interval 00:05:00 >>>>>> priority 0 >>>>>> min_cpu_interval 00:05:00 >>>>>> processors UNDEFINED >>>>>> qtype BATCH INTERACTIVE >>>>>> ckpt_list NONE >>>>>> pe_list make >>>>>> rerun FALSE >>>>>> slots 1 >>>>>> tmpdir /tmp >>>>>> shell /bin/csh >>>>>> prolog NONE >>>>>> epilog NONE >>>>>> shell_start_mode posix_compliant >>>>>> starter_method NONE >>>>>> suspend_method NONE >>>>>> resume_method NONE >>>>>> terminate_method NONE >>>>>> notify 00:00:60 >>>>>> owner_list NONE >>>>>> user_lists NONE >>>>>> xuser_lists NONE >>>>>> subordinate_list NONE >>>>>> complex_values NONE >>>>>> projects NONE >>>>>> xprojects NONE >>>>>> calendar NONE >>>>>> initial_state default >>>>>> s_rt INFINITY >>>>>> h_rt INFINITY >>>>>> s_cpu INFINITY >>>>>> h_cpu INFINITY >>>>>> s_fsize INFINITY >>>>>> h_fsize INFINITY >>>>>> s_data INFINITY >>>>>> h_data INFINITY >>>>>> s_stack INFINITY >>>>>> h_stack INFINITY >>>>>> s_core INFINITY >>>>>> h_core INFINITY >>>>>> s_rss INFINITY >>>>>> h_rss INFINITY >>>>>> s_vmem INFINITY >>>>>> h_vmem INFINITY >>>>>> >>>>>> I would like to know how to get qsub working. >>>>>> >>>>>> Thanks, >>>>>> -Robert Paul Chase >>>>>> Channing Labs >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> [email protected] >>>>>> https://gridengine.org/mailman/listinfo/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> [email protected] >>>>> https://gridengine.org/mailman/listinfo/users >>> >>> >> >> -- >> Hung-Sheng Tsao Ph D. >> Founder& Principal >> HopBit GridComputing LLC >> cell: 9734950840 >> >> http://laotsao.blogspot.com/ >> http://laotsao.wordpress.com/ >> http://blogs.oracle.com/hstsao/ >> > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
