Am 15.01.2013 um 02:06 schrieb John Weiner:

> Dear Experts:
> 
> I am a newbie to linux clusters and have only yeoman competence in 
> information technology generally so my culture and intuition are not deep.  
> Some help on a perplexing problem would be greatly appreciated.
> 
> About a month ago we installed Rocks v. 6.1 on a small cluster consisting of 
> a FrontEnd and two compute nodes.  The installation proceeded without error 
> and parallel processing on the cluster works fine.  The SGE queue works fine 
> as well.  SGE is a package installed with Rocks software.
> 
> We have just installed another Rocks 6.1 cluster, using different hardware, 
> on a FrontEnd and 5 Compute nodes.  After some adjustments to the BIOS on the 
> motherboards of the compute nodes, the installation looks normal, 1 FrontEnd 
> and compute-0-0, compute-0-1…compute-0-5.  The compute nodes consist of a 
> SuperMicro motherboard with dual processors, Intel E5 2650 with hyper 
> threading.  Each compute node has a total of 16 physical cores and with hyper 
> threading the "effective" number of cores is 32.
> 
>> When parallel jobs, using MPI,

By MPI you mean Open MPI (as you use a PE "orte" below) - there is only one 
`mpirun` installed resp. the correct one called in the jobscript?


>> are submitted "by hand", typing out the explicit commands at the command 
>> line, the system works without any problem.  When the very same job is 
>> submitted to the SGE queue, an error is generated, and although qstat 
>> indicates a running program, in fact it is not.  qstat -f shows that the job 
>> was not distributed among the four compute nodes as specified by mpi.exe 
>> command.

It's working the other way round: SGE will grant access to the requested number 
of slots and Open MPI has to use these and only these.

http://www.open-mpi.org/faq/?category=building#build-rte-sge

http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge

The granted allocation can be viewed by:

$ qstat -g t


>> The command line for submitting the job to the SGE queue is
>> 
>> qsub -pe orte 64 shellfile.sh  (there are 64 cores specified for the job on 
>> 4 compute nodes)

You compiled Open MPI with --with-sge?


> In this case job 43 was started, but the program does not run on the 
> specified nodes with the specified cores.
>> 
>> The error from shellfile.sh.e43 is:
>> 
>> error: executing task of job 43 failed: execution daemon on host 
>> "compute-0-0" didn't accept task
>> error: executing task of job 43 failed: execution daemon on host 
>> "compute-0-1" didn't accept task

You set up a PE for Tight Integration of Open MPI?

One of the causes can be, that there is at least one `qrsh - inherit ...` call 
made too much to a slave machine of the parallel job than allowed by the 
granted slot count thereon.

-- Reuti

NB: Nowadays often only one `qrsh -inherit ...` call is made at all to each 
slave machine, as additional processes are started as forks (you can observe 
this with `ps -e f`).


>> The job had been submitted to compute-0-0, compute-0-1, compute-0-2, 
>> compute-0-3
>> 
>> What does "execution daemon on host "compute-0-0" didn't accept task" mean?
> 
> Since SGE works without problems on the earlier cluster, I don't understand 
> what where the error is here.
>> 
>> Any suggestions would be much appreciated.
> 
> John
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to