Hello, For everybody interested: the problem with affinity that was discussed here was fixed in main OMPI trunk. The fix will be included into 1.8.1 version.
2014-03-01 0:45 GMT+07:00 Ralph Castain <[email protected]>: > > > On Feb 28, 2014, at 9:13 AM, L. Shawn Matott <[email protected]> wrote: > > > > > Danny, > > > > That's good to know. Which of the steps causes the loss of functionality > (rankfile, ssh as plm, or mpirun instead of srun)? > > Again, to clarify - you gain a lot of functionality in terms of mapping, > binding, and other areas. In exchange, you lose atomicity in accounting and > memory limits. Note, however, that you can enforce memory limits on the > individual procs using mpirun if you so choose, so it the only actual loss > is the individual process accounting. Mpirun will provide those numbers as > well, if you want, but will not add them to the Slurm accounting database. > > > > > --- Shawn > > > > -----Original Message----- From: Danny Auble > > Sent: Friday, February 28, 2014 12:09 PM > > To: slurm-dev > > Subject: [slurm-dev] Re: openmpi misbehaves when started under slurm > > > > > > Just a notice to those attempting to run this way, Slurm will not be > > able to monitor the step or keep accounting or enforce memory limits > > when running this way. > > > > On 02/28/2014 09:01 AM, L. Shawn Matott wrote: > >> > >> On our cluster we use SLURM v2.6.3 with cpusets enabled. We sometimes > see > >> problems with openmpi and incorrect cpu pinning. As a workaround we use > the > >> following bit of bash code to manually assemble an openmpi rankfile, > switch > >> from slurm to ssh as the process launch module, and finally launch using > >> mpirun instead of srun. Hope this is helpful to someone..... > >> > >> ---- > >> L. Shawn Matott, PhD > >> Computational Scientist > >> University at Buffalo, > >> Center for Computational Research > >> 701 Ellicott Street, Buffalo, New York 14203 > >> > >> # > >> > ================================================================================================ > >> # create rank file to explicitly bind cores > >> echo "creating hostfile and rankfile" > >> uid=`id -u` > >> jid=$SLURM_JOB_ID > >> nodes=`nodeset -e $SLURM_NODELIST` > >> > >> # trigger creation of cpuset information and save to working dir > >> srun bash -c "cat > /cgroup/cpuset/slurm/uid_${uid}/job_${jid}/cpuset.cpus > > >> cpus.\`hostname\`.$SLURM_JOB_ID" > >> > >> RANKFILE=rankfile.$$ > >> NODEFILE=nodefile.$$ > >> > >> rm -f $RANKFILE > >> rm -f $NODEFILE > >> rank=0 > >> for i in ${nodes}; do > >> # extract space-separated list of assigned cpus > >> cpus=`cat cpus.${i}.${SLURM_JOB_ID}` > >> cpus=`nodeset -Re $cpus` > >> # add cpu assignments to the rank file > >> for j in ${cpus}; do > >> echo "rank ${rank}=$i slot=$j" >> $RANKFILE > >> echo "$i" >> $NODEFILE > >> rank=`expr $rank + 1` > >> if [ "$rank" == "$SLURM_NPROCS" ]; then > >> break; > >> fi > >> done > >> if [ "$rank" == "$SLURM_NPROCS" ]; then > >> break; > >> fi > >> done > >> > >> # use ssh instead of slurm as the launcher > >> # the rankfile that was just created will ensure cpusets are still > honored. > >> export OMPI_MCA_plm=rsh > >> > >> # launch application using mpirun > >> echo "Launching application using mpirun" > >> mpirun \ > >> -h $NODEFILE \ > >> --rankfile $RANKFILE \ > >> --prefix $OMPI \ > >> --n $SLURM_NPROCS \ > >> --display-map \ > >> --verbose $EXE $ARGS > >> # > >> > ================================================================================================ > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov
