Re: [OMPI users] SLURM and OpenMPI
Ralph, Thanks for your reply. Let me know if I can help in any way. fds -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph H Castain Sent: Thursday, June 19, 2008 10:24 AM To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] SLURM and OpenMPI Well, if the only system I cared about was slurm, there are some things I could possibly do to make things better, but at the expense of our support for other environments - which is unacceptable. There are a few technical barriers to doing this without the orteds on slurm, and a major licensing issue that prohibits us from calling any slurm APIs. How all that gets resolved is unclear. Frankly, one reason we don't put more emphasis on it is that we don't see a significant launch time difference between the two modes, and we truly do want to retain the ability to utilize different error response strategies (which slurm will not allow - you can only follow theirs). So I would say we simply have different objectives than what you stated, and different concerns that make a deeper slurm integration less favorable. May still happen, but not anytime soon. Ralph On 6/19/08 8:08 AM, "Sacerdoti, Federico" <federico.sacerd...@deshawresearch.com> wrote: > Ralph thanks for your quick response. > > Regarding your fourth paragraph, slurm will not let you run on a > no-longer-valid allocation, an srun will correctly exit non-zero with a > useful failure reason. So perhaps openmpi 1.3 with your changes will > just work, I look forward to testing it. > > E.g. > $ srun hostname > srun: error: Unable to confirm allocation for job 745346: Invalid job id > specified > srun: Check SLURM_JOBID environment variable for expired or invalid job. > > > Regarding srun to launch the jobs directly (no orteds), I am sad to hear > the idea is not in favor. We have found srun to be extremely scalable > (tested up to 4096 MPI processes) and very good at cleaning up after an > error or node failure. It seems you could simplify orterun quite a bit > by relying on slurm (or whatever resource manager) to handle job > cleanup after failures; it is their responsibility after all, and they > have better knowledge about the health and availability of nodes than > any launcher can hope for. > > I helped write an mvapich launcher used internally called mvrun, which > was used for several years. I wrote a lot of logic to run down and stop > all processes when one had failed, which I understand you have as well. > We came to the conclusion that slurm was in a better position to handle > such failures, and in fact did it more effectively. For example if slurm > detects a node has failed, it will stop the job, allocate an additional > free node to make up the deficit, then relaunch. It more difficult (to > put it mildly) for a job launcher to do that. > > Thanks again, > Federico > > -Original Message- > From: Ralph H Castain [mailto:r...@lanl.gov] > Sent: Tuesday, June 17, 2008 1:09 PM > To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] SLURM and OpenMPI > > I can believe 1.2.x has problems in that regard. Some of that has > nothing to > do with slurm and reflects internal issues with 1.2. > > We have made it much more resistant to those problems in the upcoming > 1.3 > release, but there is no plan to retrofit those changes to 1.2. Part of > the > problem was that we weren't using the --kill-on-bad-exit flag when we > called > srun internally, which has been fixed for 1.3. > > BTW: we actually do use srun to launch the daemons - we just call it > internally from inside orterun. The only real difference is that we use > orterun to setup the cmd line and then tell the daemons what they need > to > do. The issues you are seeing relate to our ability to detect that srun > has > failed, and/or that one or more daemons have failed to launch or do > something they were supposed to do. The 1.2 system has problems in that > regard, which was one motivation for the 1.3 overhaul. > > I would argue that slurm allowing us to attempt to launch on a > no-longer-valid allocation is a slurm issue, not OMPI's. As I said, we > use > srun to launch the daemons - the only reason we hang is that srun is not > returning with an error. I've seen this on other systems as well, but > have > no real answer - if slurm doesn't indicate an error has occurred, I'm > not > sure what I can do about it. > > We are unlikely to use srun to directly launch jobs (i.e., to have slurm > directly launch the job from an srun cmd line without mpirun) anytime > soon. > It isn't clear there is enough benefit to justify the rather large
Re: [OMPI users] SLURM and OpenMPI
Well, if the only system I cared about was slurm, there are some things I could possibly do to make things better, but at the expense of our support for other environments - which is unacceptable. There are a few technical barriers to doing this without the orteds on slurm, and a major licensing issue that prohibits us from calling any slurm APIs. How all that gets resolved is unclear. Frankly, one reason we don't put more emphasis on it is that we don't see a significant launch time difference between the two modes, and we truly do want to retain the ability to utilize different error response strategies (which slurm will not allow - you can only follow theirs). So I would say we simply have different objectives than what you stated, and different concerns that make a deeper slurm integration less favorable. May still happen, but not anytime soon. Ralph On 6/19/08 8:08 AM, "Sacerdoti, Federico" <federico.sacerd...@deshawresearch.com> wrote: > Ralph thanks for your quick response. > > Regarding your fourth paragraph, slurm will not let you run on a > no-longer-valid allocation, an srun will correctly exit non-zero with a > useful failure reason. So perhaps openmpi 1.3 with your changes will > just work, I look forward to testing it. > > E.g. > $ srun hostname > srun: error: Unable to confirm allocation for job 745346: Invalid job id > specified > srun: Check SLURM_JOBID environment variable for expired or invalid job. > > > Regarding srun to launch the jobs directly (no orteds), I am sad to hear > the idea is not in favor. We have found srun to be extremely scalable > (tested up to 4096 MPI processes) and very good at cleaning up after an > error or node failure. It seems you could simplify orterun quite a bit > by relying on slurm (or whatever resource manager) to handle job > cleanup after failures; it is their responsibility after all, and they > have better knowledge about the health and availability of nodes than > any launcher can hope for. > > I helped write an mvapich launcher used internally called mvrun, which > was used for several years. I wrote a lot of logic to run down and stop > all processes when one had failed, which I understand you have as well. > We came to the conclusion that slurm was in a better position to handle > such failures, and in fact did it more effectively. For example if slurm > detects a node has failed, it will stop the job, allocate an additional > free node to make up the deficit, then relaunch. It more difficult (to > put it mildly) for a job launcher to do that. > > Thanks again, > Federico > > -Original Message- > From: Ralph H Castain [mailto:r...@lanl.gov] > Sent: Tuesday, June 17, 2008 1:09 PM > To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] SLURM and OpenMPI > > I can believe 1.2.x has problems in that regard. Some of that has > nothing to > do with slurm and reflects internal issues with 1.2. > > We have made it much more resistant to those problems in the upcoming > 1.3 > release, but there is no plan to retrofit those changes to 1.2. Part of > the > problem was that we weren't using the --kill-on-bad-exit flag when we > called > srun internally, which has been fixed for 1.3. > > BTW: we actually do use srun to launch the daemons - we just call it > internally from inside orterun. The only real difference is that we use > orterun to setup the cmd line and then tell the daemons what they need > to > do. The issues you are seeing relate to our ability to detect that srun > has > failed, and/or that one or more daemons have failed to launch or do > something they were supposed to do. The 1.2 system has problems in that > regard, which was one motivation for the 1.3 overhaul. > > I would argue that slurm allowing us to attempt to launch on a > no-longer-valid allocation is a slurm issue, not OMPI's. As I said, we > use > srun to launch the daemons - the only reason we hang is that srun is not > returning with an error. I've seen this on other systems as well, but > have > no real answer - if slurm doesn't indicate an error has occurred, I'm > not > sure what I can do about it. > > We are unlikely to use srun to directly launch jobs (i.e., to have slurm > directly launch the job from an srun cmd line without mpirun) anytime > soon. > It isn't clear there is enough benefit to justify the rather large > effort, > especially considering what would be required to maintain scalability. > Decisions on all that are still pending, though, which means any > significant > change in that regard wouldn't be released until sometime next year. > > Ralph > > On 6/17/08 10:39 AM, "Sacerdoti, Federico" > <
Re: [OMPI users] SLURM and OpenMPI
I can believe 1.2.x has problems in that regard. Some of that has nothing to do with slurm and reflects internal issues with 1.2. We have made it much more resistant to those problems in the upcoming 1.3 release, but there is no plan to retrofit those changes to 1.2. Part of the problem was that we weren't using the --kill-on-bad-exit flag when we called srun internally, which has been fixed for 1.3. BTW: we actually do use srun to launch the daemons - we just call it internally from inside orterun. The only real difference is that we use orterun to setup the cmd line and then tell the daemons what they need to do. The issues you are seeing relate to our ability to detect that srun has failed, and/or that one or more daemons have failed to launch or do something they were supposed to do. The 1.2 system has problems in that regard, which was one motivation for the 1.3 overhaul. I would argue that slurm allowing us to attempt to launch on a no-longer-valid allocation is a slurm issue, not OMPI's. As I said, we use srun to launch the daemons - the only reason we hang is that srun is not returning with an error. I've seen this on other systems as well, but have no real answer - if slurm doesn't indicate an error has occurred, I'm not sure what I can do about it. We are unlikely to use srun to directly launch jobs (i.e., to have slurm directly launch the job from an srun cmd line without mpirun) anytime soon. It isn't clear there is enough benefit to justify the rather large effort, especially considering what would be required to maintain scalability. Decisions on all that are still pending, though, which means any significant change in that regard wouldn't be released until sometime next year. Ralph On 6/17/08 10:39 AM, "Sacerdoti, Federico" <federico.sacerd...@deshawresearch.com> wrote: > Ralph, > > I was wondering what the status of this feature was (using srun to > launch orted daemons)? I have two new bug reports to add from our > experience using orterun from 1.2.6 on our 4000 CPU infiniband cluster. > > 1. Orterun will happily hang if it is asked to run on an invalid slurm > job, e.g. if the job has exceeded its timelimit. This would be trivially > fixed if you used srun to launch, as they would fail with non-zero exit > codes. > > 2. A very simple orterun invocation hangs instead of exiting with an > error. In this case the executable does not exist, and we would expect > orterun to exit non-zero. This has caused > headaches with some workflow management script that automatically start > jobs. > > salloc -N2 -p swdev orterun dummy-binary-I-dont-exist > [hang] > > orterun dummy-binary-I-dont-exist > [hang] > > Thanks, > Federico > > -Original Message- > From: Sacerdoti, Federico > Sent: Friday, March 21, 2008 5:41 PM > To: 'Open MPI Users' > Subject: RE: [OMPI users] SLURM and OpenMPI > > > Ralph wrote: > "I don't know if I would say we "interfere" with SLURM - I would say > that we > are only lightly integrated with SLURM at this time. We use SLURM as a > resource manager to assign nodes, and then map processes onto those > nodes > according to the user's wishes. We chose to do this because srun applies > its > own load balancing algorithms if you launch processes directly with it, > which leaves the user with little flexibility to specify their desired > rank/slot mapping. We chose to support the greater flexibility." > > Ralph, we wrote a launcher for mvapich that uses srun to launch but > keeps tight control of where processes are started. The way we did it > was to force srun to launch a single process on a particular node. > > The launcher calls many of these: > srun --jobid $JOBID -N 1 -n 1 -w host005 CMD ARGS > > Hope this helps (and we are looking forward to a tighter orterun/slurm > integration as you know). > > Regards, > Federico > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Thursday, March 20, 2008 6:41 PM > To: Open MPI Users <us...@open-mpi.org> > Cc: Ralph Castain > Subject: Re: [OMPI users] SLURM and OpenMPI > > Hi there > > I am no slurm expert. However, it is our understanding that > SLURM_TASKS_PER_NODE means the number of slots allocated to the job, not > the > number of tasks to be executed on each node. So the 4(x2) tells us that > we > have 4 slots on each of two nodes to work with. You got 4 slots on each > node > because you used the -N option, which told slurm to assign all slots on > that > node to this job - I assume you have 4 processors on your nodes. OpenMPI > parses that string to get the allocation, then maps the number of > specified > processes against it. > &
Re: [OMPI users] SLURM and OpenMPI
On Thu, 20 Mar 2008 16:40:41 -0600 Ralph Castainwrote: > I am no slurm expert. However, it is our understanding that > SLURM_TASKS_PER_NODE means the number of slots allocated to the job, > not the number of tasks to be executed on each node. So the 4(x2) > tells us that we have 4 slots on each of two nodes to work with. You > got 4 slots on each node because you used the -N option, which told > slurm to assign all slots on that node to this job - I assume you > have 4 processors on your nodes. OpenMPI parses that string to get > the allocation, then maps the number of specified processes against > it. That was also my interpretation and I was absolutely sure to have read it a couple of days ago in the srun man-page. In the meantime I changed my opinion because now it says "Number of tasks to be initiated on each node" as Tim has quoted. I've no idea, how Tim managed to change the man-page on my computer ;-) and there is another variable documented: SLURM_CPUS_ON_NODE Count of processors available to the job on this node. Note the select/linear plugin allocates entire nodes to jobs, so the value indicates the total count of CPUs on the node. The select/cons_res plugin allocates individual processors to jobs, so this number indicates the number of processors on this node allocated to the job. Anyway, back to reality: I made some further tests, and the only way to change the values of SLURM_TASKS_PER_NODE was to tell slurm that node x has only y cpus in slurm.conf. The variable documented as SLURM_CPUS_ON_NODE (in 1.0.15 and 1.2.22) doesn't seem to exist in either version. In 1.2.22 there seems to be SLURM_JOB_CPUS_PER_NODE which has the same value as SLURM_TASKS_PER_NODE. In a couple of days I'll try the other allocator plugin which allocates on a cpu base instead of a node base. And after that it probably would be a good idea, that somebody (me?) sums up our thread and asks the slurm guys for their opinion. > It is possible that the interpretation of SLURM_TASKS_PER_NODE is > different when used to allocate as opposed to directly launch > processes. Our typical usage is for someone to do: > > srun -N 2 -A > mpirun -np 2 helloworld > > In other words, we use srun to create an allocation, and then run > mpirun separately within it. > > > I am therefore unsure what the "-n 2" will do here. If I believe the > documentation, it would seem to imply that srun will attempt to > launch two copies of "mpirun -np 2 helloworld", yet your output > doesn't seem to support that interpretation. It would appear that the > "-n 2" is being ignored and only one copy of mpirun is being > launched. I'm no slurm expert, so perhaps that interpretation is > incorrect. That indeed happens when you call "srun -N 2 mpirun -np 2 helloworld", but "srun -N 2 -b mpirun -np 2 helloworld" submits it as a batch-job, i.e. "mpirun -np 2 helloworld" is executed only once on one of the allocated nodes and environment variables are set appropriately -- or at least should be set appropriately -- that a consecutive srun or an mpirun inside the command does the right thing. Werner
Re: [OMPI users] SLURM and OpenMPI
Ralph wrote: "I don't know if I would say we "interfere" with SLURM - I would say that we are only lightly integrated with SLURM at this time. We use SLURM as a resource manager to assign nodes, and then map processes onto those nodes according to the user's wishes. We chose to do this because srun applies its own load balancing algorithms if you launch processes directly with it, which leaves the user with little flexibility to specify their desired rank/slot mapping. We chose to support the greater flexibility." Ralph, we wrote a launcher for mvapich that uses srun to launch but keeps tight control of where processes are started. The way we did it was to force srun to launch a single process on a particular node. The launcher calls many of these: srun --jobid $JOBID -N 1 -n 1 -w host005 CMD ARGS Hope this helps (and we are looking forward to a tighter orterun/slurm integration as you know). Regards, Federico -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Thursday, March 20, 2008 6:41 PM To: Open MPI Users <us...@open-mpi.org> Cc: Ralph Castain Subject: Re: [OMPI users] SLURM and OpenMPI Hi there I am no slurm expert. However, it is our understanding that SLURM_TASKS_PER_NODE means the number of slots allocated to the job, not the number of tasks to be executed on each node. So the 4(x2) tells us that we have 4 slots on each of two nodes to work with. You got 4 slots on each node because you used the -N option, which told slurm to assign all slots on that node to this job - I assume you have 4 processors on your nodes. OpenMPI parses that string to get the allocation, then maps the number of specified processes against it. It is possible that the interpretation of SLURM_TASKS_PER_NODE is different when used to allocate as opposed to directly launch processes. Our typical usage is for someone to do: srun -N 2 -A mpirun -np 2 helloworld In other words, we use srun to create an allocation, and then run mpirun separately within it. I am therefore unsure what the "-n 2" will do here. If I believe the documentation, it would seem to imply that srun will attempt to launch two copies of "mpirun -np 2 helloworld", yet your output doesn't seem to support that interpretation. It would appear that the "-n 2" is being ignored and only one copy of mpirun is being launched. I'm no slurm expert, so perhaps that interpretation is incorrect. Assuming that the -n 2 is ignored in this situation, your command line: > srun -N 2 -n 2 -b mpirun -np 2 helloworld will cause mpirun to launch two processes, mapped byslot against the slurm allocation of two nodes, each having 4 slots. Thus, both processes will be launched on the first node, which is what you observed. Similarly, the command line > srun -N 2 -n 2 -b mpirun helloworld doesn't specify the #procs to mpirun. In that case, mpirun will launch a process on every available slot in the allocation. Given this command, that means 4 procs will be launched on each of the 2 nodes, for a total of 8 procs. Ranks 0-3 will be placed on the first node, ranks 4-7 on the second. Again, this is what you observed. I don't know if I would say we "interfere" with SLURM - I would say that we are only lightly integrated with SLURM at this time. We use SLURM as a resource manager to assign nodes, and then map processes onto those nodes according to the user's wishes. We chose to do this because srun applies its own load balancing algorithms if you launch processes directly with it, which leaves the user with little flexibility to specify their desired rank/slot mapping. We chose to support the greater flexibility. Using the SLURM-defined mapping will require launching without our mpirun. This capability is still under development, and there are issues with doing that in slurm environments which need to be addressed. It is at a lower priority than providing such support for TM right now, so I wouldn't expect it to become available for several months at least. Alternatively, it may be possible for mpirun to get the SLURM-defined mapping and use it to launch the processes. If we can get it somehow, there is no problem launching it as specified - the problem is how to get the map! Unfortunately, slurm's licensing prevents us from using its internal APIs, so obtaining the map is not an easy thing to do. Anyone who wants to help accelerate that timetable is welcome to contact me. We know the technical issues - this is mostly a problem of (a) priorities versus my available time, and (b) similar considerations on the part of the slurm folks to do the work themselves. Ralph On 3/20/08 3:48 PM, "Tim Prins" <tpr...@open-mpi.org> wrote: > Hi Werner, > > Open MPI does things a little bit differently than other MPIs when it > comes to supporting SLURM. See > http://www.open-mpi.org/faq/?category=s
Re: [OMPI users] SLURM and OpenMPI
Hi there I am no slurm expert. However, it is our understanding that SLURM_TASKS_PER_NODE means the number of slots allocated to the job, not the number of tasks to be executed on each node. So the 4(x2) tells us that we have 4 slots on each of two nodes to work with. You got 4 slots on each node because you used the -N option, which told slurm to assign all slots on that node to this job - I assume you have 4 processors on your nodes. OpenMPI parses that string to get the allocation, then maps the number of specified processes against it. It is possible that the interpretation of SLURM_TASKS_PER_NODE is different when used to allocate as opposed to directly launch processes. Our typical usage is for someone to do: srun -N 2 -A mpirun -np 2 helloworld In other words, we use srun to create an allocation, and then run mpirun separately within it. I am therefore unsure what the "-n 2" will do here. If I believe the documentation, it would seem to imply that srun will attempt to launch two copies of "mpirun -np 2 helloworld", yet your output doesn't seem to support that interpretation. It would appear that the "-n 2" is being ignored and only one copy of mpirun is being launched. I'm no slurm expert, so perhaps that interpretation is incorrect. Assuming that the -n 2 is ignored in this situation, your command line: > srun -N 2 -n 2 -b mpirun -np 2 helloworld will cause mpirun to launch two processes, mapped byslot against the slurm allocation of two nodes, each having 4 slots. Thus, both processes will be launched on the first node, which is what you observed. Similarly, the command line > srun -N 2 -n 2 -b mpirun helloworld doesn't specify the #procs to mpirun. In that case, mpirun will launch a process on every available slot in the allocation. Given this command, that means 4 procs will be launched on each of the 2 nodes, for a total of 8 procs. Ranks 0-3 will be placed on the first node, ranks 4-7 on the second. Again, this is what you observed. I don't know if I would say we "interfere" with SLURM - I would say that we are only lightly integrated with SLURM at this time. We use SLURM as a resource manager to assign nodes, and then map processes onto those nodes according to the user's wishes. We chose to do this because srun applies its own load balancing algorithms if you launch processes directly with it, which leaves the user with little flexibility to specify their desired rank/slot mapping. We chose to support the greater flexibility. Using the SLURM-defined mapping will require launching without our mpirun. This capability is still under development, and there are issues with doing that in slurm environments which need to be addressed. It is at a lower priority than providing such support for TM right now, so I wouldn't expect it to become available for several months at least. Alternatively, it may be possible for mpirun to get the SLURM-defined mapping and use it to launch the processes. If we can get it somehow, there is no problem launching it as specified - the problem is how to get the map! Unfortunately, slurm's licensing prevents us from using its internal APIs, so obtaining the map is not an easy thing to do. Anyone who wants to help accelerate that timetable is welcome to contact me. We know the technical issues - this is mostly a problem of (a) priorities versus my available time, and (b) similar considerations on the part of the slurm folks to do the work themselves. Ralph On 3/20/08 3:48 PM, "Tim Prins"wrote: > Hi Werner, > > Open MPI does things a little bit differently than other MPIs when it > comes to supporting SLURM. See > http://www.open-mpi.org/faq/?category=slurm > for general information about running with Open MPI on SLURM. > > After trying the commands you sent, I am actually a bit surprised by the > results. I would have expected this mode of operation to work. But > looking at the environment variables that SLURM is setting for us, I can > see why it doesn't. > > On a cluster with 4 cores/node, I ran: > [tprins@odin ~]$ cat mprun.sh > #!/bin/sh > printenv > [tprins@odin ~]$ srun -N 2 -n 2 -b mprun.sh > srun: jobid 55641 submitted > [tprins@odin ~]$ cat slurm-55641.out |grep SLURM_TASKS_PER_NODE > SLURM_TASKS_PER_NODE=4(x2) > [tprins@odin ~]$ > > Which seems to be wrong, since the srun man page says that > SLURM_TASKS_PER_NODE is the "Number of tasks to be initiated on each > node". This seems to imply that the value should be "1(x2)". So maybe > this is a SLURM problem? If this value were correctly reported, Open MPI > should work fine for what you wanted to do. > > Two other things: > 1. You should probably use the command line option '--npernode' for > mpirun instead of setting the rmaps_base_n_pernode directly. > 2. In regards to your second example below, Open MPI by default maps 'by > slot'. That is, it will fill all available slots on the first node > before moving to the second. You can change this, see: >
Re: [OMPI users] SLURM and OpenMPI
Hi Werner, Open MPI does things a little bit differently than other MPIs when it comes to supporting SLURM. See http://www.open-mpi.org/faq/?category=slurm for general information about running with Open MPI on SLURM. After trying the commands you sent, I am actually a bit surprised by the results. I would have expected this mode of operation to work. But looking at the environment variables that SLURM is setting for us, I can see why it doesn't. On a cluster with 4 cores/node, I ran: [tprins@odin ~]$ cat mprun.sh #!/bin/sh printenv [tprins@odin ~]$ srun -N 2 -n 2 -b mprun.sh srun: jobid 55641 submitted [tprins@odin ~]$ cat slurm-55641.out |grep SLURM_TASKS_PER_NODE SLURM_TASKS_PER_NODE=4(x2) [tprins@odin ~]$ Which seems to be wrong, since the srun man page says that SLURM_TASKS_PER_NODE is the "Number of tasks to be initiated on each node". This seems to imply that the value should be "1(x2)". So maybe this is a SLURM problem? If this value were correctly reported, Open MPI should work fine for what you wanted to do. Two other things: 1. You should probably use the command line option '--npernode' for mpirun instead of setting the rmaps_base_n_pernode directly. 2. In regards to your second example below, Open MPI by default maps 'by slot'. That is, it will fill all available slots on the first node before moving to the second. You can change this, see: http://www.open-mpi.org/faq/?category=running#mpirun-scheduling I have copied Ralph on this mail to see if he has a better response. Tim Werner Augustin wrote: Hi, At our site here at the University of Karlsruhe we are running two large clusters with SLURM and HP-MPI. For our new cluster we want to keep SLURM and switch to OpenMPI. While testing I got the following problem: with HP-MPI I do something like srun -N 2 -n 2 -b mpirun -srun helloworld and get Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13. Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14. when I try the same with OpenMPI (version 1.2.4) srun -N 2 -n 2 -b mpirun helloworld I get Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13. Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13. Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14. Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13. Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14. Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13. Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14. Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14. and with srun -N 2 -n 2 -b mpirun -np 2 helloworld Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13. Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13. which is still wrong, because it uses only one of the two allocated nodes. OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE environment variables, starts with slurm one orted per node and tasks upto the maximum number of slots on every node. So basically it also does some 'resource management' and interferes with slurm. OK, I can fix that with a mpirun wrapper script which calls mpirun with the right -np and the right rmaps_base_n_pernode setting, but it gets worse. We want to allocate computing power on a per cpu base instead of per node, i.e. different user might share a node. In addition slurm allows to schedule according to memory usage. Therefore it is important that on every node there is exactly the number of tasks running that slurm wants. The only solution I came up with is to generate for every job a detailed hostfile and call mpirun --hostfile. Any suggestions for improvement? I've found a discussion thread "slurm and all-srun orterun" in the mailinglist archive concerning the same problem, where Ralph Castain announced that he is working on two new launch methods which would fix my problems. Unfortunately his email address is deleted from the archive, so it would be really nice if the friendly elf mentioned there is still around and could forward my mail to him. Thanks in advance, Werner Augustin ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] SLURM and OpenMPI
Hi, At our site here at the University of Karlsruhe we are running two large clusters with SLURM and HP-MPI. For our new cluster we want to keep SLURM and switch to OpenMPI. While testing I got the following problem: with HP-MPI I do something like srun -N 2 -n 2 -b mpirun -srun helloworld and get Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13. Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14. when I try the same with OpenMPI (version 1.2.4) srun -N 2 -n 2 -b mpirun helloworld I get Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13. Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13. Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14. Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13. Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14. Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13. Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14. Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14. and with srun -N 2 -n 2 -b mpirun -np 2 helloworld Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13. Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13. which is still wrong, because it uses only one of the two allocated nodes. OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE environment variables, starts with slurm one orted per node and tasks upto the maximum number of slots on every node. So basically it also does some 'resource management' and interferes with slurm. OK, I can fix that with a mpirun wrapper script which calls mpirun with the right -np and the right rmaps_base_n_pernode setting, but it gets worse. We want to allocate computing power on a per cpu base instead of per node, i.e. different user might share a node. In addition slurm allows to schedule according to memory usage. Therefore it is important that on every node there is exactly the number of tasks running that slurm wants. The only solution I came up with is to generate for every job a detailed hostfile and call mpirun --hostfile. Any suggestions for improvement? I've found a discussion thread "slurm and all-srun orterun" in the mailinglist archive concerning the same problem, where Ralph Castain announced that he is working on two new launch methods which would fix my problems. Unfortunately his email address is deleted from the archive, so it would be really nice if the friendly elf mentioned there is still around and could forward my mail to him. Thanks in advance, Werner Augustin
Re: [OMPI users] Slurm and Openmpi
Thanks for the help I renamed the nodes, and now slurm and openmpi seem to be playing nicely with each other. Bob On 1/19/07, Jeff Squyreswrote: I think the SLURM code in Open MPI is making an assumption that is failing in your case: we assume that your nodes will have a specific naming convention: mycluster.example.com --> head node mycluster01.example.com --> cluster node 1 mycluster02.example.com --> cluster node 2 ...etc. OMPI is therefore parsing the SLURM environment and not correctly groking the "master,wolf1" string because, to be honest, I didn't even know that SLURM supported that scenario. I.e., I thought SLURM required the naming convention I listed above. In hindsight, that's a pretty silly assumption, but to be fair, you're the first user that ever came to us with this problem (i.e., we use pretty much the same string parsing in LAM/MPI, which has had SLURM support for several years). Oops! We can fix this, but I don't know if it'll make the v1.2 cutoff or not. :-\ Thanks for bringing this to our attention! On Jan 19, 2007, at 1:50 PM, Robert Bicknell wrote: > Thanks for your response. The program that I have been using for > testing purposes is a simple hello: > > > #include < stdio.h> > > #include > > > #include > #include > #include > #include > main(int argc, char *argv) > { > char name[BUFSIZ]; > int length; > int rank; > struct rlimit rlim; > FILE *output; > > MPI_Init(, ); > MPI_Get_processor_name(name, ); > MPI_Comm_rank(MPI_COMM_WORLD, ); > rank = 0; > MPI_Comm_rank(MPI_COMM_WORLD, ); > > // while(1) { > printf("%s: hello world from rank %d\n", name, rank); > sleep(1); > // } > MPI_Finalize(); > } > > If I run this program not in a slurm environment I get the following > > mpirun -np 4 -mca btl tcp,self -host wolf1,master ./hello > > master: hello world from rank 1 > wolf1: hello world from rank 0 > wolf1: hello world from rank 2 > master: hello world from rank 3 > > This is exactly what I expect. Now if I create a slurm environment > using the following: > > srun -n 4 -A > > The output of printenv|grep SLRUM gives me: > > SLURM_NODELIST=master,wolf1 > SLURM_SRUN_COMM_PORT=58929 > SLURM_MEM_BIND_TYPE= > SLURM_CPU_BIND_VERBOSE=quiet > SLURM_MEM_BIND_LIST= > SLURM_CPU_BIND_LIST= > SLURM_NNODES=2 > SLURM_JOBID=66135 > SLURM_TASKS_PER_NODE=2(x2) > SLURM_SRUN_COMM_HOST=master > SLURM_CPU_BIND_TYPE= > SLURM_MEM_BIND_VERBOSE=quiet > SLURM_NPROCS=4 > > This seems to indicate that both master and wolf1 have been > allocated and that each node should run 2 tasks, which is correct > since both master and wolf1 are dual processor machines. > > Now if I run: > > mpirun -np 4 -mca btl tcp,self ./hello > > The output is: > > master: hello world from rank 1 > master: hello world from rank 2 > master: hello world from rank 3 > master: hello world from rank 0 > > > All four processes are running on master and none on wolf1. > > If I try the following and specify the hosts. I get the following > error message. > > mpirun -np 4 -host wolf1,master -mca btl tcp,self ./hello > > -- > > Some of the requested hosts are not included in the current > allocation for the > application: > ./hello > The requested hosts were: > wolf1,master > > Verify that you have mapped the allocated resources properly using the > --host specification. > -- > > [master:28022] [0,0,0] ORTE_ERROR_LOG: Out of resource in file > rmgr_urm.c at line 377 > [master:28022] mpirun: spawn failed with errno=-2 > > > I'm at a loss to figure out how to get this working correctly. Any > help would be greatly appreciated. > > Bob > > On 1/19/07, Ralph Castain < r...@lanl.gov> wrote: Open MPI and SLURM > should work together just fine right out-of-the-box. The > typical command progression is: > > srun -n x -A > mpirun -n y . > > > If you are doing those commands and still see everything running on > the head > node, then two things could be happening: > > (a) you really aren't getting an allocation from slurm. Perhaps you > don't > have slurm setup correctly and aren't actually seeing the > allocation in your > environment. Do a "printenv | grep SLURM" and see if you find the > following > variables: > SLURM_NPROCS=8 > SLURM_CPU_BIND_VERBOSE=quiet > SLURM_CPU_BIND_TYPE= > SLURM_CPU_BIND_LIST= > SLURM_MEM_BIND_VERBOSE=quiet > SLURM_MEM_BIND_TYPE= > SLURM_MEM_BIND_LIST= > SLURM_JOBID=47225 > SLURM_NNODES=2 > SLURM_NODELIST=odin[013-014] > SLURM_TASKS_PER_NODE=4(x2) > SLURM_SRUN_COMM_PORT=43206 > SLURM_SRUN_COMM_HOST=odin > > Obviously, the values will be different, but we really need the > TASKS_PER_NODE and NODELIST ones to be there > > (b) the master node is being included in your nodelist and you aren't > running enough mpi processes to need more nodes (i.e., the number > of slots > on the master node is greater than
Re: [OMPI users] Slurm and Openmpi
I think the SLURM code in Open MPI is making an assumption that is failing in your case: we assume that your nodes will have a specific naming convention: mycluster.example.com --> head node mycluster01.example.com --> cluster node 1 mycluster02.example.com --> cluster node 2 ...etc. OMPI is therefore parsing the SLURM environment and not correctly groking the "master,wolf1" string because, to be honest, I didn't even know that SLURM supported that scenario. I.e., I thought SLURM required the naming convention I listed above. In hindsight, that's a pretty silly assumption, but to be fair, you're the first user that ever came to us with this problem (i.e., we use pretty much the same string parsing in LAM/MPI, which has had SLURM support for several years). Oops! We can fix this, but I don't know if it'll make the v1.2 cutoff or not. :-\ Thanks for bringing this to our attention! On Jan 19, 2007, at 1:50 PM, Robert Bicknell wrote: Thanks for your response. The program that I have been using for testing purposes is a simple hello: #include #include #include #include #include #include main(int argc, char *argv) { char name[BUFSIZ]; int length; int rank; struct rlimit rlim; FILE *output; MPI_Init(, ); MPI_Get_processor_name(name, ); MPI_Comm_rank(MPI_COMM_WORLD, ); rank = 0; MPI_Comm_rank(MPI_COMM_WORLD, ); // while(1) { printf("%s: hello world from rank %d\n", name, rank); sleep(1); // } MPI_Finalize(); } If I run this program not in a slurm environment I get the following mpirun -np 4 -mca btl tcp,self -host wolf1,master ./hello master: hello world from rank 1 wolf1: hello world from rank 0 wolf1: hello world from rank 2 master: hello world from rank 3 This is exactly what I expect. Now if I create a slurm environment using the following: srun -n 4 -A The output of printenv|grep SLRUM gives me: SLURM_NODELIST=master,wolf1 SLURM_SRUN_COMM_PORT=58929 SLURM_MEM_BIND_TYPE= SLURM_CPU_BIND_VERBOSE=quiet SLURM_MEM_BIND_LIST= SLURM_CPU_BIND_LIST= SLURM_NNODES=2 SLURM_JOBID=66135 SLURM_TASKS_PER_NODE=2(x2) SLURM_SRUN_COMM_HOST=master SLURM_CPU_BIND_TYPE= SLURM_MEM_BIND_VERBOSE=quiet SLURM_NPROCS=4 This seems to indicate that both master and wolf1 have been allocated and that each node should run 2 tasks, which is correct since both master and wolf1 are dual processor machines. Now if I run: mpirun -np 4 -mca btl tcp,self ./hello The output is: master: hello world from rank 1 master: hello world from rank 2 master: hello world from rank 3 master: hello world from rank 0 All four processes are running on master and none on wolf1. If I try the following and specify the hosts. I get the following error message. mpirun -np 4 -host wolf1,master -mca btl tcp,self ./hello -- Some of the requested hosts are not included in the current allocation for the application: ./hello The requested hosts were: wolf1,master Verify that you have mapped the allocated resources properly using the --host specification. -- [master:28022] [0,0,0] ORTE_ERROR_LOG: Out of resource in file rmgr_urm.c at line 377 [master:28022] mpirun: spawn failed with errno=-2 I'm at a loss to figure out how to get this working correctly. Any help would be greatly appreciated. Bob On 1/19/07, Ralph Castainwrote: Open MPI and SLURM should work together just fine right out-of-the-box. The typical command progression is: srun -n x -A mpirun -n y . If you are doing those commands and still see everything running on the head node, then two things could be happening: (a) you really aren't getting an allocation from slurm. Perhaps you don't have slurm setup correctly and aren't actually seeing the allocation in your environment. Do a "printenv | grep SLURM" and see if you find the following variables: SLURM_NPROCS=8 SLURM_CPU_BIND_VERBOSE=quiet SLURM_CPU_BIND_TYPE= SLURM_CPU_BIND_LIST= SLURM_MEM_BIND_VERBOSE=quiet SLURM_MEM_BIND_TYPE= SLURM_MEM_BIND_LIST= SLURM_JOBID=47225 SLURM_NNODES=2 SLURM_NODELIST=odin[013-014] SLURM_TASKS_PER_NODE=4(x2) SLURM_SRUN_COMM_PORT=43206 SLURM_SRUN_COMM_HOST=odin Obviously, the values will be different, but we really need the TASKS_PER_NODE and NODELIST ones to be there (b) the master node is being included in your nodelist and you aren't running enough mpi processes to need more nodes (i.e., the number of slots on the master node is greater than or equal to the num procs you requested). You can force Open MPI to not run on your master node by including "--nolocal" on your command line. Of course, if the master node is the only thing on the nodelist, this will cause mpirun to abort as there is nothing else for us to use. Hope that helps Ralph On 1/18/07 11:03 PM, "Robert Bicknell" wrote: > I'm
Re: [OMPI users] Slurm and Openmpi
Open MPI and SLURM should work together just fine right out-of-the-box. The typical command progression is: srun -n x -A mpirun -n y . If you are doing those commands and still see everything running on the head node, then two things could be happening: (a) you really aren't getting an allocation from slurm. Perhaps you don't have slurm setup correctly and aren't actually seeing the allocation in your environment. Do a "printenv | grep SLURM" and see if you find the following variables: SLURM_NPROCS=8 SLURM_CPU_BIND_VERBOSE=quiet SLURM_CPU_BIND_TYPE= SLURM_CPU_BIND_LIST= SLURM_MEM_BIND_VERBOSE=quiet SLURM_MEM_BIND_TYPE= SLURM_MEM_BIND_LIST= SLURM_JOBID=47225 SLURM_NNODES=2 SLURM_NODELIST=odin[013-014] SLURM_TASKS_PER_NODE=4(x2) SLURM_SRUN_COMM_PORT=43206 SLURM_SRUN_COMM_HOST=odin Obviously, the values will be different, but we really need the TASKS_PER_NODE and NODELIST ones to be there (b) the master node is being included in your nodelist and you aren't running enough mpi processes to need more nodes (i.e., the number of slots on the master node is greater than or equal to the num procs you requested). You can force Open MPI to not run on your master node by including "--nolocal" on your command line. Of course, if the master node is the only thing on the nodelist, this will cause mpirun to abort as there is nothing else for us to use. Hope that helps Ralph On 1/18/07 11:03 PM, "Robert Bicknell"wrote: > I'm trying to get slurm and openmpi to work together on a debian, two > node cluster. Slurm and openmpi seem to work fine seperately, but when > I try to run a mpi program in a slurm allocation, all the processes get > run on the master node, and not distributed to the second node. What am > I doing wrong? > > Bob > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Slurm and Openmpi
I'm trying to get slurm and openmpi to work together on a debian, two node cluster. Slurm and openmpi seem to work fine seperately, but when I try to run a mpi program in a slurm allocation, all the processes get run on the master node, and not distributed to the second node. What am I doing wrong? Bob