Re: [OMPI users] Option -cpus-per-proc 2 not working with given machinefile?
Am 28.02.2013 um 19:50 schrieb Reuti: > Am 28.02.2013 um 19:21 schrieb Ralph Castain: > >> >> On Feb 28, 2013, at 9:53 AM, Reutiwrote: >> >>> Am 28.02.2013 um 17:54 schrieb Ralph Castain: >>> Hmmmthe problem is that we are mapping procs using the provided slots instead of dividing the slots by cpus-per-proc. So we put too many on the first node, and the backend daemon aborts the job because it lacks sufficient processors for cpus-per-proc=2. >>> >>> Ok, this I would understand. But why is it then working if no maximum >>> number of slots is given? Will it then just fill the node up to the found >>> number of cores inside and subtract this correctly each time a new process >>> ist started and jump to the next machine if necessary? >> >> Not exactly. If no max slots is given, then we assume a value of one. This >> effectively converts byslot mapping to bynode - i.e., we place one proc on a >> node, and that meets its #slots, so we place the next proc on the next node. >> So you wind up balancing across the two nodes. > > Ok, now I understand the behavior - Thx. > > >> If you specify slots=64, then we'll try to place all 64 procs on the first >> node because we are using byslot mapping by default. You could make it work >> by just adding -bynode to your command line. >> >> >>> >>> It is of course for now a feasible workaround to get the intended behavior >>> by supplying just an additional hostfile. >> >> Or use bynode mapping > > You mean a line like: > > mpiexec -cpus-per-proc 2 -bynode -report-bindings ./mpihello > > For me this results in the same error like without "-bynode" at all. > > >>> But regarding my recent eMail I also wonder about the difference between >>> running on the command line and inside SGE. In the latter case the overall >>> universe is correct. >> >> If you don't provide a slots value in the hostfile, we assume 1 - and so the >> universe size is 2, and you are heavily oversubscribed. Inside SGE, we see >> 128 slots assigned to you, and you are not oversubscribed. > > Yes, but the "fake" hostfile I provide on the command line to `mpiexec` has > only the plain names inside. Somehow this changes the way the processes are > distributed to "-bynode", but not the overall slotcount - interesting. Ups: I meant "overall universe size", as the slot count (i.e. processes) I supply by the -np option to reduce it. -- Reuti > > -- Reuti > > >> HTH >> Ralph >> >> >>> >>> -- Reuti >>> >>> Given that there are no current plans for a 1.6.5, this may not get fixed. On Feb 27, 2013, at 3:15 PM, Reuti wrote: > Hi, > > I have an issue using the option -cpus-per-proc 2. As I have Bulldozer > machines and I want only one process per FP core, I thought using > -cpus-per-proc 2 would be the way to go. Initially I had this issue > inside GridEngine but then tried it outside any queuingsystem and face > exactly the same behavior. > > @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 > integer cores per machine in total. Used Open MPI is 1.6.4. > > > a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 > ./mpihello > > and a hostfile containing only the two lines listing the machines: > > node006 > node007 > > This works as I would like it (see working.txt) when initiated on node006. > > > b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 > ./mpihello > > But changing the hostefile so that it is having a slot count which might > mimic the behavior in case of a parsed machinefile out of any queuing > system: > > node006 slots=64 > node007 slots=64 > > This fails with: > > -- > An invalid physical processor ID was returned when attempting to bind > an MPI process to a unique processor on node: > > Node: node006 > > This usually means that you requested binding to more processors than > exist (e.g., trying to bind N MPI processes to M processors, where N > > M), or that the node has an unexpectedly different topology. > > Double check that you have enough unique processors for all the > MPI processes that you are launching on this host, and that all nodes > have identical topologies. > > You job will now abort. > -- > > (see failed.txt) > > > b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 > ./mpihello > > This works and the found universe is 128 as expected (see only32.txt). > > > c) Maybe the used machinefile is not parsed in the correct way, so I >
Re: [OMPI users] Option -cpus-per-proc 2 not working with given machinefile?
Am 28.02.2013 um 19:21 schrieb Ralph Castain: > > On Feb 28, 2013, at 9:53 AM, Reutiwrote: > >> Am 28.02.2013 um 17:54 schrieb Ralph Castain: >> >>> Hmmmthe problem is that we are mapping procs using the provided slots >>> instead of dividing the slots by cpus-per-proc. So we put too many on the >>> first node, and the backend daemon aborts the job because it lacks >>> sufficient processors for cpus-per-proc=2. >> >> Ok, this I would understand. But why is it then working if no maximum number >> of slots is given? Will it then just fill the node up to the found number of >> cores inside and subtract this correctly each time a new process ist started >> and jump to the next machine if necessary? > > Not exactly. If no max slots is given, then we assume a value of one. This > effectively converts byslot mapping to bynode - i.e., we place one proc on a > node, and that meets its #slots, so we place the next proc on the next node. > So you wind up balancing across the two nodes. Ok, now I understand the behavior - Thx. > If you specify slots=64, then we'll try to place all 64 procs on the first > node because we are using byslot mapping by default. You could make it work > by just adding -bynode to your command line. > > >> >> It is of course for now a feasible workaround to get the intended behavior >> by supplying just an additional hostfile. > > Or use bynode mapping You mean a line like: mpiexec -cpus-per-proc 2 -bynode -report-bindings ./mpihello For me this results in the same error like without "-bynode at all". >> But regarding my recent eMail I also wonder about the difference between >> running on the command line and inside SGE. In the latter case the overall >> universe is correct. > > If you don't provide a slots value in the hostfile, we assume 1 - and so the > universe size is 2, and you are heavily oversubscribed. Inside SGE, we see > 128 slots assigned to you, and you are not oversubscribed. Yes, but the "fake" hostfile I provide on the command line to `mpiexec` has only the plain names inside. Somehow this changes the way the processes are distributed to "-bynode", but not the overall slotcount - interesting. -- Reuti > HTH > Ralph > > >> >> -- Reuti >> >> >>> Given that there are no current plans for a 1.6.5, this may not get fixed. >>> >>> On Feb 27, 2013, at 3:15 PM, Reuti wrote: >>> Hi, I have an issue using the option -cpus-per-proc 2. As I have Bulldozer machines and I want only one process per FP core, I thought using -cpus-per-proc 2 would be the way to go. Initially I had this issue inside GridEngine but then tried it outside any queuingsystem and face exactly the same behavior. @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 integer cores per machine in total. Used Open MPI is 1.6.4. a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 ./mpihello and a hostfile containing only the two lines listing the machines: node006 node007 This works as I would like it (see working.txt) when initiated on node006. b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 ./mpihello But changing the hostefile so that it is having a slot count which might mimic the behavior in case of a parsed machinefile out of any queuing system: node006 slots=64 node007 slots=64 This fails with: -- An invalid physical processor ID was returned when attempting to bind an MPI process to a unique processor on node: Node: node006 This usually means that you requested binding to more processors than exist (e.g., trying to bind N MPI processes to M processors, where N > M), or that the node has an unexpectedly different topology. Double check that you have enough unique processors for all the MPI processes that you are launching on this host, and that all nodes have identical topologies. You job will now abort. -- (see failed.txt) b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 ./mpihello This works and the found universe is 128 as expected (see only32.txt). c) Maybe the used machinefile is not parsed in the correct way, so I checked: c1) mpiexec -hostfile machines -np 64 ./mpihello => works c2) mpiexec -hostfile machines -np 128 ./mpihello => works c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected So, it got the slot counts in the correct way. What do I miss?
Re: [OMPI users] Option -cpus-per-proc 2 not working with given machinefile?
On Feb 28, 2013, at 9:53 AM, Reutiwrote: > Am 28.02.2013 um 17:54 schrieb Ralph Castain: > >> Hmmmthe problem is that we are mapping procs using the provided slots >> instead of dividing the slots by cpus-per-proc. So we put too many on the >> first node, and the backend daemon aborts the job because it lacks >> sufficient processors for cpus-per-proc=2. > > Ok, this I would understand. But why is it then working if no maximum number > of slots is given? Will it then just fill the node up to the found number of > cores inside and subtract this correctly each time a new process ist started > and jump to the next machine if necessary? Not exactly. If no max slots is given, then we assume a value of one. This effectively converts byslot mapping to bynode - i.e., we place one proc on a node, and that meets its #slots, so we place the next proc on the next node. So you wind up balancing across the two nodes. If you specify slots=64, then we'll try to place all 64 procs on the first node because we are using byslot mapping by default. You could make it work by just adding -bynode to your command line. > > It is of course for now a feasible workaround to get the intended behavior by > supplying just an additional hostfile. Or use bynode mapping > > But regarding my recent eMail I also wonder about the difference between > running on the command line and inside SGE. In the latter case the overall > universe is correct. If you don't provide a slots value in the hostfile, we assume 1 - and so the universe size is 2, and you are heavily oversubscribed. Inside SGE, we see 128 slots assigned to you, and you are not oversubscribed. HTH Ralph > > -- Reuti > > >> Given that there are no current plans for a 1.6.5, this may not get fixed. >> >> On Feb 27, 2013, at 3:15 PM, Reuti wrote: >> >>> Hi, >>> >>> I have an issue using the option -cpus-per-proc 2. As I have Bulldozer >>> machines and I want only one process per FP core, I thought using >>> -cpus-per-proc 2 would be the way to go. Initially I had this issue inside >>> GridEngine but then tried it outside any queuingsystem and face exactly the >>> same behavior. >>> >>> @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 >>> integer cores per machine in total. Used Open MPI is 1.6.4. >>> >>> >>> a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 >>> ./mpihello >>> >>> and a hostfile containing only the two lines listing the machines: >>> >>> node006 >>> node007 >>> >>> This works as I would like it (see working.txt) when initiated on node006. >>> >>> >>> b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 >>> ./mpihello >>> >>> But changing the hostefile so that it is having a slot count which might >>> mimic the behavior in case of a parsed machinefile out of any queuing >>> system: >>> >>> node006 slots=64 >>> node007 slots=64 >>> >>> This fails with: >>> >>> -- >>> An invalid physical processor ID was returned when attempting to bind >>> an MPI process to a unique processor on node: >>> >>> Node: node006 >>> >>> This usually means that you requested binding to more processors than >>> exist (e.g., trying to bind N MPI processes to M processors, where N > >>> M), or that the node has an unexpectedly different topology. >>> >>> Double check that you have enough unique processors for all the >>> MPI processes that you are launching on this host, and that all nodes >>> have identical topologies. >>> >>> You job will now abort. >>> -- >>> >>> (see failed.txt) >>> >>> >>> b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 >>> ./mpihello >>> >>> This works and the found universe is 128 as expected (see only32.txt). >>> >>> >>> c) Maybe the used machinefile is not parsed in the correct way, so I >>> checked: >>> >>> c1) mpiexec -hostfile machines -np 64 ./mpihello => works >>> >>> c2) mpiexec -hostfile machines -np 128 ./mpihello => works >>> >>> c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected >>> >>> So, it got the slot counts in the correct way. >>> >>> What do I miss? >>> >>> -- Reuti >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Option -cpus-per-proc 2 not working with given machinefile?
Am 28.02.2013 um 17:54 schrieb Ralph Castain: > Hmmmthe problem is that we are mapping procs using the provided slots > instead of dividing the slots by cpus-per-proc. So we put too many on the > first node, and the backend daemon aborts the job because it lacks sufficient > processors for cpus-per-proc=2. Ok, this I would understand. But why is it then working if no maximum number of slots is given? Will it then just fill the node up to the found number of cores inside and subtract this correctly each time a new process ist started and jump to the next machine if necessary? It is of course for now a feasible workaround to get the intended behavior by supplying just an additional hostfile. But regarding my recent eMail I also wonder about the difference between running on the command line and inside SGE. In the latter case the overall universe is correct. -- Reuti > Given that there are no current plans for a 1.6.5, this may not get fixed. > > On Feb 27, 2013, at 3:15 PM, Reutiwrote: > >> Hi, >> >> I have an issue using the option -cpus-per-proc 2. As I have Bulldozer >> machines and I want only one process per FP core, I thought using >> -cpus-per-proc 2 would be the way to go. Initially I had this issue inside >> GridEngine but then tried it outside any queuingsystem and face exactly the >> same behavior. >> >> @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 >> integer cores per machine in total. Used Open MPI is 1.6.4. >> >> >> a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 >> ./mpihello >> >> and a hostfile containing only the two lines listing the machines: >> >> node006 >> node007 >> >> This works as I would like it (see working.txt) when initiated on node006. >> >> >> b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 >> ./mpihello >> >> But changing the hostefile so that it is having a slot count which might >> mimic the behavior in case of a parsed machinefile out of any queuing system: >> >> node006 slots=64 >> node007 slots=64 >> >> This fails with: >> >> -- >> An invalid physical processor ID was returned when attempting to bind >> an MPI process to a unique processor on node: >> >> Node: node006 >> >> This usually means that you requested binding to more processors than >> exist (e.g., trying to bind N MPI processes to M processors, where N > >> M), or that the node has an unexpectedly different topology. >> >> Double check that you have enough unique processors for all the >> MPI processes that you are launching on this host, and that all nodes >> have identical topologies. >> >> You job will now abort. >> -- >> >> (see failed.txt) >> >> >> b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 >> ./mpihello >> >> This works and the found universe is 128 as expected (see only32.txt). >> >> >> c) Maybe the used machinefile is not parsed in the correct way, so I checked: >> >> c1) mpiexec -hostfile machines -np 64 ./mpihello => works >> >> c2) mpiexec -hostfile machines -np 128 ./mpihello => works >> >> c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected >> >> So, it got the slot counts in the correct way. >> >> What do I miss? >> >> -- Reuti >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Calling MPI_send MPI_recv from a fortran subroutine
oh! it works now. Thanks a lot and sorry about my negligence. 2013/3/1 Ake Sandgren> On Fri, 2013-03-01 at 01:24 +0900, Pradeep Jha wrote: > > Sorry for those mistakes. I addressed all the three problems > > - I put "implicit none" at the top of main program > > - I initialized tag. > > - changed MPI_INT to MPI_INTEGER > > - "send_length" should be just "send", it was a typo. > > > > > > But the code is still hanging in sendrecv. The present form is below: > > > > "tag" isn't iniitalized to anything so it may very well be totally > different in all the processes. > ALWAYS initialize variables before using them. > > > main.f > > > > > > program main > > > > implicit none > > > > include 'mpif.h' > > > > integer me, np, ierror > > > > call MPI_init( ierror ) > > call MPI_comm_rank( mpi_comm_world, me, ierror ) > > call MPI_comm_size( mpi_comm_world, np, ierror ) > > > > call sendrecv(me, np) > > > > call mpi_finalize( ierror ) > > > > stop > > end > > > > sendrecv.f > > > > > > subroutine sendrecv(me, np) > > > > include 'mpif.h' > > > > integer np, me, sender, tag > > integer, dimension(mpi_status_size) :: status > > > > integer, dimension(1) :: recv, send > > > > if (me.eq.0) then > > > > do sender = 1, np-1 > > call mpi_recv(recv, 1, mpi_integer, sender, tag, > > & mpi_comm_world, status, ierror) > > > > end do > > end if > > > > if ((me.ge.1).and.(me.lt.np)) then > > send(1) = me*12 > > > > call mpi_send(send, 1, mpi_integer, 0, tag, > > &mpi_comm_world, ierror) > > end if > > > > return > > end > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Option -cpus-per-proc 2 not working with given machinefile?
Hmmmthe problem is that we are mapping procs using the provided slots instead of dividing the slots by cpus-per-proc. So we put too many on the first node, and the backend daemon aborts the job because it lacks sufficient processors for cpus-per-proc=2. Given that there are no current plans for a 1.6.5, this may not get fixed. On Feb 27, 2013, at 3:15 PM, Reutiwrote: > Hi, > > I have an issue using the option -cpus-per-proc 2. As I have Bulldozer > machines and I want only one process per FP core, I thought using > -cpus-per-proc 2 would be the way to go. Initially I had this issue inside > GridEngine but then tried it outside any queuingsystem and face exactly the > same behavior. > > @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 > integer cores per machine in total. Used Open MPI is 1.6.4. > > > a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 > ./mpihello > > and a hostfile containing only the two lines listing the machines: > > node006 > node007 > > This works as I would like it (see working.txt) when initiated on node006. > > > b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 > ./mpihello > > But changing the hostefile so that it is having a slot count which might > mimic the behavior in case of a parsed machinefile out of any queuing system: > > node006 slots=64 > node007 slots=64 > > This fails with: > > -- > An invalid physical processor ID was returned when attempting to bind > an MPI process to a unique processor on node: > > Node: node006 > > This usually means that you requested binding to more processors than > exist (e.g., trying to bind N MPI processes to M processors, where N > > M), or that the node has an unexpectedly different topology. > > Double check that you have enough unique processors for all the > MPI processes that you are launching on this host, and that all nodes > have identical topologies. > > You job will now abort. > -- > > (see failed.txt) > > > b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 > ./mpihello > > This works and the found universe is 128 as expected (see only32.txt). > > > c) Maybe the used machinefile is not parsed in the correct way, so I checked: > > c1) mpiexec -hostfile machines -np 64 ./mpihello => works > > c2) mpiexec -hostfile machines -np 128 ./mpihello => works > > c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected > > So, it got the slot counts in the correct way. > > What do I miss? > > -- Reuti > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] High cpu usage
Hi , First, I don't see any cpu utilization but %time (of a function wrt others in a process/application). Generally for high cpu utilization, there could be many reason. Two of them that comes to my mind is, 1. Depends on the network stack, eg. the "tcp" way will use more CPU than the "openib" way. 2. Polling is generally good for performance but comes with a penalty of high CPU utilization. Also not sure if a context switch costs CPU utilization, if so gettimeofday could be a reason as for each call there is a user to kernel switch and back to user. -- Joba Sent from my iPhone On Feb 28, 2013, at 7:34 AM, Bokassawrote: > Hi, > I notice that a simple MPI program in which rank 0 sends 4 bytes to each > rank and receives a reply uses a > considerable amount of CPU in system call.s > > % time seconds usecs/call callserrors syscall > -- --- --- - - > 61.100.016719 3 5194 gettimeofday > 20.770.005683 2 2596 epoll_wait > 18.130.004961 2 2595 sched_yield > 0.000.00 0 4 write > 0.000.00 0 4 stat > 0.000.00 0 2 readv > 0.000.00 0 2 writev > -- --- --- - - > 100.000.027363 10397 total > > and > > Process 2512 attached - interrupt to quit > 16:32:17.793039 sched_yield() = 0 <0.78> > 16:32:17.793276 gettimeofday({1362065537, 793330}, NULL) = 0 <0.70> > 16:32:17.793460 epoll_wait(4, {}, 32, 0) = 0 <0.000114> > 16:32:17.793712 gettimeofday({1362065537, 793773}, NULL) = 0 <0.97> > 16:32:17.793914 sched_yield() = 0 <0.89> > 16:32:17.794107 gettimeofday({1362065537, 794157}, NULL) = 0 <0.83> > 16:32:17.794292 epoll_wait(4, {}, 32, 0) = 0 <0.72> > 16:32:17.794457 gettimeofday({1362065537, 794541}, NULL) = 0 <0.000115> > 16:32:17.794695 sched_yield() = 0 <0.79> > 16:32:17.794877 gettimeofday({1362065537, 794927}, NULL) = 0 <0.81> > 16:32:17.795062 epoll_wait(4, {}, 32, 0) = 0 <0.79> > 16:32:17.795244 gettimeofday({1362065537, 795294}, NULL) = 0 <0.82> > 16:32:17.795432 sched_yield() = 0 <0.96> > 16:32:17.795761 gettimeofday({1362065537, 795814}, NULL) = 0 <0.79> > 16:32:17.795940 epoll_wait(4, {}, 32, 0) = 0 <0.80> > 16:32:17.796123 gettimeofday({1362065537, 796191}, NULL) = 0 <0.000121> > 16:32:17.796388 sched_yield() = 0 <0.000127> > 16:32:17.796635 gettimeofday({1362065537, 796722}, NULL) = 0 <0.000121> > 16:32:17.796951 epoll_wait(4, {}, 32, 0) = 0 <0.89> > > What is the purpose of this behavior. > > Thanks, >David > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Option -cpus-per-proc 2 not working with given machinefile?
Am 28.02.2013 um 17:29 schrieb Ralph Castain: > > On Feb 28, 2013, at 6:17 AM, Reutiwrote: > >> Am 28.02.2013 um 08:58 schrieb Reuti: >> >>> Am 28.02.2013 um 06:55 schrieb Ralph Castain: >>> I don't off-hand see a problem, though I do note that your "working" version incorrectly reports the universe size as 2! >>> >>> Yes, it was 2 in the case when it was working by giving only two hostnames >>> without any dedicated slot count. What should it be in this case - >>> "unknown", "infinity"? >> >> As an add on: >> >> a) I tried it again on the command line and still get: >> >> Total: 64 >> Universe: 2 >> >> with a hostfile >> >> node006 >> node007 >> > > My bad - since no slots were given, we default to a value of 1 for each node, > so this is correct. > >> >> b) In a job script under SGE and Open MPI compiled --with-sge I get after >> mangling the hostfile: >> >> #!/bin /sh >> #$ -pe openmpi* 128 >> #$ -l exclusive >> cut -f 1 -d" " $PE_HOSTFILE > $TMPDIR/machines >> mpiexec -cpus-per-proc 2 -report-bindings -hostfile $TMPDIR/machines -np 64 >> ./mpihello >> >> Here: >> >> Total: 64 >> Universe: 128 > > This would be correct as SGE is allocating a total of 128 slots (or pe's) Yep, this is the case. But the hostfile I give in addition contains only the two hostnames (not slot count). And if I don't supply this mangled file in addition, it won't startup but give the error: -- An invalid physical processor ID was returned when attempting to bind an MPI process to a unique processor. This usually means that you requested binding to more processors than exist (e.g., trying to bind N MPI processes to M processors, where N > M). Double check that you have enough unique processors for all the MPI processes that you are launching on this host. You job will now abort. -- What I just note: in this error there is no hostname given when running inside SGE. But there is one given if started from the command line like: -- An invalid physical processor ID was returned when attempting to bind an MPI process to a unique processor on node: Node: node006 This usually means that you requested binding to more processors than exist (e.g., trying to bind N MPI processes to M processors, where N > M), or that the node has an unexpectedly different topology. Double check that you have enough unique processors for all the MPI processes that you are launching on this host, and that all nodes have identical topologies. You job will now abort. -- -- Reuti >> >> Maybe the found allocation by SGE and the one from the command line argument >> are getting mixed here. >> >> -- Reuti >> >> >>> -- Reuti >>> >>> I'll have to take a look at this and get back to you on it. On Feb 27, 2013, at 3:15 PM, Reuti wrote: > Hi, > > I have an issue using the option -cpus-per-proc 2. As I have Bulldozer > machines and I want only one process per FP core, I thought using > -cpus-per-proc 2 would be the way to go. Initially I had this issue > inside GridEngine but then tried it outside any queuingsystem and face > exactly the same behavior. > > @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 > integer cores per machine in total. Used Open MPI is 1.6.4. > > > a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 > ./mpihello > > and a hostfile containing only the two lines listing the machines: > > node006 > node007 > > This works as I would like it (see working.txt) when initiated on node006. > > > b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 > ./mpihello > > But changing the hostefile so that it is having a slot count which might > mimic the behavior in case of a parsed machinefile out of any queuing > system: > > node006 slots=64 > node007 slots=64 > > This fails with: > > -- > An invalid physical processor ID was returned when attempting to bind > an MPI process to a unique processor on node: > > Node: node006 > > This usually means that you requested binding to more processors than > exist (e.g., trying to bind N MPI processes to M processors, where N > > M), or that the node has an unexpectedly different topology. > > Double check that you have enough unique processors for all the > MPI processes that you are launching on this host, and that all nodes > have identical
Re: [OMPI users] Calling MPI_send MPI_recv from a fortran subroutine
On Fri, 2013-03-01 at 01:24 +0900, Pradeep Jha wrote: > Sorry for those mistakes. I addressed all the three problems > - I put "implicit none" at the top of main program > - I initialized tag. > - changed MPI_INT to MPI_INTEGER > - "send_length" should be just "send", it was a typo. > > > But the code is still hanging in sendrecv. The present form is below: > "tag" isn't iniitalized to anything so it may very well be totally different in all the processes. ALWAYS initialize variables before using them. > main.f > > > program main > > implicit none > > include 'mpif.h' > > integer me, np, ierror > > call MPI_init( ierror ) > call MPI_comm_rank( mpi_comm_world, me, ierror ) > call MPI_comm_size( mpi_comm_world, np, ierror ) > > call sendrecv(me, np) > > call mpi_finalize( ierror ) > > stop > end > > sendrecv.f > > > subroutine sendrecv(me, np) > > include 'mpif.h' > > integer np, me, sender, tag > integer, dimension(mpi_status_size) :: status > > integer, dimension(1) :: recv, send > > if (me.eq.0) then > > do sender = 1, np-1 > call mpi_recv(recv, 1, mpi_integer, sender, tag, > & mpi_comm_world, status, ierror) > > end do > end if > > if ((me.ge.1).and.(me.lt.np)) then > send(1) = me*12 > > call mpi_send(send, 1, mpi_integer, 0, tag, > &mpi_comm_world, ierror) > end if > > return > end
Re: [OMPI users] Calling MPI_send MPI_recv from a fortran subroutine
I don't see tag being set to any value On Feb 28, 2013, at 8:24 AM, Pradeep Jhawrote: > Sorry for those mistakes. I addressed all the three problems > - I put "implicit none" at the top of main program > - I initialized tag. > - changed MPI_INT to MPI_INTEGER > - "send_length" should be just "send", it was a typo. > > But the code is still hanging in sendrecv. The present form is below: > > main.f > > > program main > > implicit none > > include 'mpif.h' > > integer me, np, ierror > > call MPI_init( ierror ) > call MPI_comm_rank( mpi_comm_world, me, ierror ) > call MPI_comm_size( mpi_comm_world, np, ierror ) > > call sendrecv(me, np) > > call mpi_finalize( ierror ) > > stop > end > sendrecv.f > > > subroutine sendrecv(me, np) > > include 'mpif.h' > > integer np, me, sender, tag > integer, dimension(mpi_status_size) :: status > > integer, dimension(1) :: recv, send > > if (me.eq.0) then > > do sender = 1, np-1 > call mpi_recv(recv, 1, mpi_integer, sender, tag, > & mpi_comm_world, status, ierror) > > end do > end if > > if ((me.ge.1).and.(me.lt.np)) then > send(1) = me*12 > > call mpi_send(send, 1, mpi_integer, 0, tag, > &mpi_comm_world, ierror) > end if > > return > end > > > 2013/3/1 Jeff Squyres (jsquyres) > On Feb 28, 2013, at 9:59 AM, Pradeep Jha > wrote: > > > Is it possible to call the MPI_send and MPI_recv commands inside a > > subroutine and not the main program? > > Yes. > > > I have written a minimal program for what I am trying to do. It is > > compiling fine but it is not working. The program just hangs in the > > "sendrecv" subroutine. Any ideas how can I do it? > > You seem to have several errors in the sendrecv subroutine. I would strongly > encourage you to use "implicit none" to avoid many of these errors. Here's a > few errors I see offhand: > > - tag is not initialized > - what's send_length(1)? > - use MPI_INTEGER, not MPI_INT (MPI_INT = C int, MPI_INTEGER = Fortran > INTEGER) > > > > main.f > > > > > > program main > > > > include 'mpif.h' > > > > integer me, np, ierror > > > > call MPI_init( ierror ) > > call MPI_comm_rank( mpi_comm_world, me, ierror ) > > call MPI_comm_size( mpi_comm_world, np, ierror ) > > > > call sendrecv(me, np) > > > > call mpi_finalize( ierror ) > > > > stop > > end > > > > sendrecv.f > > > > > > subroutine sendrecv(me, np) > > > > include 'mpif.h' > > > > integer np, me, sender > > integer, dimension(mpi_status_size) :: status > > > > integer, dimension(1) :: recv, send > > > > if (me.eq.0) then > > > > do sender = 1, np-1 > > call mpi_recv(recv, 1, mpi_int, sender, tag, > > & mpi_comm_world, status, ierror) > > > > end do > > end if > > > > if ((me.ge.1).and.( > > me.lt.np > > )) then > > send_length(1) = me*12 > > > > call mpi_send(send, 1, mpi_int, 0, tag, > > &mpi_comm_world, ierror) > > end if > > > > return > > end > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Option -cpus-per-proc 2 not working with given machinefile?
On Feb 28, 2013, at 6:17 AM, Reutiwrote: > Am 28.02.2013 um 08:58 schrieb Reuti: > >> Am 28.02.2013 um 06:55 schrieb Ralph Castain: >> >>> I don't off-hand see a problem, though I do note that your "working" >>> version incorrectly reports the universe size as 2! >> >> Yes, it was 2 in the case when it was working by giving only two hostnames >> without any dedicated slot count. What should it be in this case - >> "unknown", "infinity"? > > As an add on: > > a) I tried it again on the command line and still get: > > Total: 64 > Universe: 2 > > with a hostfile > > node006 > node007 > My bad - since no slots were given, we default to a value of 1 for each node, so this is correct. > > b) In a job script under SGE and Open MPI compiled --with-sge I get after > mangling the hostfile: > > #!/bin /sh > #$ -pe openmpi* 128 > #$ -l exclusive > cut -f 1 -d" " $PE_HOSTFILE > $TMPDIR/machines > mpiexec -cpus-per-proc 2 -report-bindings -hostfile $TMPDIR/machines -np 64 > ./mpihello > > Here: > > Total: 64 > Universe: 128 This would be correct as SGE is allocating a total of 128 slots (or pe's) > > Maybe the found allocation by SGE and the one from the command line argument > are getting mixed here. > > -- Reuti > > >> -- Reuti >> >> >>> >>> I'll have to take a look at this and get back to you on it. >>> >>> On Feb 27, 2013, at 3:15 PM, Reuti wrote: >>> Hi, I have an issue using the option -cpus-per-proc 2. As I have Bulldozer machines and I want only one process per FP core, I thought using -cpus-per-proc 2 would be the way to go. Initially I had this issue inside GridEngine but then tried it outside any queuingsystem and face exactly the same behavior. @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 integer cores per machine in total. Used Open MPI is 1.6.4. a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 ./mpihello and a hostfile containing only the two lines listing the machines: node006 node007 This works as I would like it (see working.txt) when initiated on node006. b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 ./mpihello But changing the hostefile so that it is having a slot count which might mimic the behavior in case of a parsed machinefile out of any queuing system: node006 slots=64 node007 slots=64 This fails with: -- An invalid physical processor ID was returned when attempting to bind an MPI process to a unique processor on node: Node: node006 This usually means that you requested binding to more processors than exist (e.g., trying to bind N MPI processes to M processors, where N > M), or that the node has an unexpectedly different topology. Double check that you have enough unique processors for all the MPI processes that you are launching on this host, and that all nodes have identical topologies. You job will now abort. -- (see failed.txt) b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 ./mpihello This works and the found universe is 128 as expected (see only32.txt). c) Maybe the used machinefile is not parsed in the correct way, so I checked: c1) mpiexec -hostfile machines -np 64 ./mpihello => works c2) mpiexec -hostfile machines -np 128 ./mpihello => works c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected So, it got the slot counts in the correct way. What do I miss? -- Reuti ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Calling MPI_send MPI_recv from a fortran subroutine
Sorry for those mistakes. I addressed all the three problems - I put "implicit none" at the top of main program - I initialized tag. - changed MPI_INT to MPI_INTEGER - "send_length" should be just "send", it was a typo. But the code is still hanging in sendrecv. The present form is below: main.f program main implicit none include 'mpif.h' integer me, np, ierror call MPI_init( ierror ) call MPI_comm_rank( mpi_comm_world, me, ierror ) call MPI_comm_size( mpi_comm_world, np, ierror ) call sendrecv(me, np) call mpi_finalize( ierror ) stop end sendrecv.f subroutine sendrecv(me, np) include 'mpif.h' integer np, me, sender, tag integer, dimension(mpi_status_size) :: status integer, dimension(1) :: recv, send if (me.eq.0) then do sender = 1, np-1 call mpi_recv(recv, 1, mpi_integer, sender, tag, & mpi_comm_world, status, ierror) end do end if if ((me.ge.1).and.(me.lt.np)) then send(1) = me*12 call mpi_send(send, 1, mpi_integer, 0, tag, &mpi_comm_world, ierror) end if return end 2013/3/1 Jeff Squyres (jsquyres)> On Feb 28, 2013, at 9:59 AM, Pradeep Jha > wrote: > > > Is it possible to call the MPI_send and MPI_recv commands inside a > subroutine and not the main program? > > Yes. > > > I have written a minimal program for what I am trying to do. It is > compiling fine but it is not working. The program just hangs in the > "sendrecv" subroutine. Any ideas how can I do it? > > You seem to have several errors in the sendrecv subroutine. I would > strongly encourage you to use "implicit none" to avoid many of these > errors. Here's a few errors I see offhand: > > - tag is not initialized > - what's send_length(1)? > - use MPI_INTEGER, not MPI_INT (MPI_INT = C int, MPI_INTEGER = Fortran > INTEGER) > > > > main.f > > > > > > program main > > > > include 'mpif.h' > > > > integer me, np, ierror > > > > call MPI_init( ierror ) > > call MPI_comm_rank( mpi_comm_world, me, ierror ) > > call MPI_comm_size( mpi_comm_world, np, ierror ) > > > > call sendrecv(me, np) > > > > call mpi_finalize( ierror ) > > > > stop > > end > > > > sendrecv.f > > > > > > subroutine sendrecv(me, np) > > > > include 'mpif.h' > > > > integer np, me, sender > > integer, dimension(mpi_status_size) :: status > > > > integer, dimension(1) :: recv, send > > > > if (me.eq.0) then > > > > do sender = 1, np-1 > > call mpi_recv(recv, 1, mpi_int, sender, tag, > > & mpi_comm_world, status, ierror) > > > > end do > > end if > > > > if ((me.ge.1).and.( > > me.lt.np > > )) then > > send_length(1) = me*12 > > > > call mpi_send(send, 1, mpi_int, 0, tag, > > &mpi_comm_world, ierror) > > end if > > > > return > > end > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Calling MPI_send MPI_recv from a fortran subroutine
On Feb 28, 2013, at 9:59 AM, Pradeep Jhawrote: > Is it possible to call the MPI_send and MPI_recv commands inside a subroutine > and not the main program? Yes. > I have written a minimal program for what I am trying to do. It is compiling > fine but it is not working. The program just hangs in the "sendrecv" > subroutine. Any ideas how can I do it? You seem to have several errors in the sendrecv subroutine. I would strongly encourage you to use "implicit none" to avoid many of these errors. Here's a few errors I see offhand: - tag is not initialized - what's send_length(1)? - use MPI_INTEGER, not MPI_INT (MPI_INT = C int, MPI_INTEGER = Fortran INTEGER) > main.f > > > program main > > include 'mpif.h' > > integer me, np, ierror > > call MPI_init( ierror ) > call MPI_comm_rank( mpi_comm_world, me, ierror ) > call MPI_comm_size( mpi_comm_world, np, ierror ) > > call sendrecv(me, np) > > call mpi_finalize( ierror ) > > stop > end > > sendrecv.f > > > subroutine sendrecv(me, np) > > include 'mpif.h' > > integer np, me, sender > integer, dimension(mpi_status_size) :: status > > integer, dimension(1) :: recv, send > > if (me.eq.0) then > > do sender = 1, np-1 > call mpi_recv(recv, 1, mpi_int, sender, tag, > & mpi_comm_world, status, ierror) > > end do > end if > > if ((me.ge.1).and.( > me.lt.np > )) then > send_length(1) = me*12 > > call mpi_send(send, 1, mpi_int, 0, tag, > &mpi_comm_world, ierror) > end if > > return > end > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] High cpu usage
Hi, I notice that a simple MPI program in which rank 0 sends 4 bytes to each rank and receives a reply uses a considerable amount of CPU in system call.s % time seconds usecs/call callserrors syscall -- --- --- - - 61.100.016719 3 5194 gettimeofday 20.770.005683 2 2596 epoll_wait 18.130.004961 2 2595 sched_yield 0.000.00 0 4 write 0.000.00 0 4 stat 0.000.00 0 2 readv 0.000.00 0 2 writev -- --- --- - - 100.000.027363 10397 total and Process 2512 attached - interrupt to quit 16:32:17.793039 sched_yield() = 0 <0.78> 16:32:17.793276 gettimeofday({1362065537, 793330}, NULL) = 0 <0.70> 16:32:17.793460 epoll_wait(4, {}, 32, 0) = 0 <0.000114> 16:32:17.793712 gettimeofday({1362065537, 793773}, NULL) = 0 <0.97> 16:32:17.793914 sched_yield() = 0 <0.89> 16:32:17.794107 gettimeofday({1362065537, 794157}, NULL) = 0 <0.83> 16:32:17.794292 epoll_wait(4, {}, 32, 0) = 0 <0.72> 16:32:17.794457 gettimeofday({1362065537, 794541}, NULL) = 0 <0.000115> 16:32:17.794695 sched_yield() = 0 <0.79> 16:32:17.794877 gettimeofday({1362065537, 794927}, NULL) = 0 <0.81> 16:32:17.795062 epoll_wait(4, {}, 32, 0) = 0 <0.79> 16:32:17.795244 gettimeofday({1362065537, 795294}, NULL) = 0 <0.82> 16:32:17.795432 sched_yield() = 0 <0.96> 16:32:17.795761 gettimeofday({1362065537, 795814}, NULL) = 0 <0.79> 16:32:17.795940 epoll_wait(4, {}, 32, 0) = 0 <0.80> 16:32:17.796123 gettimeofday({1362065537, 796191}, NULL) = 0 <0.000121> 16:32:17.796388 sched_yield() = 0 <0.000127> 16:32:17.796635 gettimeofday({1362065537, 796722}, NULL) = 0 <0.000121> 16:32:17.796951 epoll_wait(4, {}, 32, 0) = 0 <0.89> What is the purpose of this behavior. Thanks, David
Re: [OMPI users] MPI_Abort under slurm
Thanks Ralph, you were right I was not aware of --kill-on-bad-exit and KillOnBadExit, setting it to 1 shuts down the entire MPI job when MPI_Abort() is called. I was thinking this MPI protocol message was just transported by slurm and then each task would exit. Oh well I should not guess the implementation. :-) Thanks again. David
[OMPI users] Calling MPI_send MPI_recv from a fortran subroutine
Is it possible to call the MPI_send and MPI_recv commands inside a subroutine and not the main program? I have written a minimal program for what I am trying to do. It is compiling fine but it is not working. The program just hangs in the "sendrecv" subroutine. Any ideas how can I do it? main.f program main include 'mpif.h' integer me, np, ierror call MPI_init( ierror ) call MPI_comm_rank( mpi_comm_world, me, ierror ) call MPI_comm_size( mpi_comm_world, np, ierror ) call sendrecv(me, np) call mpi_finalize( ierror ) stop end sendrecv.f subroutine sendrecv(me, np) include 'mpif.h' integer np, me, sender integer, dimension(mpi_status_size) :: status integer, dimension(1) :: recv, send if (me.eq.0) then do sender = 1, np-1 call mpi_recv(recv, 1, mpi_int, sender, tag, & mpi_comm_world, status, ierror) end do end if if ((me.ge.1).and.(me.lt.np)) then send_length(1) = me*12 call mpi_send(send, 1, mpi_int, 0, tag, &mpi_comm_world, ierror) end if return end
Re: [OMPI users] Option -cpus-per-proc 2 not working with given machinefile?
Am 28.02.2013 um 08:58 schrieb Reuti: > Am 28.02.2013 um 06:55 schrieb Ralph Castain: > >> I don't off-hand see a problem, though I do note that your "working" version >> incorrectly reports the universe size as 2! > > Yes, it was 2 in the case when it was working by giving only two hostnames > without any dedicated slot count. What should it be in this case - "unknown", > "infinity"? As an add on: a) I tried it again on the command line and still get: Total: 64 Universe: 2 with a hostfile node006 node007 b) In a job script under SGE and Open MPI compiled --with-sge I get after mangling the hostfile: #!/bin /sh #$ -pe openmpi* 128 #$ -l exclusive cut -f 1 -d" " $PE_HOSTFILE > $TMPDIR/machines mpiexec -cpus-per-proc 2 -report-bindings -hostfile $TMPDIR/machines -np 64 ./mpihello Here: Total: 64 Universe: 128 Maybe the found allocation by SGE and the one from the command line argument are getting mixed here. -- Reuti > -- Reuti > > >> >> I'll have to take a look at this and get back to you on it. >> >> On Feb 27, 2013, at 3:15 PM, Reutiwrote: >> >>> Hi, >>> >>> I have an issue using the option -cpus-per-proc 2. As I have Bulldozer >>> machines and I want only one process per FP core, I thought using >>> -cpus-per-proc 2 would be the way to go. Initially I had this issue inside >>> GridEngine but then tried it outside any queuingsystem and face exactly the >>> same behavior. >>> >>> @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 >>> integer cores per machine in total. Used Open MPI is 1.6.4. >>> >>> >>> a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 >>> ./mpihello >>> >>> and a hostfile containing only the two lines listing the machines: >>> >>> node006 >>> node007 >>> >>> This works as I would like it (see working.txt) when initiated on node006. >>> >>> >>> b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 >>> ./mpihello >>> >>> But changing the hostefile so that it is having a slot count which might >>> mimic the behavior in case of a parsed machinefile out of any queuing >>> system: >>> >>> node006 slots=64 >>> node007 slots=64 >>> >>> This fails with: >>> >>> -- >>> An invalid physical processor ID was returned when attempting to bind >>> an MPI process to a unique processor on node: >>> >>> Node: node006 >>> >>> This usually means that you requested binding to more processors than >>> exist (e.g., trying to bind N MPI processes to M processors, where N > >>> M), or that the node has an unexpectedly different topology. >>> >>> Double check that you have enough unique processors for all the >>> MPI processes that you are launching on this host, and that all nodes >>> have identical topologies. >>> >>> You job will now abort. >>> -- >>> >>> (see failed.txt) >>> >>> >>> b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 >>> ./mpihello >>> >>> This works and the found universe is 128 as expected (see only32.txt). >>> >>> >>> c) Maybe the used machinefile is not parsed in the correct way, so I >>> checked: >>> >>> c1) mpiexec -hostfile machines -np 64 ./mpihello => works >>> >>> c2) mpiexec -hostfile machines -np 128 ./mpihello => works >>> >>> c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected >>> >>> So, it got the slot counts in the correct way. >>> >>> What do I miss? >>> >>> -- Reuti >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Option -cpus-per-proc 2 not working with given machinefile?
Am 28.02.2013 um 06:55 schrieb Ralph Castain: > I don't off-hand see a problem, though I do note that your "working" version > incorrectly reports the universe size as 2! Yes, it was 2 in the case when it was working by giving only two hostnames without any dedicated slot count. What should it be in this case - "unknown", "infinity"? -- Reuti > > I'll have to take a look at this and get back to you on it. > > On Feb 27, 2013, at 3:15 PM, Reutiwrote: > >> Hi, >> >> I have an issue using the option -cpus-per-proc 2. As I have Bulldozer >> machines and I want only one process per FP core, I thought using >> -cpus-per-proc 2 would be the way to go. Initially I had this issue inside >> GridEngine but then tried it outside any queuingsystem and face exactly the >> same behavior. >> >> @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 >> integer cores per machine in total. Used Open MPI is 1.6.4. >> >> >> a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 >> ./mpihello >> >> and a hostfile containing only the two lines listing the machines: >> >> node006 >> node007 >> >> This works as I would like it (see working.txt) when initiated on node006. >> >> >> b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 >> ./mpihello >> >> But changing the hostefile so that it is having a slot count which might >> mimic the behavior in case of a parsed machinefile out of any queuing system: >> >> node006 slots=64 >> node007 slots=64 >> >> This fails with: >> >> -- >> An invalid physical processor ID was returned when attempting to bind >> an MPI process to a unique processor on node: >> >> Node: node006 >> >> This usually means that you requested binding to more processors than >> exist (e.g., trying to bind N MPI processes to M processors, where N > >> M), or that the node has an unexpectedly different topology. >> >> Double check that you have enough unique processors for all the >> MPI processes that you are launching on this host, and that all nodes >> have identical topologies. >> >> You job will now abort. >> -- >> >> (see failed.txt) >> >> >> b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 >> ./mpihello >> >> This works and the found universe is 128 as expected (see only32.txt). >> >> >> c) Maybe the used machinefile is not parsed in the correct way, so I checked: >> >> c1) mpiexec -hostfile machines -np 64 ./mpihello => works >> >> c2) mpiexec -hostfile machines -np 128 ./mpihello => works >> >> c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected >> >> So, it got the slot counts in the correct way. >> >> What do I miss? >> >> -- Reuti >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Option -cpus-per-proc 2 not working with given machinefile?
I don't off-hand see a problem, though I do note that your "working" version incorrectly reports the universe size as 2! I'll have to take a look at this and get back to you on it. On Feb 27, 2013, at 3:15 PM, Reutiwrote: > Hi, > > I have an issue using the option -cpus-per-proc 2. As I have Bulldozer > machines and I want only one process per FP core, I thought using > -cpus-per-proc 2 would be the way to go. Initially I had this issue inside > GridEngine but then tried it outside any queuingsystem and face exactly the > same behavior. > > @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 > integer cores per machine in total. Used Open MPI is 1.6.4. > > > a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 > ./mpihello > > and a hostfile containing only the two lines listing the machines: > > node006 > node007 > > This works as I would like it (see working.txt) when initiated on node006. > > > b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 > ./mpihello > > But changing the hostefile so that it is having a slot count which might > mimic the behavior in case of a parsed machinefile out of any queuing system: > > node006 slots=64 > node007 slots=64 > > This fails with: > > -- > An invalid physical processor ID was returned when attempting to bind > an MPI process to a unique processor on node: > > Node: node006 > > This usually means that you requested binding to more processors than > exist (e.g., trying to bind N MPI processes to M processors, where N > > M), or that the node has an unexpectedly different topology. > > Double check that you have enough unique processors for all the > MPI processes that you are launching on this host, and that all nodes > have identical topologies. > > You job will now abort. > -- > > (see failed.txt) > > > b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 > ./mpihello > > This works and the found universe is 128 as expected (see only32.txt). > > > c) Maybe the used machinefile is not parsed in the correct way, so I checked: > > c1) mpiexec -hostfile machines -np 64 ./mpihello => works > > c2) mpiexec -hostfile machines -np 128 ./mpihello => works > > c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected > > So, it got the slot counts in the correct way. > > What do I miss? > > -- Reuti > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users