Re: [OMPI users] Receiving MPI messages of unknown size
Hi Lars I wonder if you could always use blocking message passing on the preliminary send/receive pair that transmits the message size/header, then use non-blocking mode for the actual message. If the "message size/header" part transmits a small buffer, the preliminary send/recv pair will use the "eager" communication mode, return quickly, and may not reduce performance, I would guess. For a group of several messages the preliminary send/recv pair could transmit a small (to ensure "eager mode") array of message sizes, maybe along with the message tags and sender ranks, instead of only one size. Just a thought. Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - Lars Andersson wrote: Hi, I'm trying to solve a problem of passing serializable, arbitrarily sized objects around using MPI and non-blocking communication. The problem I'm facing is what to do at the receiving end when expecting an object of unknown size, but at the same time not block on waiting for it. When using blocking message passing, I have simply solved the problem by first sending a small, fixed size header containing the size of rest of the data, sent in the following mpi message. When using non-blocking message passing, this doesn't seem to be such a good idea, since we cant post the main data transfer until we have received the message header... It seems to take away most of the advantages on non-blocking io in the first place. I've been thinking about solving this using MPI_Probe / MPI_IProbe, but I'm worried about performance. Question 1: Will MPI_Probe or the underlying MPI implementation actually receive the full message data (assuming reasonably sized message, like less than 10MB) before MPI_Probe returns? Or will there be a significant data transfer delay (for large messages) when calling MPI_Recv after a successful MPI_Probe? What I want is something like this: 1) post one or several non-blocking, variable sized message receives 2) do other, non-MPI work, while any incoming messages will be fully received into buffers on the local machine. 3) perform completion of the receives posted in 1). I don't want to unnecessarily wait here for data transfers that could have taken place during 2). Problems: I can't post non-blocking MPI_Irecv() calls in 1, because I don't know the sizes of incoming messages. If I simply do nothing in 1, and call MPI_Probe in 3, I'm worried that I won't get nice compute/transfer overlap because the messages wont actually be received locally until I post a Probe or Recv in 3. Question 2: How can I achieve the communication sequence described in 1,2,3 above, with overlapping data transfer and local computation during 2? Question 3: A temporary kludge solution to the problem above might be to allocate a temporary receive buffer of some arbitrary, constant maximum size BUFSIZE in 1 for each non-blocking receive operation, make sure messages sent are not larger than BUFSIZE, and post MPI_Irecv(buffer, BUFSIZE,...) calls in 1. I haven't been able to figure out if it's actually correct and portable to receive less data than specified in the count argument to MPI_Irecv. What if the message sent on the other end is 10 bytes, and BUFSIZE=count=20. Would that be OK? If anyone can shed any light on this, I'd be grateful. FYI, we're using a cluster of 2-8 core x86-64 machines running Linux and connected using ordinary 1Gbit ethernet. Best regards, Lars Andersson ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Receiving MPI messages of unknown size
Hi, I'm trying to solve a problem of passing serializable, arbitrarily sized objects around using MPI and non-blocking communication. The problem I'm facing is what to do at the receiving end when expecting an object of unknown size, but at the same time not block on waiting for it. When using blocking message passing, I have simply solved the problem by first sending a small, fixed size header containing the size of rest of the data, sent in the following mpi message. When using non-blocking message passing, this doesn't seem to be such a good idea, since we cant post the main data transfer until we have received the message header... It seems to take away most of the advantages on non-blocking io in the first place. I've been thinking about solving this using MPI_Probe / MPI_IProbe, but I'm worried about performance. Question 1: Will MPI_Probe or the underlying MPI implementation actually receive the full message data (assuming reasonably sized message, like less than 10MB) before MPI_Probe returns? Or will there be a significant data transfer delay (for large messages) when calling MPI_Recv after a successful MPI_Probe? What I want is something like this: 1) post one or several non-blocking, variable sized message receives 2) do other, non-MPI work, while any incoming messages will be fully received into buffers on the local machine. 3) perform completion of the receives posted in 1). I don't want to unnecessarily wait here for data transfers that could have taken place during 2). Problems: I can't post non-blocking MPI_Irecv() calls in 1, because I don't know the sizes of incoming messages. If I simply do nothing in 1, and call MPI_Probe in 3, I'm worried that I won't get nice compute/transfer overlap because the messages wont actually be received locally until I post a Probe or Recv in 3. Question 2: How can I achieve the communication sequence described in 1,2,3 above, with overlapping data transfer and local computation during 2? Question 3: A temporary kludge solution to the problem above might be to allocate a temporary receive buffer of some arbitrary, constant maximum size BUFSIZE in 1 for each non-blocking receive operation, make sure messages sent are not larger than BUFSIZE, and post MPI_Irecv(buffer, BUFSIZE,...) calls in 1. I haven't been able to figure out if it's actually correct and portable to receive less data than specified in the count argument to MPI_Irecv. What if the message sent on the other end is 10 bytes, and BUFSIZE=count=20. Would that be OK? If anyone can shed any light on this, I'd be grateful. FYI, we're using a cluster of 2-8 core x86-64 machines running Linux and connected using ordinary 1Gbit ethernet. Best regards, Lars Andersson
Re: [OMPI users] top question
Simon, it is a lot more difficult than it appears. You're right, select/poll can do it for any file descriptor, and shared mutexes/ conditions (despite the performance impact) can do it for shared memory. However, in the case where you have to support both simultaneously, what is the right approach, i.e. the one that doesn't impact the current performance? We're open to smart solutions ... george. On Jun 3, 2009, at 11:49 , Number Cruncher wrote: Jeff Squyres wrote: We get this question so much that I really need to add it to the FAQ. :-\ Open MPI currently always spins for completion for exactly the reason that Scott cites: lower latency. Arguably, when using TCP, we could probably get a bit better performance by blocking and allowing the kernel to make more progress than a single quick pass through the sockets progress engine, but that involves some other difficulties such as simultaneously allowing shared memory progress. We have ideas how to make this work, but it has unfortunately remained at a lower priority: the performance difference isn't that great, and we've been focusing on the other, lower latency interconnects (shmem, MX, verbs, etc.). Whilst I understand that you have other priorities, and I grateful for the leverage I get by using OpenMPI, I would like to offer an alternative use case, which I believe may become more common. We're developing parallel software which is designed to be used *interactively* as well as in batch mode. We want the same SIMD code running on a user's quad-core workstation as on a 1,000-node cluster. For the former case (single workstation), it would be *much* more user friendly and interactive, for the back-end MPI code not to be spinning at 100% when it's just waiting for the next front-end command. The GUI thread doesn't get a look in. I can't imagine the difficulties involved, but if the POSIX calls select() and pthread_cond_wait() can do it for TCP and shared-memory threads respectively, it can't be impossible! Just my .2c, Simon ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper
Dear Reiner, Jeff, Gus and list, Thanks for your suggestions, I will test them tomorrow. I did not check your mails before because I was busy trying the gcc/gfortran way. I have other problems: - for static linking I am missing plenty of ibv_* routines. I saw on the net that they should be in a libibverbs library, but I cannot find it. - dynamic linking is OK, but when I test a simple test program on my machine (i7 920) with an mpd-hosts containing a single line with the name of the machine and slots=4, the program only execute provided I give my password, allthough I do have a .rhosts file with the name of my machine in my home directory. -- Sincerely yours, Michel DEVEL
Re: [OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper
Hi Michel, Jeff, Rainer, list I have AMD Opteron Shanghai, and Intel 10.1017. I had trouble with the Intel -fast flag also. According to the ifort man page/help: -fast = -xT -O3 -ipo -no-prec-div -static (Each compiler vendor has a different -fast, PGI is another thing.) Intel doesn't allow SSE-type optimization beyond W (SSE+SSE2) for AMD processors (an old and contentious issue, Google it form more info). So, I changed -xT to -xW (the highest level allowed, also recommended by AMD). I had trouble with ipo before (missing symbols during link), so I reduced it to ip. Moreover, -static definitely cannot work with the Infiniband and other tons of shared libraries, of course, hence I simply removed it. However, as suggested by Rainer, -static-intel may be OK, if all you want is to avoid sending the Intel LD_LIBRARY_PATH with your mpiexec command. (I haven't tried it, though.) The flags became: -xW -O3 -ip -no-prec-div I used the same flags for ifort (FFLAGS, FCFLAGS), icc (CFLAGS) and icpc (CXXFLAGS),to build OpenMPI 1.3.2, and it works. For "Genuine Intel" processors you can upgrade -xW to whatever is appropriate. My $0.02. Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - Jeff Squyres wrote: Rainer and I are still iterating on the trunk solution (we moved to an hg branch just for convenience for the moment). Note that the Fortran flags aren't too important to OMPI. We *only* use them in configure. OMPI doesn't contain any Fortran 77 code at all, and the F90 module is extremely minimalistic (generally one-line subroutines to call the C counterpart). So a workaround for the moment -- until we can figure out the problem -- might be to remove the -fast from the FFLAGS and FCFLAGS. On Jun 3, 2009, at 11:34 AM, Rainer Keller wrote: Dear Michel, per the naming convention test in configure: ifort -fast will turn on -xHOST -O3 -ipo -no-prec-div -static, of which -ipo turns on interprocedural optimizations for multiple files. Here the compiled object file does not contain the symbols searched for in the configure-tests. Looking into the simple test-case in configure and the options that one has to figure out the naming convention using compilation (-c), I don't see an other other than disabling -fast & -ipo for intel-fortan compilers. Please check trunk in commit r21363. On Wednesday 03 June 2009 09:29:09 am DEVEL Michel wrote: > In fact I forgot to put back to '-fast -C' the FCFLAGS variable (from > '-O3 -C'). There is still an error (many opal_*_* subroutines not found > during the ipo step) at the same place, coming from the fact that > "ld: attempted static link of dynamic object > `../../../opal/.libs/libopen-pal.so' > although I put --enable-static in the configure step... > Any idea of how to make the static libraries ? In order to statically link at least the intel-libraries, please add -static-intel (in previous intel compilers called -i-static) to LDFLAGS With best regards, Rainer -- Rainer Keller, PhD Tel: +1 (865) 241-6293 Oak Ridge National Lab Fax: +1 (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Openmpi and processor affinity
The changes Jeff mentioned are not in the 1.3 branch - not sure if they will come over there or not. I'm a little concerned in this thread that someone is reporting the process affinity binding changing - that shouldn't be happening, and my guess is that something outside of our control may be changing it. One other thing to consider that has been an issue around here, and will be an even bigger issue with the change to bind at app start. If your app is threaded, we will bind *all* threads to the same processor, thus potentially hampering performance. We have found that multi-threaded apps often provide better performance if users do *not* set processor affinity via MPI, but instead embed binding calls inside the individual threads so they can be placed on separate processors. All depends on the exact nature of the application, of course! HTH Ralph On Wed, Jun 3, 2009 at 10:02 AM, Jeff Squyres wrote: > On Jun 3, 2009, at 11:40 AM, Ashley Pittman wrote: > > Wasn't there a discussion about this recently on the list, OMPI binds >> during MPI_Init() so it's possible for memory to be allocated on the >> wrong quad, the discussion was about moving the binding to the orte >> process as I recall? >> >> > Yes. It's been fixed in OMPI devel trunk. I'm not sure it made it to the > v1.3 branch, but it's definitely not in a released version yet. > > I *thought* that HPL did all allocation after MPI_INIT. But I could be > wrong. If so, then using numactl to bind before the MPI app starts will > likely give better results -- you're right (until we get our fixes in such > that we bind pre-main). > > Regardless, if something is *changing* the affinity after MPI_INIT, then > there's little OMPI can do about that. > > >From my testing of process affinity you tend to get much more consistent >> results with it on and much more unpredictable results with it off, I'd >> questing that it's working properly if you are seeing a 88-93% range in >> the results. >> >> Ashley Pittman. >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > -- > Jeff Squyres > Cisco Systems > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper
Rainer and I are still iterating on the trunk solution (we moved to an hg branch just for convenience for the moment). Note that the Fortran flags aren't too important to OMPI. We *only* use them in configure. OMPI doesn't contain any Fortran 77 code at all, and the F90 module is extremely minimalistic (generally one-line subroutines to call the C counterpart). So a workaround for the moment -- until we can figure out the problem -- might be to remove the -fast from the FFLAGS and FCFLAGS. On Jun 3, 2009, at 11:34 AM, Rainer Keller wrote: Dear Michel, per the naming convention test in configure: ifort -fast will turn on -xHOST -O3 -ipo -no-prec-div -static, of which -ipo turns on interprocedural optimizations for multiple files. Here the compiled object file does not contain the symbols searched for in the configure-tests. Looking into the simple test-case in configure and the options that one has to figure out the naming convention using compilation (-c), I don't see an other other than disabling -fast & -ipo for intel-fortan compilers. Please check trunk in commit r21363. On Wednesday 03 June 2009 09:29:09 am DEVEL Michel wrote: > In fact I forgot to put back to '-fast -C' the FCFLAGS variable (from > '-O3 -C'). There is still an error (many opal_*_* subroutines not found > during the ipo step) at the same place, coming from the fact that > "ld: attempted static link of dynamic object > `../../../opal/.libs/libopen-pal.so' > although I put --enable-static in the configure step... > Any idea of how to make the static libraries ? In order to statically link at least the intel-libraries, please add -static-intel (in previous intel compilers called -i-static) to LDFLAGS With best regards, Rainer -- Rainer Keller, PhD Tel: +1 (865) 241-6293 Oak Ridge National Lab Fax: +1 (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Openmpi and processor affinity
On Jun 3, 2009, at 11:40 AM, Ashley Pittman wrote: Wasn't there a discussion about this recently on the list, OMPI binds during MPI_Init() so it's possible for memory to be allocated on the wrong quad, the discussion was about moving the binding to the orte process as I recall? Yes. It's been fixed in OMPI devel trunk. I'm not sure it made it to the v1.3 branch, but it's definitely not in a released version yet. I *thought* that HPL did all allocation after MPI_INIT. But I could be wrong. If so, then using numactl to bind before the MPI app starts will likely give better results -- you're right (until we get our fixes in such that we bind pre-main). Regardless, if something is *changing* the affinity after MPI_INIT, then there's little OMPI can do about that. >From my testing of process affinity you tend to get much more consistent results with it on and much more unpredictable results with it off, I'd questing that it's working properly if you are seeing a 88-93% range in the results. Ashley Pittman. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] top question
Jeff Squyres wrote: We get this question so much that I really need to add it to the FAQ. :-\ Open MPI currently always spins for completion for exactly the reason that Scott cites: lower latency. Arguably, when using TCP, we could probably get a bit better performance by blocking and allowing the kernel to make more progress than a single quick pass through the sockets progress engine, but that involves some other difficulties such as simultaneously allowing shared memory progress. We have ideas how to make this work, but it has unfortunately remained at a lower priority: the performance difference isn't that great, and we've been focusing on the other, lower latency interconnects (shmem, MX, verbs, etc.). Whilst I understand that you have other priorities, and I grateful for the leverage I get by using OpenMPI, I would like to offer an alternative use case, which I believe may become more common. We're developing parallel software which is designed to be used *interactively* as well as in batch mode. We want the same SIMD code running on a user's quad-core workstation as on a 1,000-node cluster. For the former case (single workstation), it would be *much* more user friendly and interactive, for the back-end MPI code not to be spinning at 100% when it's just waiting for the next front-end command. The GUI thread doesn't get a look in. I can't imagine the difficulties involved, but if the POSIX calls select() and pthread_cond_wait() can do it for TCP and shared-memory threads respectively, it can't be impossible! Just my .2c, Simon
Re: [OMPI users] Openmpi and processor affinity
On Wed, 2009-06-03 at 11:27 -0400, Jeff Squyres wrote: > On Jun 3, 2009, at 10:48 AM, wrote: > > > For HPL, try writing a bash script that pins processes to their > > local memory controllers using numactl before kicking off HPL. This > > is particularly helpful when spawning more than 1 thread per > > process. The last line of your script should look like "numactl -c > > $cpu_bind -m $ mem_bind $*". > > > > Believe it or not, I hit 94.5% HPL efficiency using this tactic on a > > 16 node cluster. Using processor affinity (various MPIs) my results > > were inconsistent and ranged between 88-93% > > > > If you're using multi-threaded HPL, that might be useful. But if > you're not, I'd be surprised if you got any different results than > Open MPI binding itself. If there really is a difference, we should > figure out why. More specifically, calling numactl yourself should be > pretty much exactly what we do in OMPI (via API, not via calling > numactl). Wasn't there a discussion about this recently on the list, OMPI binds during MPI_Init() so it's possible for memory to be allocated on the wrong quad, the discussion was about moving the binding to the orte process as I recall? >From my testing of process affinity you tend to get much more consistent results with it on and much more unpredictable results with it off, I'd questing that it's working properly if you are seeing a 88-93% range in the results. Ashley Pittman.
Re: [OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper
Dear Michel, per the naming convention test in configure: ifort -fast will turn on -xHOST -O3 -ipo -no-prec-div -static, of which -ipo turns on interprocedural optimizations for multiple files. Here the compiled object file does not contain the symbols searched for in the configure-tests. Looking into the simple test-case in configure and the options that one has to figure out the naming convention using compilation (-c), I don't see an other other than disabling -fast & -ipo for intel-fortan compilers. Please check trunk in commit r21363. On Wednesday 03 June 2009 09:29:09 am DEVEL Michel wrote: > In fact I forgot to put back to '-fast -C' the FCFLAGS variable (from > '-O3 -C'). There is still an error (many opal_*_* subroutines not found > during the ipo step) at the same place, coming from the fact that > "ld: attempted static link of dynamic object > `../../../opal/.libs/libopen-pal.so' > although I put --enable-static in the configure step... > Any idea of how to make the static libraries ? In order to statically link at least the intel-libraries, please add -static-intel (in previous intel compilers called -i-static) to LDFLAGS With best regards, Rainer -- Rainer Keller, PhD Tel: +1 (865) 241-6293 Oak Ridge National Lab Fax: +1 (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink
Re: [OMPI users] Openmpi and processor affinity
Hi Jeff, Yes, this technique is particularly helpful for multi-threaded and works consistently across the various MPIs I test. Thanks, jacob > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Jeff Squyres > Sent: Wednesday, June 03, 2009 10:27 AM > To: Open MPI Users > Subject: Re: [OMPI users] Openmpi and processor affinity > > On Jun 3, 2009, at 10:48 AM, wrote: > > > For HPL, try writing a bash script that pins processes to their > > local memory controllers using numactl before kicking off HPL. This > > is particularly helpful when spawning more than 1 thread per > > process. The last line of your script should look like "numactl -c > > $cpu_bind -m $ mem_bind $*". > > > > Believe it or not, I hit 94.5% HPL efficiency using this tactic on a > > 16 node cluster. Using processor affinity (various MPIs) my results > > were inconsistent and ranged between 88-93% > > > > If you're using multi-threaded HPL, that might be useful. But if > you're not, I'd be surprised if you got any different results than > Open MPI binding itself. If there really is a difference, we should > figure out why. More specifically, calling numactl yourself should be > pretty much exactly what we do in OMPI (via API, not via calling > numactl). > > -- > Jeff Squyres > Cisco Systems > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Openmpi and processor affinity
On Jun 3, 2009, at 10:48 AM, wrote: For HPL, try writing a bash script that pins processes to their local memory controllers using numactl before kicking off HPL. This is particularly helpful when spawning more than 1 thread per process. The last line of your script should look like "numactl -c $cpu_bind -m $ mem_bind $*". Believe it or not, I hit 94.5% HPL efficiency using this tactic on a 16 node cluster. Using processor affinity (various MPIs) my results were inconsistent and ranged between 88-93% If you're using multi-threaded HPL, that might be useful. But if you're not, I'd be surprised if you got any different results than Open MPI binding itself. If there really is a difference, we should figure out why. More specifically, calling numactl yourself should be pretty much exactly what we do in OMPI (via API, not via calling numactl). -- Jeff Squyres Cisco Systems
Re: [OMPI users] Hypre
I'm afraid I have no experience with Hypre -- sorry! :-( Do they have a support web site / mailing list somewhere? You might have better luck contacting them about their software. On Jun 3, 2009, at 11:05 AM, naveed wrote: Hi, I wanted to know if any have used Hypre library for the solution of Ax = b for of equations. I have problems reading in matrix file. I went through user manual, but couldn't get much out of it. I wanted to know what will be the best file format for reading large sparse matrices with Hypre. Looking forward for any kind of help related to hypre. Best Regard. Ahnav. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] top question
tsi...@coas.oregonstate.edu wrote: Thanks for the explanation. I am using GigEth + Open MPI and the buffered MPI_BSend. I had already noticed that top behaved differently on another cluster with Infinibandb + MPICH. So the only option to find out how much time each process is waiting around seems to be to profile the code. Will gprof show me anything useful or will I have to use a more sophisticated (any free ones?) parallel profiler? Another frequently asked question! I can try to add a FAQ entry/category. There are a number of free options including TAU http://www.cs.uoregon.edu/research/tau/home.php mpiP http://mpip.sourceforge.net/ FPMPI http://www.mcs.anl.gov/research/projects/fpmpi/WWW/index.html IPM http://ipm-hpc.sourceforge.net/ Sun Studio http://developers.sun.com/sunstudio/ The only one I've really used is Sun Studio. Jumpshot *might* work with Open MPI, I forget. Or, it might be more an MPICH tool.
[OMPI users] Hypre
Hi, I wanted to know if any have used Hypre library for the solution of Ax = b for of equations. I have problems reading in matrix file. I went through user manual, but couldn't get much out of it. I wanted to know what will be the best file format for reading large sparse matrices with Hypre. Looking forward for any kind of help related to hypre. Best Regard. Ahnav.
Re: [OMPI users] Openmpi and processor affinity
Hi Iftikhar, For HPL, try writing a bash script that pins processes to their local memory controllers using numactl before kicking off HPL. This is particularly helpful when spawning more than 1 thread per process. The last line of your script should look like "numactl -c $cpu_bind -m $ mem_bind $*". Believe it or not, I hit 94.5% HPL efficiency using this tactic on a 16 node cluster. Using processor affinity (various MPIs) my results were inconsistent and ranged between 88-93% Thanks, jacob > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Iftikhar Rathore > Sent: Tuesday, June 02, 2009 10:25 PM > To: Open MPI Users > Subject: Re: [OMPI users] Openmpi and processor affinity > > Guss > Thanks for the reply and it was a typo (Im sick). I have updated to > 1.3.2 since my last post and have tried checking cpu affinity by using > f and j it shows processes spread on all 8 cores in the beginning, but > it does eventually shows all processes running on 0, > > My P and Q's are made for an 890 run, I have done this run with other > mpi implementation before. I have made sure that I am using the right > mpirun, but as Jeff pointed out I may have a mixed build and I am > investigating it more and will post my findings. > > Regards > > > On Tue, 2009-06-02 at 20:58 -0400, Gus Correa wrote: > > Hi Iftikhar > > > > Iftikhar Rathore wrote: > > > Hi > > > We are using openmpi version 1.2.8 (packaged with ofed-1.4). I am > trying > > > to run hpl-2.0 (linpak). We have two intel quad core CPU's in all > our > > > server (8 total cores) and all hosts in the hostfile have lines > that > > > look like "10.100.0.227 slots=8max_slots=8". > > > > Is this a typo on your email or on your hostfile? > > > > > look like "10.100.0.227 slots=8max_slots=8". > > > > There should be blank space between the number of slots and max_slots: > > > > 10.100.0.227 slots=8 max_slots=8 > > > > Another possibility is that you may be inadvertently using another > > mpirun on the system. > > > > A third possibility: > > Does your HPL.dat file require 896 processors? > > The product P x Q on each (P,Q) pair should match 896. > > If it is less, HPL will run on less processors, i.e., on P x Q only. > > (If it is more, HPL will issue an error message and stop.) > > Is this what is happening? > > > > A fourth one ...: > > Are you sure processor affinity is not correct? > > Do the processes drift across the cores? > > Typing 1 on top is not enough to clarify this. > > To see the process-to-core map on top, > > type "f" (for fields), > > then "j" (to display the CPU/core number), > > and wait for several minutes to see if processor/core (header "P") > > and the process ID (header "PID"), > > drift or not. > > > > Even when I launch less processes than the available/requested cores > > "--mca mpi_paffinity_alone 1" works right here, > > as I just checked, with P=4 and Q=1 on HPL.dat > > and with -np 8 on mpiexec. > > > > ** > > > > I recently ran a bunch of HPL tests with --mca mpi_paffinity_alone 1 > > and OpenMPI 1.3.2, built from source, and the processor affinity > seems > > to work (i.e., the processes stick to the cores). > > Building from source quite simple, and would give you the latest > OpenMPI. > > > > I don't know if 1.2.8 (which you are using) > > has a problem with mpi_paffinity_alone, > > but the OpenMPI developers may clarify this. > > > > > > I hope this helps, > > Gus Correa > > - > > Gustavo Correa > > Lamont-Doherty Earth Observatory - Columbia University > > Palisades, NY, 10964-8000 - USA > > - > > > > > > > > Now when I use mpirun (even with --mca mpi_paffinity_alone 1) it > does > > > not keep the affinity, the processes seem to gravitate towards > first > > > four cores (using top and hitting 1). I know I do have MCA > paffinity > > > available. > > > > > > [root@devi DLR_WB_88]# ompi_info | grep paffinity > > > [devi.cisco.com:26178] mca: base: component_find: unable to open > btl openib: file not found (ignored) > > >MCA paffinity: linux (MCA v1.0, API v1.0, Component > v1.2.8) > > > > > > The command line I am using is: > > > > > > # mpirun -nolocal -np 896 -v --mca mpi_paffinity_alone 1 -hostfile > /mnt/apps/hosts/896_8slots /mnt/apps/bin/xhpl > > > > > > Am I doing something wrong and is there a way to confirm cpu > affinity besides hitting "1" on top. > > > > > > > > > [root@devi DLR_WB_88]# mpirun -V > > > mpirun (Open MPI) 1.2.8 > > > > > > Report bugs to http://www.open-mpi.org/community/help/ > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- > Iftikhar Rathore > Technical Marketing Engineer > Server Access Virtualization BU. > Cisco Systems, Inc. > > Phone: +1 408 853 5322
Re: [OMPI users] top question
Thanks for the explanation. I am using GigEth + Open MPI and the buffered MPI_BSend. I had already noticed that top behaved differently on another cluster with Infinibandb + MPICH. So the only option to find out how much time each process is waiting around seems to be to profile the code. Will gprof show me anything useful or will I have to use a more sophisticated (any free ones?) parallel profiler? Cheers, Tiago
Re: [OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper
Ralph Castain a écrit : > I assume you re-did the ./configure command? Thanks for your answer. Yes. > Did you also remember to "make clean" before doing your "make all > install"? No, but now that I have done it, the result is the same: "ld: attempted static link of dynamic object `../../../opal/.libs/libopen-pal.so' > > Also, I note that your prefix looks really strange - it looks like you > are trying to install OMPI where the Intel compiler is located? Are > you sure you want to do that? Well yes, but maybe it is a silly thing. I wanted to do that because of lazyness to avoid having to make a script to add the directories to $PATH $LD_LIBRARY_PATH and so on. Furthermore, I would like to keep a version compiled with gcc and gfortran in /usr/local. -- Sincerely yours, Michel DEVEL
Re: [OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper
I assume you re-did the ./configure command? Did you also remember to "make clean" before doing your "make all install"? Also, I note that your prefix looks really strange - it looks like you are trying to install OMPI where the Intel compiler is located? Are you sure you want to do that? On Wed, Jun 3, 2009 at 7:29 AM, DEVEL Michel wrote: > Hi again, > > In fact I forgot to put back to '-fast -C' the FCFLAGS variable (from '-O3 > -C'). There is still an error (many opal_*_* subroutines not found during > the ipo step) at the same place, coming from the fact that > "ld: attempted static link of dynamic object > `../../../opal/.libs/libopen-pal.so' > although I put --enable-static in the configure step... > > Any idea of how to make the static libraries ? > -- > > Sincerely yours, > > Michel DEVEL > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] top question
We get this question so much that I really need to add it to the FAQ. :-\ Open MPI currently always spins for completion for exactly the reason that Scott cites: lower latency. Arguably, when using TCP, we could probably get a bit better performance by blocking and allowing the kernel to make more progress than a single quick pass through the sockets progress engine, but that involves some other difficulties such as simultaneously allowing shared memory progress. We have ideas how to make this work, but it has unfortunately remained at a lower priority: the performance difference isn't that great, and we've been focusing on the other, lower latency interconnects (shmem, MX, verbs, etc.). On Jun 3, 2009, at 8:37 AM, Scott Atchley wrote: On Jun 3, 2009, at 6:05 AM, tsi...@coas.oregonstate.edu wrote: > Top always shows all the paralell processes at 100% in the %CPU > field, although some of the time these must be waiting for a > communication to complete. How can I see actual processing as > opposed to waiting at a barrier? > > Thanks, > Tiago Using what interconnect? For performance reasons (lower latency), the app and/or OMPI may be polling on the completion. Are you using blocking or non-blocking communication? Scott ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper
Hi again, In fact I forgot to put back to '-fast -C' the FCFLAGS variable (from '-O3 -C'). There is still an error (many opal_*_* subroutines not found during the ipo step) at the same place, coming from the fact that "ld: attempted static link of dynamic object `../../../opal/.libs/libopen-pal.so' although I put --enable-static in the configure step... Any idea of how to make the static libraries ? -- Sincerely yours, Michel DEVEL
[OMPI users] Pb in configure script when using ifort with "-fast" + link of opal_wrapper
Dear openMPI users and developers, I have just tried installing openmpi by compiling it rather than just using a rpm because I want to use it with the ifort compiler. I have noticed a problem in the configure script (present at least in version 1.3.1 and 1.3.2) for the determination of Fortran naming convention : I tried to use ./configure --prefix=/opt/intel/Compiler/11.0/074/ --with-sge --enable-static CC='icc' CFLAGS=' -fast -C' LDFLAGS='-fast -C' AR='ar' F77='ifort' FC='ifort' FFLAGS=' -fast -C' FCFLAGS=' -fast -C' CXX='icpc' but the test to determine Fortran naming convention (single underscore in ifort case) fails because of the -fast flag. If I do "ifort -c -C -fast conftest.f" then "nm -B conftest.o" gives " w __ildata_included " whereas it correctly gives " T foo_bar_ " if I use "ifort -c -C -O3 conftest.f" I inserted "ompi_cv_f77_external_symbol="single underscore" at line 35244 of configure script (as if this variable had been cached) to get around this bug, which is not clean at all but works in my case. With this change, the configure script completes successfully. However "make all" then fails at the linking of opal_wrapper with following messages : /bin/sh ../../../libtool --tag=CC --mode=link icc -DNDEBUG -fast -C -finline-functions -fno-strict-aliasing -restrict -pthread -fvisibility=hidden -export-dynamic -fast -C -o opal_wrapper opal_wrapper.o ../../../opal/libopen-pal.la -lnsl -lutil libtool: link: icc -DNDEBUG -fast -C -finline-functions -fno-strict-aliasing -restrict -pthread -fvisibility=hidden -fast -C -o .libs/opal_wrapper opal_wrapper.o -Wl,--export-dynamic ../../../opal/.libs/libopen-pal.so -lm -lnsl -lutil -pthread -Wl,-rpath -Wl,/opt/intel/Compiler/11.0/074/lib *** glibc detected *** /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom: double free or corruption (!prev): 0x02d06c70 *** === Backtrace: = /lib64/libc.so.6[0x2b8a83f7d118] /lib64/libc.so.6(cfree+0x76)[0x2b8a83f7ec76] /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x10b43e7] /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x1104a68] /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x11145ae] /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x11172d5] /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x11168b7] /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x110f181] /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x10ffe06] /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x10ade6b] /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0xfe7960] /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x847c06] /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom[0x452935] /lib64/libc.so.6(__libc_start_main+0xe6)[0x2b8a83f27586] /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom(regcomp+0x3a)[0x40557a] === Memory map: 0040-01dff000 r-xp 08:07 402335 /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom 01efe000-0202f000 rwxp 019fe000 08:07 402335 /opt/intel/Compiler/11.0/074/bin/intel64/mcpcom 0202f000-02e6b000 rwxp 0202f000 00:00 0 [heap] 2b8a83599000-2b8a835b7000 r-xp 08:07 1262134
Re: [OMPI users] top question
On Jun 3, 2009, at 6:05 AM, tsi...@coas.oregonstate.edu wrote: Top always shows all the paralell processes at 100% in the %CPU field, although some of the time these must be waiting for a communication to complete. How can I see actual processing as opposed to waiting at a barrier? Thanks, Tiago Using what interconnect? For performance reasons (lower latency), the app and/or OMPI may be polling on the completion. Are you using blocking or non-blocking communication? Scott
[OMPI users] top question
Top always shows all the paralell processes at 100% in the %CPU field, although some of the time these must be waiting for a communication to complete. How can I see actual processing as opposed to waiting at a barrier? Thanks, Tiago
Re: [OMPI users] Exit Program Without Calling MPI_Finalize For Special Case
I'm afraid there is no way to do this in 1.3.2 (or any OMPI distributed release) with MPI applications. The OMPI trunk does provide continuous re-spawn of failed processes, mapping them to other nodes and considering fault relationships between nodes, but this only works if they are -not- MPI processes. I can detail that for you, if you would like. The problem with MPI processes is that restart is a much larger problem than just re-spawning a process. The entire MPI system becomes out-of-sync when one process fails - messages in-flight can be lost, collectives hang, etc. Even if you rewire the connections after re-spawning the process, you still have the problem of re-synchronizing the MPI communications - recovering lost messages, determining if a collective is already in operation and waiting for this process to respond, etc. Hence, our default response is to simply terminate the job, letting the user restart it from some prior checkpoint. Of course, the issue of how to recover from a single process failure remains the subject of considerable research. I assume you are engaging in such research? On Jun 2, 2009, at 10:49 PM, Tee Wen Kai wrote: Hi, I am writing a program for a central controller that will spawn processes depend on the user selection. And when there is some fault in the spawn processes like for example, the computer that is spawned with the process suddenly go down, the controller should react to this and respawn the processes to available machines. However, when a computer go down, all communications will hang. To resolve this, the controller will sent SIGTERM signal to kill those spawned processes. In the spawned program, I have written signal handler to handle such cases. However, when I include MPI_Finalize in the handler, there will be some error messages when the processes exit because some communication is not complete. Thus, I modify my program such that when the processes need to exit through handler, there will be no MPI_Finalize statement. I am using openmpi 1.2.8 and this works. However, version 1.2.8 has other bugs like spawned processes using MPI_Comm_spawn when exited does not terminate the orted services leading to alot of orted services when processes are spawn over and over again. Thus, I started evaluating version 1.3.2. 1.3.2 solve the bug but the whole program exited once a process exit without calling MPI_Finalize. Therefore, I seek your help or suggestion on how should I overcome this or what should be the proper way to quit processes when they are stuck due to one process down. Thank you. Regards, Wenkai New Email names for you! Get the Email name you've always wanted on the new @ymail and @rocketmail. Hurry before someone else does! ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Exit Program Without Calling MPI_Finalize For Special Case
Hi, I am writing a program for a central controller that will spawn processes depend on the user selection. And when there is some fault in the spawn processes like for example, the computer that is spawned with the process suddenly go down, the controller should react to this and respawn the processes to available machines. However, when a computer go down, all communications will hang. To resolve this, the controller will sent SIGTERM signal to kill those spawned processes. In the spawned program, I have written signal handler to handle such cases. However, when I include MPI_Finalize in the handler, there will be some error messages when the processes exit because some communication is not complete. Thus, I modify my program such that when the processes need to exit through handler, there will be no MPI_Finalize statement. I am using openmpi 1.2.8 and this works. However, version 1.2.8 has other bugs like spawned processes using MPI_Comm_spawn when exited does not terminate the orted services leading to alot of orted services when processes are spawn over and over again. Thus, I started evaluating version 1.3.2. 1.3.2 solve the bug but the whole program exited once a process exit without calling MPI_Finalize. Therefore, I seek your help or suggestion on how should I overcome this or what should be the proper way to quit processes when they are stuck due to one process down. Thank you. Regards, Wenkai