[OMPI users] openmpi 1.4.1

2010-05-06 Thread David Logan
Ooops, found the problem, hadn't restarted pbs after changing the nodes lists and the job had been put onto a node with a faulty myrinet connection on the switch. Regards Hi All, I am receiving an error message [grid-admin@ng2 ~]$ cat dml_test.err [hydra010:22914] [btl_gm_proc.c:191] error

[OMPI users] openmpi 1.4.1

2010-05-06 Thread David Logan
Hi All, I am receiving an error message [grid-admin@ng2 ~]$ cat dml_test.err [hydra010:22914] [btl_gm_proc.c:191] error in converting global to local id [hydra002:07435] [btl_gm_proc.c:191] error in converting global to local id [hydra009:31492] [btl_gm_proc.c:191] error in converting global to

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Gus Correa
Hi Jeff Answers inline. Jeff Squyres wrote: On May 6, 2010, at 2:01 PM, Gus Correa wrote: 1) Now I can see and use the btl_sm_num_fifos component: I had committed already "btl = ^sm" to the openmpi-mca-params.conf file. This apparently hides the btl_sm_num_fifos from ompi_info. After I

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Samuel K. Gutierrez
Hi Gus, Doh! I didn't see the kernel-related messages after the segfault message. Definitely some weirdness here that is beyond your control... Sorry about that. -- Samuel K. Gutierrez Los Alamos National Laboratory On May 6, 2010, at 3:28 PM, Gus Correa wrote: Hi Samuel Samuel K.

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Jeff Squyres
On May 6, 2010, at 2:01 PM, Gus Correa wrote: > 1) Now I can see and use the btl_sm_num_fifos component: > > I had committed already "btl = ^sm" to the openmpi-mca-params.conf > file. This apparently hides the btl_sm_num_fifos from ompi_info. > > After I switched to no options in

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Gus Correa
Hi Samuel Samuel K. Gutierrez wrote: Hi Gus, This may not help, but it's worth a try. If it's not too much trouble, can you please reconfigure your Open MPI installation with --enable-debug and then rebuild? After that, may we see the stack trace from a core file that is produced after

Re: [OMPI users] MPI_Bsend vs. MPI_Ibsend (2)

2010-05-06 Thread Eugene Loh
First, to minimize ambiguity, it may make sense to distinguish explicitly between two buffers: the send buffer (specified in the MPI_Send or MPI_Bsend call) and the attached buffer (specified in some MPI_Buffer_attach call). Jovana Knezevic wrote: On the other hand, a slight confusion

Re: [OMPI users] MPI_Bsend vs. MPI_Ibsend (2)

2010-05-06 Thread Richard Treumann
Bsend does not guarantee to use the attached buffer, Return from MPI_Ibsend does not guarantee you can modify the application send buffer. Maybe the implementation would try to optimize by scheduling a nonblocking send from the apploication buffer that bypasses the copy to the attach buffer.

[OMPI users] MPI_Bsend vs. MPI_Ibsend (2)

2010-05-06 Thread Jovana Knezevic
Thank you all! Regarding the posted Recv, I am aware that neither send nor buffered send tell the sender if it is posted. Regarding the distinction between blocking and unblocking calls in general, everything is clear as well. On the other hand, a slight confusion when Buffered send is

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Eugene Loh
Gus Correa wrote: 2) However, running with "sm" still breaks, unfortunately: I get the same errors that I reported in my very first email, if I increase the number of processes to 16, to explore the hyperthreading range. This is using "sm" (i.e. not excluded in the mca config file), and

Re: [OMPI users] Can NWChem be run with OpenMPI over an InfiniBand interconnect ... ??

2010-05-06 Thread Ralph Castain
Yeah, you just need to set the param specified in the warning message. We inserted that to ensure that people understand that IB doesn't play well with fork'd processes, so you need to be careful when doing so. On May 6, 2010, at 12:27 PM, Addepalli, Srirangam V wrote: > HelloRichard, >

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Ralph Castain
I know a few national labs that run OMPI w/Fedora 9, but that isn't on Nehalem hardware and is using gcc 4.3.x. However, I think the key issue really is the compiler. I have seen similar problems on multiple platforms and OS's whenever I use GCC 4.4.x - I -think- it has to do with the

Re: [OMPI users] Can NWChem be run with OpenMPI over an InfiniBand interconnect ... ??

2010-05-06 Thread Addepalli, Srirangam V
HelloRichard, Yes NWCHEM can be run on IB using 1.4.1. If you have built openmpi with IB support. Note: If your IB cards are qlogic you need to compile NWCHEM with MPI-SPAWN. Rangam Settings for my Build with MPI-SPAWN: export ARMCI_NETWORK=MPI-SPAWN export IB_HOME=/usr export

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Samuel K. Gutierrez
Hi Gus, This may not help, but it's worth a try. If it's not too much trouble, can you please reconfigure your Open MPI installation with -- enable-debug and then rebuild? After that, may we see the stack trace from a core file that is produced after the segmentation fault? Thanks, --

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Gus Correa
Hi Jeff Thank you for your testimony. So now I have two important data points (you and Douglas Guptill) to support the argument here that installing Fedora on machines meant to do scientific and parallel computation is to ask for trouble. I use CentOS in our cluster, but this is a standalone

[OMPI users] Can NWChem be run with OpenMPI over an InfiniBand interconnect ... ??

2010-05-06 Thread Richard Walsh
All, I have built NWChem successfully, and trying to run it with an Intel built version of OpenMPI 1.4.1. If I force to run over over 1 GigE maintenance interconnect it works, but when I try it over the default InfiniBand communications network it fails with:

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Gus Correa
Hi Eugene Thanks for the detailed answer. * 1) Now I can see and use the btl_sm_num_fifos component: I had committed already "btl = ^sm" to the openmpi-mca-params.conf file. This apparently hides the btl_sm_num_fifos from ompi_info. After I switched to no options in

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Jeff Squyres
On May 6, 2010, at 1:11 PM, Gus Correa wrote: > Just for the record, I am using: > Open MPI 1.4.2 (released 2 days ago), gcc 4.4.3 (g++, gfortran). > All on Fedora Core 12, kernel 2.6.32.11-99.fc12.x86_64 #1 SMP. Someone asked earlier in this thread -- I've used RHEL4.4 and RHEL5.4 on my

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Gus Correa
Hi Douglas Just for the record, I am using: Open MPI 1.4.2 (released 2 days ago), gcc 4.4.3 (g++, gfortran). All on Fedora Core 12, kernel 2.6.32.11-99.fc12.x86_64 #1 SMP. The machine is a white box with two-way quad-core Intel Xeon (Nehalem) E5540 @ 2.53GHz, 48GB RAM. Hyperthreading is

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Eugene Loh
Gus Correa wrote: Hi Eugene Thank you for answering one of my original questions. However, there seems to be a problem with the syntax. Is it really "-mca btl btl_sm_num_fifos=some_number"? No. Try "--mca btl_sm_num_fifos 4". Or, % setenv OMPI_MCA_btl_sm_num_fifos 4 % ompi_info -a | grep

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Douglas Guptill
Hello Gus: On Thu, May 06, 2010 at 11:26:57AM -0400, Gus Correa wrote: > Douglas: > Would you know which gcc you used to build your Open MPI? > Or did you use Intel icc instead? Intel ifort and icc. I build OpenMPI with the same compiler, and same options, that I build my application with. I

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Gus Correa
Hi Eugene Thank you for answering one of my original questions. However, there seems to be a problem with the syntax. Is it really "-mca btl btl_sm_num_fifos=some_number"? (FYI, I am using Open MPI 4.1.2, a tarball from two days ago.) When I grep any component starting with btl_sm I get

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Gus Correa
Hi Ralph, Douglas Ralph: Yes, I am in black list of your ticket (gcc 4.4.3): gcc --version gcc (GCC) 4.4.3 20100127 (Red Hat 4.4.3-4) Copyright (C) 2010 Free Software Foundation, Inc. Is is possible (and not too time consuming) to install an older gcc on this Fedora 12 box, and compile Open

Re: [OMPI users] MPI_Bsend vs. MPI_Ibsend

2010-05-06 Thread Richard Treumann
An MPI send (of any kind), is defined by "local completion semantics". When a send is complete, the send buffer may be reused. The only kind of send that gives any indication whether the receive is posted is the synchronous send. Neither standard send nor buffered send tell the sender if the recv

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Eugene Loh
Ralph Castain wrote: Yo Gus Just saw a ticket go by reminding us about continuing hang problems on shared memory when building with gcc 4.4.x - any chance you are in that category? You might have said something earlier in this thread Going back to the original e-mail in this thread:

Re: [OMPI users] MPI_Bsend vs. MPI_Ibsend

2010-05-06 Thread Bill Rankin
Actually the 'B' in MPI_Bsend() specifies that it is a blocking *buffered* send. So if I remember my standards correctly, this call requires: 1) you will have to explicitly manage the send buffers via MPI_Buffer_[attach|detach](), and 2) the send will block until a corresponding receive is

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread John Hearns
Gus, I'm not using OpenMPI, however OpenSUSE 11.2 with current updates seems to work fine on Nehalem. I'm curious that you say the Nvidia graphics driver does not install - have you tried running the install script manually, rather than downloading an RPM etc? I'm using version 195.36.15 and it

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Ralph Castain
Yo Gus Just saw a ticket go by reminding us about continuing hang problems on shared memory when building with gcc 4.4.x - any chance you are in that category? You might have said something earlier in this thread On May 5, 2010, at 5:54 PM, Douglas Guptill wrote: > On Wed, May 05, 2010

Re: [OMPI users] Fortran derived types

2010-05-06 Thread Richard Treumann
Assume your data is discontiguous in memory and making it contiguous is not practical (e.g. there is no way to make cells of a row and cells of a column both contiguous.) You have 3 options: 1) Use many small/contiguous messages 2) Allocate scratch space and pack/unpack 3) Use a derived

[OMPI users] opal_mutex_lock(): Resource deadlock avoided

2010-05-06 Thread Ake Sandgren
Hi! We have a code that trips on this fairly often. I've seen cases where it works but mostly it gets stuck here. The actual mpi call is call mpi_file_open(...) I'm currently just wondering if there has been other reports on/anyone have seen deadlock in mpi-io parts of the code or if this most

Re: [OMPI users] Fortran derived types

2010-05-06 Thread Paul Kapinos
Hi, In general, even in your serial fortran code, you're already taking a performance hit using a derived type. That is not generally true. The right statement is: "it depends". Yes, sometimes derived data types and object orientation and so on can lead to some performance hit; but current

[OMPI users] MPI_Bsend vs. MPI_Ibsend

2010-05-06 Thread Jovana Knezevic
Dear all, Could anyone please clarify me the difference between MPI_Bsend and MPI_Ibsend? Or, in other words, what exactly is "blocking" in MPI_Bsend, when the data is stored in the buffer and we "return"? :-) Another, but similar, question: What about the data-buffer - when can it be reused in

Re: [OMPI users] Fortran derived types

2010-05-06 Thread Terry Frankcombe
Hi Derek On Wed, 2010-05-05 at 13:05 -0400, Cole, Derek E wrote: > In general, even in your serial fortran code, you're already taking a > performance hit using a derived type. Do you have any numbers to back that up? Ciao Terry